### Process for incorporation of named entity recognition

We want recognition of:</br>
<li>Email</li>
<li>Phone numbers</li>
<li>Dates</li>
<li>Time</li>
<li>Address</li>


To implement, we need to do the following steps:</br>
<li>For each document inside the BM25 class, instantiate a regex_matches and a category dictionary</li>
<li>Run regex for phone numbers and emails on each corpus and record the number of matches in regex_matches dictionary</li>
<li>Record frequency of date, time and address named entities as determined by spaCy in the category dictionary.</li>
<li>Update the frequency dictionary (for each individual word in each document) with the category dictionary for that document</li>
<li>Nothing else should be changed, since our changes so far should've accounted for the counting of categories that the verbatim term frequency and inverse document frequency would've missed.</li>


In [15]:
import re
import spacy
import math
import numpy as np
from multiprocessing import Pool, cpu_count
nlp = spacy.load('en_core_web_sm')

### Regular expressions

A US phone number has 10 digits, and may include hyphens or brackets. I remove all punctuation using regex, and find occurrences of suvh numbers by using a regular expression. 
An email needs to contain an '@' symbol, as well as a period after which a domain extension is specified.</br>
Credit to <a href = 'https://www.geeksforgeeks.org/check-if-email-address-valid-or-not-in-python/'>GeekforGeeks</a> for guidance on finding emails from strings.

In [21]:
email_regex = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
phone_regex = re.compile(r'\b([0-9]{3}-[0-9]{3}-[0-9]{4})\b')

Testing:

In [23]:
string = 'his phone number is 652-982-3987. Mine is 872-974-8342'

In [24]:
re.findall(phone_regex, string)

['652-982-3987', '872-974-8342']

In [122]:
re.findall(email_regex, 'email fas@dfasd.edu ihsfa@jdf.com')

['fas@dfasd.edu', 'ihsfa@jdf.com']

### Incorporating in BM25

This is adapted from the rank_bm25 library; credit to <a href="https://github.com/dorianbrown/rank_bm25">Dorian Brown</a>. Changes I made:<br>
<li>Adding regular expressions to detect emails and phone numbers</li>
<li>Resolving entities in the initialization of the BM25 class</li>
<li>Constraining tokenizing by spaces inside the class</li>

In [62]:
class BM25:
    def __init__(self, corpus, tokenizer=None):
        self.corpus_size = len(corpus)
        self.avgdl = 0
        self.doc_freqs = []
        self.idf = {}
        self.doc_len = []
        self.tokenizer = tokenizer
        # compiling regular expressions
        self.email_regex = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
        self.phone_regex = re.compile(r'\b([0-9]{3}-[0-9]{3}-[0-9]{4})\b')

        if tokenizer:
            corpus = self._tokenize_corpus(corpus)

        nd = self._initialize(corpus)
        self._calc_idf(nd)

    def _initialize(self, corpus):
        nd = {}  # word -> number of documents with word
        num_doc = 0
        for document in corpus:
            self.doc_len.append(len(document))
            num_doc += len(document)
            # we apply spaCy's named entity recognition to the document
            category = {}
            doc = nlp(document)
            for ent in doc.ents:
                if ent.label_ == 'GPE' or ent.label_ == 'LOC':
                    if 'locations' not in category:
                        category['locations'] = 0
                    category['locations'] += 1
                elif ent.label_ == 'DATE':
                    if 'dates' not in category:
                        category['dates'] = 0
                    category['dates'] += 1
                elif ent.label_ == 'TIME':
                    if 'times' not in category:
                        category['times'] = 0
                    category['times'] += 1
                else:
                    continue
            
            # regular expressions to detect phone numbers and emails
            regex_matches = {}
            email_matches = re.findall(self.email_regex, document)
            phone_matches = re.findall(self.phone_regex, document)
            regex_matches['email'] = len(email_matches)
            regex_matches['phone'] = len(phone_matches)
            
            # this is for verbatim matching with the query
            frequencies = {}
            # we tokenize the corpus
            tokenized_corpus = document.split(" ")
            for word in tokenized_corpus:
                if word not in frequencies:
                    frequencies[word] = 0
                frequencies[word] += 1
            frequencies.update(category)
            frequencies.update(regex_matches)
            self.doc_freqs.append(frequencies)
            
            for word, freq in frequencies.items():
                try:
                    nd[word]+=1
                except KeyError:
                    nd[word] = 1

        self.avgdl = num_doc / self.corpus_size
        return nd

    def _tokenize_corpus(self, corpus):
        pool = Pool(cpu_count())
        tokenized_corpus = pool.map(self.tokenizer, corpus)
        return tokenized_corpus

    def _calc_idf(self, nd):
        raise NotImplementedError()

    def get_scores(self, query):
        raise NotImplementedError()

    def get_batch_scores(self, query, doc_ids):
        raise NotImplementedError()

    def get_top_n(self, query, documents, n=5):

        assert self.corpus_size == len(documents), "The documents given don't match the index corpus!"

        scores = self.get_scores(query)
        top_n = np.argsort(scores)[::-1][:n]
        return [documents[i] for i in top_n]


class BM25Okapi(BM25):
    def __init__(self, corpus, tokenizer=None, k1=1.5, b=0.75, epsilon=0.25):
        self.k1 = k1
        self.b = b
        self.epsilon = epsilon
        super().__init__(corpus, tokenizer)

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

    def get_scores(self, query):
        """
        The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
        this algorithm also adds a floor to the idf value of epsilon.
        See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
        :param query:
        :return:
        """
        score = np.zeros(self.corpus_size)
        doc_len = np.array(self.doc_len)
        for q in query:
            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
            score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
        return score

    def get_batch_scores(self, query, doc_ids):
        """
        Calculate bm25 scores between query and subset of all docs
        """
        assert all(di < len(self.doc_freqs) for di in doc_ids)
        score = np.zeros(len(doc_ids))
        doc_len = np.array(self.doc_len)[doc_ids]
        for q in query:
            q_freq = np.array([(self.doc_freqs[di].get(q) or 0) for di in doc_ids])
            score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
        return score.tolist()

In [36]:
corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?",
    "my email is ahfs@ndf.com",
    "Please Contact : Traci Kantzas 203-292-5006 | Traci@greenfieldsource.com"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(corpus)
query = "phone numbers"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)

[0.        0.        0.        0.        0.1355042]


### Testing

In [1]:
import pandas as pd
import dateutil
import datetime

The test data is the message history of the #jobs-and-internships channel in the I School of Information (obtained with permission from the IT team). It spans 2020-11-02 to 2021-12-15. I remove all fields outside of:<br>
<li>'user', which is the Slack user ID</li>
<li>'text', the message content</li>
<li>'ts', the Unix timestamp</li>

In [2]:
first_date = '2020-11-02'
first_date = dateutil.parser.parse(first_date)
df = pd.read_json("jobs-and-internships/2020-11-02.json")
df = df[['user', 'text', 'ts']]

In [3]:
for i in range(341): 
    first_date += datetime.timedelta(days=1)
    try:
        df_add = pd.read_json("jobs-and-internships/" + str(first_date)[0:10] + '.json')
        df_add = df_add[['user', 'text', 'ts']]
        df = pd.concat([df, df_add], ignore_index = True)
    except:
        continue

In [4]:
df.shape

(1497, 3)

In [7]:
df.sample(10)

Unnamed: 0,user,text,ts
1318,U5N1K4YG0,<@U01G53LLB42>The recruiter from Cruise reache...,1629753000.0
78,U01DSD43ECF,<@U01DSD43ECF> has joined the channel,1606238000.0
1247,U01FRBJM4SV,Here are internship positions at FB: <https://...,1628354000.0
5,UEV3JDG12,"HI, Any one here who can help me with a referr...",1604527000.0
572,URRQ1E50S,Karen do you have specific contacts that could...,1614629000.0
83,U01CKHVJMH7,<@U01CKHVJMH7> has joined the channel,1606268000.0
882,U1UHKAMFD,"you know, i never considered that before joini...",1619020000.0
1428,U1UJVQ0G4,Another goodie: <https://twitter.com/egoodman/...,1632315000.0
253,UEKS87R5K,This is a posting from the firm my wife used t...,1608751000.0
1226,U0298P5SH40,I didn’t know McD Tech Labs existed! How excit...,1628014000.0


I save this as a csv file for testing. This is attached.

In [11]:
df.to_csv('Slack_test_data.csv', index = False)

In [8]:
test_corpus = list(df['text'])

In [63]:
tokenized_test_corpus = [doc.split(" ") for doc in test_corpus]
bm25 = BM25Okapi(test_corpus)

#### "phone numbers"

In [40]:
query = "phone numbers"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)

In [45]:
for i in bm25.get_top_n(tokenized_query, test_corpus, n=5):
    print('Result:')
    print(i)

Result:
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fworkforcenow.adp.com%2Fmascsr%2Fdefault%2Fmdf%2Frecruitment%2Frecruitment.html%3Fcid%3D35e6c783-42c6-456d-a2d7-5348ce39bd8d%26ccId%3D19000101_000001%26lang%3Den_US&amp;data=04%7C01%7Cjfogel%40mediquant.com%7Cc6e50d0fdfbd4592a3a508d8ea254ccb%7Cee99115b93684a8ba65e601d3afd6236%7C1%7C0%7C637516793210062319%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=bHTIeZzD6wqcclhCoZggD5Mnprv%2BeFoZx6kTKQKe0XM%3D&amp;reserved=0|MediQuant> out of Ohio is looking for Data Engineers, Data Modelers, Clinical Business Systems Analysts, and more. Contact <mailto:lindsay_ludlow@mediquant.com|Lindsay Ludlow> if interested (<tel:9102325259|910-232-5259>).
Result:
Annual <https://skydeckrecuiting2020.eventbrite.com/|MEng + SkyDeck Startup Recruiting Event> This is not another Zoom event! Please join us on the interactive platform, Remo, for our annual <https://skydeckrecuiting2

We see phone numbers included in the result (none listed in Slack search).

#### Email addresses

In [46]:
query = "email addresses"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)

In [53]:
for i in bm25.get_top_n(tokenized_query, test_corpus, n=5):
    print('Result:')
    print(i)

Result:
<@UKJAL3UKB> well if we're putting grievances out there I have one that might be resolved by now. Originally zscalar did not have company specific pools of public IP addresses allocated to the VPN service, this meant we couldn't use the corporate VPN to open up access to a large pool of data because a lot of PII/PHI data in the particular business unit I worked with. I believe there was a release where enterprise companies could pay for their own IP pool in the last year or so.
Result:
Hi everyone, my friend is an MPH student at Berkeley and asked to share this job opp working on a COVIDScholar tool to contribute to the dissemination of scientific information regarding the pandemic. There is more information on the flyer attached.

If you have any questions feel free to shoot an email to <mailto:egainor33@berkeley.edu|egainor33@berkeley.edu>, <mailto:hildy.fong@berkeley.edu|hildy.fong@berkeley.edu> and/or <mailto:jdagdelen@berkeley.edu|jdagdelen@berkeley.edu>.
Result:
Hi all! M

#### Dates

In [64]:
query = "dates"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)

In [65]:
for i in bm25.get_top_n(tokenized_query, test_corpus, n=5):
    print('Result:')
    print(i)

Result:
I think they hired some last year but more this year.
Result:
I know, right?! On the other hand, I didn't really think we needed to be in the '90s by the first week in April, so I welcome the variety!
Result:
I applied for their summer internships this spring 2021. In my past experience, they didn't post intern listings until the early spring before summer. There are no intern positions that I am aware of right now.
Result:
Not sure how it works in other companies, but we usually get budgets only till the end of the year this early in the year. Probably towards the end of the year, we'd have more clarity on recruiting for next year.
Result:
<@U1UKGEH4J> how often is this internship offered - is it offered at other times of year or is it summer only? I'm interested, but I'm in my first year of MIDS so I think that it may be helpful to finish the core courses first.


#### Locations

In [66]:
query = "locations"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
for i in bm25.get_top_n(tokenized_query, test_corpus, n=5):
    print('Result:')
    print(i)

Result:
We are location agnostic within the continental US, fully remote is available and we have offices in St. Louis, Chicago, Minneapolis, Dallas, Atlanta, and New Jersey (sort of).
Result:
Georgia Pacific is hiring Senior Data Scientists! If you are interested in either of these roles, please *email me your resume by Thursday, March 4th* (no cover letter required).

Senior Data Scientist – Atlanta, GA
<https://referrals.kochcareers.com/jobs/6330678-senior-data-scientist?bid=186&amp;referral=1&amp;internal=1>

Sr. Data Scientist - Utility Analytics (Multiple locations):
<https://jobs.kochcareers.com/jobs/5953900-sr-data-scientist-utility-analytics>
NOTE: locations for this role are -  Santa Clara, California, Tacoma, Washington, Wichita, Kansas, St. Louis, Missouri, Atlanta, Georgia, New York, New York, Tulsa, Oklahoma, Chicago, Illinois, Denver, Colorado
 
*When submitting your resume, please indicate which role you are interested in, and for the Utility Analytics role, also indica

#### Times

In [67]:
query = "times"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
for i in bm25.get_top_n(tokenized_query, test_corpus, n=5):
    print('Result:')
    print(i)

Result:
Thanks! Will contact tmr morning!
Result:
Just a reminder to apply by tonight if you're interested in working with the D-Lab! The <https://berkeley.qualtrics.com/jfe/form/SV_bxgWFFHiQy7JZRP|application> should take you 10 minutes or less.
Result:
Shared via the noise listserve earlier today...
<https://megagon.ai/jobs/internships/>
Result:
From a friend:  Would any of your current/former data science students be interested part time / contractor work few hours a week? Just need to know SQL and possibly some Python, bonus if familiar with Mode. Analyzing teacher/school marketplace data / for my company Swing Education. Data spans product analytics to marketing and ops. Rate is $40-50/hour but negotiable depending on experience and # hours.
Result:
and we are hiring a data product PM.  your chance to boss me around.
or if anyone knows anyone interested …
<https://boards.greenhouse.io/voxmedia/jobs/2952098>
