# Using an agreed upon corpus, extract neighborhoods and crimes for a given dataframe
The following code will search for and extract matches (exact and partial) of the neighborhood/crime corpi from a given dataframe. The philosophy is:
1. Look for exact corpus matches first using CountVectorizer
2. For records that had no exact matches, tokenize using ngrams (size 1-4)and try to find high score partial matches

In [1]:
import pandas as pd
import py_stringmatching as sm
import nltk
from nltk.util import ngrams

from sklearn.feature_extraction.text import CountVectorizer

## To use this code, specify the dataframe location and the column to be searched.

In [2]:
'''
df = pd.read_csv('../data/cleaned_reddit_12-21_to_1115.csv')
df.reset_index(inplace = True)
col = 'post_text'
'''

"\ndf = pd.read_csv('../data/cleaned_reddit_12-21_to_1115.csv')\ndf.reset_index(inplace = True)\ncol = 'post_text'\n"

Prep the search column by making it string.

In [3]:
is_part_1 = True
df = pd.read_csv('../data/processed_reddit_data/cleaned_reddit_12-21_to_1115.csv')
#df.drop(columns=['Unnamed: 0', 'level_0', 'index'], inplace=True)
df.reset_index(inplace = True)
midpoint = round(df.shape[0] / 2)
if is_part_1:
    df = df.iloc[:midpoint]
else:
    df = df.iloc[midpoint:]
col = 'post_text'
df

Unnamed: 0,index,subreddit,title,post_id,post_author,post_utc,full_link,post_text,post_text_count
0,0,sandiego,going to visit san diego next week any places...,x4nzh2,Fearmkultra,2022-09-03 06:57:58+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,going to visit san diego next week any places ...,12
1,1,sandiego,interesting illusion’s,x4ny4c,Break-these-cuffs,2022-09-03 06:55:24+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,interesting illusion’s,2
2,2,sandiego,whaley house picture of ghost,x4ntm7,Open_Construction_31,2022-09-03 06:47:09+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,whaley house picture of ghost as a kid i saw t...,199
3,3,sandiego,language exchange,x4n6xv,Poshorock,2022-09-03 06:07:46+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,language exchange is there someone by there wh...,31
4,4,SanDiegan,chula vista police stopping cars going east on...,x4n5aj,kaptaincorn,2022-09-03 06:04:54+00:00,https://www.reddit.com/r/SanDiegan/comments/x4...,chula vista police stopping cars going east on...,57
...,...,...,...,...,...,...,...,...,...
22673,22673,UCSD,graduate students needed for a quick 5 min survey,uggnor,Yall-re-nt,2022-05-02 03:35:17+00:00,https://www.reddit.com/r/UCSD/comments/uggnor/...,graduate students needed for a quick 5 min sur...,98
22674,22674,UCSD,any keshi fans want to help me with his sun go...,uggkev,-fflux,2022-05-02 03:29:40+00:00,https://www.reddit.com/r/UCSD/comments/uggkev/...,any keshi fans want to help me with his sun go...,88
22675,22675,sandiego,are there any abandoned buildings in san diego,uggkci,ForceInternational50,2022-05-02 03:29:33+00:00,https://www.reddit.com/r/sandiego/comments/ugg...,are there any abandoned buildings in san diego...,71
22676,22676,sandiego,cherimoya anyone seen them in stores,uggixq,Canary-Admirable,2022-05-02 03:26:56+00:00,https://www.reddit.com/r/sandiego/comments/ugg...,cherimoya anyone seen them in stores any asian...,14


In [4]:
df[col].fillna('', inplace = True)
df[col] = df[col].astype('str')

Import the agreed upon crime and neighborhood corpi

In [5]:
crime_corpus = pd.read_csv('../data/crime_corpus.csv')
crime_corpus['crime'].astype('str')
crime_corpus['crime'] = crime_corpus['crime'].apply(lambda x: x.strip())
crime_corpus.drop_duplicates(inplace = True)
crime_corpus.reset_index(drop = True, inplace = True)
crime_corpus.reset_index(inplace = True)

neighborhood_corpus = pd.read_csv('../data/neighborhood_corpus.csv')
neighborhood_corpus['neighborhood'].astype('str')
neighborhood_corpus['neighborhood'] = neighborhood_corpus['neighborhood'].apply(lambda x: x.strip())
neighborhood_corpus.reset_index(inplace = True)
neighborhood_corpus.drop_duplicates(inplace = True)

Initialize the vectorizer for each corpus and fit the vocab.

In [6]:
crime_vectorizer = CountVectorizer(vocabulary = crime_corpus['crime'])
neighborhood_vectorizer = CountVectorizer(vocabulary = neighborhood_corpus['neighborhood'])

In [7]:
print(crime_vectorizer.get_feature_names_out()[0:10])
print(neighborhood_vectorizer.get_feature_names_out()[0:10])

['reckless driving' 'stolen vehicle log' 'ambulance call overdose'
 'abandoned refrigerator' 'calling for help' 'adw cost recovery'
 'stayout of area no radio trans' 'receive sell stolen prop' 'adw'
 'officer needs help']
['clairemont mesa east' 'clairemont mesa west' 'bay ho' 'north clairemont'
 'university city' 'bay park' 'mission beach' 'pacific beach'
 'mission bay park' 'la jolla']


Find exact matching by using the transform method.

In [8]:
%%time
crime_vectorizer_matches = crime_vectorizer.transform(df[col])
neighborhood_vectorizer_matches = neighborhood_vectorizer.transform(df[col])

CPU times: total: 3.52 s
Wall time: 3.78 s


Check the shape of the matches matrix to make sure it has the right dimensions.

In [9]:
print(crime_vectorizer_matches.shape)
print(df.shape[0], len(crime_corpus['crime']))

print(neighborhood_vectorizer_matches.shape)
print(df.shape[0], len(neighborhood_corpus['neighborhood']))

(22678, 712)
22678 712
(22678, 402)
22678 402


Get the indices for the exact matches as a pandas dataframe

In [10]:
crime_exact_matches_df = pd.DataFrame({'dfindex': crime_vectorizer_matches.nonzero()[0], 
                                      'crimeindex': crime_vectorizer_matches.nonzero()[1]})
neighborhood_exact_matches_df = pd.DataFrame({'dfindex': neighborhood_vectorizer_matches.nonzero()[0], 
                                      'neighborhoodindex': neighborhood_vectorizer_matches.nonzero()[1]})
crime_exact_matches_df

Unnamed: 0,dfindex,crimeindex
0,2,674
1,4,171
2,4,307
3,4,367
4,4,672
...,...,...
17399,22672,409
17400,22672,421
17401,22672,453
17402,22677,332


In [11]:
neighborhood_exact_matches_df

Unnamed: 0,dfindex,neighborhoodindex
0,9,61
1,13,356
2,21,39
3,33,356
4,64,89
...,...,...
1376,22558,303
1377,22568,18
1378,22613,18
1379,22629,356


Merge the exact matches with the original corpi, then collapse into a list

In [12]:
crime_exact_matches_df = crime_exact_matches_df.merge(crime_corpus, how = 'inner', left_on = 'crimeindex', right_index = True)
crime_exact_matches_df = crime_exact_matches_df.groupby(by = 'dfindex', as_index = False).agg({'crime': lambda x: x.tolist()})
crime_exact_matches_df

Unnamed: 0,dfindex,crime
0,2,[child]
1,4,"[police, shooting, street, investigations]"
2,5,"[street, traffic, poor, culture, buy]"
3,6,"[abuse, business]"
4,11,"[scam, evidence, mistake, money, security, wro..."
...,...,...
7875,22662,[kill]
7876,22663,[kill]
7877,22671,[court]
7878,22672,"[buy, money, steal, redeem, nazi]"


In [13]:
neighborhood_exact_matches_df = neighborhood_exact_matches_df.merge(neighborhood_corpus, how = 'inner', 
                                                                    left_on = 'neighborhoodindex', 
                                                                    right_index = True)
neighborhood_exact_matches_df = neighborhood_exact_matches_df.groupby(by = 'dfindex', 
                                                                      as_index = False).agg({'neighborhood': lambda x: x.tolist()})
neighborhood_exact_matches_df

Unnamed: 0,dfindex,neighborhood
0,9,[gaslamp]
1,13,[downtown]
2,21,[skyline]
3,33,[downtown]
4,64,[border]
...,...,...
1240,22558,[spectrum]
1241,22568,[miramar]
1242,22613,[miramar]
1243,22629,[downtown]


Get records that did not have exact matches to be used for partial matching

In [14]:
crime_partial_matches_df = df[~df['index'].isin(crime_exact_matches_df.index)].copy(deep = True)
print(crime_partial_matches_df.shape)
neighborhood_partial_matches_df = df[~df['index'].isin(neighborhood_exact_matches_df.index)].copy(deep = True)
print(neighborhood_partial_matches_df.shape)

(14798, 9)
(21433, 9)


Set up the n-gram tokenizer

In [15]:
al_tok = sm.AlphabeticTokenizer()

al_tok.tokenize('hello! world?this is: our2 project*')

['hello', 'world', 'this', 'is', 'our', 'project']

In [16]:
def ngram_tokenize(txt, min_len = 1, max_len = 4):
    alph_toks = al_tok.tokenize(txt)
    tokens = []
    for i in range(min_len, max_len + 1):
        tokens += [' '.join(gram) for gram in list(ngrams(alph_toks, i))]
    return tokens

In [17]:
ngram_tokenize('hello! world?this is: our2 project*')

['hello',
 'world',
 'this',
 'is',
 'our',
 'project',
 'hello world',
 'world this',
 'this is',
 'is our',
 'our project',
 'hello world this',
 'world this is',
 'this is our',
 'is our project',
 'hello world this is',
 'world this is our',
 'this is our project']

Create a tokenized column for records that did not have exact matches.

In [18]:
crime_partial_matches_df[col+'_tok'] = crime_partial_matches_df[col].apply(lambda x: ngram_tokenize(x))
neighborhood_partial_matches_df[col+'_tok'] = neighborhood_partial_matches_df[col].apply(lambda x: ngram_tokenize(x))

In [19]:
crime_partial_matches_df[col+'_tok'].head()

2704    [hey, everyone, i, just, need, some, help, fig...
2705    [as, long, as, we, buy, from, reputable, store...
2706    [more, self, inflicted, stupidity, from, our, ...
2707    [be, careful, of, this, guy, with, two, pitbul...
2708    [hi, does, anyone, know, what, s, going, on, i...
Name: post_text_tok, dtype: object

In [20]:
neighborhood_partial_matches_df[col+'_tok'].head()

794    [porch, pirate, right, before, pm, still, dayl...
795    [was, at, balboa, park, with, my, son, i, usua...
796    [last, night, someone, came, into, my, yard, t...
797    [dead, cats, i, ve, lost, cats, now, in, the, ...
798    [looking, for, doggie, foster, home, my, best,...
Name: post_text_tok, dtype: object

Estabilish similarity measures and matching function.

In [21]:
jaro = sm.Jaro()
print(jaro.get_raw_score('the', 'theft'))

0.8666666746139526


In [22]:
def get_partial_matches(toks, corpus, min_score = 0.95):
    matches = []
    for token in toks:
        for corp in corpus:
            if (jaro.get_raw_score(token, corp) >= min_score):
                matches.append(corp)
    matches = list(set(matches))
    return matches

Get the partial matches.

In [23]:
%%time
crime_partial_matches_df['crime'] = crime_partial_matches_df[col+'_tok'].apply(lambda x: 
                                                                                            get_partial_matches(x, 
                                                                                                                crime_corpus['crime'].to_list()))

CPU times: total: 23min
Wall time: 23min 9s


In [24]:
'''
%%time
neighborhood_partial_matches_df['neighborhood'] = neighborhood_partial_matches_df[col+'_tok'].apply(lambda x: 
                                                                                            get_partial_matches(x, 
                                                                                                                neighborhood_corpus['neighborhood'].to_list()))
                                                                                                                '''

"\n%%time\nneighborhood_partial_matches_df['neighborhood'] = neighborhood_partial_matches_df[col+'_tok'].apply(lambda x: \n                                                                                            get_partial_matches(x, \n                                                                                                                neighborhood_corpus['neighborhood'].to_list()))\n                                                                                                                "

Concatenate the matches

In [32]:
crime_exact_matches_df.rename(columns={"dfindex": "index"}, inplace = True)

In [33]:
crime_matches = pd.concat([crime_exact_matches_df[['index', 'crime']], 
                           crime_partial_matches_df[['index', 'crime']]])

In [None]:
df_crime = df.merge(crime_matches, left_on = 'index', right_on = 'index', how = 'left')