# Police nodes


# reddit nodes (NECESSARY, TBD)
## posts
* source: merged_df found below in reddit sections
* node id: post id
* attributes
    * url
    * post title
    * keywords (may be empty)
        * format with ";" delimiter
    * (future ideas: tot comments, upvotes, has media, etc) 

## users (NOT NECESSARY, TBD, NICE TO HAVE)
* source: original cleaned reddit data
* node id: author
* attributes
    * has_posted?
    * has commented?

# nextdoor nodes (NECESSARY, TBD)
* source: nd_keywords_ner.csv
* node_id: post_id
* attributes
    * ShortUrl
    * keywords
    * cannot have post title

# police data nodes (NECESSARY, DONE)
* node_id: incident_id
* attributes
    * priority
    * crime_type (we manually populated)

# Crime corpus nodes (NECESSARY, DONE)
* node: crime type

# Neighborhood corpus nodes (NECESSARY, DONE)
* node: neighborhood location

# Time nodes (DONE)

# Relationships

## Reddit (NOT NECESSARY, nice to have)
* start_id = user id
* end_id = post id
* type: comment, post

## Crime (BELONGS_TO) (NECESSARY)
* start_id = crime post/call (reddit, nextdoor, police data)
* end_id = crime node (crime corpus)
* source type = reddit, nextdoor, police calls (:TYPE)
* time type = time bin
* neighborhood type?

## Crime (HAPPENED_IN) (NECESSARY)
* start_id = crime post/call
* end_id = neighborhood node
* source type = reddit, nextdoor, police calls (:TYPE)

## Crime (HAPPENED_AT) (TBD, nice to have)
* start_id = crime post/call
* end_id = time
* source type = reddit, nextdoor, police calls (:TYPE)

## Other relationships
ethinicity/drugs?/immigration?
General pattern:
start_id: reddit, nextdoor, police
end_id: corpus

In [1]:
# import libraries
from pathlib import Path

import pandas as pd

In [2]:
# set paths
data_p             = Path("../data")

corpi_p            = data_p / "corpi"
neighborhood_p     = corpi_p / "neighborhood_corpus.csv"
crime_p            = corpi_p / "crime_corpus.csv"

reddit_processed_p = data_p / "processed_reddit_data"

# create out path
out_p = data_p / "neo4j_files"
out_p.mkdir(exist_ok=True)

node_p = out_p / "nodes"
node_p.mkdir(exist_ok=True)

relations_p = out_p / "relationships"
relations_p.mkdir(exist_ok=True)

## Make Corpi Nodes

In [3]:
# read in neighborhood corpus and write to node file
neighborhood_df = pd.read_csv(neighborhood_p)

# prepare corpus csv
neighborhood_df[":ID"] = neighborhood_df.index + 1
neighborhood_df[":LABEL"] = "neighborhood"

# rearrange columns
neighborhood_df = neighborhood_df[[":ID", "neighborhood", ":LABEL"]]

# Write out node csv
neighborhood_out_p = node_p / "nodes_neighborhood.csv"
neighborhood_df.to_csv(neighborhood_out_p, index=False)

neighborhood_df.head()

Unnamed: 0,:ID,neighborhood,:LABEL
0,1,clairemont mesa east,neighborhood
1,2,clairemont mesa west,neighborhood
2,3,bay ho,neighborhood
3,4,north clairemont,neighborhood
4,5,university city,neighborhood


In [4]:
# read in neighborhood corpus and write to node file
crime_df = pd.read_csv(crime_p)

# prepare corpus csv
crime_df[":ID"] = crime_df.index + 1
# crime_df[":LABEL"] = # want to add crime type to everything in corpus...

# rearrange columns
crime_df = crime_df[[":ID", "crime"]]

# Write out node csv
crime_out_p = node_p / "nodes_crime.csv"
crime_df.to_csv(crime_out_p, index=False)

crime_df.head()

Unnamed: 0,:ID,crime
0,1,reckless driving
1,2,stolen vehicle log
2,3,ambulance call overdose
3,4,abandoned refrigerator
4,5,calling for help


## Reditt Prep

### Reddit: Merge NER and Rake results

In [5]:
ner_p = reddit_processed_p / "cleaned_reddit_ner_12-21_to_1115.csv"
keywords_p = reddit_processed_p / "keyword_extraction.csv"

In [6]:
ner_df = pd.read_csv(ner_p)
print(f"Total observations: {ner_df.shape[0]}")

# drop unamed index
ner_df.drop(columns=['Unnamed: 0'], inplace=True)

ner_df.head()

Total observations: 43421


Unnamed: 0,subreddit,title,post_id,post_author,post_utc,full_link,post_text,post_text_count,ORG,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,NORP,PERSON,TIME
0,sandiego,going to visit san diego next week any places...,x4nzh2,Fearmkultra,2022-09-03 06:57:58+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,going to visit san diego next week any places ...,12,['san diego'],['next week'],,,,,,,,,
1,sandiego,whaley house picture of ghost,x4ntm7,Open_Construction_31,2022-09-03 06:47:09+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,whaley house picture of ghost as a kid i saw t...,199,"['whaley house', 'the whaley house']","['13', '25 yrs ago']",,,['san diegans'],,,,,,"['a minute later', 'late nightearly morning']"
2,sandiego,language exchange,x4n6xv,Poshorock,2022-09-03 06:07:46+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,language exchange is there someone by there wh...,31,,,,,,['english'],,,['spanish'],['san diego'],
3,SanDiegan,chula vista police stopping cars going east on...,x4n5aj,kaptaincorn,2022-09-03 06:04:54+00:00,https://www.reddit.com/r/SanDiegan/comments/x4...,chula vista police stopping cars going east on...,57,,,,,['chula vista'],,,,,,
4,SanDiegan,todd gloria finalizes plan to change park blvd...,x4n2rv,Lemonade_IceCold,2022-09-03 06:00:38+00:00,https://www.reddit.com/r/SanDiegan/comments/x4...,todd gloria finalizes plan to change park blvd...,666,['gtonly'],,,['balboa park'],,,,,['north american'],"['todd gloria', 'kevin']",


In [7]:
ner_df.columns

Index(['subreddit', 'title', 'post_id', 'post_author', 'post_utc', 'full_link',
       'post_text', 'post_text_count', 'ORG', 'DATE', 'EVENT', 'FAC', 'GPE',
       'LANGUAGE', 'LAW', 'LOC', 'NORP', 'PERSON', 'TIME'],
      dtype='object')

In [8]:
keywords_df = pd.read_csv(keywords_p)
print(f"Total observations: {keywords_df.shape[0]}")
keywords_df.drop(columns=['post_text'], inplace=True)
keywords_df.head()

Total observations: 31415


Unnamed: 0,post_id,keywords
0,x4ntm7,"['suddenly appeared', 'something hard', 'smoke..."
1,x4n6xv,"['language exchange', 'practice spanish', 'pra..."
2,x4n5aj,"['grand ave', 'seen', 'pb', 'holidays', 'end',..."
3,x4n2rv,"['zoo uptown', 'working class', 'traffic elsew..."
4,x4mz7c,"['verbal abuse', 'sell anything', 'extreme win..."


In [9]:
merged_df = ner_df.merge(keywords_df, left_on="post_id", right_on="post_id", how="left")

### Make Reddit nodes based on merged df across ner and keywords

In [10]:
merged_df

Unnamed: 0,subreddit,title,post_id,post_author,post_utc,full_link,post_text,post_text_count,ORG,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,NORP,PERSON,TIME,keywords
0,sandiego,going to visit san diego next week any places...,x4nzh2,Fearmkultra,2022-09-03 06:57:58+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,going to visit san diego next week any places ...,12,['san diego'],['next week'],,,,,,,,,,
1,sandiego,whaley house picture of ghost,x4ntm7,Open_Construction_31,2022-09-03 06:47:09+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,whaley house picture of ghost as a kid i saw t...,199,"['whaley house', 'the whaley house']","['13', '25 yrs ago']",,,['san diegans'],,,,,,"['a minute later', 'late nightearly morning']","['suddenly appeared', 'something hard', 'smoke..."
2,sandiego,language exchange,x4n6xv,Poshorock,2022-09-03 06:07:46+00:00,https://www.reddit.com/r/sandiego/comments/x4n...,language exchange is there someone by there wh...,31,,,,,,['english'],,,['spanish'],['san diego'],,"['language exchange', 'practice spanish', 'pra..."
3,SanDiegan,chula vista police stopping cars going east on...,x4n5aj,kaptaincorn,2022-09-03 06:04:54+00:00,https://www.reddit.com/r/SanDiegan/comments/x4...,chula vista police stopping cars going east on...,57,,,,,['chula vista'],,,,,,,"['grand ave', 'seen', 'pb', 'holidays', 'end',..."
4,SanDiegan,todd gloria finalizes plan to change park blvd...,x4n2rv,Lemonade_IceCold,2022-09-03 06:00:38+00:00,https://www.reddit.com/r/SanDiegan/comments/x4...,todd gloria finalizes plan to change park blvd...,666,['gtonly'],,,['balboa park'],,,,,['north american'],"['todd gloria', 'kevin']",,"['zoo uptown', 'working class', 'traffic elsew..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43620,UCSD,la jolla donor makes 50m research t that could...,scdqum,Yeezy75024,2022-01-25 13:28:21+00:00,https://www.reddit.com/r/UCSD/comments/scdqum/...,la jolla donor makes 50m research t that could...,74,"['usc the san diego uniontribune i', 'usc']",,,,"['la jolla', 'san diego lmao']",,,,,"['usc', 'usc']",,"['wasnt aware', 'san diego', 'never wondered',..."
43621,UCSD,new covid variant detected in at least 40 diff...,sca7fv,Yeezy75024,2022-01-25 09:58:30+00:00,https://www.reddit.com/r/UCSD/comments/sca7fv/...,new covid variant detected in at least 40 diff...,93,,['every year'],,,,,,,,['wpec idk'],,"['sigma variant', 'new shot', 'like omicron', ..."
43622,sandiego,tmz baltimore maggots leaked video twitter sca...,sc9b5t,EdgeIQ,2022-01-25 08:54:03+00:00,https://www.reddit.com/r/sandiego/comments/sc9...,tmz baltimore maggots leaked video twitter sca...,14,['tmz baltimore'],,,,,,,,,['santosogerio'],,
43623,UCSD,mailing services while school’s online,sc90i4,esppperanza,2022-01-25 08:32:43+00:00,https://www.reddit.com/r/UCSD/comments/sc90i4/...,mailing services while school’s online hey eve...,223,['clownface'],"['a couple weeks ago', 'the quarter', 'last we...",,,['hahaha'],,,,,,,"['thing thankfully', 'theyre forwarding', 'pre..."


### Reddit Nodes and Relationships

## Nextdoor Nodes and Relationships

In [11]:
# source: ../data/processed_nextdoor_data/nd_keywords_ner.csv
a = pd.read_csv("../data/processed_nextdoor_data/nd_keywords_ner.csv")
a.head()

Unnamed: 0,post_id,ShortLink,Author,post_text,post_text_count,Neighborhood,PERSON,TIME,DATE,ORG,...,GPE,FAC,LOC,LAW,LANGUAGE,EVENT,keywords,crime_score,ethnicity_score,neighborhood_score
0,nd1,https://nextdoor.com/p/--3jc5nsXN58?view=detail,Hannah Lopez,how late can people be working on construction...,131,Corridor,tapebill,,,,...,,,,,,,"['willful violation', 'news trying', 'means ca...",0.005391,0.0,0.0
1,nd2,https://nextdoor.com/p/--mjpdwdS3yx?view=detail,Tim Welch,rain has finally arrived in north park but las...,280,Montclair,"['chad jeremy 1964yeah', 'nicolas cage']",only 3 minutes,"['tomorrow', 'yesterday', 'about two months la...",like.humidity,...,"['china', 'san miguel de allende']",,,,,,"['“ yeah', 'vehicles chance', 'shall rebuild',...",0.0,0.0,0.002079
2,nd3,https://nextdoor.com/p/-3GwdKj4_sMm?view=detail,News,dont we have a water shortage... jennifer that...,1250,,"['jennifer', 'zanyface', 'agendawalter', 'wate...",,"['a day', '2 years ago', '5000 a month', '13',...","['sandags series', 'angelescarol dellangela']",...,"['san francisco', 'san diego', 'differently.go...",,,,,,"['… enough', 'water usage', 'water situation',...",0.008448,0.0,0.0
3,nd4,https://nextdoor.com/p/-4qn3_2yNk_Y?view=detail,Frank Negrete,guess nd didnt like my question about drinking...,82,Hillcrest Northeast,"['ndi’d', 'moderatorselectra hendrickson']",,,,...,,,,,,,"['public facewithtearsofjoy', 'faces bios', 'd...",0.0,0.0,0.0
4,nd5,https://nextdoor.com/p/-5-J-BXgJ84y?view=detail,Dawn Burton,day time robbery marston hillsupdate. update u...,1853,Hillcrest Southeast,"['max', 'insanitylaurie hewitt', 'pam lauri', ...","['530 pm', 'morning', 'night', 'around midnigh...","['a month ago', 'age 2030', 'feb 26', 'about t...","['marston', 'nextdoor wvideo', 'dogood', 'your...",...,"['california', 'california', 'essex st', 'verm...",,,,,,"['yet nothing', 'violent felonies', 'unlawful ...",0.042534,0.0,0.0


In [12]:
a.fillna("", inplace=True)
a.ORG = a.ORG.str.replace("[", "")
a.ORG = a.ORG.str.replace("]", "")
a

  a.ORG = a.ORG.str.replace("[", "")
  a.ORG = a.ORG.str.replace("]", "")


Unnamed: 0,post_id,ShortLink,Author,post_text,post_text_count,Neighborhood,PERSON,TIME,DATE,ORG,...,GPE,FAC,LOC,LAW,LANGUAGE,EVENT,keywords,crime_score,ethnicity_score,neighborhood_score
0,nd1,https://nextdoor.com/p/--3jc5nsXN58?view=detail,Hannah Lopez,how late can people be working on construction...,131,Corridor,tapebill,,,,...,,,,,,,"['willful violation', 'news trying', 'means ca...",0.005391,0.0,0.0
1,nd2,https://nextdoor.com/p/--mjpdwdS3yx?view=detail,Tim Welch,rain has finally arrived in north park but las...,280,Montclair,"['chad jeremy 1964yeah', 'nicolas cage']",only 3 minutes,"['tomorrow', 'yesterday', 'about two months la...",like.humidity,...,"['china', 'san miguel de allende']",,,,,,"['“ yeah', 'vehicles chance', 'shall rebuild',...",0.0,0.0,0.002079
2,nd3,https://nextdoor.com/p/-3GwdKj4_sMm?view=detail,News,dont we have a water shortage... jennifer that...,1250,,"['jennifer', 'zanyface', 'agendawalter', 'wate...",,"['a day', '2 years ago', '5000 a month', '13',...","'sandags series', 'angelescarol dellangela'",...,"['san francisco', 'san diego', 'differently.go...",,,,,,"['… enough', 'water usage', 'water situation',...",0.008448,0.0,0.0
3,nd4,https://nextdoor.com/p/-4qn3_2yNk_Y?view=detail,Frank Negrete,guess nd didnt like my question about drinking...,82,Hillcrest Northeast,"['ndi’d', 'moderatorselectra hendrickson']",,,,...,,,,,,,"['public facewithtearsofjoy', 'faces bios', 'd...",0.0,0.0,0.0
4,nd5,https://nextdoor.com/p/-5-J-BXgJ84y?view=detail,Dawn Burton,day time robbery marston hillsupdate. update u...,1853,Hillcrest Southeast,"['max', 'insanitylaurie hewitt', 'pam lauri', ...","['530 pm', 'morning', 'night', 'around midnigh...","['a month ago', 'age 2030', 'feb 26', 'about t...","'marston', 'nextdoor wvideo', 'dogood', 'yours...",...,"['california', 'california', 'essex st', 'verm...",,,,,,"['yet nothing', 'violent felonies', 'unlawful ...",0.042534,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,nd2817,https://nextdoor.com/p/zyBKcPsfG8p4?view=detail,Lisa Busalacchi,got this text today… since i’m expecting some ...,1026,Del Cerro Hearst,"['hearthands mediumdarkskintone', 'san dawggot...",,"['today', 'the past couple of years', 'yesterd...","'amazon', 'amazon', 'reader.i'",...,"['florida', 'san dawg', 'ssn', 'alberta', 'ita...",,,,,,"['xxx amount', 'vacation home', 'uspspaula abs...",0.003181,0.0,0.0
2804,nd2818,https://nextdoor.com/p/zzWdg8FDxMw4?view=detail,Eleanor Jacobs,this has to be sketchy scammy don’t click dele...,171,North Park Burlingame/Altadena,['yiunever'],,,'amazon',...,,,,,,,"['sketchy scammy', 'senders email', 'scammicha...",0.001335,0.0,0.0
2805,nd2819,https://nextdoor.com/p/zzYsgLb5T2sb?view=detail,Rosie Hin,hi everyone. this is my baby 3 ish month old c...,63,University Heights Antique Row N,['charlie'],,"['3 ish month old', '25 yrs old']",,...,,,,,,,"['‘ charlie', 'yrs old', 'outcarol thank', 'hi...",0.0,0.0,0.0
2806,nd2820,https://nextdoor.com/p/zzgTmx49yTM4?view=detail,Grace Joseph,ah yes 1148pm. the perfect time to play the ga...,68,North Park Burlingame/Altadena,"['1148pm', 'wonderfulit']",['all a few seconds'],,'gworek',...,,,,,,,"['tongueincheekdarn fireworks', 'seconds apart...",0.002759,0.0,0.0


In [13]:
a.ORG = a.ORG.apply(lambda x: ';'.join(x.split(",")))
a

Unnamed: 0,post_id,ShortLink,Author,post_text,post_text_count,Neighborhood,PERSON,TIME,DATE,ORG,...,GPE,FAC,LOC,LAW,LANGUAGE,EVENT,keywords,crime_score,ethnicity_score,neighborhood_score
0,nd1,https://nextdoor.com/p/--3jc5nsXN58?view=detail,Hannah Lopez,how late can people be working on construction...,131,Corridor,tapebill,,,,...,,,,,,,"['willful violation', 'news trying', 'means ca...",0.005391,0.0,0.0
1,nd2,https://nextdoor.com/p/--mjpdwdS3yx?view=detail,Tim Welch,rain has finally arrived in north park but las...,280,Montclair,"['chad jeremy 1964yeah', 'nicolas cage']",only 3 minutes,"['tomorrow', 'yesterday', 'about two months la...",like.humidity,...,"['china', 'san miguel de allende']",,,,,,"['“ yeah', 'vehicles chance', 'shall rebuild',...",0.0,0.0,0.002079
2,nd3,https://nextdoor.com/p/-3GwdKj4_sMm?view=detail,News,dont we have a water shortage... jennifer that...,1250,,"['jennifer', 'zanyface', 'agendawalter', 'wate...",,"['a day', '2 years ago', '5000 a month', '13',...",'sandags series'; 'angelescarol dellangela',...,"['san francisco', 'san diego', 'differently.go...",,,,,,"['… enough', 'water usage', 'water situation',...",0.008448,0.0,0.0
3,nd4,https://nextdoor.com/p/-4qn3_2yNk_Y?view=detail,Frank Negrete,guess nd didnt like my question about drinking...,82,Hillcrest Northeast,"['ndi’d', 'moderatorselectra hendrickson']",,,,...,,,,,,,"['public facewithtearsofjoy', 'faces bios', 'd...",0.0,0.0,0.0
4,nd5,https://nextdoor.com/p/-5-J-BXgJ84y?view=detail,Dawn Burton,day time robbery marston hillsupdate. update u...,1853,Hillcrest Southeast,"['max', 'insanitylaurie hewitt', 'pam lauri', ...","['530 pm', 'morning', 'night', 'around midnigh...","['a month ago', 'age 2030', 'feb 26', 'about t...",'marston'; 'nextdoor wvideo'; 'dogood'; 'yours...,...,"['california', 'california', 'essex st', 'verm...",,,,,,"['yet nothing', 'violent felonies', 'unlawful ...",0.042534,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803,nd2817,https://nextdoor.com/p/zyBKcPsfG8p4?view=detail,Lisa Busalacchi,got this text today… since i’m expecting some ...,1026,Del Cerro Hearst,"['hearthands mediumdarkskintone', 'san dawggot...",,"['today', 'the past couple of years', 'yesterd...",'amazon'; 'amazon'; 'reader.i',...,"['florida', 'san dawg', 'ssn', 'alberta', 'ita...",,,,,,"['xxx amount', 'vacation home', 'uspspaula abs...",0.003181,0.0,0.0
2804,nd2818,https://nextdoor.com/p/zzWdg8FDxMw4?view=detail,Eleanor Jacobs,this has to be sketchy scammy don’t click dele...,171,North Park Burlingame/Altadena,['yiunever'],,,'amazon',...,,,,,,,"['sketchy scammy', 'senders email', 'scammicha...",0.001335,0.0,0.0
2805,nd2819,https://nextdoor.com/p/zzYsgLb5T2sb?view=detail,Rosie Hin,hi everyone. this is my baby 3 ish month old c...,63,University Heights Antique Row N,['charlie'],,"['3 ish month old', '25 yrs old']",,...,,,,,,,"['‘ charlie', 'yrs old', 'outcarol thank', 'hi...",0.0,0.0,0.0
2806,nd2820,https://nextdoor.com/p/zzgTmx49yTM4?view=detail,Grace Joseph,ah yes 1148pm. the perfect time to play the ga...,68,North Park Burlingame/Altadena,"['1148pm', 'wonderfulit']",['all a few seconds'],,'gworek',...,,,,,,,"['tongueincheekdarn fireworks', 'seconds apart...",0.002759,0.0,0.0
