# 4. Data Cleaning Part IV: Adding Location Coordinates for Crimes and Extracting Time Differences Between Crime Time and Email Time
The goal of this notebook is to finish data cleaning of emails, resulting in a final data frame, **certain_crimes** that contains emails with a crime, date, time, and location. These crimes will then be visualized through Folium, giving a clear indication of crimes reported by WarnMe system
- [Section A: Assigning Coordinates For Each Location](#Section-A:-Assigning-Coordinates-For-Each-Location)
- [Section B: Finding Time Differences Between Crime Occurrence and Email Time](#Section-B:-Finding-Time-Differences-Between-Crime-Occurrence-and-Email-Time)

In [1]:
import pandas as pd 
import numpy as np

In [2]:
crimes_df = pd.read_csv('complete_crimes.csv')
crimes_df.head(10)

Unnamed: 0,Subject,Body,date of crime,time of crime,email time,email day of week,email date
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,04:02,Thursday,06-17-2021
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,10:10,Wednesday,06-16-2021
2,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,16:51,Tuesday,06-08-2021
3,"Burglary at Botanical Gardens, 200 Centennial ...",<https://oem.berkeley.edu/sites/default/files...,05-10-2021,10:30,12:03,Sunday,05-30-2021
4,Violent Crime Reported at Channing Way/ Colleg...,<https://oem.berkeley.edu/sites/default/files...,05-18-2021,15:50,23:23,Tuesday,05-18-2021
5,Burglary at Clark Kerr Campus building 23,<https://oem.berkeley.edu/sites/default/files...,07-01-2021,18:07,19:32,Thursday,07-01-2021
6,Violent Crime Reported at 1700 block of Spruce...,<https://oem.berkeley.edu/sites/default/files...,07-20-2021,00:14,01:04,Tuesday,07-20-2021
7,UC Berkeley WarnMe: Thank you for doing your p...,<https://oem.berkeley.edu/sites/default/files...,07-12-2021,16:06,16:07,Monday,07-12-2021
8,UC Berkeley WarnMe: Please reduce power usage ...,<https://oem.berkeley.edu/sites/default/files...,07-09-2021,21:40,21:41,Friday,07-09-2021
9,Violent Crime Reported at West Crescent - Plea...,<https://oem.berkeley.edu/sites/default/files...,07-31-2021,15:53,17:02,Saturday,07-31-2021


In [3]:
sum(crimes_df['Body'].str.extract(r'(occurred)').isna()[0].astype(int))
crimes_df['not crime'] = crimes_df['Body'].str.extract(r'(occurred)').isna()

In [4]:
certain_crimes = crimes_df[crimes_df['not crime'] == False].reset_index().drop(columns = 'index')
test_crimes = crimes_df[crimes_df['not crime'] == True].reset_index().drop(columns = 'index')

# creating two data frames based on email body information: if there is no 'occurred' in email body, then put in test_crimes df, where will check if any of the emails
# concern a crime:

# listo = test_crimes['Body'].to_list()
# for i in listo:
#     print(i)

# from parsing through the list, all email contents seem to pertain to Police activity, rather than a clear crime that has occurred..will not use for final observation
# certain_crimes are crimes that have a definite location, date, and time reported by WarnMe

### Section A: Assigning Coordinates For Each Location
- This part of the notebook is essentially just manually inputting coordinates for each crime location (although tedious, I found this to be the best way to go about data entry, as many locations from the email don't have precise locations when using something like geo encoder , eg: Octagon Bridge)

In [5]:
certain_crimes['crime location'] = certain_crimes['Body'].str.extract(r'occurred([\w\s\,\'\/\-]+)')
location_series = certain_crimes['crime location']
certain_crimes['crime location'] = location_series.str.strip('^ at')

# creating an additional column 'crime location' that includes information about where the crime has occurred, some
# locations are cut off because  of the way the regex finds matches..



In [6]:
# Now time to retrieve all coordinates for each crime,, not sure what a shorter method would have been to accurately retrieve these locations..
# tried using Nominatim to automatically produce locations, but some were not accurate and required me to enter an actual "location
# a lot of campus location descriptions aren't reflected on Google Maps, for example, so I found it easiest to manually enter these coordinates based on Google Maps
# and knowledge for where these locations are on campus (e.g octagon bridge,)

loc_0 = (37.885668374889185, -122.30092909940954) #UNIVERSITY VILLAGE

loc_1 = (37.86665552582794, -122.25433591851045)

loc_2 = (37.86610424037988, -122.24930616051559)

loc_3 = (37.87549249607238, -122.23872662769344) #Botanical gardens

loc_4 = (37.86750795125447, -122.25424953285788)

loc_5 = (37.86484677848779, -122.24787866459457)

loc_6 = (37.87609356106545, -122.26538185462118)

loc_7 = (37.87202007014124, -122.26564447956427)

loc_8 = (37.87060293649298, -122.2659985855477)

loc_9 = (37.86781820643444, -122.25899259480302)

In [7]:
loc_10 = (37.865682740415785, -122.2569642176637) # PEP 1 

loc_11 = (37.865573737486194, -122.2552224541243)

loc_12 = (37.867619753562195, -122.26122597056194)

loc_13 = (37.86953204164864, -122.26226534919819)

loc_14 = (37.91363009321955, -122.33410994155717)

loc_15 = (37.86648324936755, -122.25711260284339)

loc_16 = (37.87379032049858, -122.2575404586629)

loc_17 = (37.86304746489007, -122.26039073211933)

loc_18 = (37.87081245854583, -122.26439206816863) # Grinnell Path 1

loc_19 = (37.8709652765156, -122.26406951208767) # Grinnell Path 2

In [8]:
loc_20 = (37.86751753675671, -122.25422736768546)

loc_21 = (37.870701614199135, -122.2634021295405)

loc_22 = (37.866889318488994, -122.25761943167934)

loc_23 = (37.872468217709525, -122.24174801954294) # Coffer Dam, seems like old name for parkng lot near lower fire trail entrance

loc_24 = (37.86968370594719, -122.26033123929126)

loc_25 = (37.86575686025478, -122.26055330099094)

loc_26 = (37.869937593835964, -122.25538785290266)

loc_27 = (37.87090307843082, -122.25696504174218)

loc_28 = (37.88329267537631, -122.30385400870257)

loc_29 = (37.86821755544612, -122.2599386758598)

In [9]:
loc_30 = (37.8682433312342, -122.25964613385412)

loc_31 = (37.869317860271806, -122.26049723048824)

loc_32 = (37.86540767127886, -122.25706234887276)

loc_33 = (37.87344193583443, -122.25766334339795)

loc_34 = (37.88723282268444, -122.30088772723381)

loc_35 = (37.86892845206356, -122.25952183200175)

loc_36 = (37.866020406494194, -122.26264597624498)

loc_37 = (37.87059441451832, -122.26038503770606)

loc_38 = (37.865867865204414, -122.25922951679871) # SECOND CRIME HASTE/ELLSWORTH

loc_39 = (37.86570712975303, -122.2567912111524) # PEP 2

In [10]:
loc_40 = (37.886318416383695, -122.29819547180433)

loc_41 = (37.86765616120074, -122.25841255683427)

loc_42 = (37.864331090761155, -122.2489114626912)

loc_43 = (37.870444, -122.262111)

loc_44 = (37.883228, -122.298581)  # Little league baseball fields 1

loc_45 = (37.867962, -122.259649)

loc_46 = (37.87186380964017, -122.25332804517107) 

loc_47 = (37.87619204598824, -122.25879924517099)

loc_48 = (37.86419416328994, -122.25025548521589)

loc_49 = (37.870711416404475, -122.26344228778417)

In [11]:
loc_50 = (37.867034978722415, -122.2588147740073)

loc_51 = (37.87158379172142, -122.25998021496419) # UC Berkeley Campus.. I believe this was shooting threat.. can keep this

loc_52 = (37.868346615975796, -122.25129814517132)

loc_53 = (37.88524254987419, -122.30029760501414)

loc_54 = (37.87049105782291, -122.25387188375312)

loc_55 = (37.87481150678773, -122.26435791818732)

loc_56 = (37.87004422718286, -122.26180770284333)

loc_57 = (37.88380869602941, -122.29852230946554) # Little league baseball fields 2

loc_58 = (37.86973034940577, -122.25986476345292)

loc_59 = (37.8710081955918, -122.26299873167929)

In [12]:
loc_60 = (37.872331713425964, -122.24649307432887)

loc_61 = (37.866692107439604, -122.25751044517123)

loc_62 = (37.87318023633792, -122.26300068749899)

loc_63 = (37.868201604696544, -122.25875756083802)

loc_64 = (37.8727524847917, -122.25536436051519)

loc_65 = (37.866092439082806, -122.2584790642201) ## 2500 Haste Street

loc_66 = (37.86593749636556, -122.25842481270622) # 2500 Haste Street 

loc_67 = (37.86180331529041, -122.25364233078507)

loc_68 = (37.87273993929719, -122.26521562982681)

loc_69 = (37.86639555739679, -122.2631812107783)

In [13]:
loc_70 = (37.87596514184489, -122.25686640922834)

loc_71 = (37.87566386912146, -122.25923881357075)

loc_72 = (37.884557322206874, -122.30562416083409)

loc_73 = (37.866372111248104, -122.25644555126098)

loc_74 = (37.865976553167116, -122.25737250751233)

loc_75 = (37.86871708273302, -122.25914892188258)

loc_76 = (37.86660751287279, -122.25524021851058)

loc_77 = (37.86554543619386, -122.25519676739857)

loc_78 = (37.86723308823183, -122.26348801154207)

loc_79 = (37.86897885562757, -122.25691015693559)

In [14]:
loc_80 = (37.8752348819203, -122.25885865957139)

loc_81 = (37.8674610569245, -122.25794386051561)

loc_82 = (37.87647011584509, -122.2565509526357)

loc_83 = (37.86802007110562, -122.25761770316596)

loc_84 = (37.86815137986785, -122.26369088730023)

loc_85 = (37.869203798797265, -122.26023414854107)

loc_86 = (-9999, -9999) # UC Berkeley Residential Hall.. unclear where, will keep for analysis, won't plot on map

loc_87 = (37.867561308239985, -122.25730043167943)

loc_88 = (37.8677283085401, -122.26130452441932)

loc_89 = (37.872201101493374, -122.24721644263906)

In [15]:
loc_90 = (37.868252494620954, -122.25878973167934)

loc_91 = (37.87159745345414, -122.25760696404691)

loc_92 = (37.86382725604944, -122.24990127169713)

loc_93 = (37.865744392794255, -122.25392575641257)

loc_94 = (37.86814690102153, -122.2562641107314)

loc_95 = (37.866723356379346, -122.2543788181876)

loc_96 = (37.871516833740145, -122.25916926051542)

loc_97 = (37.8669062580087, -122.25760870284331)

loc_98 = (37.867804279422444, -122.25520381864958)

loc_99 = (37.86605863119275, -122.25484788202175)

In [16]:
loc_100 = (37.87482546503129, -122.26868741461494)

loc_101 = (37.88539741113053, -122.30080035337699)

loc_102 = (37.87040285117511, -122.2595467893515)

loc_103 = (37.86975380523294, -122.26027897432951)

loc_104 = (37.8717167752084, -122.25221748767503)

loc_105 = (37.87551680073064, -122.25628656877228)

loc_106 = (37.871589210384336, -122.25751870284331)

loc_107 = (37.874321424162765, -122.26533890574552)

loc_108 = (37.87169685950271, -122.26382497116457)

loc_109 = (37.86713613181871, -122.25650038230374)

In [17]:
loc_110 = (37.829484379261714, -122.27868284517257)

loc_111 = (37.86804532542768, -122.26688627400733)

loc_112 = (37.870577059828854, -122.26599624314518)

loc_113 = (-9999, -9999) # UC Berkeley Residential hall.. don't include in maps, use for analysis

loc_114 = (37.82929988511739, -122.27877827433916) 

loc_115 = (37.83591719188485, -122.27658376051662)

loc_116 = (37.86213896810392, -122.25349831940463)

loc_117 = (37.869987479703404, -122.2596558473456)

loc_118 = (37.86556471088748, -122.25521722398945)

loc_119 = (37.86690101845268, -122.25893594238079)

In [18]:
loc_120 = (37.86863470252675, -122.26244814517115)

loc_121 = (37.88419188311278, -122.29962623283154)

loc_122 = (37.86613737871155, -122.25047217400734)

loc_123 = (37.86578740735615, -122.25885366083858)

loc_124 = (37.86657603791535, -122.25843200284335)

loc_125 = (37.86573376978658, -122.2539375061918)

loc_126 = (37.86444355325474, -122.2582965893516)

loc_127 = (37.88586716455661, -122.30219850284274)

loc_128 = (37.86896750319744, -122.25692187999896)

loc_129 = (37.88637367723143, -122.30088732240158)

In [19]:
loc_130 = (37.86590119352147, -122.25537478261077)

loc_131 = (37.87506752256203, -122.25988297956407)

loc_132 = (37.87213090944993, -122.24798644702335)

loc_133 = (37.86870120036351, -122.25914307866759)

loc_134 = (37.87293014623572, -122.25752306051535)

loc_135 = (37.867656653482776, -122.2658134372363)

loc_136 = (37.86762874446727, -122.26022452189216)

loc_137 = (37.86754495978886, -122.25424730284334)

loc_138 = (37.86439885116898, -122.24884708967492)

loc_139 = (37.86564859533007, -122.25646991334573) # PEP 3

In [20]:
loc_140 = (37.86592663342498, -122.25646373454175) # PEP 4

loc_141 = (37.86611199157212, -122.25661202583713) # PEP 5

loc_142 = (37.86619491480276, -122.25648844975765) # PEP 6

loc_143 = (37.8655315263447, -122.2574708795896) # PEP 7

loc_144 = (37.872782284726554, -122.2652800028431)

loc_145 = (37.86832745523135, -122.25411890284344)

loc_146 = (37.86543396871479, -122.25732258829422) # PEP 8

loc_147 = (37.86544860236752, -122.25713104537101) # PEP 9

loc_148 = (37.865716885485654, -122.25731023068627) # PEP 10

loc_149 = (37.86582419845942, -122.25741527035386) # PEP 11

In [21]:
loc_150 = (37.866, -122.25712044702367) # PEP 12

loc_151 = (37.866, -122.25800044702367) # PEP 13

loc_152 = (37.86554615997802, -122.25648227095367) # PEP 14

loc_153 = (37.86577541985436, -122.25642048291394) # PEP 15

loc_154 = (37.865604694482336, -122.25671706550472) # PEP 16

loc_155 = (37.86550225906927, -122.25690242962395) # PEP 17

loc_156 = (37.8657412748116, -122.25771803174858) # PEP 18


loc_157 = (37.86558030511118, -122.25764388610091) # PEP 19

loc_158 = (37.86636896613394, -122.25110810501863)

loc_159 = (37.86543396871479, -122.25767478012077) # PEP 20

In [22]:
loc_160 = (37.865980289778676, -122.25673560191663) # PEP 21

loc_161 = (37.86581932060036, -122.25684064158422) # PEP 22

loc_162 = (37.86605345747076, -122.25755738284526) # PEP 23

loc_163 = (37.865643717459385, -122.25718665460677) # PEP 24

loc_164 = (37.913470824651505, -122.33408100284186) # RFS 1

loc_165 = (37.8657559084033, -122.2566738138769) # PEP 25

loc_166 = (37.86599980117032, -122.25729787307831) # PEP 26

loc_167 = (37.87001288791314, -122.25960220316537)

loc_168 = (37.86609735805112, -122.25677885354445) # PEP 27

loc_169 = (37.913624778192556, -122.33388573826372) # RFS 2

In [23]:
loc_170 = (37.87088185444292, -122.25303228179779)

loc_171 = (37.86969261633393, -122.25933176123971)

loc_172 = (37.8864206224644, -122.29813088749863)

loc_173 = (37.859870271208194, -122.25585005171963)

loc_174 = (37.879507515400384, -122.23616119767078) 

loc_175 = (37.913427405968235, -122.33463626099054)

loc_176 = (37.91260830557845, -122.33442986724066)

In [24]:

# remaining_crime_num = np.arange(80, 90)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# certain_crimes.loc[80, "Body"]

# remaining_crime_num = np.arange(150, 160)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# remaining_crime_num = np.arange(110, 120)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# certain_crimes.iloc[140:150,:]

# remaining_crime_num = np.arange(160, 170)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# remaining_crime_num = np.arange(170, 177)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# len(locations)


# remaining_crime_num = np.arange(10, 20)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# remaining_crime_num = np.arange(30, 40)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# remaining_crime_num = np.arange(40, 50)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# remaining_crime_num = np.arange(90, 100)
# for i in remaining_crime_num:
#     print(i, certain_crimes.loc[i, "Subject"])

# remainders  = np.arange(170, 177)
# for i in remainders:
#     print(i, certain_crimes.loc[i, 'Subject'])


In [25]:
all_locations = [loc_0, loc_1, loc_2, loc_3, loc_4, loc_5, loc_6, loc_7, loc_8, loc_9, 
             loc_10, loc_11, loc_12, loc_13, loc_14, loc_15, loc_16, loc_17, loc_18, loc_19,
             loc_20, loc_21, loc_22, loc_23, loc_24, loc_25, loc_26, loc_27, loc_28, loc_29,
             loc_30, loc_31, loc_32, loc_33, loc_34, loc_35, loc_36, loc_37, loc_38, loc_39,
             loc_40, loc_41, loc_42, loc_43, loc_44, loc_45, loc_46, loc_47, loc_48, loc_49, 
             loc_50, loc_51, loc_52, loc_53, loc_54, loc_55, loc_56, loc_57, loc_58, loc_59, loc_60, 
             loc_61, loc_62, loc_63, loc_64, loc_65, loc_66, loc_67, loc_68, loc_69, loc_70,
             loc_71, loc_72, loc_73, loc_74, loc_75, loc_76, loc_77, loc_78, loc_79, loc_80,
             loc_81, loc_82, loc_83, loc_84, loc_85, loc_86, loc_87, loc_88, loc_89, loc_90,
             loc_91, loc_92, loc_93, loc_94, loc_95, loc_96, loc_97, loc_98, loc_99, loc_100,
             loc_101, loc_102, loc_103, loc_104, loc_105, loc_106, loc_107, loc_108, loc_109, loc_110,
             loc_111, loc_112, loc_113, loc_114, loc_115, loc_116, loc_117, loc_118, loc_119, loc_120,
             loc_121, loc_122, loc_123, loc_124, loc_125, loc_126, loc_127, loc_128, loc_129, loc_130,
             loc_131, loc_132, loc_133, loc_134, loc_135, loc_136, loc_137, loc_138, loc_139, loc_140,
             loc_141, loc_142, loc_143, loc_144, loc_145, loc_146, loc_147, loc_148, loc_149, loc_150,
             loc_151, loc_152, loc_153, loc_154, loc_155, loc_156, loc_157, loc_158, loc_159, loc_160,
             loc_161, loc_162, loc_163, loc_164, loc_165, loc_166, loc_167, loc_168, loc_169, loc_170,
             loc_171, loc_172, loc_173, loc_174, loc_175, loc_176]

In [26]:
locations = [list(l) for l in all_locations]
latitude = [lat[0] for lat in all_locations]
longitude = [lon[1] for lon in all_locations]
certain_crimes['crime latitude'] = latitude
certain_crimes['crime longitude'] = longitude
certain_crimes
#location_df = pd.DataFrame({'latitude':latitude, 'longitude':longitude})
len(all_locations)
#location_df['latitude'].value_counts().head(10)

177

In [27]:
certain_crimes = certain_crimes.drop(labels = 'not crime', axis = 1)

certain_crimes

Unnamed: 0,Subject,Body,date of crime,time of crime,email time,email day of week,email date,crime location,crime latitude,crime longitude
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,04:02,Thursday,06-17-2021,University Village,37.885668,-122.300929
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,10:10,Wednesday,06-16-2021,2650 Haste S,37.866656,-122.254336
2,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,16:51,Tuesday,06-08-2021,on the 3100 Block of Dwight Way,37.866104,-122.249306
3,"Burglary at Botanical Gardens, 200 Centennial ...",<https://oem.berkeley.edu/sites/default/files...,05-10-2021,10:30,12:03,Sunday,05-30-2021,"Botanical Gardens, 200 Centennial Dr Berkeley CA",37.875492,-122.238727
4,Violent Crime Reported at Channing Way/ Colleg...,<https://oem.berkeley.edu/sites/default/files...,05-18-2021,15:50,23:23,Tuesday,05-18-2021,Channing Way/ College Ave,37.867508,-122.254250
...,...,...,...,...,...,...,...,...,...,...
172,Burglary at 1050 San Pablo Ave,<https://oem.berkeley.edu/sites/default/files...,09-21-2023,11:31,18:11,Thursday,09-21-2023,1050 San Pablo Ave,37.886421,-122.298131
173,Stuart Street & Hillegas Ave - Violent Crime ...,<https://oem.berkeley.edu/sites/default/files...,09-07-2023,12:17,13:49,Thursday,09-07-2023,Stuart Stree,37.859870,-122.255850
174,Singletrack trail east of the Botanical Garden...,<https://oem.berkeley.edu/sites/default/files...,08-20-2023,13:45,16:17,Sunday,08-20-2023,Singletrack trail east of the Botanical Garden...,37.879508,-122.236161
175,Burglary at Richmond Field Station,<https://oem.berkeley.edu/sites/default/files...,06-19-2023,02:00,15:38,Tuesday,06-20-2023,Richmond Field Station,37.913427,-122.334636


In [28]:
# need to edit two dates from emails as they table says that the email occurred before the crime date.. seems like there is an error in 
# the WarnMe email

certain_crimes.at[157, 'date of crime'] = '05-03-2022'
certain_crimes.at[40, 'date of crime'] = '03-20-2022' #[certain_crimes['date of crime'] == '03-21-2022']


## Section B: Finding Time Differences Between Crime Occurrence and Email Time

This portion of the notebook adds columns that describes the difference in time from when the crime happened, and when the WarnMe email for this crime was actually sent. To make this process easier, I split the dataframe certain_crimes into two data frames:

- **same_day_emails**: emails that were sent the same day as when the crime happened
- **diff_day_emails**: emails that were sent on a different day than that of the day the crime occurred, then split into two tables
    - one_day_off_emails
    - more_than_one_day

In [29]:
#Assingning data frames, splitting data based on if the date of the crime is the same as the email date, or not the same

same_day_emails  = certain_crimes[certain_crimes['date of crime'] == certain_crimes['email date']].reset_index(drop=True)
diff_day_emails = certain_crimes[certain_crimes['date of crime'] != certain_crimes['email date']].reset_index(drop=True)

# If the date fof crime and email are the same, then  the day of the week in which the crime happened is the same as the email day of week, make that assignment here:
day_of_weeks = same_day_emails['email day of week']
same_day_emails['crime day of week'] = day_of_weeks

In [30]:
## Converting Email Times into Minutes,, 
# For each email and crime time,, convert them into minutes, by multiplying the hour by 60, and adding the remaining minutes of the time
# Final result is features 'email in minutes' and 'crime in minutes', which describe the eamil time in terms of minutes and crime time in terms of minutes, 
# so it is easier to get the difference in time

email_times = same_day_emails['email time'] 
crime_times = same_day_emails['time of crime']

email_hours = [] 
crime_hours = []


for i in range(len(email_times)):
    email_hours.append(int(email_times[i][0:2])*60 + int(email_times[i][3:5]))
    
    crime_hours.append(int(crime_times[i][0:2])*60 +int(crime_times[i][3:5]))

same_day_emails['email in minutes'] = email_hours 

same_day_emails['crime in minutes'] = crime_hours 


same_day_emails

Unnamed: 0,Subject,Body,date of crime,time of crime,email time,email day of week,email date,crime location,crime latitude,crime longitude,crime day of week,email in minutes,crime in minutes
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,04:02,Thursday,06-17-2021,University Village,37.885668,-122.300929,Thursday,242,129
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,10:10,Wednesday,06-16-2021,2650 Haste S,37.866656,-122.254336,Wednesday,610,320
2,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,16:51,Tuesday,06-08-2021,on the 3100 Block of Dwight Way,37.866104,-122.249306,Tuesday,1011,795
3,Violent Crime Reported at Channing Way/ Colleg...,<https://oem.berkeley.edu/sites/default/files...,05-18-2021,15:50,23:23,Tuesday,05-18-2021,Channing Way/ College Ave,37.867508,-122.254250,Tuesday,1403,950
4,Burglary at Clark Kerr Campus building 23,<https://oem.berkeley.edu/sites/default/files...,07-01-2021,18:07,19:32,Thursday,07-01-2021,Clark Kerr Campus building 23,37.864847,-122.247879,Thursday,1172,1087
...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,Sproul Plaza - Violent Crime Reported - Pleas...,<https://oem.berkeley.edu/sites/default/files...,10-25-2023,13:30,14:14,Wednesday,10-25-2023,Sproul Plaz,37.869693,-122.259332,Wednesday,854,810
142,Burglary at 1050 San Pablo Ave,<https://oem.berkeley.edu/sites/default/files...,09-21-2023,11:31,18:11,Thursday,09-21-2023,1050 San Pablo Ave,37.886421,-122.298131,Thursday,1091,691
143,Stuart Street & Hillegas Ave - Violent Crime ...,<https://oem.berkeley.edu/sites/default/files...,09-07-2023,12:17,13:49,Thursday,09-07-2023,Stuart Stree,37.859870,-122.255850,Thursday,829,737
144,Singletrack trail east of the Botanical Garden...,<https://oem.berkeley.edu/sites/default/files...,08-20-2023,13:45,16:17,Sunday,08-20-2023,Singletrack trail east of the Botanical Garden...,37.879508,-122.236161,Sunday,977,825


In [31]:
difference = same_day_emails['email in minutes'] - same_day_emails['crime in minutes']

hours_diff = difference//60 
minutes_diff = difference%60
same_day_emails['difference hours'] = hours_diff
same_day_emails['difference minutes'] = minutes_diff
# same_day_emails.iloc[130] # change crime date to 05-03-2022..
# same_day_emails.iloc[36] # change crime date to 03-20-2022...
hullo = difference.to_list()
np.mean((hullo)), np.std((hullo))**2
#np.percentile(hullo, 80)

(171.5958904109589, 44515.679161193475)

In [32]:
# instead of manually entering values, most email and crimes with different days are usually a day off, and can adjust data entry as needed
# Create new feature that checks for each email that is not the same day, if the date difference is off by one day or more,, 
# Note: does not work with one day off emails, but different month , e.g. 11-30 and 12-01

one_day_more = []
num_rows = diff_day_emails.shape[0]

for i in range(num_rows):
    if (diff_day_emails.iloc[i]['date of crime'][0:2] == diff_day_emails.iloc[i]['email date'][0:2]) & (int(diff_day_emails.iloc[i]['date of crime'][3:5]) == (int(diff_day_emails.iloc[i]['email date'][3:5]) - 1)):
            one_day_more.append(True)
    else: 
        one_day_more.append(False)
  
        
diff_day_emails['one day off?'] = one_day_more


In [33]:
one_day_off_emails = diff_day_emails[diff_day_emails['one day off?'] == True].reset_index(drop = True)
more_than_one_day = diff_day_emails[diff_day_emails['one day off?'] == False].reset_index(drop = True)

In [34]:
# For eamils that are one day off, can map the crime day of the week to one day earlier than that of the email 

def one_day_later_converter(days):
    return({'Tuesday':'Monday', 'Wednesday':'Tuesday', 'Thursday':'Wednesday', 
            'Friday':'Thursday', 'Saturday':'Friday', 'Sunday':'Saturday', 'Monday':'Sunday'}[days])


email_days_of_week = one_day_off_emails['email day of week'].to_list()

crime_days_of_week = [] 
for e in email_days_of_week:
    crime_days_of_week.append(one_day_later_converter(e))

one_day_off_emails['crime day of week'] = crime_days_of_week
one_day_off_emails

Unnamed: 0,Subject,Body,date of crime,time of crime,email time,email day of week,email date,crime location,crime latitude,crime longitude,one day off?,crime day of week
0,Violent Crime Reported at People's Park bathro...,<https://oem.berkeley.edu/sites/default/files...,09-07-2021,23:30,04:24,Wednesday,09-08-2021,People's Park bathroom,37.865683,-122.256964,True,Tuesday
1,Violent Crime Reported at Sather Lane - Please...,<https://oem.berkeley.edu/sites/default/files...,11-13-2021,20:30,03:02,Sunday,11-14-2021,Sather Lane,37.868218,-122.259939,True,Saturday
2,Burglary at Dwinelle Hall,<https://oem.berkeley.edu/sites/default/files...,11-29-2021,14:15,05:06,Tuesday,11-30-2021,Dwinelle Hall,37.870594,-122.260385,True,Monday
3,"Burglary at Gill Tract, University Village Albany",<https://oem.berkeley.edu/sites/default/files...,03-20-2022,20:00,19:06,Monday,03-21-2022,"Gill Tract, University Village Albany",37.886318,-122.298195,True,Sunday
4,Burglary at Enclave Apartments,<https://oem.berkeley.edu/sites/default/files...,03-15-2022,23:16,04:42,Wednesday,03-16-2022,Enclave Apartments,37.867656,-122.258413,True,Tuesday
5,Violent Crime Reported at The old Tolman Hall ...,<https://oem.berkeley.edu/sites/default/files...,08-05-2022,23:56,02:04,Saturday,08-06-2022,The old Tolman Hall are,37.874812,-122.264358,True,Friday
6,Violent Crime Reported at Etcheverry Hall/Soda...,<https://oem.berkeley.edu/sites/default/files...,11-17-2022,15:30,05:48,Friday,11-18-2022,Etcheverry Hall/Soda Hall breezeway,37.875664,-122.259239,True,Thursday
7,Violent Crime Reported at Bowditch St at Bancr...,<https://oem.berkeley.edu/sites/default/files...,12-23-2022,23:20,00:45,Saturday,12-24-2022,Bowditch St at Bancroft Way,37.868979,-122.25691,True,Friday
8,Burglary at Foothill Building 2,<https://oem.berkeley.edu/sites/default/files...,12-08-2022,03:00,22:49,Friday,12-09-2022,Foothill Building 2,37.87647,-122.256551,True,Thursday
9,Violent Crime Reported at Bancroft Way at Ells...,<https://oem.berkeley.edu/sites/default/files...,09-29-2022,23:45,00:07,Friday,09-30-2022,Bancroft Way at Ellsworth S,37.868151,-122.263691,True,Thursday


In [35]:
minutes_in_day = 24*60

crime_time_list = one_day_off_emails['time of crime'].to_list()
crime_time_in_minutes = []
for c in crime_time_list:
    hours_in_min = 60*int(c[0:2])
    minutes = int(c[3:5])

    crime_time_in_minutes.append(hours_in_min + minutes)

remaining_minutes_in_day = []

for t in crime_time_in_minutes:
    remaining_minutes_in_day.append(minutes_in_day - t)


email_time_list = one_day_off_emails['email time']
email_time_in_minutes = []

for e in email_time_list:
    hours_in_min_dos = 60*int(e[0:2])
    minutes_dos = int(e[3:5])
    email_time_in_minutes.append(hours_in_min_dos + minutes_dos)

time_difference = [] 

for i in range(23):
    diff = remaining_minutes_in_day[i] + email_time_in_minutes[i]
    time_difference.append(diff)

time_difference
        

one_day_off_emails['total difference (in minutes)'] = time_difference

In [36]:
same_day_emails = same_day_emails.drop(['crime in minutes', 'difference hours'], axis=1)
same_day_emails.drop(['email in minutes', 'difference minutes'], axis=1)
same_day_emails['total difference (in minutes)'] = difference
same_day_emails = same_day_emails.drop(['difference minutes', 'email in minutes'], axis=1)


In [37]:
one_day_off_emails = one_day_off_emails.drop(['one day off?'], axis=1)

In [38]:
two_combo = pd.concat([same_day_emails, one_day_off_emails], axis = 0).reset_index(drop =True)

same_day_emails.to_csv('same_day_emails.csv')

In [39]:
# Going to manually enter this data...

more_than_one_day_of_week = ['Monday', 'Friday', 'Wednesday', 'Saturday', 'Wednesday', 'Monday', 'Friday', 'Saturday']
more_than_one_day['crime day of week'] = more_than_one_day_of_week
more_than_one_day

Unnamed: 0,Subject,Body,date of crime,time of crime,email time,email day of week,email date,crime location,crime latitude,crime longitude,one day off?,crime day of week
0,"Burglary at Botanical Gardens, 200 Centennial ...",<https://oem.berkeley.edu/sites/default/files...,05-10-2021,10:30,12:03,Sunday,05-30-2021,"Botanical Gardens, 200 Centennial Dr Berkeley CA",37.875492,-122.238727,False,Monday
1,"Burglary at 2521 Channing Way, Berkeley",<https://oem.berkeley.edu/sites/default/files...,12-09-2022,20:00,16:48,Monday,12-12-2022,"2521 Channing Way, Berkeley",37.867461,-122.257944,False,Friday
2,Violent Crime Reported at a UC Berkeley Reside...,<https://oem.berkeley.edu/sites/default/files...,11-30-2022,20:45,10:48,Thursday,12-01-2022,UC Berkeley Residence Hall,-9999.0,-9999.0,False,Wednesday
3,Violent Crime Reported at Memorial Stadium - P...,<https://oem.berkeley.edu/sites/default/files...,04-08-2023,17:40,06:25,Tuesday,04-11-2023,Memorial Stadium,37.871717,-122.252217,False,Saturday
4,Burglary at Banway Building,<https://oem.berkeley.edu/sites/default/files...,11-29-2023,20:05,11:16,Friday,12-01-2023,Banway Building,37.868045,-122.266886,False,Wednesday
5,A Campus Residence Hall - Violent Crime Repor...,<https://oem.berkeley.edu/sites/default/files...,11-27-2023,00:00,13:40,Tuesday,12-05-2023,A Campus Residence Hall,-9999.0,-9999.0,False,Monday
6,"Burglary at Unit 2 - Ehrman Hall, 2650 Haste S...",<https://oem.berkeley.edu/sites/default/files...,07-07-2023,22:00,17:15,Monday,07-10-2023,"Unit 2 - Ehrman Hall, 2650 Haste St, Berkeley",37.865901,-122.255375,False,Friday
7,Burglary at Northgate Hall,<https://oem.berkeley.edu/sites/default/files...,07-22-2023,04:00,05:42,Monday,07-24-2023,Northgate Hall,37.875068,-122.259883,False,Saturday


In [40]:
# 0 -- 93
# 1 -- 240, 1440, 1440, 1008 = 4128
# 2 -- 195, 648 = 843
# 3 -- 380, 1440, 1440, 385 = 3645
# 4 -- 235, 1440, 676 = 2351
# 5 -- 1440*8 + 820 = 12340
# 6 -- 120 + 1440 + 1440 + 1035 = 4035
# 7 -- 1200 + 1440 + 342 = 2982


more_than_one_day['total difference (in minutes)'] = [93, 4128, 843, 3645, 2531, 12340, 4035, 2982]
more_than_one_day = more_than_one_day.drop(['one day off?'], axis=1)


In [41]:
finale = pd.concat([two_combo, more_than_one_day], axis=0).reset_index(drop = True)
finale

Unnamed: 0,Subject,Body,date of crime,time of crime,email time,email day of week,email date,crime location,crime latitude,crime longitude,crime day of week,total difference (in minutes)
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,04:02,Thursday,06-17-2021,University Village,37.885668,-122.300929,Thursday,113
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,10:10,Wednesday,06-16-2021,2650 Haste S,37.866656,-122.254336,Wednesday,290
2,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,16:51,Tuesday,06-08-2021,on the 3100 Block of Dwight Way,37.866104,-122.249306,Tuesday,216
3,Violent Crime Reported at Channing Way/ Colleg...,<https://oem.berkeley.edu/sites/default/files...,05-18-2021,15:50,23:23,Tuesday,05-18-2021,Channing Way/ College Ave,37.867508,-122.254250,Tuesday,453
4,Burglary at Clark Kerr Campus building 23,<https://oem.berkeley.edu/sites/default/files...,07-01-2021,18:07,19:32,Thursday,07-01-2021,Clark Kerr Campus building 23,37.864847,-122.247879,Thursday,85
...,...,...,...,...,...,...,...,...,...,...,...,...
172,Violent Crime Reported at Memorial Stadium - P...,<https://oem.berkeley.edu/sites/default/files...,04-08-2023,17:40,06:25,Tuesday,04-11-2023,Memorial Stadium,37.871717,-122.252217,Saturday,3645
173,Burglary at Banway Building,<https://oem.berkeley.edu/sites/default/files...,11-29-2023,20:05,11:16,Friday,12-01-2023,Banway Building,37.868045,-122.266886,Wednesday,2531
174,A Campus Residence Hall - Violent Crime Repor...,<https://oem.berkeley.edu/sites/default/files...,11-27-2023,00:00,13:40,Tuesday,12-05-2023,A Campus Residence Hall,-9999.000000,-9999.000000,Monday,12340
175,"Burglary at Unit 2 - Ehrman Hall, 2650 Haste S...",<https://oem.berkeley.edu/sites/default/files...,07-07-2023,22:00,17:15,Monday,07-10-2023,"Unit 2 - Ehrman Hall, 2650 Haste St, Berkeley",37.865901,-122.255375,Friday,4035


In [42]:
finale.to_csv('crimes_final_dataframe_part1.csv')

# titled part 1,, might expand on this project later to see if there is a trend with crime type and location
# Also will add additional WarnMe Emails, was last updated