# 3. Data Cleaning Part III: Manually Entering Crimes with Duplicate Email Subjects + Removing Emails that are Not Criminal Activity/Crimes

- Import the data frame made from first to notebooks, from **warnme_emails_w_email_dtime.csv**  (296 observations/emails), and are then split into two data frames:
    - **community_notifications** (66 observations/emails):&emsp;emails that don't have contents pertaining to a specific crime, may use this later for analysis, not sure yet
    - **crimes_complete** (230 observations/emails):&emsp; originally split into two data frames, unique subjects (187) and common_subjects (43)

In [1]:
import pandas as pd
import numpy as np

In [2]:
emails =pd.read_csv('warnme_emails_w_email_dtime.csv').drop(labels='Unnamed: 0', axis=1)
# can also drop extra index col when making DF into csv**
emails.head(10)

Unnamed: 0,Subject,Body,date of crime,time of crime,approx time,email time,email day of week,email date (num)
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,,04:02,Thursday,06-17-2021
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,,10:10,Wednesday,06-16-2021
2,UC Berkeley WarnMe:,<https://oem.berkeley.edu/sites/default/files...,6/12/21,,approximately 0322,,,
3,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,,16:51,Tuesday,06-08-2021
4,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,06-01-2021,15:42,,,,
5,Community Advisory - UCPD Supports LGBTQ+ Prid...,<https://oem.berkeley.edu/sites/default/files...,,,,15:40,Tuesday,06-01-2021
6,"Burglary at Botanical Gardens, 200 Centennial ...",<https://oem.berkeley.edu/sites/default/files...,05-10-2021,10:30,,12:03,Sunday,05-30-2021
7,Community Advisory - Work-Study Internet Scam,<https://oem.berkeley.edu/sites/default/files...,,,,08:37,Wednesday,05-26-2021
8,Community Advisory: Work-Study Internet Scam,<https://oem.berkeley.edu/sites/default/files...,,,,08:31,Wednesday,05-26-2021
9,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,05-23-2021,17:24,,,,


In [3]:
emails['does not have dOFcrime'] = emails['date of crime'].isna()

no_date = emails[emails['does not have dOFcrime'] == True]

nd_contents = no_date['Body']

#for i in nd_contents:
#   print(i)

# create new col, 'does not have dOFcrime' to filter out emails without a date in body
# no_date are all emails that have NA values for the date of crimes 
# then print out all of no crime date email bodies to see contents of the email
# all emails are community advisories, police activity emails that don't have specific information about crimes, and are thus 
# removed from final data frame for analysis

community_notifications = no_date
no_date.shape
# create new data frame community_notifications with community notifications.. might use this for later analysis

(66, 9)

In [4]:
crimes = emails[emails['does not have dOFcrime'] == False].drop(labels='does not have dOFcrime', axis=1).reset_index(drop=True)

# create a new data frame 'crimes' that removes all emails with no crime date

crimes['does not have email date'] = crimes['email date (num)'].isna()
crimes.head(10)

# create new column 'does not have email date' to indicate emails that when the data frames were merged, were not assigned an email date
# these are emails with duplicate subjects, and so now have to manually assign dates and times of these emails based on my email inbox: 

Unnamed: 0,Subject,Body,date of crime,time of crime,approx time,email time,email day of week,email date (num),does not have email date
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,,04:02,Thursday,06-17-2021,False
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,,10:10,Wednesday,06-16-2021,False
2,UC Berkeley WarnMe:,<https://oem.berkeley.edu/sites/default/files...,6/12/21,,approximately 0322,,,,True
3,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,,16:51,Tuesday,06-08-2021,False
4,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,06-01-2021,15:42,,,,,True
5,"Burglary at Botanical Gardens, 200 Centennial ...",<https://oem.berkeley.edu/sites/default/files...,05-10-2021,10:30,,12:03,Sunday,05-30-2021,False
6,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,05-23-2021,17:24,,,,,True
7,Violent Crime Reported at Channing Way/ Colleg...,<https://oem.berkeley.edu/sites/default/files...,05-18-2021,15:50,,23:23,Tuesday,05-18-2021,False
8,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,07-06-2021,11:39,,,,,True
9,Burglary at Clark Kerr Campus building 23,<https://oem.berkeley.edu/sites/default/files...,07-01-2021,18:07,,19:32,Thursday,07-01-2021,False


In [5]:
unique_subjects = crimes[crimes['does not have email date'] == False].reset_index(drop=True)
#unique_subjects 

# unique_subjects data frame is all emails in crimes that have an email date, meaning that they have unique subject lines.. 

unique_subjects.shape

(187, 9)

In [6]:
# created common_subjects data frame with all these emails, and then manually enter data for these emails.. 
# a bit tedious process, not sure what would be a better approach to this.. 

common_subjects = crimes[crimes['does not have email date'] == True].reset_index(drop=True)

common_subjects.head(10)

common_subjects.iloc[0]['Body']

common_subjects.loc[0, 'email time'] = '07:13'
common_subjects.loc[0, 'email day of week'] = 'Saturday'
common_subjects.loc[0, 'email date (num)'] = '06-12-2021'
common_subjects.loc[0, 'date of crime'] = '06-12-2021'
common_subjects.loc[0, 'time of crime'] = '03:22'

#common_subjects.iloc[1]['Body']

common_subjects.loc[1, 'email time'] = '16:37'
common_subjects.loc[1, 'email day of week'] = 'Tuesday'
common_subjects.loc[1, 'email date (num)'] = '06-01-2021'
common_subjects.head(10)

#common_subjects.iloc[2]['Body']

common_subjects.loc[2, 'email time'] = '18:37'
common_subjects.loc[2, 'email day of week'] = 'Sunday'
common_subjects.loc[2, 'email date (num)'] = '05-23-2021'

#common_subjects.iloc[3]['Body']

common_subjects.loc[3, 'email time'] = '12:56'
common_subjects.loc[3, 'email day of week'] = 'Tuesday'
common_subjects.loc[3, 'email date (num)'] = '07-06-2021'

#common_subjects.iloc[4]['Body']

common_subjects.loc[4, 'email time'] = '11:31'
common_subjects.loc[4, 'email day of week'] = 'Friday'
common_subjects.loc[4, 'email date (num)'] = '08-27-2021'
common_subjects.head(10)

#ommon_subjects.loc[5]['Body']

common_subjects.loc[5, 'email time'] = '18:31'
common_subjects.loc[5, 'email day of week'] = 'Monday'
common_subjects.loc[5, 'email date (num)'] = '10-25-2021'

#common_subjects.loc[6]['Body']

common_subjects.loc[6, 'email time'] = '00:53'
common_subjects.loc[6, 'email day of week'] = 'Friday'
common_subjects.loc[6, 'email date (num)'] = '10-22-2021'

#common_subjects.loc[7]['Body']

common_subjects.loc[7, 'email time'] = '23:33'
common_subjects.loc[7, 'email day of week'] = 'Wednesday'
common_subjects.loc[7, 'email date (num)'] = '11-10-2021'

#common_subjects.loc[8]['Body']

common_subjects.loc[8, 'email time'] = '18:47'
common_subjects.loc[8, 'email day of week'] = 'Saturday'
common_subjects.loc[8, 'email date (num)'] = '10-02-2021'

#common_subjects.loc[9]['Body']

common_subjects.loc[9, 'email time'] = '18:04'
common_subjects.loc[9, 'email day of week'] = 'Sunday'
common_subjects.loc[9, 'email date (num)'] = '11-21-2021'

#common_subjects.loc[10]['Body']

common_subjects.loc[10, 'email time'] = '09:25'
common_subjects.loc[10, 'email day of week'] = 'Tuesday'
common_subjects.loc[10, 'email date (num)'] = '01-18-2022'
common_subjects.loc[10, 'date of crime'] = '01-18-2022'


#common_subjects.loc[11]['Body']

common_subjects.loc[11, 'email time'] = '21:49'
common_subjects.loc[11, 'email day of week'] = 'Tuesday'
common_subjects.loc[11, 'email date (num)'] = '12-28-2021'

common_subjects.loc[12]['Body']

common_subjects.loc[12, 'email time'] = '12:50'
common_subjects.loc[12, 'email day of week'] = 'Sunday'
common_subjects.loc[12, 'email date (num)'] = '04-17-2022'

common_subjects.loc[13]['Body']

common_subjects.loc[13, 'email time'] = '03:51'
common_subjects.loc[13, 'email day of week'] = 'Saturday'
common_subjects.loc[13, 'email date (num)'] = '04-16-2022'
common_subjects.loc[13, 'date of crime'] = '04-16-2022'
common_subjects.loc[13, 'time of crime'] = '02:00'

common_subjects.loc[14]['Body']

common_subjects.loc[14, 'email time'] = '10:04'
common_subjects.loc[14, 'email day of week'] = 'Wednesday'
common_subjects.loc[14, 'email date (num)'] = '06-01-2022'

common_subjects

common_subjects.loc[15]['Body']

common_subjects.loc[15, 'email time'] = '07:22'
common_subjects.loc[15, 'email day of week'] = 'Tuesday'
common_subjects.loc[15, 'email date (num)'] = '05-24-2022'

common_subjects.loc[16]['Body']

common_subjects.loc[16, 'email time'] = '05:38'
common_subjects.loc[16, 'email day of week'] = 'Saturday'
common_subjects.loc[16, 'email date (num)'] = '08-06-2022'

common_subjects.loc[17]['Body']

common_subjects.loc[17, 'email time'] = '16:45'
common_subjects.loc[17, 'email day of week'] = 'Thursday'
common_subjects.loc[17, 'email date (num)'] = '07-14-2022'

common_subjects.loc[18]['Body']

common_subjects.loc[18, 'email time'] = '10:02'
common_subjects.loc[18, 'email day of week'] = 'Thursday'
common_subjects.loc[18, 'email date (num)'] = '06-02-2022'

common_subjects.loc[19]['Body']

common_subjects.loc[19, 'email time'] = '13:51'
common_subjects.loc[19, 'email day of week'] = 'Monday'
common_subjects.loc[19, 'email date (num)'] = '06-06-2022'

common_subjects.loc[20]['Body']

common_subjects.loc[20, 'email time'] = '00:13' #CHANGE THIS TO PST?? 
common_subjects.loc[20, 'email day of week'] = 'Wednesday'
common_subjects.loc[20, 'email date (num)'] = '05-04-2022'

common_subjects

common_subjects.loc[21]['Body']

common_subjects.loc[21, 'email time'] = '03:24'
common_subjects.loc[21, 'email day of week'] = 'Friday'
common_subjects.loc[21, 'email date (num)'] = '04-29-2022'

common_subjects.loc[22]['Body']

common_subjects.loc[22, 'email time'] = '10:41'
common_subjects.loc[22, 'email day of week'] = 'Sunday'
common_subjects.loc[22, 'email date (num)'] = '04-24-2022'

common_subjects.loc[23]['Body']

common_subjects.loc[23, 'email time'] = '14:43'
common_subjects.loc[23, 'email day of week'] = 'Sunday'
common_subjects.loc[23, 'email date (num)'] = '08-21-2022'

common_subjects.loc[24]['Body']

common_subjects.loc[24, 'email time'] = '18:55'
common_subjects.loc[24, 'email day of week'] = 'Tuesday'
common_subjects.loc[24, 'email date (num)'] = '08-16-2022'

common_subjects.loc[25]['Body']

common_subjects.loc[25, 'email time'] = '21:18'
common_subjects.loc[25, 'email day of week'] = 'Wednesday'
common_subjects.loc[25, 'email date (num)'] = '11-23-2022'

common_subjects.loc[26]['Body']

common_subjects.loc[26, 'email time'] = '23:31'
common_subjects.loc[26, 'email day of week'] = 'Friday'
common_subjects.loc[26, 'email date (num)'] = '09-23-2022'

common_subjects.loc[27]['Body']

common_subjects.loc[27, 'email time'] = '19:16'
common_subjects.loc[27, 'email day of week'] = 'Wednesday'
common_subjects.loc[27, 'email date (num)'] = '09-21-2022'

common_subjects.loc[28]['Body']

common_subjects.loc[28, 'email time'] = '10:34'
common_subjects.loc[28, 'email day of week'] = 'Saturday'
common_subjects.loc[28, 'email date (num)'] = '10-08-2022'
common_subjects.loc[28, 'date of crime'] = '10-08-2022'

common_subjects.loc[29]['Body']

common_subjects.loc[29, 'email time'] = '06:04'
common_subjects.loc[29, 'email day of week'] = 'Saturday'
common_subjects.loc[29, 'email date (num)'] = '10-15-2022'

common_subjects.loc[30]['Body']

common_subjects.loc[30, 'email time'] = '12:04'
common_subjects.loc[30, 'email day of week'] = 'Wednesday'
common_subjects.loc[30, 'email date (num)'] = '01-04-2023'

common_subjects.loc[31]['Body']

common_subjects.loc[31, 'email time'] = '07:56'
common_subjects.loc[31, 'email day of week'] = 'Monday'
common_subjects.loc[31, 'email date (num)'] = '02-27-2023'

common_subjects.loc[32]['Body']

common_subjects.loc[32, 'email time'] = '12:41'
common_subjects.loc[32, 'email day of week'] = 'Wednesday'
common_subjects.loc[32, 'email date (num)'] = '03-01-2023'

common_subjects.loc[33]['Body']

common_subjects.loc[33, 'email time'] = '22:14'
common_subjects.loc[33, 'email day of week'] = 'Sunday'
common_subjects.loc[33, 'email date (num)'] = '03-26-2023'
common_subjects.loc[33, 'date of crime'] = '03-26-2023'

common_subjects.loc[34]['Body']

common_subjects.loc[34, 'email time'] = '18:01'
common_subjects.loc[34, 'email day of week'] = 'Thursday'
common_subjects.loc[34, 'email date (num)'] = '11-16-2023'

common_subjects.loc[35]['Body']

common_subjects.loc[35, 'email time'] = '16:50'
common_subjects.loc[35, 'email day of week'] = 'Monday'
common_subjects.loc[35, 'email date (num)'] = '10-30-2023'

common_subjects.loc[36]['Body']

common_subjects.loc[36, 'email time'] = '14:14'
common_subjects.loc[36, 'email day of week'] = 'Wednesday'
common_subjects.loc[36, 'email date (num)'] = '10-25-2023'

common_subjects.loc[37]['Body']

common_subjects.loc[37, 'email time'] = '18:11'
common_subjects.loc[37, 'email day of week'] = 'Thursday'
common_subjects.loc[37, 'email date (num)'] = '09-21-2023'

common_subjects.loc[38]['Body']

common_subjects.loc[38, 'email time'] = '22:02'
common_subjects.loc[38, 'email day of week'] = 'Sunday'
common_subjects.loc[38, 'email date (num)'] = '09-10-2023'
common_subjects.loc[38, 'date of crime'] = '09-10-2023'
common_subjects.loc[38, 'time of crime'] = '20:40'

common_subjects.loc[39]['Body']

common_subjects.loc[39, 'email time'] = '13:49'
common_subjects.loc[39, 'email day of week'] = 'Thursday'
common_subjects.loc[39, 'email date (num)'] = '09-07-2023'

common_subjects.loc[40]['Body']

common_subjects.loc[40, 'email time'] = '16:17'
common_subjects.loc[40, 'email day of week'] = 'Sunday'
common_subjects.loc[40, 'email date (num)'] = '08-20-2023'

common_subjects.loc[41]['Body']

common_subjects.loc[41, 'email time'] = '15:38'
common_subjects.loc[41, 'email day of week'] = 'Tuesday'
common_subjects.loc[41, 'email date (num)'] = '06-20-2023'

common_subjects.loc[42]['Body']

common_subjects.loc[42, 'email time'] = '18:40'
common_subjects.loc[42, 'email day of week'] = 'Wednesday'
common_subjects.loc[42, 'email date (num)'] = '06-07-2023'

common_subjects

Unnamed: 0,Subject,Body,date of crime,time of crime,approx time,email time,email day of week,email date (num),does not have email date
0,UC Berkeley WarnMe:,<https://oem.berkeley.edu/sites/default/files...,06-12-2021,03:22,approximately 0322,07:13,Saturday,06-12-2021,True
1,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,06-01-2021,15:42,,16:37,Tuesday,06-01-2021,True
2,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,05-23-2021,17:24,,18:37,Sunday,05-23-2021,True
3,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,07-06-2021,11:39,,12:56,Tuesday,07-06-2021,True
4,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,08-27-2021,11:10,,11:31,Friday,08-27-2021,True
5,Burglary at Li Ka Shing Office 100c,<https://oem.berkeley.edu/sites/default/files...,10-25-2021,08:17,,18:31,Monday,10-25-2021,True
6,Burglary at 2333 College Ave Ida Jackson House,<https://oem.berkeley.edu/sites/default/files...,10-21-2021,10:00,,00:53,Friday,10-22-2021,True
7,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,11-10-2021,21:30,,23:33,Wednesday,11-10-2021,True
8,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,10-02-2021,15:41,,18:47,Saturday,10-02-2021,True
9,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,11-21-2021,13:30,,18:04,Sunday,11-21-2021,True


In [7]:
crimes_complete = pd.concat([unique_subjects, common_subjects]).reset_index(drop=True).drop(labels='does not have email date', axis=1)

# concatenate unique_subjects and common_subjects to form complete data frame crimes_complete

In [1]:
#pd.set_option('display.max_rows', None)
#crimes_complete

In [11]:
crimes_complete.loc[75]['Body']

crimes_complete.loc[75, 'time of crime'] = '12:25'


# row 75 did not have time of crime displayed because time format was '1225' in this particular instance..  manually entered the time

In [15]:
sum(crimes_complete['time of crime'].isna().astype(int)) # ensures every crime has valid time

crimes_complete = crimes_complete.drop(labels='approx time', axis=1).rename(columns={'email date (num)' : 'email date'})



In [113]:
crimes_complete.to_csv('complete_crimes.csv', index = False) 

In [None]:
sum(crimes_complete['Body'].str.extract(r'(occurred at \w+)').isna().astype(int)[0])

crimes_complete[crimes_complete['email day of week'] == 'Sunday'].shape[0]

crimes_complete['Body'].str.extract(r'(occurred at \w+)')