# 1. Data Cleaning Part I: Extracting Email Subject, Date and Time, and Creating Data Frame

This notebook is split into two main portions that make up the initial data cleaning process:
- [Section A: Extracting Email Subject, Date, and Time](#Section-A:-Extracting-Email-Subject,-Date,-and-Time)
- [Section B: Creating Data Frame of Emails with Unique Subject Lines](#Section-B:-Creating-Data-Frame-of-Emails-with-Unique-Subject-Lines)
- [Section C: Merging the two Data Frames together](#Section-C.-Merging-the-two-Data-Frames-together)

## Section A: Extracting Email Subject, Date, and Time

**WarnmeEmails_Text_PST2.txt** :&emsp; a text file containing all emails (as of 10:29 PST December 29, 2023) that were received from the UC Berkeley WarnMe System

The following section involves cleaning data from the text file that outputs all WarnMe emails found in my inbox.. such that the final result is a data frame with features: Email Subject, Email Date and Time, and day of the week that the email, allowing for the data to be in a readable format

In [1]:
import pandas as pd
import re
import numpy as np

In [2]:
f = open('WarnmeEmails_Text_PST2.txt')

sent_line = r'Sent:\t\w+,\s\w+\s\d+,\s\d+\s\d+:\d+\s\w{2}'

with open('WarnmeEmails_Text_PST2.txt') as d:
    all_email_text_lines = d.readlines()
    
emails_text_string = "".join(all_email_text_lines)

# WarnMeEmails_Text_PST2.txt is a text file with all emails from my inbox that were sent by the UC Berkeley WarnMe System 

# Regex 'send' used to find all instances of an Email Date and Time stamp
# formatting body of text file into one long string so that can use regex
# to filter necessary data

In [3]:
emails_dates_and_times = (re.findall(sent_line, emails_text_string))

# emails_dates_and_times contains date time emails were sent in raw format

In [4]:
emails_text_list = re.split(r'\n', emails_text_string)

# emails_text_list is list that separates the string of emails into list split by 
# a new line such that it becomes easy to locate the subject of each eamil

In [5]:
all_subject_lines = [x for x in emails_text_list if 'Subject:' in x]
print(len(all_subject_lines))

##all_subject_lines

# For every new line of data, find the lines that include 'Subject' which guarantees
# all the subject lines to be extracted as the end of the subject line is followed by a new line

# It is important to note that the number of emails or number of observation at the current moment is 305

305


In [6]:
warnme_emails_subanddate = pd.DataFrame()

warnme_emails_subanddate['subject'] = all_subject_lines

warnme_emails_subanddate['subject'] = warnme_emails_subanddate['subject'].str.strip('Subject:') 
warnme_emails_subanddate['subject'] = warnme_emails_subanddate['subject'].str.strip('\t') 

##warnme_emails_subanddate.head(40) 

# wanrme_emails_subanddate organizes extracted data into tabular format for readability
# 'subject' feature is then assigned the subjects extracted in their raw format
# then use str.strip function on series to remove unnecessary data

In [7]:
warnme_emails_subanddate['date and time'] = emails_dates_and_times
warnme_emails_subanddate['date and time'] = warnme_emails_subanddate['date and time'].str.strip(r'Sent:').str.strip('\t')

# assign 'date and time' feature to dates extracted from text
# Then similarly clean the date and time lines

In [8]:
#warnme_emails_subanddate['subject'].value_counts()

# Observing number of subject duplicates

In [9]:
warnme_emails_subanddate['day or night'] = warnme_emails_subanddate['date and time'].str.extract(r'(AM)').fillna('PM')

# Organize time email was sent based on if it was sent during AM (day) or PM (night) time
# The purpose of organizing the data as such is to reformat the time the email was sent into military time
# such that the time when the crime happened (WarnMe system describes the crime time in military)
# can be analyzed side by side to see the difference in when the crime happened, and when the WarnMe email was sent

In [10]:
warnme_emails_subanddate['hour'] = warnme_emails_subanddate['date and time'].str.extract(r'(\d{1,2}:)')[0].str.strip(':').astype(int)

# Extracting the hour in which the email was sent and making a new column containing the hour

In [11]:
##warnme_emails_subanddate.head(60)

In [12]:
time_arr = warnme_emails_subanddate[['day or night','hour']].to_numpy()

##time_arr

# creating 2D array with PM/AM info and hour in order to easily convert the times

In [13]:
for i in time_arr:
    if (i[0] == 'PM') & (i[1] != 12):
        i[1] = i[1] + 12

# For every pair of AM/PM and hour, if the hour is past noon, and is not 12 PM, then add 12 such that you have the hour in military time
# e.g. if crime time ['PM', 3], then convert hour to 3 + 12 = 15, the 15th hour and then replace old time with new time in 2D arr

In [14]:
##time_arr

In [15]:
old_times = time_arr[:,1].astype(str)

# 'old_times' Represents the times in int format, which will then be converted to string to be able to
# display time data in hh:mm format

In [16]:
time_index = np.arange(len(old_times))

for t in time_index:
    if old_times[t].astype(int) < 10:
        old_times[t] = '0' + old_times[t]
    if (old_times[t].astype(int) == 12) & (time_arr[t, 0] == 'AM'):
        old_times[t] = '00'

# if less than 10, want 0 preceding single digit hour number
# if the time is 12 AM, then the hour becomes 00 # otherwise the time remains the same and does not need to be concatenated by anything
# 'old times' is now the hour correctly formated into military time and as type string

#old_times

In [17]:
warnme_emails_subanddate['hour new'] = old_times
warnme_emails_subanddate.head(10)

# Create new column containing the updated hours that are correctly formatted to 'hour new'

Unnamed: 0,subject,date and time,day or night,hour,hour new
0,Clark Kerr Campus - Violent Crime Reported - ...,"Thursday, December 28, 2023 8:52 AM",AM,8,8
1,Burglary at Intersection Apartments,"Friday, December 8, 2023 8:32 AM",AM,8,8
2,Community Advisory - Test of the UC Berkeley O...,"Wednesday, December 6, 2023 12:01 PM",PM,12,12
3,A Campus Residence Hall - Violent Crime Repor...,"Tuesday, December 5, 2023 1:40 PM",PM,1,13
4,Burglary at Banway Building,"Friday, December 1, 2023 11:16 AM",AM,11,11
5,Burglary at Richmond Field Station,"Thursday, November 16, 2023 6:02 PM",PM,6,18
6,UC Berkeley WarnMe:11-06-202315:52:52ALL CLEAR...,"Monday, November 6, 2023 3:53 PM",PM,3,15
7,UC Berkeley WarnMe:11-06-202314:44:49 Building...,"Monday, November 6, 2023 2:46 PM",PM,2,14
8,Community Advisory - Please note this message ...,"Monday, November 6, 2023 11:11 AM",AM,11,11
9,Robbery occurred Center St and Oxford Way - V...,"Wednesday, November 1, 2023 5:14 PM",PM,5,17


In [18]:
minutes = warnme_emails_subanddate['date and time'].str.extract(r'(:\d+)')[0].tolist()
hours = warnme_emails_subanddate['hour new'].tolist()

# 'minutes' extracts out the :mm portion of the time such that it can then be concatenated with the new hour format
# 'hours' just converts feature of new hour format to list so that it can be parse through

In [19]:
index = np.arange(len(hours))

for i in index:
    hours[i] = (hours[i] + minutes[i])

# concatenates hour and mintute to correctly formatted tiems, and the list hours then becomes the list of times

warnme_emails_subanddate['email time'] = hours

# Then assign new column with the correct email time to the value hours

warnme_emails_subanddate.head(10)

Unnamed: 0,subject,date and time,day or night,hour,hour new,email time
0,Clark Kerr Campus - Violent Crime Reported - ...,"Thursday, December 28, 2023 8:52 AM",AM,8,8,08:52
1,Burglary at Intersection Apartments,"Friday, December 8, 2023 8:32 AM",AM,8,8,08:32
2,Community Advisory - Test of the UC Berkeley O...,"Wednesday, December 6, 2023 12:01 PM",PM,12,12,12:01
3,A Campus Residence Hall - Violent Crime Repor...,"Tuesday, December 5, 2023 1:40 PM",PM,1,13,13:40
4,Burglary at Banway Building,"Friday, December 1, 2023 11:16 AM",AM,11,11,11:16
5,Burglary at Richmond Field Station,"Thursday, November 16, 2023 6:02 PM",PM,6,18,18:02
6,UC Berkeley WarnMe:11-06-202315:52:52ALL CLEAR...,"Monday, November 6, 2023 3:53 PM",PM,3,15,15:53
7,UC Berkeley WarnMe:11-06-202314:44:49 Building...,"Monday, November 6, 2023 2:46 PM",PM,2,14,14:46
8,Community Advisory - Please note this message ...,"Monday, November 6, 2023 11:11 AM",AM,11,11,11:11
9,Robbery occurred Center St and Oxford Way - V...,"Wednesday, November 1, 2023 5:14 PM",PM,5,17,17:14


In [20]:
warnme_emails_subanddate.drop(labels=['day or night', 'hour', 'hour new'], axis=1)

Unnamed: 0,subject,date and time,email time
0,Clark Kerr Campus - Violent Crime Reported - ...,"Thursday, December 28, 2023 8:52 AM",08:52
1,Burglary at Intersection Apartments,"Friday, December 8, 2023 8:32 AM",08:32
2,Community Advisory - Test of the UC Berkeley O...,"Wednesday, December 6, 2023 12:01 PM",12:01
3,A Campus Residence Hall - Violent Crime Repor...,"Tuesday, December 5, 2023 1:40 PM",13:40
4,Burglary at Banway Building,"Friday, December 1, 2023 11:16 AM",11:16
...,...,...,...
300,"Burglary at Botanical Gardens, 200 Centennial ...","Sunday, May 30, 2021 12:03 PM",12:03
301,Community Advisory - Work-Study Internet Scam,"Wednesday, May 26, 2021 8:37 AM",08:37
302,Community Advisory: Work-Study Internet Scam,"Wednesday, May 26, 2021 8:31 AM",08:31
303,Violent Crime Reported at People's Park - Plea...,"Sunday, May 23, 2021 6:33 PM",18:33


In [21]:
warnme_emails_subanddate['email day of week'] = warnme_emails_subanddate['date and time'].str.extract(r'(\w+)')

#pd.set_option('display.max_rows', None)
warnme_emails_subanddate

# extract email day of the week .. Want to use this information later on to determine if there is any trend
# with the days of the week and criminal activity,, how does that impact the delays/what kind of crime?

Unnamed: 0,subject,date and time,day or night,hour,hour new,email time,email day of week
0,Clark Kerr Campus - Violent Crime Reported - ...,"Thursday, December 28, 2023 8:52 AM",AM,8,08,08:52,Thursday
1,Burglary at Intersection Apartments,"Friday, December 8, 2023 8:32 AM",AM,8,08,08:32,Friday
2,Community Advisory - Test of the UC Berkeley O...,"Wednesday, December 6, 2023 12:01 PM",PM,12,12,12:01,Wednesday
3,A Campus Residence Hall - Violent Crime Repor...,"Tuesday, December 5, 2023 1:40 PM",PM,1,13,13:40,Tuesday
4,Burglary at Banway Building,"Friday, December 1, 2023 11:16 AM",AM,11,11,11:16,Friday
...,...,...,...,...,...,...,...
300,"Burglary at Botanical Gardens, 200 Centennial ...","Sunday, May 30, 2021 12:03 PM",PM,12,12,12:03,Sunday
301,Community Advisory - Work-Study Internet Scam,"Wednesday, May 26, 2021 8:37 AM",AM,8,08,08:37,Wednesday
302,Community Advisory: Work-Study Internet Scam,"Wednesday, May 26, 2021 8:31 AM",AM,8,08,08:31,Wednesday
303,Violent Crime Reported at People's Park - Plea...,"Sunday, May 23, 2021 6:33 PM",PM,6,18,18:33,Sunday


In [22]:
warnme_emails_subanddate['email date'] = warnme_emails_subanddate['date and time'].str.extract(r'(\w+\s\d{1,2}, \d{4})')
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels = ['date and time', 'day or night', 'hour', 'hour new'], axis=1)

# create 'email date' feature with extracted information, and drop unnecessary columns that are now redundant data given the newly created columns
warnme_emails_subanddate

Unnamed: 0,subject,email time,email day of week,email date
0,Clark Kerr Campus - Violent Crime Reported - ...,08:52,Thursday,"December 28, 2023"
1,Burglary at Intersection Apartments,08:32,Friday,"December 8, 2023"
2,Community Advisory - Test of the UC Berkeley O...,12:01,Wednesday,"December 6, 2023"
3,A Campus Residence Hall - Violent Crime Repor...,13:40,Tuesday,"December 5, 2023"
4,Burglary at Banway Building,11:16,Friday,"December 1, 2023"
...,...,...,...,...
300,"Burglary at Botanical Gardens, 200 Centennial ...",12:03,Sunday,"May 30, 2021"
301,Community Advisory - Work-Study Internet Scam,08:37,Wednesday,"May 26, 2021"
302,Community Advisory: Work-Study Internet Scam,08:31,Wednesday,"May 26, 2021"
303,Violent Crime Reported at People's Park - Plea...,18:33,Sunday,"May 23, 2021"


In [23]:
months = warnme_emails_subanddate['email date'].str.extract(r'(\w+)')[0].tolist()

In [24]:
month_index = np.arange(len(months))

def month_mapping(monthees):
    return{'January':'01', 'February':'02', 'March':'03', 'April':'04', 'May':'05', 
           'June':'06', 'July':'07', 'August':'08', 'September':'09', 'October':'10', 'November':'11', 'December':'12'}[monthees]

# Maps month names to their corresponding number, will then use this to traverse through all months and convert them to numbered dates

In [25]:
def month_to_number(m_list, indexes):  
    for m in indexes:
        m_list[m] = month_mapping(m_list[m])
    return m_list

month_nums = month_to_number(months, month_index)
warnme_emails_subanddate['email month'] = month_nums

# month_nums is list that contains converted month into string
# and is then assigned as a new column in the table as 'email month'

In [26]:
warnme_emails_subanddate.head(10)

Unnamed: 0,subject,email time,email day of week,email date,email month
0,Clark Kerr Campus - Violent Crime Reported - ...,08:52,Thursday,"December 28, 2023",12
1,Burglary at Intersection Apartments,08:32,Friday,"December 8, 2023",12
2,Community Advisory - Test of the UC Berkeley O...,12:01,Wednesday,"December 6, 2023",12
3,A Campus Residence Hall - Violent Crime Repor...,13:40,Tuesday,"December 5, 2023",12
4,Burglary at Banway Building,11:16,Friday,"December 1, 2023",12
5,Burglary at Richmond Field Station,18:02,Thursday,"November 16, 2023",11
6,UC Berkeley WarnMe:11-06-202315:52:52ALL CLEAR...,15:53,Monday,"November 6, 2023",11
7,UC Berkeley WarnMe:11-06-202314:44:49 Building...,14:46,Monday,"November 6, 2023",11
8,Community Advisory - Please note this message ...,11:11,Monday,"November 6, 2023",11
9,Robbery occurred Center St and Oxford Way - V...,17:14,Wednesday,"November 1, 2023",11


In [27]:
email_days = warnme_emails_subanddate['email date'].str.extract(r'(\d{1,2})')

days_list = email_days[0].tolist()
index_dl = np.arange(len(days_list))

for i in index_dl:
    if int(days_list[i]) < 10:
        days_list[i] = '0' + days_list[i]

# 'email_days' is list of the date day (dd portion of mm/dd/yyyy)
# and then converts them to more readable format -- if day is less than 10, then append 0

warnme_emails_subanddate['email day'] = days_list

In [28]:
email_years = warnme_emails_subanddate['email date'].str.extract(r'(\d{4})')
warnme_emails_subanddate['email year'] = email_years
warnme_emails_subanddate

# similar idea with email year, but simply extracts the years into its own column with no modification

Unnamed: 0,subject,email time,email day of week,email date,email month,email day,email year
0,Clark Kerr Campus - Violent Crime Reported - ...,08:52,Thursday,"December 28, 2023",12,28,2023
1,Burglary at Intersection Apartments,08:32,Friday,"December 8, 2023",12,08,2023
2,Community Advisory - Test of the UC Berkeley O...,12:01,Wednesday,"December 6, 2023",12,06,2023
3,A Campus Residence Hall - Violent Crime Repor...,13:40,Tuesday,"December 5, 2023",12,05,2023
4,Burglary at Banway Building,11:16,Friday,"December 1, 2023",12,01,2023
...,...,...,...,...,...,...,...
300,"Burglary at Botanical Gardens, 200 Centennial ...",12:03,Sunday,"May 30, 2021",05,30,2021
301,Community Advisory - Work-Study Internet Scam,08:37,Wednesday,"May 26, 2021",05,26,2021
302,Community Advisory: Work-Study Internet Scam,08:31,Wednesday,"May 26, 2021",05,26,2021
303,Violent Crime Reported at People's Park - Plea...,18:33,Sunday,"May 23, 2021",05,23,2021


In [29]:
email_days = days_list
email_years = email_years[0].to_list()
indices = np.arange(len(email_years))

for i in indices:
    months[i] = months[i] + '-' + email_days[i] + '-' + email_years[i]

#months

#Concatenate months, days, and year values into string format that becomes the new date, and months then becomes array with dates

In [30]:
dates = months

In [31]:
warnme_emails_subanddate['email date (num)'] = dates
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels= ['email month', 'email day', 'email year', 'email date'], axis=1)

warnme_emails_subanddate.head(10)

Unnamed: 0,subject,email time,email day of week,email date (num)
0,Clark Kerr Campus - Violent Crime Reported - ...,08:52,Thursday,12-28-2023
1,Burglary at Intersection Apartments,08:32,Friday,12-08-2023
2,Community Advisory - Test of the UC Berkeley O...,12:01,Wednesday,12-06-2023
3,A Campus Residence Hall - Violent Crime Repor...,13:40,Tuesday,12-05-2023
4,Burglary at Banway Building,11:16,Friday,12-01-2023
5,Burglary at Richmond Field Station,18:02,Thursday,11-16-2023
6,UC Berkeley WarnMe:11-06-202315:52:52ALL CLEAR...,15:53,Monday,11-06-2023
7,UC Berkeley WarnMe:11-06-202314:44:49 Building...,14:46,Monday,11-06-2023
8,Community Advisory - Please note this message ...,11:11,Monday,11-06-2023
9,Robbery occurred Center St and Oxford Way - V...,17:14,Wednesday,11-01-2023


## Section B: Creating Data Frame of Emails with Unique Subject Lines

- The section continues with the data cleaning process with the same data frame, **warnme_emails_subanddate**
- The next goal is to filter the data frame such that row/observation has a unique subject (this will be helpful in merging this data frame with the data frame created in Notebook 2 so that there is no incorrect assignment from Email information created here to the email's body, this is done in two steps:
- Firstly, removing all emails that have a reply (where the reply also has the same subject line)
- Then removing any emails that have repeat subject lines, which is usually due to crimes being similar in activity/location

In [32]:
warnme_emails_subanddate['subject'].value_counts()[0:13]

# these are the following email subjects with non-unique subjects

subject
Community Advisory - Please note this message may contain information that some may find upsetting.                                                   23
Violent Crime Reported at People's Park - Please note this message may contain information that some may find upsetting.                              19
Violent Crime Reported at People's Park Housing Construction Site - Please note this message may contain information that some may find upsetting.     3
Community Advisory - Get Consent and Respect Boundaries                                                                                                3
Arson Reported at People's Park Housing Construction Site - Please note this message may contain information that some may find upsetting.             2
Violent Crime Reported at Sather Gate - Please note this message may contain information that some may find upsetting.                                 2
Burglary at Richmond Field Station (RFS)                                  

In [33]:
# need to remove duplicates of emails as we did in the other dataframe
# ie the ones that have same subject, but one email is a reply to the other

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Burglary at UVA Grounds Shop, 298 Ohlone Ave, Albany']
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=119) # first_duplicate

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Burglary at UVA Grounds Shop, 298 Ohlone Ave, Albany']

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Violent Crime Reported at Etcheverry Hall/Soda Hall breezeway - Please note this message may contain information that some may find upsetting.']
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=120) # second_duplicate

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Violent Crime Reported at Etcheverry Hall/Soda Hall breezeway - Please note this message may contain information that some may find upsetting.']

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Violent Crime Reported at Haas School of Business - Please note this message may contain information that some may find upsetting.']
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=225) # third_duplicate

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Violent Crime Reported at Haas School of Business - Please note this message may contain information that some may find upsetting.']

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'UC Berkeley WarnMe: AVOID THE AREA of Latimer Hall']
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=279) # fourth_duplicate

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'UC Berkeley WarnMe: AVOID THE AREA of Latimer Hall']

warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Violent Crime Reported at Sather Gate - Please note this message may contain information that some may find upsetting.']

warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=62) # fifth_duplicate
warnme_emails_subanddate[warnme_emails_subanddate['subject'] == 'Violent Crime Reported at Sather Gate - Please note this message may contain information that some may find upsetting.']

#warnme_emails_subanddate

# Nov 5, 2021 is extra WarnMe about Hearst Parking Structure.. has no contents Nov 5, 2021 8:11 PM, sixth duplicate
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=247)

#warnme_emails_subanddate

# April 10, 2023 4:00 pm extra WarnMe email template about assault Eucalyptus and Stephens Hall, seventh duplicate
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=69)

warnme_emails_subanddate.shape

# Wed, Jun 21, 2023, 7:33 AM Email is empty template, eighth duplicate
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=50)

#warnme_emails_subanddate

# Jan 24, 2023 at 9:40 is also an empty template email, ninth duplicate
warnme_emails_subanddate = warnme_emails_subanddate.drop(labels=99)

warnme_emails_subanddate.shape # now no duplicates 

# with 296 emails (rmeoved all replies), now must remove any value_counts > 1 so that can merge two tables
#warnme_emails_subanddate['subject'].value_counts()

(296, 4)

In [34]:
emails_subbanddate_no_dup_subjects = warnme_emails_subanddate[(warnme_emails_subanddate['subject'] != 'Community Advisory - Please note this message may contain information that some may find upsetting.')
                                             & (warnme_emails_subanddate['subject'] != "Violent Crime Reported at People's Park - Please note this message may contain information that some may find upsetting.")
                                             & (warnme_emails_subanddate['subject'] != "Community Advisory - Test of the UC Berkeley Outdoor Early Warning System")
                                             & (warnme_emails_subanddate['subject'] != "Community Advisory - Get Consent and Respect Boundaries")
                                             & (warnme_emails_subanddate['subject'] != "Violent Crime Reported at People's Park Housing Construction Site - Please note this message may contain information that some may find upsetting.")
                                             & (warnme_emails_subanddate['subject'] != "Burglary at Richmond Field Station")
                                             & (warnme_emails_subanddate['subject'] != "Arson Reported at People's Park Housing Construction Site - Please note this message may contain information that some may find upsetting.")
                                             & (warnme_emails_subanddate['subject'] != "Burglary at Richmond Field Station (RFS)")]

In [35]:
warnme_subject_and_date = emails_subbanddate_no_dup_subjects.reset_index().drop(labels='index', axis=1)

warnme_subject_and_date.head(10)

# now have data frame with uniques subject lines and with features  subject, email time, day of week, and email date

Unnamed: 0,subject,email time,email day of week,email date (num)
0,Clark Kerr Campus - Violent Crime Reported - ...,08:52,Thursday,12-28-2023
1,Burglary at Intersection Apartments,08:32,Friday,12-08-2023
2,A Campus Residence Hall - Violent Crime Repor...,13:40,Tuesday,12-05-2023
3,Burglary at Banway Building,11:16,Friday,12-01-2023
4,UC Berkeley WarnMe:11-06-202315:52:52ALL CLEAR...,15:53,Monday,11-06-2023
5,UC Berkeley WarnMe:11-06-202314:44:49 Building...,14:46,Monday,11-06-2023
6,Robbery occurred Center St and Oxford Way - V...,17:14,Wednesday,11-01-2023
7,College Ave and Derby St - Violent Crime Repo...,21:17,Monday,10-30-2023
8,Burglary at Golden Bear Cafe on UC Berkeley Ma...,17:36,Monday,10-30-2023
9,Burglary at 2232 Piedmont Av,16:51,Monday,10-30-2023


## Section C. Merging the two Data Frames together

In [36]:
imported_df = pd.read_csv('CSV files/emailz.csv').drop(labels='Unnamed: 0', axis=1)
imported_df
#imported_df is the data frame from Notebook 2, containing Email Body

Unnamed: 0,Subject,Body,date of crime,time of crime,approx time
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,
2,UC Berkeley WarnMe:,<https://oem.berkeley.edu/sites/default/files...,6/12/21,,approximately 0322
3,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,
4,Community Advisory - UCPD Supports LGBTQ+ Prid...,<https://oem.berkeley.edu/sites/default/files...,,,
...,...,...,...,...,...
236,Burglary at Physics North Building,<https://oem.berkeley.edu/sites/default/files...,07-03-2023,21:00,
237,Violent Crime Reported at 2200 Block of Bancro...,<https://oem.berkeley.edu/sites/default/files...,05-07-2023,23:25,
238,Violent Crime Reported at 2400 Block Durant Av...,<https://oem.berkeley.edu/sites/default/files...,06-01-2023,00:45,approximately 0045
239,Violent Crime Reported at 2400 block of Colleg...,<https://oem.berkeley.edu/sites/default/files...,04-29-2023,18:52,


In [37]:
combined_df = imported_df.merge(warnme_subject_and_date, left_on='Subject', right_on='subject', how='left')
combined_df

Unnamed: 0,Subject,Body,date of crime,time of crime,approx time,subject,email time,email day of week,email date (num)
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,,Burglary at University Village: Albany (UVA),04:02,Thursday,06-17-2021
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,,"Arson Reported at 2650 Haste St., Berkeley CA ...",10:10,Wednesday,06-16-2021
2,UC Berkeley WarnMe:,<https://oem.berkeley.edu/sites/default/files...,6/12/21,,approximately 0322,,,,
3,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,,Violent Crime Reported at 3100 Block of Dwight...,16:51,Tuesday,06-08-2021
4,Community Advisory - UCPD Supports LGBTQ+ Prid...,<https://oem.berkeley.edu/sites/default/files...,,,,Community Advisory - UCPD Supports LGBTQ+ Prid...,15:40,Tuesday,06-01-2021
...,...,...,...,...,...,...,...,...,...
236,Burglary at Physics North Building,<https://oem.berkeley.edu/sites/default/files...,07-03-2023,21:00,,Burglary at Physics North Building,07:11,Tuesday,07-04-2023
237,Violent Crime Reported at 2200 Block of Bancro...,<https://oem.berkeley.edu/sites/default/files...,05-07-2023,23:25,,Violent Crime Reported at 2200 Block of Bancro...,02:46,Monday,05-08-2023
238,Violent Crime Reported at 2400 Block Durant Av...,<https://oem.berkeley.edu/sites/default/files...,06-01-2023,00:45,approximately 0045,Violent Crime Reported at 2400 Block Durant Av...,02:12,Thursday,06-01-2023
239,Violent Crime Reported at 2400 block of Colleg...,<https://oem.berkeley.edu/sites/default/files...,04-29-2023,18:52,,Violent Crime Reported at 2400 block of Colleg...,20:07,Saturday,04-29-2023


In [38]:
heyy = pd.read_csv('TrueBlue/warnme_info.csv').drop(labels='Unnamed: 0', axis=1)

In [39]:
complete_emails = heyy.merge(warnme_subject_and_date, left_on='Subject', right_on='subject', how='left').drop(labels= 'subject', axis=1)
complete_emails

Unnamed: 0,Subject,Body,date of crime,time of crime,approx time,email time,email day of week,email date (num)
0,Burglary at University Village: Albany (UVA),<https://oem.berkeley.edu/sites/default/files...,06-17-2021,02:09,,04:02,Thursday,06-17-2021
1,"Arson Reported at 2650 Haste St., Berkeley CA ...",<https://oem.berkeley.edu/sites/default/files...,06-16-2021,05:20,,10:10,Wednesday,06-16-2021
2,UC Berkeley WarnMe:,<https://oem.berkeley.edu/sites/default/files...,6/12/21,,approximately 0322,,,
3,Violent Crime Reported at 3100 Block of Dwight...,<https://oem.berkeley.edu/sites/default/files...,06-08-2021,13:15,,16:51,Tuesday,06-08-2021
4,Violent Crime Reported at People's Park - Plea...,<https://oem.berkeley.edu/sites/default/files...,06-01-2021,15:42,,,,
...,...,...,...,...,...,...,...,...
291,Violent Crime Reported at 2200 Block of Bancro...,<https://oem.berkeley.edu/sites/default/files...,05-07-2023,23:25,,02:46,Monday,05-08-2023
292,Violent Crime Reported at 2400 Block Durant Av...,<https://oem.berkeley.edu/sites/default/files...,06-01-2023,00:45,approximately 0045,02:12,Thursday,06-01-2023
293,Violent Crime Reported at 2400 block of Colleg...,<https://oem.berkeley.edu/sites/default/files...,04-29-2023,18:52,,20:07,Saturday,04-29-2023
294,Burglary at Richmond Field Station (RFS),<https://oem.berkeley.edu/sites/default/files...,06-07-2023,00:00,,,,


In [8]:
#complete_emails

In [225]:
complete_emails.to_csv('warnme_emails_w_email_d')