# UTPD Incident Notification Scraper

Data Reporter: José Martínez, martinez307jose@gmail.com

### Data Overview

UTPD has a responsibility to comply with Clery Act requirements, releasing timely warnings about reported crimes to the campus community in a manner that will aid in the prevention of similar crimes, as well as providing emergency notifications within Clery Act geography when the health and safety of the campus community is at risk. The department utilizes one or more of the following methods of communication to post warnings:  text messages, campus wide email, social media (Facebook & Twitter), UT Emergency and home page, campus siren system, desktop pop-up alert system and closed circuit television systems in residence halls and other buildings.

The excerpt above comes from the following website: https://police.utexas.edu/crimefeed

Here, UTPD publishes all the notifications they send out to students. I decided to scrape each incident using Selenium and map it with Geopy. There were multiple pages to parse through and each page had different links to open, so it required a few for loops.

There was also a lot of cleaning to do with the coordinates. I'll also add some analysis soon.

In [36]:
from selenium import webdriver

In [37]:
from bs4 import BeautifulSoup
import pandas as pd

In [38]:
driver = webdriver.Chrome('/Users/josemartinez/Desktop/chromedriver')

In [39]:
driver.get('https://police.utexas.edu/crimefeed')

In [40]:
#Pulling all links that correspond to each page. I noticed that each of the pages has the text 'crimefeed' and 'page=' in the url,
# so I made sure to include that in my if statement.
list_pages = []
for text in driver.find_elements_by_tag_name('a'):
    links = text.get_attribute('href')
    if 'crimefeed' and 'page=' in links:
        list_pages.append(links)

In [41]:
# the first page is the only one that doesn't include the text above, so I manually added it.
list_pages.insert(0,'https://police.utexas.edu/crimefeed')

In [42]:
# I now have each page that displays multiple incidents, but to retrieve information, I have to open each incident, so I once
# again pulled all links that include only the corresponding url that won't lead to other tabs, social media links, etc.
individual_pages = []
for page in list_pages:
    driver.get(page)
    for text in driver.find_elements_by_tag_name('a'):
        links = text.get_attribute('href')
        if 'police.utexas.edu/crimefeed/' in links:
            individual_pages.append(links)

In [43]:
# This is how I avoided duplicate links
unique_pages = []
print([unique_pages.append(x) for x in individual_pages if x not in unique_pages])

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]


In [44]:
# Weren't interested in links that had the 'tag' text since it's not applicable to this situation, so we took those off too.
final_pages = []
for page in unique_pages:
    if 'tag' not in page:
        final_pages.append(page)

## Here, I start pulling individual elements.

In [45]:
# Now we're starting to pull individual elements. In this case, I told the scraper to pull the individual title or return N/A.
# Whatever is returned is added to a list.
incidents = []
for page in final_pages:
    try:
        driver.get(page)
        for element in driver.find_elements_by_class_name('page-title'):
            incidents.append(element.text)
    except:
        incidents.append('N/A')

In [46]:
# Same as above, but I was pulling the date. As seen though, some dates weren't included in the original position, so I had to find a separate xpath.
date_occurred = []
for page in final_pages:
    driver.get(page)
    if 'Date Occurred' in driver.page_source:
        for element in driver.find_elements_by_class_name('date-display-single'):
            date_occurred.append(element.text)
    else:
         for element in driver.find_elements_by_xpath('/html/body/div[2]/div[2]/div/div/div/div/div/div/div/div[2]/div[1]/div'):
                date_occurred.append(element.text)

In [71]:
# Once again, not all pages had location, so I told the scraper to return error if it didn't find it.
location = []
for page in final_pages:
    driver.get(page)
    if "Location Reported" in driver.page_source:
        for element in driver.find_elements_by_class_name('field_location'):
            location.append(element.text)
    else:
        location.append('error')

## Now it's time to turn these lists into a whole bunch of series that I can then bunch together into a dataframe

In [72]:
#Turning to series
incident_name = pd.Series(incidents).to_frame('Incident')

In [73]:
date = pd.Series(date_occurred).to_frame('Date')

In [74]:
location_time = pd.Series(location).to_frame('Location')

In [75]:
# Put them together into a dataframe
incident_notifications = pd.concat([incident_name, date, location_time],axis=1)

In [76]:
# Some of the locations were not precise for Geopy, so I cleaned the text wherever it was needed.
incident_notifications.at[0,'Location'] = '3501 Lake Austin Boulevard'
incident_notifications.at[10,'Location'] = '2025 Guadalupe St'
incident_notifications.at[22,'Location'] = '507 West 23rd Street'
incident_notifications.at[24,'Location'] = '2300 San Jacinto Boulevard'
incident_notifications.at[25,'Location'] = '2400 Nueces Street'
incident_notifications.at[28,'Location'] = '2501 Speedway'
incident_notifications.at[29,'Location'] = '201 E 21st St'
incident_notifications.at[30,'Location'] = '1624 West 6th Street'
incident_notifications.at[32,'Location'] = '1601 Trinity Street, Austin, TX'
incident_notifications.at[38,'Location'] = '2807 Rio Grande Street'
incident_notifications.at[42,'Location'] = '2500 Pearl Street'
incident_notifications.at[49,'Location'] = '1904 Guadalupe Street'
incident_notifications.at[53,'Location'] = '2025 Guadalupe St'

In [77]:
incident_notifications

Unnamed: 0,Incident,Date,Location
0,Attempted Burglaries,"October 6, 2021 at approximately 8:50 pm",3501 Lake Austin Boulevard
1,"Sexual Assault, Off Campus","October 1, 2021 at approximately 2:00 am",2619 Whitis Avenue
2,"Bank Robbery, Off Campus","September 14, 2021 at approximately 12:20 pm",2500 Guadalupe Street
3,"Robbery Investigation, Off Campus","September 1, 2021 at approximately 10:30 am",2200 Guadalupe Street
4,"Aggravated Assault with a Deadly Weapon, Off C...","August 14, 2021 at approximately 2:30 am",2700 Guadalupe Street
5,"Aggravated Robbery, Off Campus **UPDATED**","June 27, 2021 at approximately 8:30 pm",2001 Guadalupe Street
6,Burglary / Indecent Assault Arrest,"April 29, 2021",Jester Residence Hall
7,Indecent Assault,"March 28, 2021",2600 Whitis Avenue
8,"Robbery, Off Campus","March 8, 2021 at approximately 9:50 pm",25th and San Gabriel Street
9,"Robbery Arrest, Off Campus","February 24, 2021 at approximately 1:20 pm","CVS Pharmacy, 2222 Guadalupe Street"


# It's time to find coordinates! I feel like a spy.

In [78]:
from geopy.geocoders import Nominatim

In [79]:
# Created two lists to push coordinates into
latitude = []
longitude = []
geolocator = Nominatim(user_agent='incident_location')
values = list(x for x in incident_notifications["Location"])
# Finding coordinates for each street address.
for x in values:
    try:
        location = geolocator.geocode(x)
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
            latitude.append('error')
            longitude.append('error')

In [80]:
# Like previously, we turn each list into a series.
latitude = pd.Series(latitude).to_frame('Latitude')
longitude = pd.Series(longitude).to_frame('Longitude')
links = pd.Series(final_pages).to_frame('Links')

In [81]:
# Added previous series to original dataframe
incident_notifications['Latitude'] = latitude
incident_notifications['Longitude'] = longitude

In [82]:
# Wow, looks nice!
incident_notifications

Unnamed: 0,Incident,Date,Location,Latitude,Longitude
0,Attempted Burglaries,"October 6, 2021 at approximately 8:50 pm",3501 Lake Austin Boulevard,30.290251,-97.782231
1,"Sexual Assault, Off Campus","October 1, 2021 at approximately 2:00 am",2619 Whitis Avenue,30.290126,-97.740121
2,"Bank Robbery, Off Campus","September 14, 2021 at approximately 12:20 pm",2500 Guadalupe Street,30.289309,-97.741857
3,"Robbery Investigation, Off Campus","September 1, 2021 at approximately 10:30 am",2200 Guadalupe Street,30.285264,-97.742177
4,"Aggravated Assault with a Deadly Weapon, Off C...","August 14, 2021 at approximately 2:30 am",2700 Guadalupe Street,30.292234,-97.741651
5,"Aggravated Robbery, Off Campus **UPDATED**","June 27, 2021 at approximately 8:30 pm",2001 Guadalupe Street,30.282841,-97.741744
6,Burglary / Indecent Assault Arrest,"April 29, 2021",Jester Residence Hall,30.282492,-97.734306
7,Indecent Assault,"March 28, 2021",2600 Whitis Avenue,30.289884,-97.740334
8,"Robbery, Off Campus","March 8, 2021 at approximately 9:50 pm",25th and San Gabriel Street,error,error
9,"Robbery Arrest, Off Campus","February 24, 2021 at approximately 1:20 pm","CVS Pharmacy, 2222 Guadalupe Street",30.28553,-97.741998


In [83]:
# Have to adjust some coordinates for clarification purposes.
incident_notifications.at[8,'Latitude'] = '30.289489'
incident_notifications.at[8,'Longitude'] = '97.747766'

# It's time to Map! First time trying this, how exciting ;)

In [84]:
import folium
from folium.plugins import MarkerCluster

In [93]:
m = folium.Map(location=[30.283262,-97.741359],zoom_start=15, tiles='stamentoner')#told folium where the center is
marker_cluster = MarkerCluster().add_to(m)
for i, row in incident_notifications.iterrows():
    location=(row['Latitude'], row['Longitude'])
    folium.Marker(location=location, popup = row['Incident'],tooltip=row['Incident']).add_to(marker_cluster)

In [94]:
m # wow, my first ever map. noice