### NYC Arrests - Scraping Police Precincts Data

<hr>

This notebook scrapes the NYC Gov. site online to get data about Police Precincts.  Uses BeautifulSoup and then exports the data to a CSV file.

https://www1.nyc.gov/site/nypd/bureaus/patrol/precincts-landing.page

<hr>

### Imports

In [1]:
import pandas as pd
import numpy as np
import re
import datetime

import requests
import urllib
from bs4 import BeautifulSoup
from lxml import etree

# pd.set_option('display.max_rows', 200)

### Beautiful Soup Setup

In [2]:
def get_soup_data(url: str):
    """
    Given a URL, this function returns 
    a BeautifulSoup object of a website 
    parsed as lxml.
    
    @url: URL to be parsed
    Returns: BeautifulSoup object
    """
    try:
        response = requests.get(url)
        if not response.status_code == 200:
            print("HTTP error", response.status_code)
        else:
            try:
                page_data_soup = BeautifulSoup(response.content,'lxml')
                return page_data_soup
            except:
                print("Something went wrong with BeautifulSoup parsing")
    except:
        print('Something went wrong with requests.get (possible bad URL)')
        
def get_precinct_info(url: str):
    """
    https://www1.nyc.gov/site/nypd/bureaus/patrol/precincts/1st-precinct.page
    
    Given a url to the Police Precincts page,
    returns the text for the commanding officer
    and the precinct description
    
    @url: url of the Police Precinct
    Return: A tuple of the commanding officer and a description
    of the police precinct
    """
    page_data_soup = get_soup_data(url)
    
    try:
        soup_div = page_data_soup.find('div', attrs={'class': 'about-description'})
        soup_p = soup_div.find_all('p')
        officer = soup_p[0].get_text()
        description = soup_p[1].get_text()
        return (officer, description)
    except:
        print("Error parsing tag within BeautifulSoup object")

### Scrape the Police Precincts Page

In [3]:
url = 'https://www1.nyc.gov/site/nypd/bureaus/patrol/precincts-landing.page'

In [4]:
page_data_soup = get_soup_data(url)

In [5]:
# print(page_data_soup.prettify())

In [6]:
# soup_table = page_data_soup.find('table')

In [7]:
list_of_precincts = []

try:
    soup_table = page_data_soup.find('table')
except:
    print("Error parsing tag within BeautifulSoup object")


for tag in soup_table.find_all('tr'):
    
    # find the rows of the boroughs
    soup_th = tag.find('th', class_='subhead')
    if(soup_th):
        borough = soup_th.get_text()

    # find all rows of precinct now
    soup_td = tag.find('td', attrs={'data-label': 'Precinct'})
    if(soup_td):
        precinct_name = soup_td.get_text()
        url = 'https://www1.nyc.gov' + str(soup_td.find('a').get('href'))
        telephone = tag.find_all('td')[1].get_text() #note website didn't do this consistently
        address = tag.find_all('td')[2].get_text()
        officer, description = get_precinct_info(url)
        
        dict_current_precinct = {
            "Precinct Name": precinct_name,
            "Borough": borough,
            "Address": address,
            "Telephone": telephone,
            "URL": url,
            'Commanding Officer': officer,
            'Description': description
        }
        list_of_precincts.append(dict_current_precinct)

In [8]:
df = pd.DataFrame(list_of_precincts)

In [9]:
df

Unnamed: 0,Precinct Name,Borough,Address,Telephone,URL,Commanding Officer,Description
0,1st Precinct,Manhattan,16 Ericsson Place,212-334-0611,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Captain Angel L. Figueroa Jr.,The 1st Precinct serves an area that consists ...
1,5th Precinct,Manhattan,19 Elizabeth Street,212-334-0711,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Captain Paul J Zangrilli,The 5th Precinct serves the southeastern edge ...
2,6th Precinct,Manhattan,233 West 10 Street,212-741-4811,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Deputy Inspector Robert O'...,The 6th Precinct serves the southwestern Manha...
3,7th Precinct,Manhattan,19 1/2 Pitt Street,212-477-7311,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Captain Luis E. Barcia,The 7th Precinct serves Manhattan's Lower East...
4,9th Precinct,Manhattan,321 East 5 Street,212-477-7811,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Captain John L. O'Connell,The 9th Precinct serves the area from East Hou...
...,...,...,...,...,...,...,...
72,115th Precinct,Queens,92-15 Northern Boulevard,718-533-2002,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Deputy Inspector Juan A. D...,The 115th Precinct serves a northern portion o...
73,120th Precinct,Staten Island,78 Richmond Terrace,718-876-8500,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Inspector Isa Abbassi,The 120th Precinct serves the North Shore of S...
74,121st Precinct,Staten Island,970 Richmond Avenue,718-697-8700,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Captain Bruce P. Ceparano,The 121st Precinct serves the northwestern sho...
75,122nd Precinct,Staten Island,2320 Hylan Boulevard,718-667-2211,https://www1.nyc.gov/site/nypd/bureaus/patrol/...,Commanding Officer: Deputy Inspector Melissa Eger,The 122nd Precinct serves a portion of the Sou...


### Extract precinct number as an integer

In [10]:
# Manual Mapping
# http://www.nyc.gov/html/nypd/html/precincts/precinct_014.shtml
dict_precincts = {'Midtown South Precinct': 14,
                  'Midtown North Precinct': 18,
                  'Central Park Precinct': 22}

# df['Precinct Number'] = df['Precinct Name'].replace(dict_precincts)

def get_precinct_number(precinct_name: str):
    """
    Given a precinct name, returns the integer
    representation of that precinct
    """
    pattern = r'^(\d{1,3})\D+'
    match = re.search(pattern, precinct_name)
    
    if(match):
        return int(match.group(1))
    else:
        return dict_precincts[precinct_name]

In [11]:
df['Precinct Number'] = df.apply(lambda x: get_precinct_number(x['Precinct Name']), axis=1)
df[df['Precinct Number'].isnull()]

Unnamed: 0,Precinct Name,Borough,Address,Telephone,URL,Commanding Officer,Description,Precinct Number


### Clean precinct commanding officer text

In [12]:
def clean_officer_text(string: str) -> str:
    """
    Cleans the commanding officer text from the web
    and just returns the commanding officer
    """
    pattern = r'^Commanding Officer:\s*(.+)$'
    match = re.search(pattern, string)
    if match:
        return match.group(1)
    else:
        return string
    

In [13]:
df['Commanding Officer'] = df.apply(lambda x: clean_officer_text(x['Commanding Officer']), axis=1)

### Add today's date as date scraped

In [18]:
today = datetime.date.today()
df['Scraped on'] = today

### Output to CSV

In [20]:
# Output as CSV
df.to_csv('Data/police_precincts.csv')