# Requirements

I'd like to get a better understanding of crime in my neighborhood, but the publicly available information only comes in pdf form. The purpose of this script is to take pdfs downloaded from the Framingham police department, and transform the files into usable data points for analysis.

Website: https://www.framinghamma.gov/DocumentCenter/Home/Index/149

1. Functions
    - split_types: regex functions that find the line breaks in uploaded text
    - police_log_day: 
        - loads pdfs one day at a time
        - loops through all the days in a pdf
        - returns a dataframe with all the pages split into individual logs
1. Load data: Loops through all the pdfs in a local folder. Future iterations might automatically scrape the info off the web.
1. lat_lon Function: taps the googlemaps api to get latitude and longitude for a given address
1. Apply the lat_lon function to the data set
1. Save the data to a .csv -- a secondary script saves that file into a local postgres database


In [None]:
import PyPDF2
import re
import pandas as pd
import datetime as dt
import os
import requests

endpoint = 'https://maps.googleapis.com/maps/api/geocode/json?'
api_key = input()


The function below takes a single pdf, loops through each page, and splits the text into distinct rows. 

Then it returns a combined dataframe with the output.

In [190]:
def police_log_day(file):
    #function returns parsed list from individual calls
    def split_types(text, time):
        time = time
        start = text.find(time_match[0])
        rest = text[start + len(time):]
        colon = rest.find(':')
        call_type = rest[:colon].lstrip().rstrip()
        rest = rest[colon+2:]
        paren_open = rest.find('(')
        paren_close = rest.find(')')
        note = rest[paren_open+1:paren_close]
        address = rest[:paren_open].lstrip().rstrip()

        return [time, call_type, note, address]
    

    file = file
    rough_date = file[-14:-4]
    year = rough_date[:4]
    month = rough_date[5:7]
    day = rough_date[-2:]
    
    pdfReader = PyPDF2.PdfFileReader(open(file, 'rb'))
    num_pages = pdfReader.numPages
    
    #read and combine all the pages of text into a single string
    text = ''
    
    for page in range(num_pages):
        pageObj = pdfReader.getPage(page)
        text = text + pageObj.extractText()
    time_pattern = r'[0-9][0-9]:[0-9][0-9]'
    num_pattern = r'[0-9]{7}'
    

    split_text = text.split('\n\n')
    calls = []
    for i in split_text:
        i_adj = re.sub(num_pattern,'', i)
        time_match = re.findall(time_pattern, i_adj)
        if len(time_match) == 1 and 'Page:' not in i_adj:
            calls.append(split_types(i_adj,time_match[0]))
        elif len(time_match) == 0:
            pass
        elif 'Page:' not in i_adj:
            for num, time in enumerate(time_match):
                calls.append(split_types(i_adj,time))  

    data = pd.DataFrame(calls, columns= ['Time','Type','Note','Address'])
    data['date'] = '{}/{}/{}'.format(year,month,day)
    data['load_date'] = dt.datetime.now()
    return data

The code block below loops through all the pdfs in a local folder, then concatenates all the outputs to a signle dataframe.

In [202]:
folder = '/Users/williamgetman/Dropbox/Python/Police Logs/raw_data/'
folders_in_log = os.listdir(folder)

all_data = pd.DataFrame(columns = ['Time','Type','Note','Address','date','load_date'])
for file in folders_in_log:
    file_name = folder + file
    df = police_log_day(file_name)
    all_data = pd.concat([all_data, df], axis=0, join='outer',ignore_index=True)

The function below pings the googlemaps api and returns latitude and longitude for a given Framingham address.

In [234]:
def lat_lon(address):
    address = address + ' Framingham, MA 01702'
    address = address.lower()

    url_address = address.replace(' ','+')
    lat_lng = []
    nav_request = 'address={}&key={}'.format(url_address,api_key)
    
    request = endpoint + nav_request
    response = requests.get(request)
    
    resp_json_payload = response.json()
    for item in resp_json_payload['results']:
        for key,detail_item in item['geometry']['location'].items():
            if key == 'lat' and len(lat_lng) < 2:
                lat_lng.append(detail_item)
            elif key == 'lng' and len(lat_lng) < 2:
                lat_lng.append(detail_item)
    return lat_lng



The code below applies the lat_lon function to the dataset.

In [245]:
data_lat_long = all_data.copy()
data_lat_long['lat_long'] = data_lat_long['Address'].apply(lat_lon)
data_lat_long.head()

Save to a csv.

In [247]:
data_lat_long.to_csv('/Users/williamgetman/Dropbox/Python/Police Logs/all_data.w.address.11.3.19.csv')