ECE143, Spring 2018

Prof. Unpingco

Group 14

# Scraping Crime Data

### Overview

- SANDAG and the justice department publish a lot of their data in .pdf format.
- There are .csv format tables available, but it is difficult to tell the browser program what data to include/exclude.
- We need to convert the tables in the PDFs into pandas tables or numpy arrays before we can combine the crime data with the influencing variables (e.g. weather, homelessness, income, etc.).
- Relevant functions are being produced in the `crimepdf`.py module.

### Step 1: Convert PDF to CSV

- If we can get the files as a simple text, then we can use regular expressions and the like to more easily parse out the data.
- We want to do this step in python rather than by hand since there are potentially many files form which we can extract the data. Doing so in a batch format would be ideal and the most replicable.
- The first python module I found is [`tabula`-py](https://github.com/chezou/tabula-py). It has a function `convert_into_by_batch()` which converts every file in a directory into CSV format.
    - Tabula-py is a wrapper for a java program. To get this to work, java must already be installed.
    - Install from command line with `pip install tabula-py`

In [7]:
#Example: Converting my class schedule into a CSV
#Note: Kernel's current directory is 'ECE-143-Group-14', which contains the 'TEST' direectory
import tabula as tab

tab.convert_into_by_batch( "TEST", output='csv') #CSV is default, just being explicit
#The one PDF file in TEST is "SP18 sched.pdf"
#Also contians a normal .txt file, which gets ignored

with open('TEST\\SP18 sched.csv', 'r') as sched: #Check the resulting CSV
    for i in range(5):
        print next(sched) #see first five lines

"",Monday,Tuesday,Wednesday,Thursday,,Friday,Saturday,Sunday

"",,11:00 - 12:20 Enrolled,,1 1:00 - 12:20 Enrolled,,,,

11am,,CHEM  151,,CHEM  151,,,,

"",,LE / NSB 2303,,LE / NSB 2303,,,,

"",,"Weizma n, Haim",,"Weizma n, Haim",,,,



- Manually saving each table from \[insert file name(s)\] as a separate PDF into the sub-directory \[insert dir name\].
- For each PDF, a file with extension '.csv' will be created. New CSV files that share names with old ones will replace the older versions.
- Saved some PDF tables published by the [CJSC](https://oag.ca.gov/cjsc/pubs) covering California statewide data going back to 1952 and San Diego region-specific data going back to 2013. Saved under the `"crime_data"` directory.

Note: "ARJISPublicCrime041818.txt" is a CSV downloaded directly from the [SANDAG website](http://www.sandag.org/index.asp?classid=14&subclassid=21&projectid=446&fuseaction=projects.detail). It is ignored by the tabula module.

In [9]:
#Converting PDF data tables to CSV
#Current directory is "ECE-143-Group-14"
import tabula as tab
import os

tab.convert_into_by_batch( "crime_data", output='csv')
datafiles = os.listdir("crime_data") #list out file names
for data in datafiles:
    print data #View all of the data files saved

ARJISPublicCrime041818.txt
CAcrimeIndex52-96_cjsc.csv
CAcrimeIndex52-96_cjsc.pdf
CAcrimes66-15_cjsc.csv
CAcrimes66-15_cjsc.pdf
SDjurisdiction_2013_cjbulletin.csv
SDjurisdiction_2013_cjbulletin.pdf
SDjurisdiction_2014_cjbulletin.csv
SDjurisdiction_2014_cjbulletin.pdf
SDjurisdiction_2015_cjbulletin.csv
SDjurisdiction_2015_cjbulletin.pdf
SDjurisdiction_2016_cjbulletin.csv
SDjurisdiction_2016_cjbulletin.pdf
SDjurisdiction_2017_cjbulletin.csv
SDjurisdiction_2017_cjbulletin.pdf


- Checking the output in Excel to clearly see how tabula organized the rows and columns.
    - Statewide data:
[]()
    - County regional data:
[]()
    - General observations
        - Captions below and titles above the tables are removed.
        - Multi-line column headers are split into different rows.
        - Did not split crime rates from the total counts, which were in separate columns within the same cell in the original PDF.
        - In the state data, there are a few empty spacer columns.

- The UCSD daily police logs are provided as one PDF per day, going back through the past months. I would like to have a program go through and download each one for me instead of doing so myself.
    - The URLs are pretty straightforward as far as format:
```
http://www.police.ucsd.edu/docs/reports/CallsandArrests/CallsForService/[month]%20[day],%20[year].pdf
```
    - For example, the report for May 5, 2018 would be:
```
http://www.police.ucsd.edu/docs/reports/CallsandArrests/CallsForService/May%205,%202018.pdf
```