### 90803 Data Cleaning and Question Definition
# Data Cleaning: Crime Data

**Team 14**

Chi-Shiun Tsai & Colton Lapp

This notebook is used for cleaning the crime data from FBI.

### 0. Importing libraries

In [1]:
import glob
import numpy as np
import pandas as pd
from datetime import datetime
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import os

### 1. Reading datasets

Read all excel files and merge them into a single dataframe.

In [2]:

try:
    # Get the path of all Excel files 
    files = glob.glob("../data/State Tables Offenses by Agency 2020/*.xls")
    
    apd_dta = []
    for f in files:
        df = pd.read_excel(f, skiprows=3)
        col_names = df.iloc[0] 
        df = df[1:] 
        df.columns = col_names
        df.columns = df.columns.str.replace('\n', '')

        # Fill in missing values and subset only agency of cities
        df['Agency Type'] = df['Agency Type'].ffill(axis = 0)
        df = df[df['Agency Type']=='Cities']
        df.drop(['Agency Type'], axis=1, inplace=True)

        # Keep only first columns
        df = df.iloc[:, :3]
        df.sort_index(inplace=True) 

        # Add column for state with filename
        s = "/".join(f.split("/")[-1:]) 
        s = s.rstrip("/").split('_Offense')[0]
        df['State'] = s
        apd_dta.append(df)

    df_merged = pd.concat(apd_dta, ignore_index=True)

except:
    print("Error reading in raw crime data. \
These files are small so they should be on your local machine after pulling the repo. \
If they are not, please download them here and try again \n \
https://drive.google.com/uc?export=download&id=1fkrUbxsr3eGjjsfD3TcUOV8K0041N3FL\
\nOptionally, you may need to use pip install xlrd")

In [3]:
df_merged.head(10)

Unnamed: 0,Agency Name,Population1,TotalOffenses,State,Population
0,Alta Vista,422,0,Kansas,
1,Andover,13592,553,Kansas,
2,Anthony,2051,115,Kansas,
3,Arkansas City,11589,1272,Kansas,
4,Arma,1413,87,Kansas,
5,Assaria,407,0,Kansas,
6,Atchison,10421,573,Kansas,
7,Atwood,1221,32,Kansas,
8,Augusta,9339,690,Kansas,
9,Basehor,6742,169,Kansas,


### 2. Data cleaning

In [4]:
# Check for missing values
df_merged.isnull().sum(axis=0)

0
Agency Name         0
Population1         3
TotalOffenses    4740
State               0
Population       9411
dtype: int64

### These missing values are caused by police districts not filing with the FBI. There is nothing we can do about this so we drop these observations

In [5]:
# Missing values caused by missing report
df_merged = df_merged[df_merged['TotalOffenses'].notna()]

Some population values are missing. We will fill these values with the population from another column.

In [6]:
# Fill in missing values
df_merged['Population1'] = df_merged['Population1'].fillna(df_merged['Population'])

In [7]:
df_merged = df_merged.iloc[:, :4]
df_merged.columns = ['City', 'Population', 'TotalOffenses', 'State']

The format of some state names are not consistent with the other datasets. We will need to fix this.

In [8]:
# Fix State names
df_merged['State'] = [' '.join(i.split('_')) for i in df_merged['State']]

In [9]:
# Check for missing values again
df_merged.isnull().sum(axis=0)

City             0
Population       0
TotalOffenses    0
State            0
dtype: int64

In [10]:
df_merged.head(20)

Unnamed: 0,City,Population,TotalOffenses,State
0,Alta Vista,422,0,Kansas
1,Andover,13592,553,Kansas
2,Anthony,2051,115,Kansas
3,Arkansas City,11589,1272,Kansas
4,Arma,1413,87,Kansas
5,Assaria,407,0,Kansas
6,Atchison,10421,573,Kansas
7,Atwood,1221,32,Kansas
8,Augusta,9339,690,Kansas
9,Basehor,6742,169,Kansas


No missing values in the data. We will save it as a csv file.

### 3. Saving cleaned dataset

In [11]:
df_merged.to_csv('../data/data_cleaned/crime2020.csv', index=False)

### References

* Data source: https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/home
* Read excel: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
* https://stackoverflow.com/questions/20908018/import-multiple-excel-files-into-python-pandas-and-concatenate-them-into-one-dat