# Seattle Crime #

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier

### Let's start off by looking at our data ###

In [2]:
df = pd.read_csv("SPD_Crime_Data.csv")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'SPD_Crime_Data.csv'

In [None]:
df.info()

**Let's check for null values**

In [None]:
df.isnull().sum()

**Let's check for duplicate values**

In [None]:
df.drop_duplicates(inplace= True)
df.info()

## Preprocessing Our Data ##

### 1) Null Values ###

In [None]:
df.isnull().sum()

So we can see that we have null values in our "Offense Start DateTime", "Offense End DateTime", "Precint", "Sector", "Beat", and "100 Block Address" variables. Let's take a closer look at some of the variables, and decide whether we want to eliminate the column entirely, or just eliminate the null values.

**Offense Start DateTime**

"Offense Start DateTime" is described as "Start date and time the offense(s) occurred". We are interested in seeing when the crime started, so we'll simply eliminate the null values. 

In [None]:
df = df.dropna(subset = ["Offense Start DateTime"])

Let's check the null values for "Offense Start DateTime" again to make sure we eliminated them

In [None]:
df.isnull().sum()

**Offense End DateTime**

The "Offense End DateTime" variable is described as the "end date and time the offense(s) occurred..". We are not interested in when the crime ended, so we'll eliminate the column entirely. 

In [None]:
df =df.drop(["Offense End DateTime"], axis = 1)
df

**Precint, Sector and Beat**

Seattle has 5 precincts, or police station areas. They are: North, East, South, West and Southwest. Then, there are smaller geographical areas within the precints called sectors. Finally, each sector is divided into 3 smaller sections called beats, which individual patrol officers are assigned responsibility for. We'll eliminate the precinct column, to narrow down the location of crimes a little more. We'll keep the sector and beat column, but eliminate the null values.

In [None]:
#Dropping Precinct column
df =df.drop(["Precinct"], axis = 1)
df

In [None]:
#Dropping null values in the sector and beat column
df = df.dropna(subset = ["Sector", "Beat"])

Let's check the null values again

In [None]:
df.isnull().sum()

**100 Block Address**

Lastly, we have the "100 Block Address" column. Since the information in this column is censored and doesn't give us very accurate locations, we'll eliminate the entire column.

In [None]:
df =df.drop(["100 Block Address"], axis = 1)
df

Let's take another look at our null values

In [None]:
df.isnull().sum()

We can now move onto cleaning other parts of our data

### 2) Eliminating Other Variables ###

We can eliminate some of the following variables: 

* Report Number, since that won't help us with future predictions
* Offense ID, a unique identifier that also won't help with future predictions
* Report DateTime, since the time a crime is reported can differ from when it actually started
* Group A B, since no additional information is given on what the different groups represent 
* Offense Parent Group, since we have a separate column for what the actual offense was
* Offense Code, since we already have the title of the offense via our Offense variable
* Longitude, since the actual longitude is censored to within 100 blocks
* Latitude, since latitude is also censored to within 100 blocks

In [None]:
#Eliminating above mentioned variables
df = df.drop(columns = ["Report Number", "Offense ID", "Report DateTime", "Group A B", "Offense Parent Group", "Offense Code", "Longitude", "Latitude"])
df

### 3) Data Types ###

We need to make sure our data is in appropiate format, to use for our models. Let's take a look at the different types of data, and see if we can convert into usable formats.

In [None]:
df.info()

**Offense Start Date Time**

Let's start by converting our "Offense Start Date Time" into a datetime type.

In [None]:
df["Offense Start DateTime"] = df["Offense Start DateTime"].astype('datetime64[ns]')

In [None]:
df.info()

Let's split our datetime column into separate columns for day and time that the crime occured.

In [None]:
df['Time'] = pd.to_datetime(df['Offense Start DateTime']).dt.time
df['Day'] = pd.to_datetime(df['Offense Start DateTime']).dt.weekday

In [None]:
df

Let's remove the Offense Start DateTime column, since we split up our data

In [None]:
df = df.drop(["Offense Start DateTime"], axis = 1)

In [None]:
df

In [None]:
df.info()

**Crime Against Category**

Let's look at the unique values of our Crime Against Category variable, and then convert those values into numerical categories.

In [None]:
df["Crime Against Category"].value_counts()

In [None]:
#Converting into categories
df["Crime Against Category"] = df["Crime Against Category"].astype("category").cat.codes

In [None]:
df["Crime Against Category"].unique()

**Sector, Beat, and MCPP**

Let's do the same thing with our Sector, Beat, and MCPP variables

In [None]:
#Converting values into categories
df["Sector"] = df["Sector"].astype("category").cat.codes
df["Beat"] = df["Beat"].astype("category").cat.codes
df["MCPP"] = df["MCPP"].astype("category").cat.codes

Let's see what our data looks like so far

In [None]:
df

**Time and Day**

Lastly, let's do the exact same thing to our Time and Day variables

In [None]:
#Converting values into categories
df["Time"] = df["Time"].astype("category").cat.codes
df["Day"] = df["Day"].astype("category").cat.codes

Let's take a look at our updated dataset

In [None]:
df

We can now start trying to create our initial model

In [None]:
df["Offense"].value_counts()

In [None]:
#df = df.drop(df[(df["Offense"] == 'Gambling Equipment Violation') & (df["Offense"] == 'Human Trafficking, Involuntary Servitude') & (df["Offense"] == "Operating/Promoting/Assisting Gambling")].index, inplace = True)
df = df[df["Offense"].str.contains("Gambling Equipment Violation") == False]
df = df[df["Offense"].str.contains("Human Trafficking, Involuntary Servitude") == False]
df = df[df["Offense"].str.contains("Operating/Promoting/Assisting Gambling") == False]                             

In [None]:
df["Offense"].value_counts()

## Initial Model ##

Let's start off by first identifying our target variable

In [None]:
X = df.drop("Offense", axis = 1)

#Target Variable
y = df["Offense"]

Now we'll split our data using Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

We'll scale our data, to make it easier for the model to work with

In [None]:
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)

In [None]:
sm = SMOTE(random_state = 1)
X_train, y_train = sm.fit_sample(X_train, y_train)

#ros = RandomOverSampler(random_state=1)
#X_train, y_train = ros.fit_resample(X_train, y_train)

Now it's time to initialize our model

In [None]:
#Initializing our model
InitialModel = LogisticRegression(multi_class='ovr', random_state= 1)

Now we'll fit our model

In [None]:
InitialModel.fit(X_train, y_train)

In [None]:
y_train_pred = InitialModel.predict(X_train)
y_test_pred = InitialModel.predict(X_test)

In [None]:
print(classification_report(y_train, y_train_pred))

In [None]:
print(classification_report(y_test, y_test_pred))

## Random Forest Model ##

In [None]:
rf = RandomForestClassifier(random_state = 1)

rf.fit(X_train, y_train)
rf_train_pred = rf.predict(X_train)
rf_test_pred = rf.predict(X_test)