# San Francisco Police Department Crime Rate Reports

For this project we will be analyzing Sa Francisco Police Department Crime Rate reports from [Kaggle](https://www.kaggle.com/psmavi104/san-francisco-crime-data). The data contains the 

* Incident_Datetime
* Incident_Time
* Incident_Year
* Incident_Day_of_Week
* Report_Datetime
* Incident_ID
* Row_ID
* Incident_Number
* CAD_Number
* Report_Type_Code
* Report_Type_Description
* Filed_Online
* Incident_Code
* Incident_Category
* Incident_Subcategory
* Incident_Description
* Resolution
* Intersection
* CNN
* Police_District
* Analysis_Neighborhood
* Supervisor_District
* Latitude 
* Longitude
* point
* SF_Find_Neighborhoods
* Current_Police_Districts
* Current_Supervisor_Districts
* Analysis_Neighborhoods
* HSOC_Zones_as_of_2018-06-05
* OWED_Public_Spaces
* Central_Market/Tenderloin_Boundary_Polygon_-_Updated
* Parks_Alliance_CPSI_27+TL_sites


## Data and Setup

** Import some libraries **

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import sklearn.preprocessing
import warnings
from wordcloud import WordCloud

%matplotlib inline
sns.set_style("darkgrid")
warnings.filterwarnings("ignore")

** Read in the csv file as a dataframe called df **

In [None]:
df  = pd.read_csv('SFCD_2018.csv')

## Exploring Data

In [None]:
df.columns = df.columns.str.strip().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

In [None]:
df.columns

In [None]:
# plt.rcParams['figure.figsize'] = (10, 8)
# plt.style.use('bmh')
# wc = WordCloud(background_color = 'pink', width = 1500, height = 1500).generate(str(df['Incident_Description']))
# plt.imshow(wc)
# plt.axis('off')
# plt.show()

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.corr(),center =0)

In [None]:
df_2 = df.drop(['CAD_Number', 'Report_Type_Code', 'Filed_Online', 'Incident_Code',
       'Incident_Category', 'Incident_Subcategory', 'Intersection', 'CNN', 'Police_District',
       'Analysis_Neighborhood', 'Supervisor_District',
       'point', 'SF_Find_Neighborhoods', 'Current_Police_Districts',
       'Current_Supervisor_Districts', 'Analysis_Neighborhoods',
       'HSOC_Zones_as_of_2018-06-05', 'OWED_Public_Spaces',
       'Central_Market/Tenderloin_Boundary_Polygon_-_Updated'], axis = 1)

In [None]:
df_2.head()

In [None]:
df_2.dropna(how='all') 

In [None]:
df_2.shape

In [None]:
#df.Incident_Description.value_counts().plot.bar()

In [None]:
# plt.rcParams['figure.figsize'] = (10, 8)
# plt.style.use('bmh')
# wc = WordCloud(background_color = 'white', width = 1800, height = 1700).generate(str(df_2['Incident_Description']))
# plt.imshow(wc)
# plt.axis('off')
# plt.show()

In [None]:
sns.heatmap(df_2.corr(),center =0)

In [None]:
df_2['Incident_Datetime'] = pd.to_datetime(df_2['Incident_Datetime'])

In [None]:
df_2.columns

In [None]:
df_2['date'] = df_2['Incident_Datetime'].apply(lambda t: t.date())

In [None]:
df_2['month'] = df_2['Incident_Datetime'].apply(lambda time: time.month)

In [None]:
df_2['hour'] = df_2['Incident_Datetime'].apply(lambda time: time.hour)

In [None]:
df_2['day'] = df_2['Incident_Datetime'].apply(lambda time: time.hour)

In [None]:
df_2.drop(['Incident_Date', 'Incident_Time', 'Incident_Year','Row_ID', 'Incident_ID', 'Report_Type_Description', 'Parks_Alliance_CPSI_27+TL_sites','Incident_Number', 'Report_Datetime','Incident_Datetime'], axis  = 1, inplace=True)



In [None]:
df_2.dropna(axis=0, inplace=True)

In [None]:
df_2.head()

## Basic Questions

** What are the top 20 crimes in San Francisco? **

In [None]:
df_2.Incident_Description.value_counts().head(20).plot(kind='bar')

In [None]:
from matplotlib.pyplot import figure
figure(num=None, figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')

In [None]:
df_2.groupby('Incident_Description')['Longitude'].value_counts().plot()
plt.title('Reasons')
plt.figure(figsize=(20, 6))
#plt.tight_layout()

In [None]:
#sns.countplot(x='month',data=byMonth,hue='Incident_Description',palette='viridis')

In [None]:
byMonth = df_2.groupby('Incident_Day_of_Week').count()

In [None]:
byMonth['Incident_Description'].plot(kind='bar')

In [None]:
df_2.head()

In [None]:
df_2.info()

In [None]:
df_2.Resolution.value_counts(normalize = True)

In [None]:
crimes=list(df_2.Incident_Description.values)
key_crimes = []
for x in crimes:
    crime = x.split()
    for c in crime:
        key_crimes.append(c.strip(','))
    

In [None]:
crimes_df = pd.DataFrame(key_crimes)

In [None]:
crimes_df[0].value_counts(normalize = True).head(75)

In [None]:
key_cr = ['Larceny','Theft', 'Assault', 'Vandalism', 'Burglary',
          'Battery','Rape','Possession', 'Robbery','Fraudulent','Mental','Violation','Person',
         'Suspicious','Shoplifting','Traffic','Vehicle']

In [None]:
def assign_crime(x):
    cr = 'general'
    for crime in key_cr:
        if crime in x:
            cr = crime
    return cr
df_2['Crime'] = df_2.Incident_Description.apply(assign_crime)
    

In [None]:
df_3 = df_2[df_2.Crime != 'general']
df_3.info()

In [None]:
crimes_pd = pd.concat([df_3,pd.get_dummies(df_3.Crime)],axis = 1)
crimes_pd = pd.concat([crimes_pd,pd.get_dummies(crimes_pd.Incident_Day_of_Week)],axis = 1)

In [None]:
crimes_pd.drop(['Incident_Description','date','Incident_Day_of_Week','Crime'], axis = 1, inplace = True)

In [None]:
crimes_pd.reset_index(drop=True)

In [None]:
X = crimes_pd.drop('Resolution', axis = 1)
y = crimes_pd.Resolution
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size = .3, random_state = 212)

In [None]:
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor,AdaBoostClassifier
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [None]:
print(X_test.shape)
print(X_train.shape)
print(y_test.shape)
print(y_train.shape)

In [None]:
clf = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)

In [None]:
res1 = clf.fit(X_train,y_train)

In [None]:
res1

In [None]:
y_hat = clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_hat)

In [None]:
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ConfusionMatrix

In [None]:
visualizer = ROCAUC(clf)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()   

In [None]:
cm = ConfusionMatrix(clf)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

In [None]:

y_test.value_counts(normalize=True)

In [None]:
clf_RF_weights = RandomForestClassifier(n_jobs=-1,class_weight={'Open or Active':0.81,
                                                      'Cite or Arrest Adult':0.17,
                                                     'Cite or Arrest Juvenile': .006,
                                                     "Unfounded":0.004648,
                                                     "Exceptional Adult":0.001451,
                                                    "Exceptional Juvenile":0.000249},
                                        max_depth = 1000, random_state = 24,n_estimators=1000)

In [None]:
clf_RF_weights.fit(X_train,y_train)

In [None]:
y_hat = clf_RF_weights.predict(X_test)

acc = accuracy_score(y_test, y_hat) * 100
print(f"Accuracy Score is {acc}")

In [None]:
visualizer = ROCAUC(clf_RF_weights)

visualizer.score(X_test, y_test)  # Evaluate the model on the test data

visualizer.poof()  

In [None]:
cm = ConfusionMatrix(clf_RF_weights)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

In [None]:
crimes_pd.head(10)

In [None]:
crimes_pd.to_csv("data/crime_data.csv", index=False)

In [None]:
import numpy as np
from sklearn.model_selection import KFold

X = crimes_pd.drop('Resolution', axis = 1)
y = crimes_pd.Resolution




In [None]:
def K_Fold_Classification(X,y,n):
    kf = KFold(n_splits=n)
    kf.get_n_splits(X)
    train_score = []
    test_score = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        clf_RF_weights.fit(X_train,y_train)
        y_hat = clf_RF_weights.predict(X_test)
        y_hat_train = clf_RF_weights.predict(X_train)
        acc = accuracy_score(y_test, y_hat) * 100
        acc_train = accuracy_score(y_train, y_hat_train) * 100
        train_score.append(acc_train)
        test_score.append(acc)
    return train_score,test_score


In [None]:
train, test = K_Fold_Classification(X,y,5)

In [None]:

# xs = list(range(1, 6))
# figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.
# axes = figure.add_subplot(1, 1, 1)

# axes.plot(xs, test, color="firebrick")
# axes.plot(xs, train, color="cornflowerblue")

# axes.set_xticks(xs)
# axes.set_xticklabels([str(x) for x in xs])

# axes.set_title("Validation Curves")
# axes.set_xlabel("k")
# axes.set_ylabel("Accuracy")

# plt.show()
# plt.close()

In [None]:
from sklearn.svm import SVC  
svclassifier = SVC(kernel='linear')  
svclassifier.fit(X_train, y_train) 

In [None]:
y_pred = svclassifier.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred)) 
print(f"The accuracy score is {accuracy_score(y_test, y_pred)}")

In [None]:
clf_GS = RandomForestClassifier(n_jobs=-1,
                                class_weight={'Open or Active':0.81,
                                             'Cite or Arrest Adult':0.17,
                                             'Cite or Arrest Juvenile':0.006,
                                             "Unfounded":0.004648,
                                             "Exceptional Adult":0.001451,
                                             "Exceptional Juvenile":0.000249},
                                random_state = 24)

In [None]:
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
CV_rfc = GridSearchCV(estimator=clf_GS, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train, y_train)

In [None]:
CV_rfc.best_score_

In [None]:
CV_rfc.best_params_

# Support Vector Machines

## Train Test Split

** Split your data into a training set and a testing set.**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = crimes_pd.drop('Resolution',axis=1)
y = crimes_pd['Resolution']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [None]:
crimes_pd['Resolution'].head()

# Train a Model

Now its time to train a Support Vector Machine Classifier. 

**Call the SVC() model from sklearn and fit the model to the training data.**

In [None]:
from sklearn.svm import SVC

In [None]:
svc_model = SVC()
svc_model = SVC(kernel='linear') 

In [None]:
svc_model.fit(X_train,y_train)

## Model Evaluation

**Now get predictions from the model and create a confusion matrix and a classification report.**

In [None]:
predictions = svc_model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(confusion_matrix(y_test,predictions))

In [None]:
print(classification_report(y_test,predictions))

## Gridsearch Practice

** Import GridsearchCV from SciKit Learn.**

In [None]:
from sklearn.model_selection import GridSearchCV

**Create a dictionary called param_grid and fill out some parameters for C and gamma.**

In [None]:
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001]} 

** Create a GridSearchCV object and fit it to the training data.**

In [None]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)

** Now take that grid model and create some predictions using the test set and create classification reports and confusion matrices for them. Were you able to improve?**

In [None]:
grid_predictions = grid.predict(X_test)

In [None]:
print(confusion_matrix(y_test,grid_predictions))

In [None]:
print(classification_report(y_test,grid_predictions))