# MIDS W207 Fall 2017 Final Project
## Baseline Submission
Laura Williams, Kim Vignola, Cyprian Gascoigne  
SF Crime Classification

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report



Read in data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train_features = list(train.keys())

Examine data

Description of data from Kaggle:  
Dates - timestamp of the crime incident  
Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.  
Descript - detailed description of the crime incident (only in train.csv)  
DayOfWeek - the day of the week  
PdDistrict - name of the Police Department District  
Resolution - how the crime incident was resolved (only in train.csv)  
Address - the approximate street address of the crime incident   
X - Longitude  
Y - Latitude  

In [3]:
print('First rows of train data: \n', train.head())
print('\nFirst rows of test data: \n', test.head())
print("\nThe features in the training data are: \n", train_features)
print("\nThe shape of the train data is", train.shape)
print("The shape of the test data is", test.shape)

First rows of train data: 
                  Dates        Category                      Descript  \
0  2015-05-13 23:53:00        WARRANTS                WARRANT ARREST   
1  2015-05-13 23:53:00  OTHER OFFENSES      TRAFFIC VIOLATION ARREST   
2  2015-05-13 23:33:00  OTHER OFFENSES      TRAFFIC VIOLATION ARREST   
3  2015-05-13 23:30:00   LARCENY/THEFT  GRAND THEFT FROM LOCKED AUTO   
4  2015-05-13 23:30:00   LARCENY/THEFT  GRAND THEFT FROM LOCKED AUTO   

   DayOfWeek PdDistrict      Resolution                    Address  \
0  Wednesday   NORTHERN  ARREST, BOOKED         OAK ST / LAGUNA ST   
1  Wednesday   NORTHERN  ARREST, BOOKED         OAK ST / LAGUNA ST   
2  Wednesday   NORTHERN  ARREST, BOOKED  VANNESS AV / GREENWICH ST   
3  Wednesday   NORTHERN            NONE   1500 Block of LOMBARD ST   
4  Wednesday       PARK            NONE  100 Block of BRODERICK ST   

            X          Y  
0 -122.425892  37.774599  
1 -122.425892  37.774599  
2 -122.424363  37.800414  
3 -122.426

Restructure data for modeling

In [4]:
train_data_all = np.column_stack((train['Dates'],
                                 train['DayOfWeek'],
                                 train['PdDistrict'],
                                 train['Address'],
                                 train['X'],
                                 train['Y']))

train_labels_all = np.array(train['Category'])

test_data_all = np.column_stack((test['Dates'],
                                test['DayOfWeek'],
                                test['PdDistrict'],
                                test['Address'],
                                test['X'],
                                test['Y']))

print("Training data shape is", train_data_all.shape)
print("First few rows of training data are", train_data_all[:3])
print()
print("Training labels shape is", train_labels_all.shape)
print("First few labels of training labels are", train_labels_all[:3])
print()
print("Test data shape is", test_data_all.shape)
print("First few rows of training data are", test_data_all[:3])


Training data shape is (878049, 6)
First few rows of training data are [['2015-05-13 23:53:00' 'Wednesday' 'NORTHERN' 'OAK ST / LAGUNA ST'
  -122.425891675136 37.7745985956747]
 ['2015-05-13 23:53:00' 'Wednesday' 'NORTHERN' 'OAK ST / LAGUNA ST'
  -122.425891675136 37.7745985956747]
 ['2015-05-13 23:33:00' 'Wednesday' 'NORTHERN' 'VANNESS AV / GREENWICH ST'
  -122.42436302145 37.8004143219856]]

Training labels shape is (878049,)
First few labels of training labels are ['WARRANTS' 'OTHER OFFENSES' 'OTHER OFFENSES']

Test data shape is (884262, 6)
First few rows of training data are [['2015-05-10 23:59:00' 'Sunday' 'BAYVIEW' '2000 Block of THOMAS AV'
  -122.39958770418998 37.7350510103906]
 ['2015-05-10 23:51:00' 'Sunday' 'BAYVIEW' '3RD ST / REVERE AV'
  -122.391522893042 37.7324323864471]
 ['2015-05-10 23:50:00' 'Sunday' 'NORTHERN' '2000 Block of GOUGH ST'
  -122.426001954961 37.7922124386284]]


Set aside 20% of training data as development data

In [6]:
n = train_data_all.shape[0]

# I don't understand shuffle except that it's importnat but it gave me an error
shuffle = np.random.permutation(np.arange(train_data_all.shape[0]))

train_data_all = train_data_all[shuffle]
train_labels_all = train_labels_all[shuffle]

n_train = int(0.8*n)

train_data = train_data_all[:n_train,:]
train_labels = train_data_all[:n_train]
dev_data = train_data_all[n_train:,:]
dev_labels = train_data_all[n_train:]


print("Training data shape is", train_data.shape)
print("Training labels shape is,", train_labels.shape)
print()
print("Development data shape is", dev_data.shape)
print("Develpment labels shape is", dev_labels.shape)


Training data shape is (702439, 6)
Training labels shape is, (702439, 6)

Development data shape is (175610, 6)
Develpment labels shape is (175610, 6)
