# Introduction to the Problem:

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.

This problem was hosted by [Kaggle](https://www.kaggle.com/c/sf-crime/overview).

# 1. The datasets:

The problem provides two files: a _train_ dataset and a _test_ dataset. Per usual, the train dataset will be used for training and determining the accuracy of our model. We will then select the model with the best accuracy and apply it to the testing dataset to generate our predictions for submission to the competition.

In [1]:
# import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# loading the data into the kernel:
df = pd.read_csv('train.csv')
df2 = pd.read_csv('test.csv')

print("The columns in the train dataset are: \n", df.columns)
print("\nThe columns in the test dataset are: \n", df2.columns)

The columns in the train dataset are: 
 Index(['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict',
       'Resolution', 'Address', 'X', 'Y'],
      dtype='object')

The columns in the test dataset are: 
 Index(['Id', 'Dates', 'DayOfWeek', 'PdDistrict', 'Address', 'X', 'Y'], dtype='object')


It can be seen that the number of columns in the train and test datasets are different. The `Descript` column is just the description of the `Category` column of the training dataset. Also, the `Resolution` column isn't present in the testing dataset. Hence, we will delete the columns `Descript` and `Resolution` from the training dataset and continue to look into the data to derive further insights.

In [3]:
df1 = df.copy()
del df1['Descript']
del df1['Resolution']
df1.head(10)

Unnamed: 0,Dates,Category,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541
5,2015-05-13 23:30:00,LARCENY/THEFT,Wednesday,INGLESIDE,0 Block of TEDDY AV,-122.403252,37.713431
6,2015-05-13 23:30:00,VEHICLE THEFT,Wednesday,INGLESIDE,AVALON AV / PERU AV,-122.423327,37.725138
7,2015-05-13 23:30:00,VEHICLE THEFT,Wednesday,BAYVIEW,KIRKWOOD AV / DONAHUE ST,-122.371274,37.727564
8,2015-05-13 23:00:00,LARCENY/THEFT,Wednesday,RICHMOND,600 Block of 47TH AV,-122.508194,37.776601
9,2015-05-13 23:00:00,LARCENY/THEFT,Wednesday,CENTRAL,JEFFERSON ST / LEAVENWORTH ST,-122.419088,37.807802


Let's check the number of unique values in each variable in the dataset.

In [4]:
df1.nunique(), df1.shape

(Dates         389257
 Category          39
 DayOfWeek          7
 PdDistrict        10
 Address        23228
 X              34243
 Y              34243
 dtype: int64, (878049, 7))

It seems that there are only 389,257 unique dates even though there are 878,049 total rows. This is because there could have been multiple arrests at the same time at the same/different location(s). There is definitely a repeat in locations too (`X`, `Y` and `Address` variables) and the number of districts is only 10. 

Let's look if there are any missing values in here and see what can be done about them.

In [5]:
# look for missing values:
df1.isnull().sum(), df2.isnull().sum()

(Dates         0
 Category      0
 DayOfWeek     0
 PdDistrict    0
 Address       0
 X             0
 Y             0
 dtype: int64, Id            0
 Dates         0
 DayOfWeek     0
 PdDistrict    0
 Address       0
 X             0
 Y             0
 dtype: int64)

There are NO missing values at all in our data. While that can be a good sign, it is also possible that the missing values have actually been filled as zeros. We will see soon enough. Let's visualize the data first.

# 2. Exploratory Data Analysis:

Since this is a classification problem, we will use boxplots _(seaborn)_ and other tools _(matplotlib)_ to see how each variable affects our target variable, which is `Category`. We will have to choose wisely, though. For example, choosing a histogram will not be useful at all since there are only _two_ numerical variables in `X` and `Y`, both of which are coordinates.

Let's first try to see all the datapoints on the map. We will use the `Basemap` feature of matplotlib (more infomation on Basemap can be found [here](https://matplotlib.org/basemap/index.html)). We will also talk about the arguments we use for the mapping so that it becomes slightly clearer.

In [6]:
# import the basemap feature:
# 1. Fix the 'PROJ_LIB' error (set the path to where the "epsg" file is)
import os
os.environ["PROJ_LIB"] = "D:\Anaconda\Library\share"

# 2. Import Basemap
import mpl_toolkits.basemap
from mpl_toolkits.basemap import Basemap

# 3. Extract the data we're interested in:
lat = df1['Y'].values
lon = df1['X'].values
crimes = df1['Category'].values

The basemap feature allows us to first project the map on a figure and then plot our datapoints on it. It is a very useful tool. We can see what kinds of projections we can choose from by running the command `print (mpl_toolkits.basemap.supported_projections)` on the kernel. We will delve into this interesting tool now.

In [7]:
# 1. Draw the map background (using projection and other features)
#plt.figure(figsize=(16,12))
#bmap = Basemap(projection = 'lcc', resolution='h',
              #lat_0=37.77,lon_0=-122.42,width=3E4,height=2E4)
#bmap.shadedrelief()
#bmap.drawcoastlines(color='black')
#bmap.drawcountries(color='white')
#bmap.drawstates(color='black')

# ====================================================================================================

# 2. Scatter the crime data on the map:
#bmap.scatter(lon,lat,latlon=True,alpha=0.5)


The arguments used above are as follows (more can be found [here](https://matplotlib.org/basemap/api/basemap_api.html)):

- _projection:_ The type of map projection used

- _resolution:_ The resolution with which the map will be projected. The default setting is "c" (crude), but we can choose between that and "l" (low), "i" (intermediate), "h" (high), "f" (full).

- _lat_0:_ The latitude of the center of the map to be made visible.

- _lon_0:_ The longitude of the center of the map.

- _width, height:_ The width and the height of the map. This is actually the distance in meters.

- _latlon:_ This has been set to `True`. This allows for the interpretation of X and Y inputs as longitude and latitude, respective (more on this [here](https://matplotlib.org/basemap/api/basemap_api.html#mpl_toolkits.basemap.Basemap.scatter)).


The map here doesn't reveal much information as the datapoints are just scattered all over the place. Unfortunately, all of our data is text, so we'll have to move on from visualization.

In [8]:
df1.PdDistrict.value_counts()['NORTHERN']

105296

In [9]:
df1['PdDistrict'].value_counts()

SOUTHERN      157182
MISSION       119908
NORTHERN      105296
BAYVIEW        89431
CENTRAL        85460
TENDERLOIN     81809
INGLESIDE      78845
TARAVAL        65596
PARK           49313
RICHMOND       45209
Name: PdDistrict, dtype: int64

We know that `DayOfWeek` has 7 unique values and `PdDistrict` has 10. Let's use mapping on these two variables. The idea is to encode each column (one way or another) into numbers to help us predict the Category in the end.

In [10]:
# define the mapping elements first:
days_mapping = {"Sunday":1, "Monday":2, "Tuesday":3, "Wednesday":4, "Thursday":5, "Friday":6, "Saturday":7}
df1['DayOfWeek'] = df1['DayOfWeek'].map(days_mapping)
df1.head()

Unnamed: 0,Dates,Category,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,4,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,4,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,4,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,4,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,4,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [11]:
# do the same with the PdDistricts:
pd_mapping = {"SOUTHERN":1,"MISSION":2,"NORTHERN":3,"BAYVIEW":4,"CENTRAL":5,
              "TENDERLOIN":6,"INGLESIDE":7,"TARAVAL":8,"PARK":9,"RICHMOND":10}
df1['PdDistrict'] = df1["PdDistrict"].map(pd_mapping)
df1.head()

Unnamed: 0,Dates,Category,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,4,3,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,4,3,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,4,3,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,4,3,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,4,9,100 Block of BRODERICK ST,-122.438738,37.771541


Great, now we will do the same thing for the test dataset so as to maintain consistency. Also, we will get rid of the `Address` column as it won't be that necessary since we've got the X and Y coordinates.

In [12]:
# use the mapping elements for the TEST dataset:
df2['DayOfWeek'] = df2['DayOfWeek'].map(days_mapping)
df2['PdDistrict'] = df2["PdDistrict"].map(pd_mapping)
del df1['Address']
del df2['Address']

In [13]:
df1.head()

Unnamed: 0,Dates,Category,DayOfWeek,PdDistrict,X,Y
0,2015-05-13 23:53:00,WARRANTS,4,3,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,4,3,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,4,3,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,4,3,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,4,9,-122.438738,37.771541


In [14]:
# check out the sample submission:
ss = pd.read_csv("sampleSubmission.csv")
ss.head()

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Looks like we'll have to separate all the unique values in the columns into separate columns and __have to be sorted alphabetically,__ and we'll do that using Nitin Vijay's [method](https://www.kaggle.com/nitinvijay23/predict-the-crime-category-knn-logistic/comments), after we fit our model into the test dataset.

NOW THAT ALL THE DATASETS ARE READY, let's start our model selection process.

In [15]:
z = df1.copy()

import datetime

# this line converts the string object in Timestamp object
z['DateTime'] = [datetime.datetime.strptime(d, "%Y-%m-%d %H:%M:%S") for d in df["Dates"]]

# extracting time from timestamp
z['Time'] = [datetime.datetime.time(d) for d in z['DateTime']] 

In [16]:
z.head()

Unnamed: 0,Dates,Category,DayOfWeek,PdDistrict,X,Y,DateTime,Time
0,2015-05-13 23:53:00,WARRANTS,4,3,-122.425892,37.774599,2015-05-13 23:53:00,23:53:00
1,2015-05-13 23:53:00,OTHER OFFENSES,4,3,-122.425892,37.774599,2015-05-13 23:53:00,23:53:00
2,2015-05-13 23:33:00,OTHER OFFENSES,4,3,-122.424363,37.800414,2015-05-13 23:33:00,23:33:00
3,2015-05-13 23:30:00,LARCENY/THEFT,4,3,-122.426995,37.800873,2015-05-13 23:30:00,23:30:00
4,2015-05-13 23:30:00,LARCENY/THEFT,4,9,-122.438738,37.771541,2015-05-13 23:30:00,23:30:00


In [17]:
from datetime import timedelta

In [18]:
train1 = z.copy()
z3 = []
for i in range(len(train1)):
    x3 = train1.Time[i]
    c3 = [int(datetime.timedelta(hours=x3.hour, minutes = x3.minute, 
                             seconds=x3.second).total_seconds())]
    z3.append(c3)
    
train1['Secs'] = pd.DataFrame(z3)
train1.head()

Unnamed: 0,Dates,Category,DayOfWeek,PdDistrict,X,Y,DateTime,Time,Secs
0,2015-05-13 23:53:00,WARRANTS,4,3,-122.425892,37.774599,2015-05-13 23:53:00,23:53:00,85980
1,2015-05-13 23:53:00,OTHER OFFENSES,4,3,-122.425892,37.774599,2015-05-13 23:53:00,23:53:00,85980
2,2015-05-13 23:33:00,OTHER OFFENSES,4,3,-122.424363,37.800414,2015-05-13 23:33:00,23:33:00,84780
3,2015-05-13 23:30:00,LARCENY/THEFT,4,3,-122.426995,37.800873,2015-05-13 23:30:00,23:30:00,84600
4,2015-05-13 23:30:00,LARCENY/THEFT,4,9,-122.438738,37.771541,2015-05-13 23:30:00,23:30:00,84600


In [19]:
train2 = train1[['Secs','Category','DayOfWeek','PdDistrict','X','Y']]
train2.head()

Unnamed: 0,Secs,Category,DayOfWeek,PdDistrict,X,Y
0,85980,WARRANTS,4,3,-122.425892,37.774599
1,85980,OTHER OFFENSES,4,3,-122.425892,37.774599
2,84780,OTHER OFFENSES,4,3,-122.424363,37.800414
3,84600,LARCENY/THEFT,4,3,-122.426995,37.800873
4,84600,LARCENY/THEFT,4,9,-122.438738,37.771541


In [20]:
target = train2["Category"].unique()
data_dict = {}
count = 1
for data in target:
    data_dict[data] = count
    count+=1
train2["Category"] = train2["Category"].replace(data_dict)
train2.head()

Unnamed: 0,Secs,Category,DayOfWeek,PdDistrict,X,Y
0,85980,1,4,3,-122.425892,37.774599
1,85980,2,4,3,-122.425892,37.774599
2,84780,2,4,3,-122.424363,37.800414
3,84600,3,4,3,-122.426995,37.800873
4,84600,3,4,9,-122.438738,37.771541


In [21]:
z2 = df2.copy()
# this line converts the string object in Timestamp object
z2['DateTime'] = [datetime.datetime.strptime(d, "%Y-%m-%d %H:%M:%S") for d in df2["Dates"]]

# extracting time from timestamp
z2['Time'] = [datetime.datetime.time(d) for d in z2['DateTime']] 

test1 = z2.copy()
z5 = []
for i in range(len(test1)):
    x5 = test1.Time[i]
    c5 = [int(datetime.timedelta(hours=x5.hour, minutes = x5.minute, 
                             seconds=x5.second).total_seconds())]
    z5.append(c5)
    
test1['Secs'] = pd.DataFrame(z5)
test1.head()

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,X,Y,DateTime,Time,Secs
0,0,2015-05-10 23:59:00,1,4,-122.399588,37.735051,2015-05-10 23:59:00,23:59:00,86340
1,1,2015-05-10 23:51:00,1,4,-122.391523,37.732432,2015-05-10 23:51:00,23:51:00,85860
2,2,2015-05-10 23:50:00,1,3,-122.426002,37.792212,2015-05-10 23:50:00,23:50:00,85800
3,3,2015-05-10 23:45:00,1,7,-122.437394,37.721412,2015-05-10 23:45:00,23:45:00,85500
4,4,2015-05-10 23:45:00,1,7,-122.437394,37.721412,2015-05-10 23:45:00,23:45:00,85500


In [22]:
test2 = test1[['Secs','DayOfWeek','PdDistrict','X','Y']]
test2.head()

Unnamed: 0,Secs,DayOfWeek,PdDistrict,X,Y
0,86340,1,4,-122.399588,37.735051
1,85860,1,4,-122.391523,37.732432
2,85800,1,3,-122.426002,37.792212
3,85500,1,7,-122.437394,37.721412
4,85500,1,7,-122.437394,37.721412


In [23]:
focus = test2.copy()
a=1
b=2
c=3
d=4
e=5
f=6
g=7
h=8
timequadrant = []

for i in range(len(test2)):
    if 0<=focus.Secs[i]<(3*60*60):
        timequadrant.append(int(a))
    elif (3*3600)<=focus.Secs[i]<(6*3600):
        timequadrant.append(int(b))
    elif (6*3600)<=focus.Secs[i]<(9*3600):
        timequadrant.append(int(c))
    elif (9*3600)<=focus.Secs[i]<(12*3600):
        timequadrant.append(int(d))
    elif (12*3600)<=focus.Secs[i]<(15*3600):
        timequadrant.append(int(e))
    elif (15*3600)<=focus.Secs[i]<(18*3600):
        timequadrant.append(int(f))
    elif (18*3600)<=focus.Secs[i]<(21*3600):
        timequadrant.append(int(g))
    elif (21*3600)<=focus.Secs[i]<(24*3600):
        timequadrant.append(int(h))


focus["TimeQuad"] = pd.DataFrame(timequadrant)        
test3 = focus.copy()


In [24]:
test3.head()

Unnamed: 0,Secs,DayOfWeek,PdDistrict,X,Y,TimeQuad
0,86340,1,4,-122.399588,37.735051,8
1,85860,1,4,-122.391523,37.732432,8
2,85800,1,3,-122.426002,37.792212,8
3,85500,1,7,-122.437394,37.721412,8
4,85500,1,7,-122.437394,37.721412,8


In [25]:
### Do the same for training dataset:
focus1 = train2.copy()
timequadrant1 = []
for i in range(len(train2)):
    if 0<=focus1.Secs[i]<(3*60*60):
        timequadrant1.append(int(a))
    elif (3*3600)<=focus1.Secs[i]<(6*3600):
        timequadrant1.append(int(b))
    elif (6*3600)<=focus1.Secs[i]<(9*3600):
        timequadrant1.append(int(c))
    elif (9*3600)<=focus1.Secs[i]<(12*3600):
        timequadrant1.append(int(d))
    elif (12*3600)<=focus1.Secs[i]<(15*3600):
        timequadrant1.append(int(e))
    elif (15*3600)<=focus1.Secs[i]<(18*3600):
        timequadrant1.append(int(f))
    elif (18*3600)<=focus1.Secs[i]<(21*3600):
        timequadrant1.append(int(g))
    elif (21*3600)<=focus1.Secs[i]<(24*3600):
        timequadrant1.append(int(h))


focus1["TimeQuad"] = pd.DataFrame(timequadrant1)        
train3 = focus1.copy()
train3.head()

Unnamed: 0,Secs,Category,DayOfWeek,PdDistrict,X,Y,TimeQuad
0,85980,1,4,3,-122.425892,37.774599,8
1,85980,2,4,3,-122.425892,37.774599,8
2,84780,2,4,3,-122.424363,37.800414,8
3,84600,3,4,3,-122.426995,37.800873,8
4,84600,3,4,9,-122.438738,37.771541,8


In [26]:
train_final = train3[['TimeQuad','Category','DayOfWeek','PdDistrict','X','Y']]
test_final = test3[["TimeQuad","DayOfWeek","PdDistrict","X","Y"]]

train_final.head(), test_final.head()

(   TimeQuad  Category  DayOfWeek  PdDistrict           X          Y
 0         8         1          4           3 -122.425892  37.774599
 1         8         2          4           3 -122.425892  37.774599
 2         8         2          4           3 -122.424363  37.800414
 3         8         3          4           3 -122.426995  37.800873
 4         8         3          4           9 -122.438738  37.771541,
    TimeQuad  DayOfWeek  PdDistrict           X          Y
 0         8          1           4 -122.399588  37.735051
 1         8          1           4 -122.391523  37.732432
 2         8          1           3 -122.426002  37.792212
 3         8          1           7 -122.437394  37.721412
 4         8          1           7 -122.437394  37.721412)

In [27]:
train_final.isnull().sum()

TimeQuad      0
Category      0
DayOfWeek     0
PdDistrict    0
X             0
Y             0
dtype: int64

The final training and testing datasets are `train_final` and `test_final`. From here, we will now select the X and y data from training and validation and the test data for generating our submission file.

# 3. Model Selection:

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

X = train_final.drop(['Category'],axis=1)
y = train_final['Category']


X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3,random_state=0)

In [29]:
# for mapping the "Category" values back to the type of crime:

from collections import OrderedDict
data_dict_new = OrderedDict(sorted(data_dict.items()))
print(data_dict_new)

OrderedDict([('ARSON', 26), ('ASSAULT', 8), ('BAD CHECKS', 36), ('BRIBERY', 29), ('BURGLARY', 10), ('DISORDERLY CONDUCT', 25), ('DRIVING UNDER THE INFLUENCE', 22), ('DRUG/NARCOTIC', 14), ('DRUNKENNESS', 12), ('EMBEZZLEMENT', 30), ('EXTORTION', 34), ('FAMILY OFFENSES', 27), ('FORGERY/COUNTERFEITING', 13), ('FRAUD', 19), ('GAMBLING', 35), ('KIDNAPPING', 20), ('LARCENY/THEFT', 3), ('LIQUOR LAWS', 28), ('LOITERING', 32), ('MISSING PERSON', 18), ('NON-CRIMINAL', 6), ('OTHER OFFENSES', 2), ('PORNOGRAPHY/OBSCENE MAT', 39), ('PROSTITUTION', 24), ('RECOVERED VEHICLE', 38), ('ROBBERY', 7), ('RUNAWAY', 21), ('SECONDARY CODES', 16), ('SEX OFFENSES FORCIBLE', 23), ('SEX OFFENSES NON FORCIBLE', 33), ('STOLEN PROPERTY', 15), ('SUICIDE', 31), ('SUSPICIOUS OCC', 11), ('TREA', 37), ('TRESPASS', 17), ('VANDALISM', 5), ('VEHICLE THEFT', 4), ('WARRANTS', 1), ('WEAPON LAWS', 9)])


#### 1. KNN

In [30]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred1 = knn.predict(X_val)

f1_knn = f1_score(y_pred1, y_val, average='weighted')
print("F1 score: ",f1_knn)

F1 score:  0.24479851003473196


In [31]:
# get the predictions:
predictions = knn.predict(test_final)

# print the result as a csv file in the correct format:
result_dataframe = pd.DataFrame({"Id": test1["Id"]})

for key,value in data_dict_new.items():
    result_dataframe[key] = 0
count = 0
for item in predictions:
    for key,value in data_dict.items():
        if(value == item):
            result_dataframe[key][count] = 1
    count+=1

result_dataframe.to_csv("submission_knn.csv", index=False)

In [32]:
result_dataframe.head()

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 2. Logistic Regression:

In [36]:
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression(random_state=0)
lgr.fit(X_train, y_train)
y_pred2 = lgr.predict(X_val)

f1_knn = f1_score(y_pred2, y_val, average='weighted')
print("F1 score: ",f1_knn)

predictions = lgr.predict(test_final)

#print(type(predictions))
result_dataframe = pd.DataFrame({
    "Id": test1["Id"]
})
for key,value in data_dict_new.items():
    result_dataframe[key] = 0
count = 0
for item in predictions:
    for key,value in data_dict.items():
        if(value == item):
            result_dataframe[key][count] = 1
    count+=1

result_dataframe.to_csv("submission_logistic.csv", index=False)

F1 score:  0.3081883460778926


#### 3. Gradient Boosting Classifier:

In [37]:
from sklearn.ensemble import GradientBoostingClassifier
gbc= GradientBoostingClassifier(random_state=0)
gbc.fit(X_train,y_train)
y_pred3=gbc.predict(X_val)

f1_gbc = f1_score(y_pred3, y_val, average='weighted')
print("F1 Score: ", f1_gbc)

predictions = gbc.predict(test_final)

#print(type(predictions))
result_dataframe = pd.DataFrame({
    "Id": test1["Id"]})
for key,value in data_dict_new.items():
    result_dataframe[key] = 0
count = 0
for item in predictions:
    for key,value in data_dict.items():
        if(value == item):
            result_dataframe[key][count] = 1
    count+=1

result_dataframe.to_csv("submission_gradient.csv", index=False)

F1 Score:  0.3382265552338507


#### 4. XGBoost Classifier:

In [38]:
from xgboost import XGBClassifier
xgbc = XGBClassifier(random_state=0)
xgbc.fit(X_train,y_train)
y_pred4 = xgbc.predict(X_val)

f1_xgbc = f1_score(y_pred4,y_val, average='weighted')
print("F1 score: ", f1_xgbc)

predictions = xgbc.predict(test_final)

#print(type(predictions))
result_dataframe = pd.DataFrame({
    "Id": test1["Id"]})
for key,value in data_dict_new.items():
    result_dataframe[key] = 0
count = 0
for item in predictions:
    for key,value in data_dict.items():
        if(value == item):
            result_dataframe[key][count] = 1
    count+=1

result_dataframe.to_csv("submission_xgradient.csv", index=False)

F1 score:  0.34020385167943157
