# MIDS W207 Fall 2017 Final Project¶
## Data Set Up - Data Cleaning and Feature Engineering
Laura Williams, Kim Vignola, Cyprian Gascoigne  
SF Crime Classification

This notebook reads raw data (saved in a zip file) from Kaggle, processes and organizes the data for training a variety of machine learning models, and outputs the data as zipped csv files that other notebooks can unzip and use to train different models.

The intention is that data cleaning and/or feature engineering will be added to this file as we progress through the project and look for additional way to process the data to improve our predictions.

For ease of processing this data, exploratory data analysis will be done separately.

Resulting zipped files will include: 

1) train_data.csv and train_labels.csv - includes 80% of the total training data, for training models that are not yet going to be submitted to Kaggle

2) dev_data.csv and dev_labels.csv - includes 20% of the total training data, for testing models before they are submitted to Kaggle

3) train_data_all.csv and train_labels_all.csv - includes all the training data. After testing models with the train and dev data split above, train the model from this full set of data for submission to Kaggle.

4) test_data_all.csv - create predictions on this data for submission to Kaggle.

Weather data for San Franscisco County was added to this analysis.
Source: https://www.ncdc.noaa.gov/cdo-web/search
Report: Daily Summaries, Date Range: 1/1/2003 - 12/31/2015, Search for: Counties/San Francisco, Station: SAN FRANCISCO DOWNTOWN, CA US, Metrics: Precipitation, Maximum Temperature, Minimum Temperature.


#### NOTE: holidays package is not native with anaconda and may need to be installed

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import zipfile
from datetime import datetime, timedelta, date
import holidays
from sklearn.metrics.pairwise import pairwise_distances

In [2]:
# Unzip raw data into a subdirectory 
unzip_files = zipfile.ZipFile("raw_data.zip", "r")
unzip_files.extractall("raw_data")
unzip_files.close()

In [7]:
# Read CSV files into pandas dataframes
train = pd.read_csv("raw_data/train.csv")
test = pd.read_csv("raw_data/test.csv")
weather = pd.read_csv("raw_data/SF_county.csv")

In [50]:
# Raw data shape and features
print(train.columns.values)
print(train.shape)
print(test.columns.values)
print(test.shape)
print(weather.columns.values)
print(weather.shape)

['Dates' 'Category' 'Descript' 'DayOfWeek' 'PdDistrict' 'Resolution'
 'Address' 'X' 'Y' 'month' 'year' 'hour' 'day' 'holidays' 'first_day'
 'month_year' 'doy' 'spring' 'summer' 'fall' 'winter' 'dayparts']
(878049, 22)
['Id' 'Dates' 'DayOfWeek' 'PdDistrict' 'Address' 'X' 'Y' 'month' 'year'
 'hour' 'day' 'holidays' 'first_day' 'month_year' 'doy' 'spring' 'summer'
 'fall' 'winter' 'dayparts']
(884262, 20)
['DATE' 'PRCP' 'SNOW' 'TMAX' 'TMIN']
(4748, 5)


In [9]:
# extract month, year and hour from both datasets
train["month"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").month)
train["year"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").year)
train["hour"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").hour)
train["day"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").day)

test["month"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").month)
test["year"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").year)
test["hour"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").hour)
test["day"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").day)

# map holidays
US_Holidays = holidays.UnitedStates()
train["holidays"] = train["Dates"].map(lambda x: x in US_Holidays)
test["holidays"] = test["Dates"].map(lambda x: x in US_Holidays)


In [11]:
# Pull out first day of the month as it ranks first for crime volume and specific crimes may be associated with the day
train["first_day"] = [1 if x==1 else 0 for x in train["day"]]
test["first_day"] = [1 if x==1 else 0 for x in test["day"]]

In [13]:
# create a bucket variable for month_year

train["month_year"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
train["month_year"] = train["month_year"].map(lambda x: datetime.strftime(x,"%Y-%m"))

test["month_year"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
test["month_year"] = test["month_year"].map(lambda x: datetime.strftime(x,"%Y-%m"))

# would month_day have any value?
#train["month_day"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
#train["month_day"] = train["month_day"].map(lambda x: datetime.strftime(x,"%m-%d"))
#test["month_day"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
#test["month_day"] = test["month_day"].map(lambda x: datetime.strftime(x,"%m-%d"))

In [15]:
# parse out day of year for bucketing seasons

train["doy"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").timetuple().tm_yday)
test["doy"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").timetuple().tm_yday)

train["spring"] = [1 if x in range(81,173) else 0 for x in train["doy"]]
train["summer"] = [1 if x in range(173,265) else 0 for x in train["doy"]]
train["fall"] = [1 if x in range(265,356) else 0 for x in train["doy"]]
train["winter"] = [1 if x in range(1,81) or x in range(356,366) else 0 for x in train["doy"]]

test["spring"] = [1 if x in range(81,173) else 0 for x in test["doy"]]
test["summer"] = [1 if x in range(173,265) else 0 for x in test["doy"]]
test["fall"] = [1 if x in range(265,356) else 0 for x in test["doy"]]
test["winter"] = [1 if x in range(1,81) or x in range(356,366) else 0 for x in test["doy"]]

In [17]:
# create a dictionary for bucketing hours
time_periods = {6:"early_morning", 7:"early_morning", 8:"early_morning", 
               9:"late_morning", 10:"late_morning", 11:"late_morning",
              12:"early_afternoon", 13:"early_afternoon", 14:"early_afternoon",
              15:"late_afternoon", 16:"late_afternoon", 17:"late_afternoon",
              18:"early_evening",  19:"early_evening",  20:"early_evening",
              21:"late_evening", 22:"late_evening", 23:"late_evening",
              0:"late_night", 1:"late_night", 2:"late_night",
              3:"late_night", 4:"late_night", 5:"late_night"}

# map time periods to dayparts
train["dayparts"] = train["hour"].map(time_periods)
test["dayparts"] = test["hour"].map(time_periods)

In [24]:
# clean up weather data
del weather['NAME']
weather["SNOW"] = weather["SNOW"].fillna(0)

In [26]:
# drop time from train and test date fields to be able to map Dates against weather data; remove hyphens too.
train["Dates"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
train["Dates"] = train["Dates"].map(lambda x: datetime.strftime(x,"%Y%m%d"))
test["Dates"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
test["Dates"] = test["Dates"].map(lambda x: datetime.strftime(x,"%Y%m%d"))

# Convert Weather DATE to same format as train and test data
weather["DATE"] = weather["DATE"].map(lambda x: datetime.strptime(x,"%m/%d/%y"))
weather["DATE"] = weather["DATE"].map(lambda x: datetime.strftime(x,"%Y%m%d"))

In [28]:
# convert date objects to numeric
train["Dates"] = pd.to_numeric(train["Dates"])
test["Dates"] = pd.to_numeric(test["Dates"])
weather["DATE"] = pd.to_numeric(weather["DATE"])
print(type(train["Dates"][0]))

<class 'numpy.int64'>


In [29]:
# left merge weather data based on dates
weather_train = pd.merge(train, weather, how='left', left_on="Dates", right_on = "DATE")
del weather_train['DATE']
del weather_train["SNOW"]
weather_test = pd.merge(test, weather, how='left', left_on="Dates", right_on = "DATE")
del weather_test['DATE']
del weather_test["SNOW"]

In [30]:
print(weather_train.columns.values)
print(weather_train.shape)
print(weather_test.columns.values)
print(weather_test.shape)

['Dates' 'Category' 'Descript' 'DayOfWeek' 'PdDistrict' 'Resolution'
 'Address' 'X' 'Y' 'month' 'year' 'hour' 'day' 'holidays' 'first_day'
 'month_year' 'doy' 'spring' 'summer' 'fall' 'winter' 'dayparts' 'PRCP'
 'TMAX' 'TMIN']
(878049, 25)
['Id' 'Dates' 'DayOfWeek' 'PdDistrict' 'Address' 'X' 'Y' 'month' 'year'
 'hour' 'day' 'holidays' 'first_day' 'month_year' 'doy' 'spring' 'summer'
 'fall' 'winter' 'dayparts' 'PRCP' 'TMAX' 'TMIN']
(884262, 23)


Next, fix outliers

In [31]:
# Data indicates outliers with latitude = 90. Test data has these same outliers.
# Set latitiude to the median of the district where the crime occured.
districts = set(weather_train["PdDistrict"])
medians = {el:0 for el in districts}
for district in districts:
    medians[district] = weather_train["Y"][weather_train["PdDistrict"] == district].median()
weather_train.loc[weather_train.Y > 38, "Y"] = weather_train[weather_train.Y > 38]["PdDistrict"].map(lambda x: 
                                                                                                     medians[x])
weather_test.loc[weather_test.Y > 38, "Y"] = weather_test[weather_test.Y > 38]["PdDistrict"].map(lambda x : 
                                                                                                 medians[x])

In [14]:
#print new shape
print(weather_train.shape)
print(weather_test.shape)

print("Cases removed from train data =", np.sum(878049 - weather_train.shape[0]))
print("Cases removed from test data =", np.sum(884262 - weather_test.shape[0]))
print("Cases fixed in the train data =", len(train[train.Y>38]))
print("Cases fixed in the test data =", len(test[test.Y > 38]))

(878049, 25)
(884262, 23)
Cases removed from train data = 0
Cases removed from test data = 0
Cases fixed in the train data = 67
Cases fixed in the test data = 76


In [33]:
#Add additional positional arguments. First one is distance to nearest police station. 

from math import sqrt

#We use Euclidean distance since curvature of the earth doesn't matter for small distances.
def dist(p1, p2):
    return sqrt((p2[1]-p1[0])**2 + (p2[0]-p1[1])**2)

PStations = {'Central Station':  (37.798732, -122.409919), 'Southern Station': (37.772380,-122.389412), 
             'Bayview':  (37.729732, -122.397981), 'Mission': (37.762849, -122.422005), 'Northern': (37.780190, -122.432445), 
             'Park': (37.767797, -122.455287), 'Richmond': (37.779928, -122.464467), 'Ingleside': (37.724676, -122.446215), 
             'Taraval': (37.782988, -122.483874), 'Tenderloin': (37.783669, -122.412896)}

def mindist(p1, l):
    return min([dist(p1,x) for x in l])

weather_train['d_police'] = weather_train.apply(lambda x : mindist((x.X, x.Y), PStations.values()), axis = 1)
weather_test['d_police'] = weather_train.apply(lambda x : mindist((x.X, x.Y), PStations.values()), axis = 1)

In [35]:
#rotational data for better spatial understanding
weather_train["rot_45_X"] = .707*weather_train["Y"] + .707*weather_train["X"]
weather_train["rot_45_Y"] = .707* weather_train["Y"] - .707* weather_train["X"]

weather_train["rot_30_X"] = (1.732/2)*weather_train["X"] + (1./2)*weather_train["Y"]
weather_train["rot_30_Y"] = (1.732/2)*weather_train["Y"] - (1./2)*weather_train["X"]

weather_train["rot_60_X"] = (1./2)*weather_train["X"] + (1.732/2)*weather_train["Y"]
weather_train["rot_60_Y"] = (1./2)*weather_train["Y"] - (1.732/2)*weather_train["X"]

weather_train["radial_r"] = np.sqrt(np.power(weather_train["Y"],2) + np.power(weather_train["X"],2))

#Test data
weather_test["rot_45_X"] = .707*weather_test["Y"] + .707*weather_test["X"]
weather_test["rot_45_Y"] = .707* weather_test["Y"] - .707* weather_test["X"]

weather_test["rot_30_X"] = (1.732/2)*weather_test["X"] + (1./2)*weather_test["Y"]
weather_test["rot_30_Y"] = (1.732/2)*weather_test["Y"] - (1./2)*weather_test["X"]

weather_test["rot_60_X"] = (1./2)*weather_test["X"] + (1.732/2)*weather_test["Y"]
weather_test["rot_60_Y"] = (1./2)*weather_test["Y"] - (1.732/2)*weather_test["X"]

weather_test["radial_r"] = np.sqrt(np.power(weather_test["Y"],2) + np.power(weather_test["X"],2))

In [47]:
print(train.columns)
print(weather_train.columns)
print(test.columns)
print(weather_test.columns)

Index(['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict',
       'Resolution', 'Address', 'X', 'Y', 'month', 'year', 'hour', 'day',
       'holidays', 'first_day', 'month_year', 'doy', 'spring', 'summer',
       'fall', 'winter', 'dayparts'],
      dtype='object')
Index(['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict',
       'Resolution', 'Address', 'X', 'Y', 'month', 'year', 'hour', 'day',
       'holidays', 'first_day', 'month_year', 'doy', 'spring', 'summer',
       'fall', 'winter', 'dayparts', 'PRCP', 'TMAX', 'TMIN', 'd_police',
       'rot_45_X', 'rot_45_Y', 'rot_30_X', 'rot_30_Y', 'rot_60_X', 'rot_60_Y',
       'radial_r'],
      dtype='object')
Index(['Id', 'Dates', 'DayOfWeek', 'PdDistrict', 'Address', 'X', 'Y', 'month',
       'year', 'hour', 'day', 'holidays', 'first_day', 'month_year', 'doy',
       'spring', 'summer', 'fall', 'winter', 'dayparts'],
      dtype='object')
Index(['Id', 'Dates', 'DayOfWeek', 'PdDistrict', 'Address', 'X', 'Y', 'month',
    

In [37]:
weather_train["holidays"] = weather_train["holidays"].astype(int)
weather_test["holidays"] = weather_test["holidays"].astype(int)
print(type(weather_train["holidays"][0]))
print(type(weather_test["holidays"][0]))


<class 'numpy.int64'>
<class 'numpy.int64'>


In [52]:
print(weather_train.columns.values)
print(weather_train.shape)
print(weather_test.columns.values)
print(weather_test.shape)
print(weather_train.dtypes)

['Dates' 'Category' 'Descript' 'DayOfWeek' 'PdDistrict' 'Resolution'
 'Address' 'X' 'Y' 'month' 'year' 'hour' 'day' 'holidays' 'first_day'
 'month_year' 'doy' 'spring' 'summer' 'fall' 'winter' 'dayparts' 'PRCP'
 'TMAX' 'TMIN' 'd_police' 'rot_45_X' 'rot_45_Y' 'rot_30_X' 'rot_30_Y'
 'rot_60_X' 'rot_60_Y' 'radial_r']
(878049, 33)
['Id' 'Dates' 'DayOfWeek' 'PdDistrict' 'Address' 'X' 'Y' 'month' 'year'
 'hour' 'day' 'holidays' 'first_day' 'month_year' 'doy' 'spring' 'summer'
 'fall' 'winter' 'dayparts' 'PRCP' 'TMAX' 'TMIN' 'd_police' 'rot_45_X'
 'rot_45_Y' 'rot_30_X' 'rot_30_Y' 'rot_60_X' 'rot_60_Y' 'radial_r']
(884262, 31)
Dates           int64
Category       object
Descript       object
DayOfWeek      object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
month           int64
year            int64
hour            int64
day             int64
holidays        int64
first_day       int64
month_year     object
doy             int64

In [39]:
# Encode string features into numeric features

LE = preprocessing.LabelEncoder()
LE.fit(train_data_all['month_year'])
train_data_all['month_year'] = LE.transform(train_data_all['month_year'])
test_data_all['month_year'] = LE.transform(test_data_all['month_year'])


train_data_all = pd.get_dummies(weather_train, columns = ['DayOfWeek', 'PdDistrict',
       'month', 'year', 'dayparts'])
del train_data_all["Dates"]
del train_data_all["Descript"]
del train_data_all["Resolution"]
del train_data_all["day"]
del train_data_all["doy"]
del train_data_all["Address"]

train_labels_all = np.array(train_data_all['Category'])
del train_data_all["Category"]

train_data_all.reindex()

test_data_all = pd.get_dummies(weather_test, columns = ['DayOfWeek', 'PdDistrict',
       'month', 'year','dayparts'])

del test_data_all["Id"]
del test_data_all["day"]
del test_data_all["doy"]
del test_data_all["Dates"]
del test_data_all["Address"]
                                 
print(test_data_all.columns == train_data_all.columns)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True]


In [41]:
print(train_data_all.columns.values)
print(train_data_all.shape)
print(test_data_all.columns.values)
print(test_data_all.shape)
print(train_data_all.dtypes)


['X' 'Y' 'hour' 'holidays' 'first_day' 'month_year' 'spring' 'summer'
 'fall' 'winter' 'PRCP' 'TMAX' 'TMIN' 'd_police' 'rot_45_X' 'rot_45_Y'
 'rot_30_X' 'rot_30_Y' 'rot_60_X' 'rot_60_Y' 'radial_r' 'DayOfWeek_Friday'
 'DayOfWeek_Monday' 'DayOfWeek_Saturday' 'DayOfWeek_Sunday'
 'DayOfWeek_Thursday' 'DayOfWeek_Tuesday' 'DayOfWeek_Wednesday'
 'PdDistrict_BAYVIEW' 'PdDistrict_CENTRAL' 'PdDistrict_INGLESIDE'
 'PdDistrict_MISSION' 'PdDistrict_NORTHERN' 'PdDistrict_PARK'
 'PdDistrict_RICHMOND' 'PdDistrict_SOUTHERN' 'PdDistrict_TARAVAL'
 'PdDistrict_TENDERLOIN' 'month_1' 'month_2' 'month_3' 'month_4' 'month_5'
 'month_6' 'month_7' 'month_8' 'month_9' 'month_10' 'month_11' 'month_12'
 'year_2003' 'year_2004' 'year_2005' 'year_2006' 'year_2007' 'year_2008'
 'year_2009' 'year_2010' 'year_2011' 'year_2012' 'year_2013' 'year_2014'
 'year_2015' 'dayparts_early_afternoon' 'dayparts_early_evening'
 'dayparts_early_morning' 'dayparts_late_afternoon' 'dayparts_late_evening'
 'dayparts_late_morning' 'dayp

In [54]:
print(train_labels_all)
print(train_data_all.columns)

['WARRANTS' 'OTHER OFFENSES' 'OTHER OFFENSES' ..., 'LARCENY/THEFT'
 'VANDALISM' 'FORGERY/COUNTERFEITING']
Index(['X', 'Y', 'hour', 'holidays', 'first_day', 'month_year', 'spring',
       'summer', 'fall', 'winter', 'PRCP', 'TMAX', 'TMIN', 'd_police',
       'rot_45_X', 'rot_45_Y', 'rot_30_X', 'rot_30_Y', 'rot_60_X', 'rot_60_Y',
       'radial_r', 'DayOfWeek_Friday', 'DayOfWeek_Monday',
       'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday', 'PdDistrict_BAYVIEW',
       'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 'PdDistrict_MISSION',
       'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND',
       'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN',
       'month_1', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6',
       'month_7', 'month_8', 'month_9', 'month_10', 'month_11', 'month_12',
       'year_2003', 'year_2004', 'year_2005', 'year_2006', 'year_2007',
       'year_2008', 'y

In [61]:
print(type(train_labels_all))
print(train_labels_all.shape)

<class 'numpy.ndarray'>
(878049,)


In [48]:
# Normalization
train_data_all = (train_data_all - train_data_all.mean())/(train_data_all.std())
test_data_all = (test_data_all - train_data_all.mean())/(train_data_all.std())

In [49]:
print(test_data_all[0:5])

            X          Y  hour      holidays     first_day  month_year  \
0 -122.399588  37.735051  23.0  1.134103e-15  1.221126e-15       148.0   
1 -122.391523  37.732432  23.0  1.134103e-15  1.221126e-15       148.0   
2 -122.426002  37.792212  23.0  1.134103e-15  1.221126e-15       148.0   
3 -122.437394  37.721412  23.0  1.134103e-15  1.221126e-15       148.0   
4 -122.437394  37.721412  23.0  1.134103e-15  1.221126e-15       148.0   

   spring        summer          fall        winter         ...           \
0     1.0  7.105523e-14  5.976261e-14 -2.581381e-14         ...            
1     1.0  7.105523e-14  5.976261e-14 -2.581381e-14         ...            
2     1.0  7.105523e-14  5.976261e-14 -2.581381e-14         ...            
3     1.0  7.105523e-14  5.976261e-14 -2.581381e-14         ...            
4     1.0  7.105523e-14  5.976261e-14 -2.581381e-14         ...            

      year_2013     year_2014  year_2015  dayparts_early_afternoon  \
0  3.143028e-13 -4.016814e-1

In [53]:
# Shuffle data and set aside 20% as development data
train_data_all = train_data_all.values
test_data_all = test_data_all.values
n = train_data_all.shape[0]

np.random.seed(0)

shuffle = np.random.permutation(np.arange(train_data_all.shape[0]))

train_data_all = train_data_all[shuffle]
train_labels_all = train_labels_all[shuffle]

n_train = int(0.8*n)

train_data = train_data_all[:n_train,:]
train_labels = train_labels_all[:n_train]
dev_data = train_data_all[n_train:,:]
dev_labels = train_labels_all[n_train:]

The code below can run two types of Random Forest classifiers: 

The first model has basic sub-optimal parameters to quickly confirm that no errors are returned by running the data through the model.  

The second model is set with optimal parameters and a random seed to accurately compare the results from different changes to the dataset.

In [57]:
# Import packages for the Random Forest classifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn import metrics

In [62]:
# Basic model - to check for errors

# Set up variables
n = 10
depth = 2
features = 'sqrt'

# Train the model and create predictions
RF = RandomForestClassifier(n_estimators=n, max_depth=depth, max_features=features, n_jobs=1)
RF.fit(train_data, train_labels)
pp = RF.predict_proba(dev_data)
logloss = metrics.log_loss(dev_labels, pp)
print(logloss)

2.62062662827


In [65]:
# Optimal hyperparameters with random forest model
# Note that on the 70 feature data set this took about 20-25 minutes

# Also the optimal hyperparameters might change with different types and number of features

# Set random seed so that there will not be random changes when the model is run with different data sets.
np.random.seed(0)

# Set up variables
n = 150
depth = 16
features = 0.40

# Train the model and create predictions
RF = RandomForestClassifier(n_estimators=n, max_depth=depth, max_features=features, n_jobs=1)
RF.fit(train_data, train_labels)
pp = RF.predict_proba(dev_data)
logloss = metrics.log_loss(dev_labels, pp)
print(logloss)

2.3358727151


#### Here's a place to keep track of logloss results with the optimal hyperparameters from different data sets

**2.3358727151** with dataset as of 5pm Saturday 12/2, with 70 features  

In [54]:
# print shapes and some data to compare before and after csv conversion
print("train_data shape is", train_data.shape)
print("train_labels shape is", train_labels.shape)
print("dev_data shape is", dev_data.shape)
print("dev_labels shape is", dev_labels.shape)
print("train_data_all shape is", train_data_all.shape)
print("train_labels_all shape is", train_labels_all.shape)
print("test_data_all shape is", test_data_all.shape)

train_data shape is (702439, 70)
train_labels shape is (702439,)
dev_data shape is (175610, 70)
dev_labels shape is (175610,)
train_data_all shape is (878049, 70)
train_labels_all shape is (878049,)
test_data_all shape is (884262, 70)


In [55]:
# Save data as CSV files in a subdirectory

# NOTE: mkdir will make a "csv" directory in your local repo if there is not already one there.
# It will return an error if the directory already exists in your local repo
# but that will not impact how this code runs

! mkdir csv
np.savetxt("csv/train_data.csv", train_data, delimiter=",")
np.savetxt("csv/train_labels.csv", train_labels, fmt="%s", delimiter=",")
np.savetxt("csv/dev_data.csv", dev_data, delimiter=",")
np.savetxt("csv/dev_labels.csv", dev_labels, fmt="%s", delimiter=",")
np.savetxt("csv/train_data_all.csv", train_data_all, delimiter=",")
np.savetxt("csv/train_labels_all.csv", train_labels_all, fmt="%s", delimiter=",")
np.savetxt("csv/test_data_all.csv", test_data_all, delimiter=",")

mkdir: csv: File exists


In [56]:
# Zip up the CSV files

# **IMPORTANT**  This code will rewrite existing zip files in your local repo
# You will need to push it to the group repo for everyone to have the updated zip file

# Full set of training data and labels --> data.zip
zip_train_all = zipfile.ZipFile("data.zip", "w")
zip_train_all.write("csv/train_data_all.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_train_all.write("csv/train_labels_all.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_train_all.close()

# Subset of training data and labels --> data_subset.zip
zip_train_subset = zipfile.ZipFile("data_subset.zip", "w")
zip_train_subset.write("csv/train_data.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_train_subset.write("csv/train_labels.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_train_subset.close()


# Data used for testing models (test data from Kaggle and our 20% development data) --> testing.zip
zip_testing = zipfile.ZipFile("testing.zip", "w")
zip_testing.write("csv/test_data_all.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_testing.write("csv/dev_data.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_testing.write("csv/dev_labels.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_testing.close()

