# MIDS W207 Fall 2017 Final Project¶
## Data Set Up - Data Cleaning and Feature Engineering
Laura Williams, Kim Vignola, Cyprian Gascoigne  
SF Crime Classification

This notebook reads raw data (saved in a zip file) from Kaggle, processes and organizes the data for training a variety of machine learning models, and outputs the data as zipped csv files that other notebooks can unzip and use to train different models.

The intention is that data cleaning and/or feature engineering will be added to this file as we progress through the project and look for additional way to process the data to improve our predictions.

For ease of processing this data, exploratory data analysis will be in a separate notebook.

Single zipped output file (called data.zip) includes:  

1) train_data.csv and train_labels.csv - includes 80% of the total training data, for training models that are not yet going to be submitted to Kaggle

2) dev_data.csv and dev_labels.csv - includes 20% of the total training data, for testing models before they are submitted to Kaggle

3) train_data_all.csv and train_labels_all.csv - includes all the training data. After testing models with the train and dev data split above, train the model from this full set of data for submission to Kaggle.

4) test_data_all.csv - create predictions on this data for submission to Kaggle.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import zipfile
import os

In [2]:
# Unzip raw data into a subdirectory 
unzip_files = zipfile.ZipFile("raw_data.zip", "r")
unzip_files.extractall("raw_data")
unzip_files.close()

In [4]:
# Read CSV files into pandas dataframes
train = pd.read_csv("raw_data/train.csv")
test = pd.read_csv("raw_data/test.csv")

In [5]:
# Encode string features into numeric features
LE = preprocessing.LabelEncoder()

train_data_all = np.column_stack((LE.fit_transform(train['Dates']),
                                 LE.fit_transform(train['DayOfWeek']),
                                 LE.fit_transform(train['PdDistrict']),
                                 LE.fit_transform(train['Address']),
                                 train['X'],
                                 train['Y']))

train_labels_all = np.array(train['Category'])

test_data_all = np.column_stack((LE.fit_transform(test['Dates']),
                                LE.fit_transform(test['DayOfWeek']),
                                LE.fit_transform(test['PdDistrict']),
                                LE.fit_transform(test['Address']),
                                test['X'],
                                test['Y']))

In [6]:
# Shuffle data and set aside 20% as development data
n = train_data_all.shape[0]

shuffle = np.random.permutation(np.arange(train_data_all.shape[0]))

train_data_all = train_data_all[shuffle]
train_labels_all = train_labels_all[shuffle]

n_train = int(0.8*n)

train_data = train_data_all[:n_train,:]
train_labels = train_labels_all[:n_train]
dev_data = train_data_all[n_train:,:]
dev_labels = train_labels_all[n_train:]

In [8]:
# Save arrays as CSV files in a subdirectory



# NOTE:  mkdir will return an error if the directory already exists in your local repo
# but that will not impact how this code runs

! mkdir csv
np.savetxt("csv/train_data.csv", train_data, delimiter=",")
np.savetxt("csv/train_labels.csv", train_labels, fmt="%s", delimiter=",")
np.savetxt("csv/dev_data.csv", dev_data, delimiter=",")
np.savetxt("csv/dev_labels.csv", dev_labels, fmt="%s", delimiter=",")
np.savetxt("csv/train_data_all.csv", train_data_all, delimiter=",")
np.savetxt("csv/train_labels_all.csv", train_labels_all, fmt="%s", delimiter=",")
np.savetxt("csv/test_data_all.csv", test_data_all, delimiter=",")

mkdir: csv: File exists


In [9]:
# Zip up the CSV files

# **IMPORTANT**  This code will rewrite the existing data.zip file in your local repo
# You will need to push it to the group repo for everyone to have the updated zip file

zip_files = zipfile.ZipFile("data.zip", "w")
zip_files.write("csv/train_data.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.write("csv/train_labels.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.write("csv/dev_data.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.write("csv/dev_labels.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.write("csv/train_data_all.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.write("csv/train_labels_all.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.write("csv/test_data_all.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_files.close()


