# Bagging codealong

We will be looking at a dataset of San Francisco crime rates. Specifically, we will be predicting the **type of crime based on other information about the crime**.

I munged/cleaned this [from the full kaggle dataset](https://www.kaggle.com/c/sf-crime).

---

### Crime dataset

There are 4 datasets in the data folder. The `*_mini_*` csv files are reduced to be small, for speed. We will be using these in the codealong. However, if you're interested, the `*_subset_*` csvs are a larger sample you could use to test your models.

The `*_cats.csv` files contain 36 various predictors.

    ../assets/datasets/sf_crime_mini_cats.csv
    ../assets/datasets/sf_crime_subset_cats.csv
    
The `*_adds.csv` files contain many more predictors. CountVectorizer was used to create columns based on the reported address of the crime, providing more granular, categorical location information.
    
    ../assets/datasets/sf_crime_mini_adds.csv
    ../assets/datasets/sf_crime_subset_adds.csv
    
---

The columns (not including the NLP street columns in the `adds` csvs) are:

    crime_category
    crime
    DayOfWeek[Friday]
    DayOfWeek[Monday]
    DayOfWeek[Saturday]
    DayOfWeek[Sunday]
    DayOfWeek[Thursday]
    DayOfWeek[Tuesday]
    DayOfWeek[Wednesday]
    PdDistrict[T.CENTRAL]
    PdDistrict[T.INGLESIDE]
    PdDistrict[T.MISSION]
    PdDistrict[T.NORTHERN]
    PdDistrict[T.PARK]
    PdDistrict[T.RICHMOND]
    PdDistrict[T.SOUTHERN]
    PdDistrict[T.TARAVAL]
    PdDistrict[T.TENDERLOIN]
    month[T.August]
    month[T.December]
    month[T.February]
    month[T.January]
    month[T.July]
    month[T.June]
    month[T.March]
    month[T.May]
    month[T.November]
    month[T.October]
    month[T.September]
    time_of_day[T.early_morning]
    time_of_day[T.evening]
    time_of_day[T.late_morning]
    time_of_day[T.mid_day]
    time_of_day[T.mid_night]
    time_of_day[T.morning]
    time_of_day[T.night]
    longitude_centered
    latitude_centered
    
**NOTES**:
- **crime** is the string label of the crime committed
- **crime_label** is the numeric code associated with the crime category (target classification variable)
- **time_of_day** categories are defined as:

```
    early_morning --> 2am to 5am
    morning       --> 5am to 8am
    late_morning  --> 8am to 11am
    mid_day       --> 11am to 2pm
    afternoon     --> 2pm to 5pm [reference category: part of the intercept]
    evening       --> 5pm to 8pm
    night         --> 8pm to 11pm
    mid_night     --> 11pm to 2am
```

---

### 1. Load and examine data

In [3]:
import pandas as pd
import numpy as np

In [7]:
sf = pd.read_csv('../assets/datasets/sf_crime_mini_cats.csv')

In [8]:
sf.crime.value_counts()

stolen_property                250
weapon_laws                    250
burglary                       250
suspicious_occ                 250
larceny_theft                  250
drug_narcotic                  250
sex_offenses_forcible          250
vandalism                      250
kidnapping                     250
missing_person                 250
forgery_counterfeiting         250
robbery                        250
arson                          250
vehicle_theft                  250
driving_under_the_influence    250
disorderly_conduct             250
warrants                       250
drunkenness                    250
fraud                          250
loitering                      250
prostitution                   250
assault                        250
trespass                       250
Name: crime, dtype: int64

---

### 2.1. Predict crime category with (multinomial!) logistic regression

In [14]:
# 4.3% chance that we can just randomly guess the correct prediction
baseline_acc = 1./len(sf.crime.unique())

print baseline_acc

0.0434782608696


In [15]:
sf.head()

Unnamed: 0,crime_category,crime,DayOfWeek[Friday],DayOfWeek[Monday],DayOfWeek[Saturday],DayOfWeek[Sunday],DayOfWeek[Thursday],DayOfWeek[Tuesday],DayOfWeek[Wednesday],PdDistrict[T.CENTRAL],...,month[T.September],time_of_day[T.early_morning],time_of_day[T.evening],time_of_day[T.late_morning],time_of_day[T.mid_day],time_of_day[T.mid_night],time_of_day[T.morning],time_of_day[T.night],longitude_centered,latitude_centered
0,0,arson,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011674,-0.052062
1,0,arson,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.010874,0.014868
2,0,arson,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.005531,0.014195
3,0,arson,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.026991,-0.012642
4,0,arson,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.043588,0.00152


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

In [18]:
multi_lr = LogisticRegression()

X = sf[[x for x in sf.columns if x not in ['crime', 'crime_category']]]
X = (X - X.mean()) / X.std()
X_cols = X.columns
X = X.values
Y = sf['crime_category']

scores = cross_val_score(multi_lr, X, Y, cv = 5)

# Results in a 10.9% chance of predicting correctly
print scores
print np.mean(scores)

[ 0.10695652  0.10521739  0.10869565  0.12434783  0.11913043]
0.112869565217


In [27]:
multi_lr.fit(X, Y)

print len(sf.crime.unique())

print multi_lr.coef_.shape

23
(23L, 36L)


In [None]:
coefs = multi_lr.coef_

mean_abs_val_coef = 

---

### 2.2. Look at accuracy precision, and area under precision-recall curve for model

Compare accuracy to the chance rate.

---

### 2.3. Look at area under ROC curve metric

[The area measures discrimination, that is, the ability of the test to correctly classify those with and without the disease.](http://gim.unmc.edu/dxtests/roc3.htm)

[BONUS] Plot the curve.

---

### 2.4. Look at the precision and area under precision-recall curve

[High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) 

---

### 3.1. Gridsearch best parameters for a logistic regression

---

### 3.2. Examine accuracy, area under ROC, and area under precision-recall curve for "optimal" model

---

### 4. Gridsearch and examine metrics for optimal classification trees

---

### 5.1. Build a bagging blassifier with optimal decision trees

[BaggingClassifier documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)

[BaggingRegressor documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html)

In [29]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

In [32]:
X.shape

(5750L, 36L)

In [34]:
from sklearn.grid_search import GridSearchCV

params = {
    'n_estimators': [50, 100, 250, 500],
    'max_samples':  [0.25, 0.5, 0.75, 1.0],
    'max_features': [0.25, 0.5, 0.75, 1.0],
}

In [35]:
dtc = DecisionTreeClassifier(max_depth = None)

bag = BaggingClassifier(dtc)

gs = GridSearchCV(bag, params, cv = 5, verbose = 1)

gs.fit(X, Y)

print gs.best_params_
print gs.best_score_

best_bag = gs.best_estimator_

# bag_scores = cross_val_score(bag, X, Y, cv = 5)

# print bag_scores
# print np.mean(bag_scores)

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:  1.0min


KeyboardInterrupt: 

---

### 5.2. Examine metrics for bagging classifier

---

### 6.1. Do the above with the random forest class

[RandomForestClassifier documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

[RandomForestClassifier documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)

---

### 6.2. Examine feature importances from random forest model

---

### 7. [BONUS IF TIME] Do the above for bagging classifier but with logistic regression 