# 2D - Random Forest Classifier, Support Vector Classifier, AdaBoost Classifier.

#### Prop 64 - Allows marijuana growth near schools and parks. 

The second point in the problem statement is concerned with the exposure of marijuana to minors, schoolchildren, and teenagers with the passing of Prop 64.

I am going to use the 4156 marijuana related arrests from 2015 to 2019 and use a random forest classifier, support vector classifier, and adaboost classifier to identify the accuracy of classifying whether each marijuana arrest is related to selling marijuana (target 1) or not selling marijuana (target 0). 


#### Arrests Data Dictionary:

| Column Name | Type | Description |
| --- | --- | --- | 
| Report ID | int64 | ID for the Arrest | 
| Arrest Date |  DateTime | YYYY/MM/DD |
| Time | float64 | 24 hour military time |
| Area ID |  int64 | 21 LAPD stations referred to as Geographic Areas that are sequentially numbered from 1-21 |
| Area Name | object  | Area ID's name designation that references a landmark or the surrounding community that an LAPD station is responsible for|
| Reporting District | int64 |  A four-digit code that represents a sub-area within a Geographic Area | 
| Age | int64 | Age of the arrestee | 
| Sex Code | object| F - Female, M - Male |
| Descent Code | object | Arrestee's descent code |
| Charge Group Code | object | Category of arrest charge |
| Charge Group Description | object | Defines the charge provided | 
| Arrest Type Code | object | A code to indicate the type of charge the individual was arrested for. D - Dependent F - Felony I - Infraction M - Misdemeanor O - Other |
| Charge | object | The charge the individual was arrested for |
| Charge  Description | object | Defines the Charge provided |
| Address | object | Street address of crime incident |
| Location | object  | The location where the crime incident occurred. XY coordinates reflect the nearest 100 block |
| disp_0.5_mile | int64 | Number of dispensaries within 0.5 miles of each arrest | 
| disp_1_mile | int64 | Number of dispensaries within 1 mile of each arrest | 
| school_0.5_mile | int64 | Number of school(s) within 0.5 miles of each arrest | 
| school_0.5_mile | int64 | Number of school(s) within 1 mile of each arrest | 

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

DO = '#7D1B7E'

%matplotlib inline

plt.style.use('fivethirtyeight')

In [28]:
arrests = pd.read_csv('../data/all_marijuana_arrests.csv')
arrests.head()

Unnamed: 0,Arrest Date,Time,Area Name,Age,Sex Code,Descent Code,Arrest Type Code,Charge Description,lat,long,disp_0.5_mile,disp_1_mile,school_0.5_mile,school_1_mile
0,2015-01-01,1610.0,Southeast,19,M,B,M,possess 28.5 grams or less of marijuana**,33.9456,-118.2739,1,2,5,16
1,2015-01-01,820.0,Hollywood,23,M,B,F,transport/sell/furnish/etc marijuana,34.1016,-118.3387,0,5,2,4
2,2015-01-02,1030.0,Pacific,24,F,W,F,transport/sell/furnish/etc marijuana,33.992,-118.4201,1,2,5,9
3,2015-01-02,1530.0,Pacific,30,M,O,F,possession marijuana for sale,33.944,-118.4073,0,1,0,0
4,2015-01-03,1940.0,Southwest,26,M,H,F,transport/sell/furnish/etc marijuana,34.026,-118.3652,0,1,2,5


### Creating Classes

I am going to create 2 different classes, `sell` and `not sell`. Selling marijuana exposes marijuana to children more, and the `not sell` class deals with marijuana arrests that affect the arrestee, such as the possession of marijuana. The charges based in the `Charge Description` will be grouped accordingly.

In [29]:
charge_description_list = list(arrests['Charge Description'].unique())

In [30]:
charge_description_list.sort()
charge_description_list

['attempt - sell/furnish/etc marijuana',
 'cultivate >6 marij plants viol envrnt law',
 'cultivating <6 marijuana plants',
 'furnishing marijuana to minor over 14 yrs',
 'give/transport/etc < 28.5 grams marijuana',
 'induce/etc minor to use/sell marijuana',
 'minor poss 28.5+ grams marijuana/school',
 'minor poss < 28.5 grams marijuana/school',
 'poss for sale of marijuana to a minor',
 'poss marijuana or concentrated cannabis',
 'poss of more than 28.5 grams of marijuana',
 'poss open cont/packg marij drivr/passnger',
 'poss/sale marij ovr 21 employ per 20/belw',
 'poss/smoke/ingest marij school/daycare/yc',
 'possess 28.5 grams or less of marijuana',
 'possess 28.5 grams or less of marijuana**',
 'possess marijuana for sale',
 'possess marijuana for sale under age 18',
 'possess of marijuana while driving veh',
 'possession marijuana for sale',
 'possession of marijuana in school',
 'sale/offer to sell/transport marijuana',
 'sale/trans >28.5g marijuana or >4g',
 'sale/transport mari

In [31]:
z = arrests['Charge Description'].map(lambda x: 'sell' if 'sale' in x else x)
z = z.map(lambda x: 'sell' if 'sell' in x else x)
z = z.map(lambda x: 'not sell' if 'sell' not in x else x)
z.value_counts(normalize = True)

sell        0.632579
not sell    0.367421
Name: Charge Description, dtype: float64

Since the classes are a bit unbalanced, I am going to stratify them so there is the same proportion of `sell` and `not sell` observations in both the train and test datasets.

In [32]:
arrests['target'] = z 

In [33]:
arrests['target'] = arrests['target'].map(lambda x: 1 if x == 'sell' else 0)

Creating the predictor variables `X`:

Since `disp_1_mile` includes values in `disp_0.5_mile` and `school_1_mile` includes values in `school_0.5_mile`, I will drop both `disp_0.5_mile` and `school_0.5_mile` due to the heavy correlation.

In [34]:
arrests[['disp_0.5_mile','disp_1_mile']].corr()

Unnamed: 0,disp_0.5_mile,disp_1_mile
disp_0.5_mile,1.0,0.821403
disp_1_mile,0.821403,1.0


In [35]:
arrests[['school_0.5_mile','school_1_mile']].corr()

Unnamed: 0,school_0.5_mile,school_1_mile
school_0.5_mile,1.0,0.771906
school_1_mile,0.771906,1.0


In [36]:
arrests.head()

Unnamed: 0,Arrest Date,Time,Area Name,Age,Sex Code,Descent Code,Arrest Type Code,Charge Description,lat,long,disp_0.5_mile,disp_1_mile,school_0.5_mile,school_1_mile,target
0,2015-01-01,1610.0,Southeast,19,M,B,M,possess 28.5 grams or less of marijuana**,33.9456,-118.2739,1,2,5,16,0
1,2015-01-01,820.0,Hollywood,23,M,B,F,transport/sell/furnish/etc marijuana,34.1016,-118.3387,0,5,2,4,1
2,2015-01-02,1030.0,Pacific,24,F,W,F,transport/sell/furnish/etc marijuana,33.992,-118.4201,1,2,5,9,1
3,2015-01-02,1530.0,Pacific,30,M,O,F,possession marijuana for sale,33.944,-118.4073,0,1,0,0,1
4,2015-01-03,1940.0,Southwest,26,M,H,F,transport/sell/furnish/etc marijuana,34.026,-118.3652,0,1,2,5,1


### Train Test Split

In [37]:
X = arrests.drop(['Arrest Date','Charge Description','target','disp_0.5_mile','school_0.5_mile'],1)

# creating dummy variables for sex, descent, and arrest type code
X = pd.get_dummies(X)
y = arrests['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

### Fitting Random Forest Classifier

In [38]:
# Fitting the Random Forest Classifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# train score
train_accuracy = rf.score(X_train, y_train)
print(f'Random Forest Train Accuracy: {round(train_accuracy,2)}')

# test score
test_accuracy = rf.score(X_test, y_test)
print(f'Random Forest Test Accuracy: {round(test_accuracy,2)}')

Random Forest Train Accuracy: 0.99
Random Forest Test Accuracy: 0.92


### Fitting AdaBoost Classifier

In [39]:
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
print(f'AdaBoost Train Accuracy score: {ada.score(X_train,y_train)}')
print(f'AdaBoost Test Accuracy score: {ada.score(X_test,y_test)}\n')

AdaBoost Train Accuracy score: 0.8845043310875842
AdaBoost Test Accuracy score: 0.9008662175168431



### Fitting Support Vector Classifier

In [40]:
svc = SVC()
svc.fit(X_train, y_train)
print(f'Support Vector Classifier Train Accuracy score: {svc.score(X_train,y_train)}')
print(f'Support Vector Classifier Test Accuracy score: {svc.score(X_test,y_test)}')

Support Vector Classifier Train Accuracy score: 0.8886750080205326
Support Vector Classifier Test Accuracy score: 0.6515880654475458
