# Regression and Feature Importance with the Boston Globe

Based on the super-old but super-relevant [Speed Trap: Who Gets a Ticket, Who Gets a Break?](http://archive.boston.com/globe/metro/packages/tickets/).

For data, we'll be using `tickets-warnings.csv`.

If you'd like to kind of ignore everything I say, you can go read [this page](https://investigate.ai/boston-globe-tickets/boston-globe-ticketing-regression/). It assumes some prior knowledge of logistic regression and the `statsmodels` package, though - we're coming into this backwards since we began with sentiment analysis!

In [3]:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

In [5]:
# Unknown data is always coded as 'U', so we'll pass that to na_values
df = pd.read_csv('tickets-warnings.csv', na_values='U')
df.head(2)

Unnamed: 0,TYPE,CITATION,DATE,DOW,AGENCY,AGENCY2,AGENCY3,LOCAL,OFFICER,LICSTATE,CLASS,CDL,RACE,MINORITY,BLACK,ASIAN,HISPANIC,MIDDLE,NATIVE,SEX,FEMALE,SEARCH,SEARCH2,LOCATE2,LOCATION,TIME,AMPM,TIEMDAY,DAYNIGHT,DAY,DESCRIPT,AMOUNT,MPH,ZONE,MPHOVER,MPHPCT,MPHGROUP,YOB,AGE,AGEGROUP,AGEBAND,ZIP,NEIGHBOR,INNEIGH,REGSTATE,V_MAKE,V_TYPE,V_YEAR,V_AGE,V_AGEGRP,COLOR,HOMESTATE,HOMETOWN,INTOWN,INSTATE,INTOWN2,INSTATE2
0,T,K0001506,20010411,Wednesday,State Police Troop A-4,State Police,S,N,8247791000000000.0,MA,D,N,W,W,0.0,0.0,0.0,0.0,0.0,M,0.0,,,Woburn,Woburn,5,PM,b) afternoon,day,1,SPEEDING,125.0,80,65,15.0,23.0,b) 10 to 15,1980,21.0,21-25,16-25,1876.0,,,MA,,,0,,,,MA,Tewksbury,N,Y,0.0,1
1,T,K0001507,20010417,Tuesday,State Police Troop A-4,State Police,S,N,8247791000000000.0,MA,D,N,W,W,0.0,0.0,0.0,0.0,0.0,F,1.0,,,Somerville,Somerville,10,AM,a) morning,day,1,TRAFFIC VIOLATION,50.0,0,0,,,,1965,36.0,36-40,26-39,2135.0,Allston-Brighton,N,MA,DODG,ARIES,1988,13.0,older,WHITE,MA,Boston,N,Y,0.0,1


In [6]:
df.TYPE.value_counts()

T    84002
W    82366
Name: TYPE, dtype: int64

In [7]:
df = df[['TYPE','AGENCY3','SEX','BLACK','ASIAN','HISPANIC','MINORITY','AGE','MPH','MPHOVER','INTOWN', 'INSTATE', 'DAYNIGHT']].copy()
df.head()

Unnamed: 0,TYPE,AGENCY3,SEX,BLACK,ASIAN,HISPANIC,MINORITY,AGE,MPH,MPHOVER,INTOWN,INSTATE,DAYNIGHT
0,T,S,M,0.0,0.0,0.0,W,21.0,80,15.0,N,Y,day
1,T,S,F,0.0,0.0,0.0,W,36.0,0,,N,Y,day
2,T,S,F,0.0,0.0,0.0,W,61.0,0,,N,N,day
3,T,S,M,0.0,0.0,0.0,W,52.0,0,,N,N,night
4,T,S,M,0.0,0.0,0.0,W,24.0,85,20.0,N,Y,day


In [27]:
print(df.shape)
df = df.dropna().copy()
print(df.shape)

(89691, 16)
(89691, 16)


In [28]:
# TYPE
# T = ticket
# W = warning
df['is_ticketed'] = df.TYPE.replace({
    'T': 1,
    'W': 0
})
df.head()

Unnamed: 0,TYPE,AGENCY3,SEX,BLACK,ASIAN,HISPANIC,MINORITY,AGE,MPH,MPHOVER,INTOWN,INSTATE,DAYNIGHT,is_ticketed,is_male,is_white
0,T,S,M,0.0,0.0,0.0,W,21.0,80,15.0,N,Y,day,1,1.0,1.0
4,T,S,M,0.0,0.0,0.0,W,24.0,85,20.0,N,Y,day,1,1.0,1.0
5,T,S,M,0.0,0.0,0.0,W,37.0,80,30.0,N,Y,day,1,1.0,1.0
6,W,S,M,0.0,0.0,0.0,W,30.0,80,15.0,N,N,night,0,1.0,1.0
7,W,S,F,1.0,0.0,0.0,M,22.0,75,10.0,N,N,night,0,0.0,0.0


In [29]:
df['is_male'] = df.SEX.replace({
    'M': 1,
    'F': 0
})
df.head(2)

Unnamed: 0,TYPE,AGENCY3,SEX,BLACK,ASIAN,HISPANIC,MINORITY,AGE,MPH,MPHOVER,INTOWN,INSTATE,DAYNIGHT,is_ticketed,is_male,is_white
0,T,S,M,0.0,0.0,0.0,W,21.0,80,15.0,N,Y,day,1,1,1.0
4,T,S,M,0.0,0.0,0.0,W,24.0,85,20.0,N,Y,day,1,1,1.0


In [30]:
df['is_white'] = df.MINORITY.replace({
    'W': 1,
    'M': 0
})
df.head(2)

Unnamed: 0,TYPE,AGENCY3,SEX,BLACK,ASIAN,HISPANIC,MINORITY,AGE,MPH,MPHOVER,INTOWN,INSTATE,DAYNIGHT,is_ticketed,is_male,is_white
0,T,S,M,0.0,0.0,0.0,W,21.0,80,15.0,N,Y,day,1,1,1
4,T,S,M,0.0,0.0,0.0,W,24.0,85,20.0,N,Y,day,1,1,1


In [49]:
df['is_intown'] = df.INTOWN.replace({
    'Y': 1,
    'N': 0
})
df.head(2)

Unnamed: 0,TYPE,AGENCY3,SEX,BLACK,ASIAN,HISPANIC,MINORITY,AGE,MPH,MPHOVER,INTOWN,INSTATE,DAYNIGHT,is_ticketed,is_male,is_white,is_intown
0,T,S,M,0.0,0.0,0.0,W,21.0,80,15.0,N,Y,day,1,1,1,0
4,T,S,M,0.0,0.0,0.0,W,24.0,85,20.0,N,Y,day,1,1,1,0


In [50]:
feature_columns = ['is_male', 'is_white', 'is_intown']
X = df[feature_columns]
y = df.is_ticketed

In [51]:
X.head(3)

Unnamed: 0,is_male,is_white,is_intown
0,1,1,0
4,1,1,0
5,1,1,0


In [53]:
y[:3]

0    1
4    1
5    1
Name: is_ticketed, dtype: int64

## Classification

"What is going on in minds of the police??????"

Let's make a model that reproduces what the police do, but with machine learning!

In [54]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=2)

clf.fit(X, y)

DecisionTreeClassifier(max_depth=2)

In [55]:
import eli5

feature_names = list(X.columns)

eli5.show_weights(clf,
                  feature_names=feature_names,
                  target_names=['warning', 'ticket'])

Weight,Feature
0.6431,is_intown
0.196,is_male
0.1609,is_white


## Let's try another classifier!

In [56]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X, y)

LogisticRegression(C=1000000000.0)

In [57]:
import eli5

feature_names = list(X.columns)

eli5.show_weights(clf, feature_names=feature_names)

Weight?,Feature
0.39,<BIAS>
0.364,is_male
-0.414,is_white
-0.668,is_intown
