## Faulty Takata Airbags using Logistic Regression

This story, done by The New York Times, investigates the content in complaints made to National Highway Traffic Safety Administration (NHTSA) by customers who had bad experiences with Takata airbags in their cars. Eventually, car companies had to recall airbags made by the airbag supplier that promised a cheaper alternative. 

- https://www.nytimes.com/2014/09/12/business/air-bag-flaw-long-known-led-to-recalls.html
- https://www.nytimes.com/2014/11/07/business/airbag-maker-takata-is-said-to-have-conducted-secret-tests.html
- https://www.nytimes.com/interactive/2015/06/22/business/international/takata-airbag-recall-list.html
- https://www.nytimes.com/2016/08/27/business/takata-airbag-recall-crisis.html

You can also see a presentation by Daeil Kim who did the machine learning side of the story, i.e. used logistic regression to classify whether a comment was relevant to the story or not. 

https://www.slideshare.net/mortardata/daeil-kim-at-the-nyc-data-science-meetup

In [None]:
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)


### Read in the data and do basic exploration. 

In [None]:
column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT', 'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED', 'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA', 'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE', 'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN', 'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE', 'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE', 'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE', 'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY', 'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN', 'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']

df = pd.read_csv("data/takata/FLAT_CMPL.txt",
                 sep='\t',
                 dtype='str',
                 header=None,
                 error_bad_lines=False,
                 encoding='latin-1',
                 names=column_names)

# We're only interested in pre-2015
df = df[df.DATEA < '2015']

df.head()

In [None]:
# What data do we have in this data set? How many columns do we have? 

There are 49 columns in this dataset, and many of them are not relevant to our analysis. If we keep them as "features", like we saw earlier with the Titanic dataset, we might get really skewed results. So, let's just only keep what we are interested in, and then create our features based on that. 

In this context, because we're working with unstructured data, we'll have to create our own columns. What do I mean by that? Well, it's hard for machines to infer _much_ from volumes of comments, so it means we have to make the machine's job easier. 

How can we do that?
- Only care about comments that talk about airbags. We don't really care about any of the other comments for the purpose of this story, but we also don't want to overfit the model. 
- Figure out what words or combination of words would indicate an airbag issue. 

### Creating Features

In [None]:
# Take a random sample of rows



# Take a sample of 300 items involving airbags. 
# This is a fun bit to pay attention to: this is unstructured data, so people might type "air bag" or "airbag". 
# Let's factor that in. 




# Combine them so we have all sorts for the machine to learn about




# We don't know whether they're suspicious or not




# We only want a few columns



# Save them





So, once we've pulled out a sample of data, with and without airbags in the comments, we need to find the possible combination of words that indicate an airbag's not behaving as it should, i.e. if you see "shrapnel" or "explode" in the comment with airbags, you can probably infer that it's down to the faulty airbag. 

In [None]:
## Based on the same of data can you see what terms of phrases are common when complaining about airbags? 
## Let's build some features out of those common phrases. 





In [None]:
## OK, so we are lucky in that someone's actually labelled some of our sample to highlight whether the comment 
## indicates a faulty/suspicious airbag or not. Let's read that in, clean out the NAs, and see what we have. 






In [None]:
## Now, we create our training features matrix





In [None]:
## Let's see what the breakdown is of suspicious vs. not in this dataset. 




### Building your logistic regression model

In [None]:
# Import sklearn, and follow the steps from Titanic. 





In [None]:
# Now, find the most important "features", and their corresponding coefficients. 


In [None]:
# And see what our classifier scores




In [None]:
# Now, let's look at another way to do training vs. test data. 



In [None]:
## As before, build out the confusion matrix. 





In [None]:
# And, I want to start here next class, so I'll leave this here — it should be familiar to you for linear regression, 
# but we need to talk about interpreting coefficients for logistic regression, and what Odds Ratio—the very premise of 
# logistic regression—is. 

import statsmodels.api as sm

X = features.drop(columns='is_suspicious')
X = sm.add_constant(X)
y = features.is_suspicious

model = sm.Logit(y, X)
results = model.fit(method='lbfgs')
results.summary()
