# Exercise 5

The csv file "classdata/Aircraft Incidents.csv" contains the investigation reports by the National Transportation Safety Board (NTSB) for all airline incidents in 2011. 

- Column "Final Narrative" gives the final description of each accident and its probable cause. 
- Column "Fatal" indicates if there was any fatality in each accident. 

The following code reads the data and shows the frequencies of the class labels.

In [1]:
import pandas as pd
df=pd.read_csv("classdata/Aircraft Incidents.csv")
df.Fatal.value_counts()

Non-Fatal    1554
Fatal         352
Name: Fatal, dtype: int64

1. Partition **df** into training (70%) and testing (30%) sets as new dataframes. Convert column "Final Narrative" in each set to a DTM based on the following requirements:

    - Use the default tokenizer from sklearn library. 
    - Remove stop words in the list of nltk. 
    - Do not stem the terms.
    - Create DTM in TFIDF with using bigrams. 
    - Normalize the rows of DTM so that each row sums up to one. (*Hint: set up norm="l1"*)
    
 Save your DTMs as **train_x** and **train_y**. Print the shape of train_x and train_y.

In [5]:
#Your answer here:
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
df_train, df_test = train_test_split(df, test_size=0.30, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

nltk_stopwords = nltk.corpus.stopwords.words("english") 

vectorizer=TfidfVectorizer(stop_words=nltk_stopwords, norm="l1",ngram_range=(2,2))

train_x = vectorizer.fit_transform(df_train['Final Narrative'])
test_x = vectorizer.transform(df_test["Final Narrative"])

#Check your answer
print(train_x.shape)
print(test_x.shape)

(1334, 71144)
(572, 71144)


2. Create a sparse logistic regression model with $C=100$ to predict column "Fatal" using the DTM matrix created from column "Final Narrative". You can set tolerance and max iteration number to any values as long as the warmining message "failed to converge" does not show up. Set random_state to 2021 for reproducibility. Save your model as **sparselr**. 

In [6]:
#Your answer here:
train_y = df_train['Fatal']
test_y = df_test['Fatal']

from sklearn.linear_model import LogisticRegression

sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              random_state=2021,
                              tol=0.0001,
                              max_iter=1000, 
                              C=100)
sparselr.fit(train_x,train_y)

#Check your answer:
sparselr

3. Print the number of non-zero betas in **sparselr** from question 2. According to the size of vocabulary shown in question 1, how much percent of betas are non-zero? You can do the calculate manually and type your answer as a comment in the code.  

In [10]:
#Your answer here:

percent = sum(sparselr.coef_[0]!=0) / len(sparselr.coef_[0])
print(sum(sparselr.coef_[0]!=0))
print(len(sparselr.coef_[0]))
print(percent)

#how much percent of betas are non-zero? 
#Your answer here:

184
71144
0.0025863038344765546


4. Print 15 bigrams that have the largest impact to outcome "Fatal". 

*Hint: Which outcome is considered to be the positive class by the model? Non-Fatal or Fatal? Which outcome is considered to be negative by the model? Will you look for positive betas or negative betas? Will you sort betas in ascending or descending order?*

In [13]:
#Your answer here:
dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': sparselr.coef_[0]
                     })

dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)

#Check your answer:
dfbeta.head(15)



Unnamed: 0,Term,Beta
0,obtained various,-280.171976
1,told pilot,-266.410072
2,fatally injured,-265.381918
3,data obtained,-237.870818
4,near vertical,-232.141632
5,fatal injuries,-231.345805
6,control continuity,-206.399608
7,due loss,-196.969635
8,revealed evidence,-196.245399
9,pilot blood,-180.107276


5. Print 15 bigrams that have the largest impact to outcome "Non-Fatal".

*Hint: Same as question 4*

In [14]:
#Your answer here:
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)

#Check your answer:
dfbeta.head(15)

Unnamed: 0,Term,Beta
0,pilot performing,331.102885
1,data provided,299.683184
2,may traveled,273.634377
3,provided various,262.316015
4,forced landing,185.639204
5,landing gear,175.112586
6,lost power,133.484921
7,pattern altitude,129.624603
8,selector position,126.844666
9,student pilot,106.672004


6. Print the prediction accuracy of **sparselr** on training and testing sets.

In [15]:
#Your answer here:
from sklearn.metrics import accuracy_score, roc_auc_score
print("Train:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test:")
print(accuracy_score(test_y,sparselr.predict(test_x)))


Train:
0.9835082458770614
Test:
0.9423076923076923


7. Print the AUC score of **sparselr** on training and testing sets.

In [16]:
#Your answer here:
print("Train:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Test:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))


Train:
0.9992216874718003
Test:
0.9576149425287356
