__Title:__ Mini-Lab: Logistic Regression and SVMs  
__Authors:__ Butler, Derner, Holmes, Traxler  
__Date:__ 1/22/23 

## Rubric
You are to perform predictive analysis (classification) upon a data set: model the dataset using 
methods we have discussed in class: logistic regression & support vector machines and making 
conclusions from the analysis. Follow the CRISP-DM framework in your analysis (you are not 
performing all of the CRISP-DM outline, only the portions relevant to the grading rubric outlined 
below). This report is worth 10% of the final grade. You may complete this assignment in teams 
of as many as three people. 
Write a report covering all the steps of the project. The format of the document can be PDF, 
*.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in 
the rendered Jupyter notebook. The results should be reproducible using your report. Please 
carefully describe every assumption and every step in your report.
A note on grading: A common mistake I see in this lab is not investigating different input 
parameters for each model. Try a number of parameter combinations and discuss how the 
model changed. 
SVM and Logistic Regression Modeling 
 - [50 points] Create a logistic regression model and a support vector machine model for the 
classification task involved with your dataset. Assess how well each model performs (use 
80/20 training/testing split for your data). Adjust parameters of the models to make them 
more accurate. If your dataset size requires the use of stochastic gradient descent, then 
linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing 
logistic regression and linear support vector machines. For many problems, SGD will be 
required in order to train the SVM model in a reasonable timeframe. 
 - [10 points] Discuss the advantages of each model for each classification task. Does one 
type of model offer superior performance over another in terms of prediction accuracy? In 
terms of training time or efficiency? Explain in detail. 
 - [30 points] Use the weights from logistic regression to interpret the importance of different 
features for the classification task. Explain your interpretation in detail. Why do you think 
some variables are more important? 
 - [10 points] Look at the chosen support vectors for the classification task. Do these provide 
any insight into the data? Explain. If you used stochastic gradient descent (and therefore did 
not explicitly solve for support vectors), try subsampling your data to train the SVC model—
then analyze the support vectors from the subsampled dataset.

__CRISP-DM__
 - Business understanding – What does the business need?
 - Data understanding – What data do we have / need? Is it clean?
 - Data preparation – How do we organize the data for modeling?
 - Modeling – What modeling techniques should we apply?
 - Evaluation – Which model best meets the business objectives?
 - Deployment – How do stakeholders access the results?

 Source: [Hotz, 2023](https://www.datascience-pm.com/crisp-dm-2/)



__Buisness Understanding__  
What features are most important in predicting which flights will be delayed?

__Data Understanding__  
We have a dataset of over 200,000 flights and 60 features from US Carriers in 2021. Approximately 33% of flights are delayed. There are some features with null values. These features are largely associated with columns that will be removed for the simple reason that they would not be knowable prior to the flight. We will have to prune the dataset to only include knowable features. Some of the features are highly correlated because they represent very similar things (e.g. scheduled departure time(CRSDepTime) vs. actual departure time(DepTime)). While most of these will be removed because they are not knowable, the remaining will be chosen based on the completeness of the data. The remaining issue to address is multi-colinearity. The majority of correlated features will be removed in the above steps. The remaining will again be chosen by the completeness of their data. Any remaining observations with incomplete data will be removed.

__Data Preparation__  
All categorical features will require one hot encoding. Some will not be usable due to the number of levels compared to the number of observations available.

The continous features will be normalized to reduce the influence of features with large values.

__Modeling__  
We will be comparing Logistic Regression and Support Vector Machines in this notebook. Each model will use the same training and testing datasets.

__Evaluation__  
The overall performance of each model will be evaluated by their respective accuracy, sensitivity, and specificity. We have concluded that it is more important to accurately predict the true occurance of delayed flights. For this reason, we will use sensitivity as the primary metric to compare and evaluate each model.

__Deployment__  
The findings of our study, including important features and their weights, can be found in the conclusion section of this rendered Jupyter notebook.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics

pd.set_option('display.max_columns', None)

In [2]:
# Dataset
url = 'https://github.com/cdholmes11/MSDS-7331-ML1-Labs/blob/main/Lab-1_Visualization_DataPreprocessing/data/Combined_Flights_2021_sample.csv?raw=true'
flight_data_df = pd.read_csv(url, encoding = "utf-8")

In [3]:
# New Features
flight_data_df['Delayed'] = np.where(flight_data_df['DepDelayMinutes'] > 0, 1, 0)
flight_data_df['Dif_Oper'] = np.where(flight_data_df['DOT_ID_Marketing_Airline'] == flight_data_df['DOT_ID_Operating_Airline'], 0, 1)

The original dataset doesn't have a feature for binary classification of delayed status. We've added one based on the DepDelayMinutes field. Delayed flights are coded as 1 and not delayed as 0.

Marketing Airline and Operatin Airline are often the same. One will likely be dropped from the model due to multi-colinearity. To maintain the important information from both columns, we've added a classification feature for when the flight is marketed by one airline and operated by another. Flights with different operators are coded as 1 and 0 for flights with the same marketing airline.

In [4]:
# Dataset Shape
flight_data_df.shape

(200000, 64)

In [5]:
# Delayed Frequency
delayed_df = pd.DataFrame(flight_data_df['Delayed'].value_counts()).reset_index()
delayed_df.columns = ['Delayed', 'Count']
delayed_df['Frequency'] = round(delayed_df['Count'] / sum(delayed_df['Count']) * 100, 2)
delayed_df

Unnamed: 0,Delayed,Count,Frequency
0,0,134697,67.35
1,1,65303,32.65


In [6]:
# Knowable features grouped by data type
cat_features = ['Airline', 'Dest', 'DestAirportID', 'DestAirportSeqID', 'DestCityMarketID', 'DestCityName', 'DestState', 'DestStateFips', 'DestStateName', 'DestWac', 'DOT_ID_Marketing_Airline', 'DOT_ID_Operating_Airline', 'Flight_Number_Marketing_Airline', 'Flight_Number_Operating_Airline', 'IATA_Code_Marketing_Airline', 'IATA_Code_Operating_Airline', 'Marketing_Airline_Network', 'Operated_or_Branded_Code_Share_Partners', 'Operating_Airline', 'Origin', 'OriginAirportID', 'OriginAirportSeqID', 'OriginCityMarketID', 'OriginCityName', 'OriginState', 'OriginStateFips', 'OriginStateName', 'OriginWac', 'Tail_Number']
cont_features = ['CRSArrTime', 'CRSDepTime', 'CRSElapsedTime', 'Distance']
ord_features = ['Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'Delayed', 'Dif_Oper']
date_feature = ['FlightDate']

In [7]:
# Trimmed Dataset to knowable features
flight_delay_df = flight_data_df[cat_features + cont_features + ord_features + date_feature]
flight_delay_df.shape

(200000, 40)

In [8]:
# Correlation Plot using Plotly
flight_corr = flight_delay_df.corr()

fig = go.Figure()

fig.add_trace(
    go.Heatmap(
        x = flight_corr.columns,
        y = flight_corr.index,
        z = np.array(flight_corr),
        text=flight_corr.values,
        texttemplate='%{text:.2f}' #set the size of the text inside the graphs
    )
)

fig.update_layout(
    title='Airline Feature Correlation',
    autosize=False,
    width=1000,
    height=600
)

fig.show()

  flight_corr = flight_delay_df.corr()


In [9]:
# Features converted to corresponding group type
flight_data_df[cat_features] = flight_data_df[cat_features].astype('category')
flight_data_df[ord_features] = flight_data_df[ord_features].astype(np.int64)
flight_data_df[cont_features] = flight_data_df[cont_features].astype(np.int64)
flight_data_df['FlightDate'] = pd.to_datetime(flight_data_df['FlightDate']).dt.date

In [10]:
# Tail Number Evaluation
flight_delay_df['Tail_Number'].describe()

count     199226
unique      5727
top       N491HA
freq         104
Name: Tail_Number, dtype: object

In [11]:
# Tail Number Delay Frequency
tail_df = pd.pivot_table(flight_delay_df, values='Distance', index =['Tail_Number'], aggfunc = 'count', columns='Delayed').reset_index()
tail_df.columns = ['Tail_Number', 'On-Time', 'Delayed']
tail_df['Total'] = tail_df['Delayed'] + tail_df['On-Time']
tail_df['Frequency'] = round((tail_df['Delayed'] / tail_df['Total'] * 100),2)

tail_df[tail_df['Total'].notnull()].sort_values(by=['Delayed'], ascending=False)

Unnamed: 0,Tail_Number,On-Time,Delayed,Total,Frequency
4717,N8658A,41.0,47.0,88.0,53.41
4697,N8642E,22.0,44.0,66.0,66.67
5386,N935WN,15.0,44.0,59.0,74.58
4629,N8602F,19.0,43.0,62.0,69.35
4481,N8501V,28.0,42.0,70.0,60.00
...,...,...,...,...,...
3842,N7718B,3.0,1.0,4.0,25.00
3836,N77066,1.0,1.0,2.0,50.00
2481,N511MJ,7.0,1.0,8.0,12.50
4376,N833AA,3.0,1.0,4.0,25.00


Many categorical features have a corresponding numeric feature. Wherever possible, we've decided to utilize the numeric representation of these features.

cat_remove = ['Airline', 'Dest', 'DestCityName', 'DestState', 'DestStateName', 
    'DOT_ID_Operating_Airline', 'Flight_Number_Operating_Airline', 'IATA_Code_Marketing_Airline',
    'IATA_Code_Operating_Airline', 'Marketing_Airline_Network', 'Operated_or_Branded_Code_Share_Partners',
    'Operating_Airline', 'Origin', 'OriginCityName', 'OriginState', 'OriginStateName', 'DestAirportSeqID',
    'OriginAirportSeqID']


In [12]:
# Features to remove
cat_remove = ['Airline', 'Dest', 'DestCityName', 'DestState', 'DestStateName', 
    'DOT_ID_Operating_Airline', 'Flight_Number_Operating_Airline', 'IATA_Code_Marketing_Airline',
    'IATA_Code_Operating_Airline', 'Marketing_Airline_Network', 'Operated_or_Branded_Code_Share_Partners',
    'Operating_Airline', 'Origin', 'OriginCityName', 'OriginState', 'OriginStateName', 'DestAirportSeqID',
    'OriginAirportSeqID', 'Tail_Number', 'DestCityMarketID']
cont_remove = ['CRSArrTime', 'CRSElapsedTime']
ord_remove = ['Quarter']

In [13]:
# Removing features from category groups
for n in cat_remove:
    if n in cat_features:
        cat_features.remove(n)

for n in cont_remove:
    if n in cont_features:
        cont_features.remove(n)

for n in ord_remove:
    if n in ord_features:
        ord_features.remove(n)


In [14]:
# Creating new dataset with updated feature lists
flight_delay_df = flight_data_df[cat_features + cont_features + ord_features]
flight_delay_df.shape

(200000, 16)

Tail Number has over 5700 unique values. One Hot Encoding Tail Number will add too many features to the dataset, for the given observation count. Additionaly, future datasets will likely contain different tail numbers. Practically, this makes tail number an unrealistic feature to include in the model. For the curious, we've provided a table of tail numbers sourted by the number of times delayed and provided their relative freqency of being delayed. There are clearly some aircraft that are much more likely to be delayed than others.

In [15]:
flight_delay_df[cat_features].describe()

Unnamed: 0,DestAirportID,DestStateFips,DestWac,DOT_ID_Marketing_Airline,Flight_Number_Marketing_Airline,OriginAirportID,OriginCityMarketID,OriginStateFips,OriginWac
count,200000,200000,200000,200000,200000,200000,200000,200000,200000
unique,378,53,53,10,6495,377,353,53,53
top,10397,48,74,19805,678,10397,30977,48,74
freq,10044,21972,21972,52560,95,9822,11169,22065,22065


In [16]:
flight_delay_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 16 columns):
 #   Column                           Non-Null Count   Dtype   
---  ------                           --------------   -----   
 0   DestAirportID                    200000 non-null  category
 1   DestStateFips                    200000 non-null  category
 2   DestWac                          200000 non-null  category
 3   DOT_ID_Marketing_Airline         200000 non-null  category
 4   Flight_Number_Marketing_Airline  200000 non-null  category
 5   OriginAirportID                  200000 non-null  category
 6   OriginCityMarketID               200000 non-null  category
 7   OriginStateFips                  200000 non-null  category
 8   OriginWac                        200000 non-null  category
 9   CRSDepTime                       200000 non-null  int64   
 10  Distance                         200000 non-null  int64   
 11  Month                            200000 non-null  in

In [17]:
# Train/Test Split
X = flight_delay_df.drop(['Delayed'], axis=1)
Y = flight_delay_df['Delayed']

In [18]:
# One Hot Encoding
onehot_encoder = OneHotEncoder(drop='first', sparse_output=True)
label_encoder = LabelEncoder()
X = onehot_encoder.fit_transform(X)
Y = label_encoder.fit_transform(Y)

In [19]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=110)

## Logistic Regression

In [20]:
# Basic Logistic Regression
logmod_opt = LogisticRegression(random_state=10)

In [21]:
# Basic Logistic Regression
logmod = LogisticRegression(random_state=10, max_iter=5000)
log_fit = logmod.fit(X_train, Y_train)
logmod_pred = logmod.predict(X_test)

# Feature selection
logmod2 = LogisticRegression(random_state=10, max_iter=5000)
logmod2.fit(X_train, Y_train)

model = SelectFromModel(logmod2, max_features=500, prefit=True, threshold='median')
X_train_new = model.transform(X_train)
X_test_new = model.transform(X_test)

log_fit2 = logmod2.fit(X_train_new, Y_train)
logmod_pred2 = logmod2.predict(X_test_new)

# Model Scores
print('LogisticRegression')
print('Basic Score: ', logmod.score(X_train, Y_train))
print('Feature Selection Score: ', logmod2.score(X_train_new, Y_train))

# Confusion Matrix
print(classification_report(Y_test, logmod_pred))
print(classification_report(Y_test, logmod_pred2))

LogisticRegression
Basic Score:  0.740075
Feature Selection Score:  0.702675
              precision    recall  f1-score   support

           0       0.73      0.87      0.80     26936
           1       0.56      0.34      0.42     13064

    accuracy                           0.70     40000
   macro avg       0.65      0.61      0.61     40000
weighted avg       0.68      0.70      0.67     40000

              precision    recall  f1-score   support

           0       0.71      0.92      0.80     26936
           1       0.57      0.24      0.33     13064

    accuracy                           0.69     40000
   macro avg       0.64      0.58      0.57     40000
weighted avg       0.67      0.69      0.65     40000



In [86]:
# Grid Search Param
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : [0.1, 1, 10, 100, 1000],
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]

In [87]:
# Optimized Logistic Regression
clf = GridSearchCV(logmod2, param_grid = param_grid, cv=None , verbose=False, n_jobs=-1, scoring='precision')

In [88]:
best_clf = clf.fit(X_train_new,Y_train)



900 fits failed out of a total of 2000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\corey\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\corey\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklearn\linear_model\_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\corey\AppData\Local\Packages\Python

In [89]:
print(best_clf.best_estimator_)
print(best_clf.best_score_)
# print(best_clf.best_estimator_.feature_importance())

LogisticRegression(C=1, random_state=10, solver='liblinear')
0.6062832634878605


In [48]:
logmod.score(X_train, Y_train)

0.72345

In [49]:
best_clf.score(X_test, Y_test)

0.6218238774799861

In [52]:
print(classification_report(Y_test, best_clf.best_estimator_.predict(X_test)))
print(classification_report(Y_test, logmod_pred))

              precision    recall  f1-score   support

           0       0.72      0.92      0.81     26936
           1       0.62      0.27      0.38     13064

    accuracy                           0.71     40000
   macro avg       0.67      0.60      0.59     40000
weighted avg       0.69      0.71      0.67     40000

              precision    recall  f1-score   support

           0       0.73      0.90      0.80     26936
           1       0.60      0.32      0.42     13064

    accuracy                           0.71     40000
   macro avg       0.66      0.61      0.61     40000
weighted avg       0.69      0.71      0.68     40000



## Support Vector Machines

In [23]:
svc_model = svm.SVC()

svc_model.fit(X_train, Y_train)

y_pred = clf.predict(X_test)

In [None]:
print("Accuracy:", metrics.accuracy_score(Y_test, y_pred))

## Model Advantages

## Feature Importance

## SVC Insights

## Sources

1. Hotz, N. (2023, January 19). What is CRISP DM? Data Science Process Alliance. https://www.datascience-pm.com/crisp-dm-2/