In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("regression.ipynb")

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
SEED = 3383

## 1. Regression

This dataset contains information on ticket prices for flights between six cities in India:

In [None]:
flights = pd.read_csv("flights.csv",index_col=0)
flights.head(5)

1.1. Create a boolean vector or series called `trips` that is `True` only for flights with class *Economy*, source city *Delhi*, and destination city *Mumbai*.

In [None]:
trips = ...


In [None]:
grader.check("flights-trips")

1.2. Prepare a feature matrix `X` for the rows in `trips` and the columns *departure_time*, *airline*, *duration*, *days_left*, and *stops*. In the *stops* column, replace `"zero"` with 0, `"one"` with 1, and `"two_or_more"` with 2.
   
Prepare a feature vector `y` for *price* on the same rows.

In [None]:
X = flights.loc[trips,["departure_time","airline","duration","days_left","stops"]]
X.replace({"zero":0, "one":1, "two_or_more":2},inplace=True)
y = flights.loc[trips,"price"]

In [None]:
grader.check("flights-features")

1.3. Perform a linear regression for the price with the predictors *days_left* and *duration*. Find the coefficient of determination for the fit.

In [None]:
CofD = ...

In [None]:
grader.check("regress-twovars")

1.4. Does increasing *days_left* cause the price to increase, or decrease? Does increasing *duration* cause the price to increase, or decrease? Answer `True` or `False` for each variable.

In [None]:
# True or False in each case
increasing_days_left_increases_price = False   # SOLUTIION
increasing_duration_increases_price = True   # SOLUTIION

1.5. Create a new frame `Xdum` that replaces the *airline* and *departure_time* features with dummy variables. **Use `drop_first=True` for when creating the dummies.** This option replaces a category of $k$ unique values with $k-1$ variables, leaving out redundancy.

Then retrain the linear regressor using `Xdum` and compute the new coefficient of determination.

In [None]:
Xdum = pd.get_dummies(X,columns=["airline","departure_time"],drop_first=True)  # SOLUTIION 
lm.fit(Xdum,y)  # SOLUTIION NO PROMPT
CofD_dum = lm.score(Xdum,y)  # SOLUTIION

In [None]:
grader.check("regress-dummies")

1.6. Which airline, when chosen with all else being unchanged, tends to cause the greatest increase in the price? All else being equal, what is the best departure time for lowering the price?

In [None]:
# Use a string value as the answer:
airline_biggest_increase = "Vistara"  # SOLUTIION
time_biggest_decrease = "Night"       # SOLUTIION

In [None]:
grader.check("regress-effects")

1.7. Use LASSO on the feature matrix with dummies and increase the regularization parameter until one of the coefficients is dropped. Which column corresponds to the dropped coefficient?

In [None]:
first_dropped = Xdum.columns[idx]  # SOLUTIION

In [None]:
grader.check("regress-lasso")

1.8. Use a decision tree regressor on the feature matrix with dummy variables, and compute its coefficient of determination score. Also, determine which feature is deemed to be most important by the regressor.

In [None]:
CofD_dtr = ...
top_feature = ...

In [None]:
grader.check("flight-dtree")

## 3. Probabilistic classification

This dataset contains the results of passenger satisfaction surveys on U.S. airlines. 

In [None]:
satisfaction = pd.read_csv("passenger_satisfaction.csv",index_col=0).dropna()
satisfaction.head(6)

Some of the columns are features that can be deduced independently of surveys, at least in principle. We will separate the two types of features for the analysis that follows.

In [None]:
objective = ['Gender', 'Customer Type', 'Age', 'Class',
        'Flight Distance','Departure Delay in Minutes', 'Arrival Delay in Minutes',]
subjective = ['Type of Travel','Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness']
target = 'satisfaction'

3.1 Create a feature matrix `XO` for just the `objective` features, with dummy variables, again using `drop_first=True`. Also make a label series `y` for the *satisfaction* column. Split off 20% of the data into a test set, using a shuffle with random state `SEED`.

In [None]:
from sklearn.model_selection import train_test_split  # SOLUTIION NO PROMPT
XO = ...
y = ...
...

In [None]:
grader.check("satisfy-objective")

3.2 Using the objective-only training data, perform a logistic regression with the option `penalty="none"`, which disables regularization. Make sure to use a pipeline with column standardization. 

Using the test set, find the AUC-ROC score of the regression.

In [None]:
lr_obj = ...
AUC_obj = ...


In [None]:
grader.check("satisfy-objective-logr")

<!-- BEGIN QUESTION -->

3.3. Plot the ROC curve of the regressor for detection of the state *satisfied* for **the first 1000 members** of the test set. 

In [None]:
# BEGIN SOLUTIION NO PROMPT
p_hat = lr_obj.predict_proba(XO_test)
from sklearn.metrics import roc_curve
FP,TP,theta = roc_curve(y_test[:1000],p_hat[:1000,1],pos_label="satisfied")
import seaborn as sns
sns.relplot(x=FP,y=TP,kind="line");
# END SOLUTIION NO PROMPT

<!-- END QUESTION -->

3.4 Now use the objective and subjective features together, converting to dummies and splitting into train/test like before. 

Perform logistic regression with $C=0.01$, `penalty="l1"` (i.e., LASSO), and `solver="liblinear"` (since the default will not work with LASSO). Compute the AUC-ROC score.

In [None]:
# SOLUTIION NO PROMPT
# SOLUTIION
X = ...
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=SEED)  

lr_all = ...
AUC_all = ...


In [None]:
grader.check("satisfy-all-auc")

3.5. Use the coefficients of the regression (with all features) to answer the following questions. 

(a) Which feature or features are dropped by the regularization? (Answer with a **list** of strings.)

(b) Which of the subjective features has the greatest positive effect on satisfaction? (Answer with a string.)

(c) Which type of travel tends to produce greater satisfaction? (Answer with string `"Personal"` or `"Business"`.)

In [None]:
dropped_features = [ "Flight Distance", "Departure Delay in Minutes" ] # SOLUTIION
greatest_subjective = "Online boarding"  # SOLUTIION
satisfying_travel = "Business"

In [None]:
grader.check("satisfy-all-factors")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

Select *Kernel/Restart & Run All*, then save, then run this export cell again. Submit by pushing the resulting zip file to your GitHub assignment repo.

In [None]:
grader.export(pdf=False, force_save=True)