In [None]:
# Data and Stats packages
import numpy as np
import pandas as pd
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
import statsmodels.api as sm 

# Visualization packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (13.0, 6.0)

# Other Helpful fucntions
import warnings
warnings.filterwarnings("ignore")

#Aesthetic settings
from IPython.display import display
pd.set_option('display.max_rows', 999)
pd.set_option('display.width', 500)

## 1. Extending Linear Regression by Transforming Predictors

Linear regression works great when our features are all continuous and all linearly affect the output. But often real data have more interesting characteristics. Here, we'll look at how we can extend linear regression to handle:

* Categorical predictors, like gender, which have one of a few discrete values
* Interactions between predictors, which let us model how one variable changes the effect of another.

For our dataset, we'll be using the passenger list from the Titanic, which famously sank in 1912. Let's have a look at the data. Some descriptions of the data are at https://www.kaggle.com/c/titanic/data, and here's [how seaborn preprocessed it](https://github.com/mwaskom/seaborn-data/blob/master/process/titanic.py).

In [None]:
# you can also load the dataset from seaborn 
titanic = sns.load_dataset("titanic")
titanic.head()

Note:
sns.load_dataset("titanic") load_dataset function loads data from a predefined set of datasets hosted by Seaborn, while Pandas' read_csv function can load data from any CSV file specified by the user.

sns.load_dataset("titanic") uses the Seaborn library (sns),it provides an interface to load example datasets that are hosted in its online repository.

When you call sns.load_dataset("titanic"), it accesses the Titanic dataset from the Seaborn's online repository, downloads it (if not already cached), and returns a Pandas DataFrame.

Seaborn's load_dataset specifically loads datasets that are already formatted and stored in a way that's compatible with Seaborn's data structure requirements, whereas pd.read_csv can read any text file that is structured as a CSV, allowing for more customization in terms of data source and structure.

In [None]:
# Keep only a subset of the predictors; some are redundant, others (like deck) have too many missing values.
titanic = titanic[['age', 'sex', 'class', 'embark_town', 'alone', 'fare']]

**Exercise **: 
check the dataframe information through titanic.info()

In [None]:
# check the dataframe information through titanic.info()


In [None]:
# Drop missing data through .dropna() and check the dataframe information


### First let's look at the distribution of fares.

<div class="exercise">**Exercise 2**: Show the distribution of fares in at least two ways.You may use functions from matplotlib, Pandas, or Seaborn.</div>

In [None]:
# Your code here


### Exploring predictors

Cabin class is probably going to matter for fare, but we might wonder if age and gender also matter. Let's explore them.

In [None]:
sns.distplot(titanic.age)

In [None]:
sns.lmplot(x="age", y="fare", hue="sex", data=titanic)

Hmm... the slopes seem to be different for males and females.

How about class?

So it looks like fare varies with class, age, and maybe gender, and the way that fare depends on class and age may be different for male vs female.

### Let's first do a simple linear regression on age with the statsmodels

In [None]:
# your code here


In [None]:
# check the model summary

How do we interpret this?

* How good is this model?
* Does age affect fare? How can we tell? And if so, how?

But Class and Gender are categorical variables

### Handling categorical variables

How do we deal with gender, or class? They're categorical variables. We'll need to use dummy variables to encode them.

In [None]:
titanic_orig = titanic.copy()

<div class="exercise">**Exercise 3**: Create a column `sex_male` that is 1 if the passenger is male, 0 otherwise.</div>

In [None]:
# Your code here
titanic['sex_male'] = (titanic.sex == 'male').astype(int)

<div class="exercise">**Exercise 3**: Do we need a `sex_female` column, or any others? Why or why not?</div>

<div class="exercise">**Exercise 4**: Create columns for `class_`</div>

In [None]:
# Your code here
titanic['class_Second'] = (titanic['class'] == 'Second').astype(int)
titanic['class_Third'] = 1 * (titanic['class'] == 'Third') # just another way to do it

In [None]:
titanic.info()

In [None]:
titanic.head()

#### You can also use pd.get_dummies to convert categorical variable into dummy/indicator variables.

In [None]:

# This function automates the above:
titanic = pd.get_dummies(titanic_orig, columns=['sex', 'class'], drop_first=True)
titanic.head()

<div class="exercise">**Exercise 5**: Fit a linear regression including the new sex and class variables.</div>

In [None]:
# Your code here


<div class='exercise'> **Exercise 6** How do we interpret these results?</div>
* All else being equal, what does being male do to the fare?
* What can we say about being *male* and *first-class*?


## Interactions

In [None]:
# It seemed like gender interacted with age and class. Can we put that in our model?
# hint: titanic['sex_male_X_age'] = titanic['age'] * titanic['sex_male']

# fit the model with the interaction term

# Hint: fit the model with titanic[['age', 'sex_male', 'class_Second', 'class_Third', 'sex_male_X_age']]


** What happened to the `age` and `male` terms? **

In [None]:
# It seemed like gender interacted with age and class. Can we put that in our model?
# titanic['sex_male_X_class_Second'] = titanic['age'] * titanic['class_Second']
# titanic['sex_male_X_class_Third'] = titanic['age'] * titanic['class_Third']

What has happened to the $R^2$ as we added more features? 

In [None]:
# models = [model1, model2, model3, model4]
# plt.plot([model.df_model for model in models], [model.rsquared for model in models], 'x-')