<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab 3_01: Statistical Modeling and Model Validation

> Authors: Tim Book, Matt Brems

---

## Objective
The goal of this lab is to guide you through the modeling workflow to produce the best model you can. In this lesson, you will follow all best practices when slicing your data and validating your model. 

## Imports

In [73]:
# Import everything you need here.
# You may want to return to this cell to import more things later in the lab.
# DO NOT COPY AND PASTE FROM OUR CLASS SLIDES!
# Muscle memory is important!

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# New libraries
from sklearn.linear_model import LinearRegression
from sklearn import metrics

## Read Data
The `citibike` dataset consists of Citi Bike ridership data for over 224,000 rides in February 2014.

In [74]:
# Read in the citibike data in the data folder in this repository.
citibike = pd.read_csv('./data/citibike_feb2014.csv')

## Explore the data
Use this space to familiarize yourself with the data.

Convince yourself there are no issues with the data. If you find any issues, clean them here.

In [75]:
# Checking the shape of our dataset
print(citibike.shape)

(224736, 15)


In [76]:
# Snapshot of our dataset
print(citibike.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  int64  
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  object 
 14  gend

In [5]:
# Let's check for na/null values
citibike.isna().sum()

tripduration               0
starttime                  0
stoptime                   0
start station id           0
start station name         0
start station latitude     0
start station longitude    0
end station id             0
end station name           0
end station latitude       0
end station longitude      0
bikeid                     0
usertype                   0
birth year                 0
gender                     0
dtype: int64

In [77]:
# Lets check that we have two unique gender types
citibike['gender'].unique()

array([1, 2, 0], dtype=int64)

**Note**: Hmm.. that is weird. Guess we'll need to filter Gender = 0 out from our dataset. That also removed errors for the `birth year` column

In [78]:
citibike_cleaned = citibike.loc[citibike['gender']!=0]
citibike_cleaned['usertype'].unique()

array(['Subscriber'], dtype=object)

**Note**: We can also remove usertype later as there is only *one* unique value

## Is average trip duration different by gender?

Conduct a hypothesis test that checks whether or not the average trip duration is different for `gender=1` and `gender=2`. Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly!

In [79]:
gender1 = citibike_cleaned.loc[citibike['gender']==1, 'tripduration']
gender2 = citibike_cleaned.loc[citibike['gender']==2, 'tripduration']

gender1_mean = gender1.mean()
gender2_mean = gender2.mean()

print(f'Gender 1 (Mean): {gender1_mean}, Gender 2 (Mean): {gender2_mean}')

Gender 1 (Mean): 814.0324088236293, Gender 2 (Mean): 991.3610742785506


In [22]:
# Quick snapshot of trip duration by gender
print(f'Gender 1:\n {gender1.describe()} \n\n Gender 2:\n{gender2.describe()}')

Gender 1:
 count    176526.000000
mean        814.032409
std        5020.576128
min          60.000000
25%         347.000000
50%         520.000000
75%         794.000000
max      585281.000000
Name: tripduration, dtype: float64 

 Gender 2:
count     41479.000000
mean        991.361074
std        7114.753227
min          60.000000
25%         404.000000
50%         607.000000
75%         938.000000
max      766108.000000
Name: tripduration, dtype: float64


**Our goal**: To decide *whether males or females cycle further distances*.

We set up our null and alternate hypotheses below:

$$ H_0: \mu_\text{gender1} - \mu_\text{gender2} = 0 $$
$$ H_A: \mu_\text{gender1} - \mu_\text{gender2} \ne 0 $$

Our significance level: $\alpha = 0.05$

Since we are testing the difference between two means, we will employ the **two-sample $z$-test**

In [80]:
from statsmodels.stats import weightstats as stests

# Conducting our t-test between treatment and control groups
t_stat, p_value = stests.ztest(gender1, gender2, value = 0, alternative='two-sided')

print(f'T-test statistic: {t_stat}\np-value: {p_value}')

T-test statistic: -5.929304472651931
p-value: 3.0422052695489308e-09


Since our $p\text{-value} < \alpha$ , then there is evidence to reject the null hypothesis (i.e. There is a statiscally significant difference between both genders)

## What numeric columns shouldn't be treated as numeric?

**Answer:**

* start station id
* start station latitude
* start station longitude
* end station id
* end station latitude
* end station longitude
* bikeid
* gender

## Dummify the `start station id` Variable

In [24]:
citibike_dums = pd.get_dummies(data=citibike_cleaned, columns=['start station id'], drop_first=True)

In [82]:
# Have a look at the first few roles
citibike_dums.head()

Unnamed: 0,tripduration,starttime,stoptime,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,...,start station id_2008,start station id_2009,start station id_2010,start station id_2012,start station id_2017,start station id_2021,start station id_2022,start station id_2023,start station id_3002,age
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,...,0,0,0,0,0,0,0,0,0,23
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.98978,...,0,0,0,0,0,0,0,0,0,35
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.72318,-73.9948,...,0,0,0,0,0,0,0,0,0,66
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,E 11 St & Broadway,40.732618,-73.99158,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,...,0,0,0,0,0,0,0,0,0,33
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,-73.98978,...,0,0,0,0,0,0,0,0,0,24


## Engineer a feature called `age` that shares how old the person would have been in 2014 (at the time the data was collected).

- Note: you will need to clean the data a bit.

In [25]:
# We've previously stripped out Gender == 0 and erroneous birth year values
citibike_dums['age'] = 2014 - citibike_cleaned['birth year'].astype(int)
citibike_dums[['birth year', 'age']].head()

Unnamed: 0,birth year,age
0,1991,23
1,1979,35
2,1948,66
3,1981,33
4,1990,24


In [84]:
# Check dtype of 'age' column
citibike_dums['age'].dtype

dtype('int32')

## Split your data into train/test data

Look at the size of your data. What is a good proportion for your split? **Justify your answer.**

Use the `tripduration` column as your `y` variable.

For your `X` variables, use `age`, `usertype`, `gender`, and the dummy variables you created from `start station id`. (Hint: You may find the Pandas `.drop()` method helpful here.)

**NOTE:** When doing your train/test split, please use random seed 123.

In [85]:
y = citibike_dums['tripduration'] # This has to be a series
X = citibike_dums.drop(columns=['tripduration', 'starttime', 'stoptime', 
                               'start station name', 'start station latitude', 'start station longitude',
                               'end station id', 'end station name', 'end station latitude',
                               'end station longitude', 'bikeid', 'birth year', 'usertype'])

In [86]:
# Check y variable
print(y.shape, type(y))

(218005,) <class 'pandas.core.series.Series'>


In [87]:
# Check X variable
print(X.shape, type(X))

(218005, 330) <class 'pandas.core.frame.DataFrame'>


In [88]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123) # Set random seed as 123

In [89]:
# Infer size of dataset
print(len(X_train), len(X_test))

163503 54502


## Fit a Linear Regression model in `sklearn` predicting `tripduration`.

In [90]:
# Start the linear regression model
lr = LinearRegression()

In [91]:
# Fit the model - The training begins here
lr.fit(X, y)

LinearRegression()

## Evaluate your model
Look at some evaluation metrics for **both** the training and test data. 
- How did your model do? Is it overfit, underfit, or neither?
- Does this model outperform the baseline? (e.g. setting $\hat{y}$ to be the mean of our training `y` values.)

In [42]:
# Train score
lr.score(X_train, y_train)

0.002393308490055812

**Note**: The coefficient of determination ($R^2$) is close to zero, which indicates that the model we've just built explains approximately none of the variability of the response against the selected features

In [43]:
# Test score
lr.score(X_test, y_test)

0.004559464567627347

**Note**: Same as above

*Let's try to calculate some metrics by hand*

In [95]:
# Calculating mean trip duration
null_prediction = y_train.mean()
print(null_prediction)

849.0701393858217


In [97]:
# Null residuals
null_resid = y_train - null_prediction

# Null sum of squares
null_ss = (null_resid**2).sum()

In [98]:
# Calculate SSE by hand
predictions = lr.predict(X_train)

sse = ((y_train - predictions)**2).sum()
sse

4298798718548.2876

In [99]:
# Check R2 by hand
1 - sse / null_ss # Checking understanding

0.002393308490055812

## Fit a Linear Regression model in `statsmodels` predicting `tripduration`.

In [100]:
import statsmodels.api as sm

X2 = sm.add_constant(X)
y2 = y

In [102]:
ols = sm.OLS(y2, X2).fit()

In [103]:
ols.summary()

0,1,2,3
Dep. Variable:,tripduration,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.002
Method:,Least Squares,F-statistic:,2.074
Date:,"Fri, 05 Nov 2021",Prob (F-statistic):,6.71e-27
Time:,17:50:40,Log-Likelihood:,-2185800.0
No. Observations:,218005,AIC:,4372000.0
Df Residuals:,217674,BIC:,4376000.0
Df Model:,330,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,639.8305,223.934,2.857,0.004,200.926,1078.735
gender,179.9693,30.185,5.962,0.000,120.808,239.130
start station id_79,-26.2382,308.992,-0.085,0.932,-631.855,579.379
start station id_82,252.6428,396.821,0.637,0.524,-525.117,1030.403
start station id_83,-179.8306,391.846,-0.459,0.646,-947.839,588.177
start station id_116,-371.1932,263.570,-1.408,0.159,-887.784,145.397
start station id_119,-332.6216,769.827,-0.432,0.666,-1841.462,1176.219
start station id_120,761.4627,616.833,1.234,0.217,-447.514,1970.439
start station id_127,-315.3864,278.115,-1.134,0.257,-860.484,229.712

0,1,2,3
Omnibus:,745995.199,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,292097123371.683
Skew:,63.736,Prob(JB):,0.0
Kurtosis:,5672.259,Cond. No.,13600.0


## Using the `statsmodels` summary, test whether or not `age` has a significant effect when predicting `tripduration`.
- Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly **in the context of your model**!

$H_0:$ The age of a rider does not have any relationship with trip duration

$H_A:$ The age of a rider has a statiscally significant relationship with trip duration

In [62]:
# Defining our y variable
y3 = citibike_dums['tripduration']
print(y3.shape, type(y3))

(218005,) <class 'pandas.core.series.Series'>


In [63]:
# Defining age as our feature
X3 = citibike_dums[['age']]
print(X3.shape, type(X3))

(218005, 1) <class 'pandas.core.frame.DataFrame'>


In [64]:
X3 = sm.add_constant(X3)
ols = sm.OLS(y3, X3).fit()

In [65]:
ols.summary()

0,1,2,3
Dep. Variable:,tripduration,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,18.4
Date:,"Tue, 02 Nov 2021",Prob (F-statistic):,1.79e-05
Time:,21:32:30,Log-Likelihood:,-2186200.0
No. Observations:,218005,AIC:,4372000.0
Df Residuals:,218003,BIC:,4372000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,678.0398,41.273,16.428,0.000,597.146,758.933
age,4.4084,1.028,4.290,0.000,2.394,6.423

0,1,2,3
Omnibus:,747076.893,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,296630107255.218
Skew:,64.007,Prob(JB):,0.0
Kurtosis:,5716.089,Cond. No.,141.0


**Note**: Age does not have a statistically significant relationship with trip duration

## Citi Bike is attempting to market to people who they think will ride their bike for a long time. Based on your modeling, what types of individuals should Citi Bike market toward?

**My answer**: The features we tested did not show any significant relationship with trip duration. Other information may need to be gathered