<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab 3.02: Statistical Modeling and Model Validation

> Authors: Tim Book, Matt Brems

---

## Objective
The goal of this lab is to guide you through the modeling workflow. In this lesson, you will follow all best practices when slicing your data and validating your model. The goal of this lab is not necessarily to build the best model you can, but to build and evaluate a model, and interpret its results.

## Imports

In [2]:
# Import everything you need here.
# You may want to return to this cell to import more things later in the lab.
# DO NOT COPY AND PASTE FROM OUR CLASS SLIDES!
# Muscle memory is important!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Read Data
The `citibike` dataset consists of Citi Bike ridership data for over 224,000 rides in February 2014.

In [5]:
# Read in the citibike data in the data folder in this repository.
citibike = pd.read_csv('./data/citibike_feb2014.csv')

## Explore the data
Use this space to familiarize yourself with the data.

Convince yourself there are no issues with the data. If you find any issues, clean them here.

In [7]:
citibike.head(1)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1


In [15]:
# Birth year should be an integer
citibike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  int64  
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  object 
 14  gend

In [24]:
# How to change the 
citibike['birth year'].unique()

array(['1991', '1979', '1948', '1981', '1990', '1978', '1944', '1983',
       '1969', '1986', '1962', '1965', '1942', '1989', '1980', '1957',
       '1951', '1992', '1971', '1982', '1968', '1984', '\\N', '1956',
       '1987', '1985', '1996', '1975', '1988', '1974', '1972', '1959',
       '1973', '1977', '1976', '1953', '1993', '1970', '1963', '1967',
       '1966', '1960', '1961', '1994', '1958', '1955', '1946', '1964',
       '1900', '1995', '1954', '1952', '1949', '1947', '1941', '1938',
       '1950', '1945', '1997', '1934', '1940', '1939', '1936', '1943',
       '1935', '1937', '1922', '1932', '1907', '1926', '1899', '1901',
       '1917', '1910', '1933', '1921', '1927', '1913'], dtype=object)

In [54]:
count = 0
for n in dog['birth year']:
    count += int(n)
print(count/len(dog['birth year']))

1975.4975070980051


In [73]:
def clean_by(cell):
    if cell == '\\N':
        return 1979
    else:
        return float(cell)

In [74]:
citibike['birth year'] = citibike['birth year'].map(clean_by)

In [75]:
citibike.dtypes

tripduration                 int64
starttime                   object
stoptime                    object
start station id             int64
start station name          object
start station latitude     float64
start station longitude    float64
end station id               int64
end station name            object
end station latitude       float64
end station longitude      float64
bikeid                       int64
usertype                    object
birth year                 float64
gender                       int64
dtype: object

In [76]:
# There are no null values
citibike.isna().sum()

tripduration               0
starttime                  0
stoptime                   0
start station id           0
start station name         0
start station latitude     0
start station longitude    0
end station id             0
end station name           0
end station latitude       0
end station longitude      0
bikeid                     0
usertype                   0
birth year                 0
gender                     0
dtype: int64

### Is average trip duration different by gender?

Conduct a hypothesis test that checks whether or not the average trip duration is different for `gender=1` and `gender=2`. Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly!

In [78]:
citibike['gender'].value_counts()

1    176526
2     41479
0      6731
Name: gender, dtype: int64

**Null Hypothesis:** On average the trip duration is not different by gender.  
**Alternative Hypothesis:** On average the trip duration is different by gender. 

In [181]:
from scipy import stats

In [186]:
trt = citibike[citibike['gender'] == 1]['tripduration']
ctrl = citibike[citibike['gender'] == 2]['tripduration']

In [187]:
# Conduct our t-test.
tt = stats.ttest_ind(trt, ctrl, equal_var=False)
tt

Ttest_indResult(statistic=-4.802922158264667, pvalue=1.5680482053980446e-06)

In [188]:
# Print the average of the control and experimental groups.
print(trt.mean())
print(ctrl.mean())

814.0324088236293
991.3610742785506


In [189]:
trt.mean() - ctrl.mean()

-177.3286654549213

In [190]:
tt.pvalue

1.5680482053980446e-06

**Answer:** 
Since our p-value is below our significance level of 0.05, we succeed in rejecting the null hypothesis.

### What numeric columns shouldn't be treated as numeric?

**Answer:** The columns 'start station id' should not  be treated as a numeric column.

### Dummify the `start station id` Variable

In [81]:
citibike['start station id'].value_counts()

293     2920
519     2719
497     2493
435     2403
521     2171
        ... 
431       54
278       45
443       41
2005      36
320        4
Name: start station id, Length: 329, dtype: int64

In [88]:
df = pd.get_dummies(citibike['start station id'], drop_first=True)

In [92]:
citibike = citibike.join(df)

In [93]:
citibike

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,...,2006,2008,2009,2010,2012,2017,2021,2022,2023,3002
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,...,0,0,0,0,0,0,0,0,0,0
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,...,0,0,0,0,0,0,0,0,0,0
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.723180,...,0,0,0,0,0,0,0,0,0,0
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.991580,284,Greenwich Ave & 8 Ave,40.739017,...,0,0,0,0,0,0,0,0,0,0
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224731,848,2014-02-28 23:57:13,2014-03-01 00:11:21,498,Broadway & W 32 St,40.748549,-73.988084,432,E 7 St & Avenue A,40.726218,...,0,0,0,0,0,0,0,0,0,0
224732,1355,2014-02-28 23:57:55,2014-03-01 00:20:30,470,W 20 St & 8 Ave,40.743453,-74.000040,302,Avenue D & E 3 St,40.720828,...,0,0,0,0,0,0,0,0,0,0
224733,304,2014-02-28 23:58:17,2014-03-01 00:03:21,497,E 17 St & Broadway,40.737050,-73.990093,334,W 20 St & 7 Ave,40.742388,...,0,0,0,0,0,0,0,0,0,0
224734,308,2014-02-28 23:59:10,2014-03-01 00:04:18,353,S Portland Ave & Hanson Pl,40.685396,-73.974315,365,Fulton St & Grand Ave,40.682232,...,0,0,0,0,0,0,0,0,0,0


In [94]:
citibike.isna().sum()

tripduration          0
starttime             0
stoptime              0
start station id      0
start station name    0
                     ..
2017                  0
2021                  0
2022                  0
2023                  0
3002                  0
Length: 343, dtype: int64

In [96]:
citibike['birth year']

0         1991.0
1         1979.0
2         1948.0
3         1981.0
4         1990.0
           ...  
224731    1976.0
224732    1985.0
224733    1968.0
224734    1982.0
224735    1960.0
Name: birth year, Length: 224736, dtype: float64

## Feature Engineering
Engineer a feature called `age` that shares how old the person would have been in 2014 (at the time the data was collected)
- Note: you will need to clean the data a bit.

In [95]:
citibike.head(1)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,...,2006,2008,2009,2010,2012,2017,2021,2022,2023,3002
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,...,0,0,0,0,0,0,0,0,0,0


In [98]:
citibike['age'] = 2014 - citibike['birth year']

## Split your data into train/test sets

Look at the size of your data. What is a good proportion for your split? **Justify your answer, considering the size of your data and the default split size in sklearn.**

Use the `tripduration` column as your `y` variable.

For your `X` variables, use `age`, `usertype`, `gender`, and the dummy variables you created from `start station id`. (Hint: You may find the Pandas `.drop()` method helpful here.) 

In [104]:
citibike.columns

Index([           'tripduration',               'starttime',
                      'stoptime',        'start station id',
            'start station name',  'start station latitude',
       'start station longitude',          'end station id',
              'end station name',    'end station latitude',
       ...
                            2008,                      2009,
                            2010,                      2012,
                            2017,                      2021,
                            2022,                      2023,
                            3002,                     'age'],
      dtype='object', length=344)

In [119]:
dropped_df = citibike.drop(columns = ['tripduration', 'stoptime', 'start station name', 'start station longitude', 'end station name',
                        'starttime', 'start station id', 'start station latitude', 'end station id', 'end station id',
                                     'end station longitude', 'end station latitude', 'birth year'])

In [120]:
dropped_df.head(1)

Unnamed: 0,bikeid,usertype,gender,79,82,83,116,119,120,127,...,2008,2009,2010,2012,2017,2021,2022,2023,3002,age
0,21101,Subscriber,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,23.0


In [123]:
user_dum = pd.get_dummies(dropped_df['usertype'], drop_first=True)

In [125]:
user_dum

Unnamed: 0,Subscriber
0,1
1,1
2,1
3,1
4,1
...,...
224731,1
224732,1
224733,1
224734,1


In [149]:
dropped_df = dropped_df.join(user_dum)

In [150]:
dropped_df.head(1)

Unnamed: 0,bikeid,gender,79,82,83,116,119,120,127,128,...,2009,2010,2012,2017,2021,2022,2023,3002,age,Subscriber
0,21101,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,23.0,1


In [152]:
X = dropped_df.drop(columns='bikeid')

In [154]:
X.head(1)

Unnamed: 0,gender,79,82,83,116,119,120,127,128,137,...,2009,2010,2012,2017,2021,2022,2023,3002,age,Subscriber
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,23.0,1


In [155]:
y = citibike['tripduration']

## Fit a Linear Regression model in `sklearn` predicting `tripduration`.

In [165]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, train_size = .65)

In [166]:
model = LinearRegression()

In [167]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [168]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [172]:
y_pred_test.mean()

876.5389452686321

In [169]:
model.score(X_train, y_train)

0.0054176987603250515

In [170]:
model.score(X_test, y_test)

-0.0031713904716592634

## Evaluate your model
Look at some evaluation metrics for **both** the training and test data. 
- How did your model do? Is it overfit, underfit, or neither?
- Does this model outperform the baseline? (e.g. setting $\hat{y}$ to be the mean of our training `y` values.)

The model was underfit. The 

In [173]:
y_null = np.zeros_like(y_train)
y_null = y_null + citibike['tripduration'].mean()
y_null[0:5]

array([874.51980991, 874.51980991, 874.51980991, 874.51980991,
       874.51980991])

In [175]:
mean_squared_error(y_train, y_pred_train)

31803606.675827023

In [174]:
mean_squared_error(y_train, y_null, squared=False)

5654.807480337054

## Fit a Linear Regression model in `statsmodels` predicting `tripduration`.

In [176]:
import statsmodels.api as sm

  return f(*args, **kwds)


In [None]:
statsmodels

In [177]:
mod = sm.OLS(y_train, X_train).fit()

In [178]:
mod.summary()

0,1,2,3
Dep. Variable:,tripduration,R-squared (uncentered):,0.028
Model:,OLS,Adj. R-squared (uncentered):,0.026
Method:,Least Squares,F-statistic:,12.88
Date:,"Fri, 07 Aug 2020",Prob (F-statistic):,0.0
Time:,19:49:27,Log-Likelihood:,-1469000.0
No. Observations:,146078,AIC:,2939000.0
Df Residuals:,145747,BIC:,2942000.0
Df Model:,331,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
gender,180.1784,38.551,4.674,0.000,104.620,255.737
79,1661.4124,292.156,5.687,0.000,1088.792,2234.033
82,2010.2658,410.719,4.895,0.000,1205.266,2815.266
83,1331.6633,422.938,3.149,0.002,502.714,2160.613
116,1062.1962,213.880,4.966,0.000,642.995,1481.398
119,1202.8612,958.634,1.255,0.210,-676.043,3081.765
120,1201.2216,729.184,1.647,0.099,-227.965,2630.408
127,1187.0613,239.498,4.956,0.000,717.650,1656.473
128,1184.3530,222.642,5.320,0.000,747.980,1620.726

0,1,2,3
Omnibus:,506356.477,Durbin-Watson:,2.001
Prob(Omnibus):,0.0,Jarque-Bera (JB):,224989563514.894
Skew:,66.176,Prob(JB):,0.0
Kurtosis:,6081.431,Cond. No.,15300.0


## Evaluate your model
Using the `statsmodels` summary, test whether or not `age` has a significant effect when predicting `tripduration`.
- Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly **in the context of your model**!

With this specific model age does not have a significant effect when predicting trip duration. Since we did 'get dummies' on the station id there are to many columns for the model which resulted in high variance. Since our p-value is below our significance level of 0.05, we succeed in rejecting the null hypothesis proving that on average the trip duration is different by gender.

## Citi Bike is attempting to market to people who they think will ride their bike for a long time. Based on your modeling, what types of individuals should Citi Bike market toward?

Based on our model it is hard to predict what types of individuals would ride their bikes for the longest. However, when looking at the raw numbers gender 2 has a higher average trip duration. 