<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab 3.02: Statistical Modeling and Model Validation

> Authors: Tim Book, Matt Brems

---

## Objective
The goal of this lab is to guide you through the modeling workflow. In this lesson, you will follow all best practices when slicing your data and validating your model. The goal of this lab is not necessarily to build the best model you can, but to build and evaluate a model, and interpret its results.

## Imports

In [1]:
# Import everything you need here.
# You may want to return to this cell to import more things later in the lab.
# DO NOT COPY AND PASTE FROM OUR CLASS SLIDES!
# Muscle memory is important!
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.api as sm

## Read Data
The `citibike` dataset consists of Citi Bike ridership data for over 224,000 rides in February 2014.

In [2]:
# Read in the citibike data in the data folder in this repository.

citibike = pd.read_csv('./data/citibike_feb2014.csv')
citibike

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,-73.991475,21101,Subscriber,1991,1
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,-73.989780,15456,Subscriber,1979,2
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.723180,-73.994800,16281,Subscriber,1948,2
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.991580,284,Greenwich Ave & 8 Ave,40.739017,-74.002638,17400,Subscriber,1981,1
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,-73.989780,19341,Subscriber,1990,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224731,848,2014-02-28 23:57:13,2014-03-01 00:11:21,498,Broadway & W 32 St,40.748549,-73.988084,432,E 7 St & Avenue A,40.726218,-73.983799,17413,Subscriber,1976,1
224732,1355,2014-02-28 23:57:55,2014-03-01 00:20:30,470,W 20 St & 8 Ave,40.743453,-74.000040,302,Avenue D & E 3 St,40.720828,-73.977932,15608,Subscriber,1985,2
224733,304,2014-02-28 23:58:17,2014-03-01 00:03:21,497,E 17 St & Broadway,40.737050,-73.990093,334,W 20 St & 7 Ave,40.742388,-73.997262,17112,Subscriber,1968,1
224734,308,2014-02-28 23:59:10,2014-03-01 00:04:18,353,S Portland Ave & Hanson Pl,40.685396,-73.974315,365,Fulton St & Grand Ave,40.682232,-73.961458,14761,Subscriber,1982,1


## Explore the data
Use this space to familiarize yourself with the data.

Convince yourself there are no issues with the data. If you find any issues, clean them here.

In [3]:
citibike.info()
# Start and stop time can be datetimes
# birth year should be intiger

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             224736 non-null  int64  
 1   starttime                224736 non-null  object 
 2   stoptime                 224736 non-null  object 
 3   start station id         224736 non-null  int64  
 4   start station name       224736 non-null  object 
 5   start station latitude   224736 non-null  float64
 6   start station longitude  224736 non-null  float64
 7   end station id           224736 non-null  int64  
 8   end station name         224736 non-null  object 
 9   end station latitude     224736 non-null  float64
 10  end station longitude    224736 non-null  float64
 11  bikeid                   224736 non-null  int64  
 12  usertype                 224736 non-null  object 
 13  birth year               224736 non-null  object 
 14  gend

In [4]:
citibike['starttime'] = pd.to_datetime( citibike['starttime'] )
citibike['stoptime'] = pd.to_datetime( citibike['stoptime'])
citibike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224736 entries, 0 to 224735
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   tripduration             224736 non-null  int64         
 1   starttime                224736 non-null  datetime64[ns]
 2   stoptime                 224736 non-null  datetime64[ns]
 3   start station id         224736 non-null  int64         
 4   start station name       224736 non-null  object        
 5   start station latitude   224736 non-null  float64       
 6   start station longitude  224736 non-null  float64       
 7   end station id           224736 non-null  int64         
 8   end station name         224736 non-null  object        
 9   end station latitude     224736 non-null  float64       
 10  end station longitude    224736 non-null  float64       
 11  bikeid                   224736 non-null  int64         
 12  usertype        

In [5]:


# citibike['birth year'].str.replace(r'\\N', np.nan())
        
citibike['birth year'] = citibike['birth year'].str.strip(r'\\N')


In [6]:
citibike['birth year'] = pd.to_datetime(citibike['birth year']).dt.year

In [7]:
citibike['birth year']

0         1991.0
1         1979.0
2         1948.0
3         1981.0
4         1990.0
           ...  
224731    1976.0
224732    1985.0
224733    1968.0
224734    1982.0
224735    1960.0
Name: birth year, Length: 224736, dtype: float64

### Is average trip duration different by gender?

Conduct a hypothesis test that checks whether or not the average trip duration is different for `gender=1` and `gender=2`. Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly!

In [8]:

# tripdur1 = []
# tripdur2 = []

# for i in range(len(citibike)):
#     if citibike['gender'][i] == 1:
#         tripdur1.append(citibike['tripduration'][i])
#     if citibike['gender'][i] == 2:
#         tripdur2.append(citibike['tripduration'][i])
        

In [9]:
citibike.groupby('gender')['tripduration'].agg(['mean'])

Unnamed: 0_level_0,mean
gender,Unnamed: 1_level_1
0,1740.830932
1,814.032409
2,991.361074


In [10]:
# np.mean(tripdur1)

In [11]:
# np.mean(tripdur2)

    gender 2 has a longer trip duration. I accept that hypothisis
    

### What numeric columns shouldn't be treated as numeric?

**Answer:** names, ids, usertypes, and gender. And also perhaps even the coordinates. I believe that would be a more complicated algoritm to measure the ridding distance between points. 

### Dummify the `start station id` Variable

In [12]:
citibike['start station id'] = citibike['start station id'].astype(str)
citibike['end station id'] = citibike['end station id'].astype(str)

In [13]:
startstationid_df = pd.get_dummies(citibike['start station id'])
citibike2 = citibike.join(startstationid_df)

In [14]:
citibike2

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,...,537,538,539,540,545,546,72,79,82,83
0,382,2014-02-01 00:00:00,2014-02-01 00:06:22,294,Washington Square E,40.730494,-73.995721,265,Stanton St & Chrystie St,40.722293,...,0,0,0,0,0,0,0,0,0,0
1,372,2014-02-01 00:00:03,2014-02-01 00:06:15,285,Broadway & E 14 St,40.734546,-73.990741,439,E 4 St & 2 Ave,40.726281,...,0,0,0,0,0,0,0,0,0,0
2,591,2014-02-01 00:00:09,2014-02-01 00:10:00,247,Perry St & Bleecker St,40.735354,-74.004831,251,Mott St & Prince St,40.723180,...,0,0,0,0,0,0,0,0,0,0
3,583,2014-02-01 00:00:32,2014-02-01 00:10:15,357,E 11 St & Broadway,40.732618,-73.991580,284,Greenwich Ave & 8 Ave,40.739017,...,0,0,0,0,0,0,0,0,0,0
4,223,2014-02-01 00:00:41,2014-02-01 00:04:24,401,Allen St & Rivington St,40.720196,-73.989978,439,E 4 St & 2 Ave,40.726281,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224731,848,2014-02-28 23:57:13,2014-03-01 00:11:21,498,Broadway & W 32 St,40.748549,-73.988084,432,E 7 St & Avenue A,40.726218,...,0,0,0,0,0,0,0,0,0,0
224732,1355,2014-02-28 23:57:55,2014-03-01 00:20:30,470,W 20 St & 8 Ave,40.743453,-74.000040,302,Avenue D & E 3 St,40.720828,...,0,0,0,0,0,0,0,0,0,0
224733,304,2014-02-28 23:58:17,2014-03-01 00:03:21,497,E 17 St & Broadway,40.737050,-73.990093,334,W 20 St & 7 Ave,40.742388,...,0,0,0,0,0,0,0,0,0,0
224734,308,2014-02-28 23:59:10,2014-03-01 00:04:18,353,S Portland Ave & Hanson Pl,40.685396,-73.974315,365,Fulton St & Grand Ave,40.682232,...,0,0,0,0,0,0,0,0,0,0


## Feature Engineering
Engineer a feature called `age` that shares how old the person would have been in 2014 (at the time the data was collected)
- Note: you will need to clean the data a bit.

In [15]:
# agelist = [2014 - i for i in citibike2['birth year']]


In [16]:
# citibike2['age']=agelist

In [17]:
citibike2['age'] = 2014 - citibike2['birth year']


In [18]:
agedrop = []
for i in range(len(citibike2)):
    if pd.notna(citibike2['age'][i]) == False :
        agedrop.append(i)



In [19]:
citibike2.drop(index=agedrop, inplace=True)

In [20]:
citibike2.reset_index(drop=True, inplace=True)

## Split your data into train/test sets

Look at the size of your data. What is a good proportion for your split? **Justify your answer, considering the size of your data and the default split size in sklearn.**

Use the `tripduration` column as your `y` variable.

For your `X` variables, use `age`, `usertype`, `gender`, and the dummy variables you created from `start station id`. (Hint: You may find the Pandas `.drop()` method helpful here.) 

In [21]:
citibike2['usertype'].unique()
citibike2['usertype'] = [ 1 if i == 'Subscriber' else 0 for i in citibike2['usertype']]

In [22]:
citibike2.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude',
       ...
       '538', '539', '540', '545', '546', '72', '79', '82', '83', 'age'],
      dtype='object', length=345)

In [23]:
X = citibike2.drop(columns=['tripduration', 'starttime', 'stoptime',
                       'start station id','start station name', 'start station latitude',
                       'start station longitude', 'end station id', 'end station name',
                       'end station latitude', 'end station longitude', 
                       'bikeid', 'birth year'], axis=1)

y = citibike2['tripduration']

X.head()

# I would like to do a kfold i think. Leave on out computation would take a looooong time to 
# compute. I like that in kfold we can shoose the sizes of the test and train sets. 

# I actually have no idea what model i should use yet. It looks like we used a train_test_split
# function in one of our practice workflows. Let's see how it goes! It seems to be the most
# intuitive and simple to set up.

Unnamed: 0,usertype,gender,116,119,120,127,128,137,143,144,...,538,539,540,545,546,72,79,82,83,age
0,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,23.0
1,1,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,35.0
2,1,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,66.0
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,33.0
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,24.0


In [24]:
X['age']

0         23.0
1         35.0
2         66.0
3         33.0
4         24.0
          ... 
218014    38.0
218015    29.0
218016    46.0
218017    32.0
218018    54.0
Name: age, Length: 218019, dtype: float64

## Fit a Linear Regression model in `sklearn` predicting `tripduration`.

In [25]:
lr = LinearRegression()

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8 , random_state=42)

In [27]:
lr.fit(X_train, y_train)

LinearRegression()

## Evaluate your model
Look at some evaluation metrics for **both** the training and test data. 
- How did your model do? Is it overfit, underfit, or neither?
- Does this model outperform the baseline? (e.g. setting $\hat{y}$ to be the mean of our training `y` values.)

In [28]:
# lets check their scores. 

lr.score(X_train, y_train)

0.004324457579454322

In [29]:
lr.score(X_test, y_test)

-0.0028506794723102136

In [30]:
cross_val_score(lr, X, y, cv = 5)

array([-4.32901138e-03, -9.34710788e-04, -1.49710924e-03, -1.77424175e-02,
       -8.61106551e+14])

## Fit a Linear Regression model in `statsmodels` predicting `tripduration`.

In [31]:
X = sm.add_constant(X)
ols = sm.OLS(y, X).fit()

## Evaluate your model
Using the `statsmodels` summary, test whether or not `age` has a significant effect when predicting `tripduration`.
- Be sure to specify your null and alternative hypotheses, and to state your conclusion carefully and correctly **in the context of your model**!

In [32]:
ols.summary()

0,1,2,3
Dep. Variable:,tripduration,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.002
Method:,Least Squares,F-statistic:,2.074
Date:,"Sun, 06 Dec 2020",Prob (F-statistic):,6.57e-27
Time:,22:30:32,Log-Likelihood:,-2186000.0
No. Observations:,218019,AIC:,4373000.0
Df Residuals:,217688,BIC:,4376000.0
Df Model:,330,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
usertype,529.9550,57.808,9.168,0.000,416.653,643.257
gender,180.1440,30.175,5.970,0.000,121.002,239.286
116,-261.5534,150.071,-1.743,0.081,-555.688,32.581
119,-222.9842,736.524,-0.303,0.762,-1666.552,1220.584
120,871.0894,575.867,1.513,0.130,-257.595,1999.773
127,-205.7533,174.221,-1.181,0.238,-547.222,135.716
128,-213.0492,164.384,-1.296,0.195,-535.239,109.140
137,-234.3674,248.423,-0.943,0.345,-721.270,252.535
143,-210.1843,407.417,-0.516,0.606,-1008.712,588.343

0,1,2,3
Omnibus:,746051.188,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,292152855654.214
Skew:,63.738,Prob(JB):,0.0
Kurtosis:,5672.618,Cond. No.,2250000000000000.0


## Citi Bike is attempting to market to people who they think will ride their bike for a long time. Based on your modeling, what types of individuals should Citi Bike market toward?

Older people in gender 2, seeing there is a large coorelation with age and also a large correlation with gender 2. 
Getting subscribers seems to help, and also there are several locations with a large positive correlation so location 
based adds may be of a significant impact. 