<a href="https://colab.research.google.com/github/gauriagarwal18/NYC-Taxi-Trip-Time-Prediction/blob/master/NYC_Taxi_Trip_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

##Data Loading And Description

###We are using the following libraries for analysis:
- Numpy: We will use numpy arrays as they are comparitively faster than lists, also columns of dataframes behaves as numpy arrays

- Pandas: for reading the data from csv file, for data clening and for preparing data for analysis

- matplotlib,seaborn: for different visualisations, for drawing conclusions from data and for exploratory data analysis. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

###Data Loading
Loading CSV file from google drive in colab enviroment:
We will first load our csv file in which data is stored to the colab enviroment in data frame format so that we can make the copy of the original data and perform the required cleaning and analysis on that data without changing the original one.



In [None]:
#Download datasets Hepatitis automobile from UCI repository
from google.colab import drive
drive.mount('/content/drive')
import os
path="/content/drive/My Drive/AlmaBetter_Capstone_projects/Capstone_project2_ml/NYC_TaxiData.csv"
taxi_original=pd.read_csv(path,parse_dates=[2,3])
taxi= taxi_original.copy()

###Data Description

In [None]:
#shape of the data
taxi.shape

In [None]:
taxi.head()



The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

<b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


<b>Data fields</b>
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

In [None]:
taxi.info()
"""
here we note that there is not any null value in data right now,
we have two date-time columns
"""

In [None]:
taxi.describe(include="all")

In [None]:
taxi.columns

In [None]:
categorical=["vendor_id","passenger_count","store_and_fwd_flag"]
continuous=['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude','trip_duration']
for c in categorical:
  print(f"distribution of {c}:\n{taxi[c].value_counts()}\n\n")

Here we note that store and fwd flag is highly biased.
There are several records where passenger count is zero, which is of no use, as none of the passenger will require time duration when there is no passenger in the taxi

##Data Cleaning

In [None]:
def print_null_percent(df):
  null_percent=pd.Series()
  for col in df.columns:
    null_percent[col]=((df.shape[0]-df[col].count())/(df.shape[0]))*100
  print("columns with null values\n",null_percent[null_percent!=0])



In [None]:
#removing outliers using z-score method
def remove_outliers(df):
  from scipy import stats

  continuous_col=df.describe().columns
  df.boxplot(rot=90)
  plt.title("before removing outliers",)
  plt.show()
  
  for c in continuous_col:
    df = df[stats.zscore(df[c])<3] 
    
    #df.loc[upper][c]=Q3
    #df.loc[lower][c]=Q1
     
  df.boxplot(rot=90)
  plt.title("after removing outliers",)
  plt.show()
  return df

In [None]:
def normalization(df,col_list):

  for c1 in col_list:
    try:
      mx=df.max()[c1]
      mn=df.min()[c1]
      df[c1]=(df[c1]-mn)/(mx-mn)
      col_list.remove(c1)    #so that c1 do  not get normalized again and again
    except:
      print(f"{c1} is not a numerical column, so it can not be normalized")
      col_list.remove(c1)
      normalization(df,col_list)

In [None]:
"""
presently data do not have any null value but it may be introduced later
"""
def cleaning(df,continuous_col=[],discrete_col=[],print_null=True,th=20.0):
  """
  this function removes all the null values from the data 
  """

  print(f"before cleaning\n")
  print(f"shape of data: {df.shape}")
  if(print_null):
    print_null_percent(df)
  
  #step1
  #preserving columns having at least 20% of not null values
  df.dropna(axis=1,inplace=True,thresh=((th/100.0)*df.shape[0]))
  #preserving rows having at least 20% of not null values
  df.dropna(axis=0,inplace=True,thresh=((th/100.0)*df.shape[1]))

  #step2
  df.drop_duplicates(inplace=True,ignore_index=True)
  

  #step3
  #removing all the null values
  for c1 in df.columns:

    #i.e it is an non catagorical column
    if c1 in continuous_col: 
      df[c1].fillna(df[c1].mean(),inplace=True)
    else:
      df[c1].fillna(df[c1].value_counts().idxmax(),inplace=True)

  print(f"\n\nAfter cleaning the data\n")
  print(f"shape of data: {df.shape}")
  print_null_percent(df)
  return df

In [None]:
taxi=cleaning(taxi,continuous,categorical,th=20)

In [None]:
taxi = remove_outliers(taxi)

In [None]:
taxi.shape  #we have also tried removing outliers from quantile method, but in that case nearly 25% of the total values get removed.

##Feature Engineering


In [None]:
"""
The column drop off time is a dependent column, as drop off time will depend on trip time 
also triptime= dropoff time-pickup time, so it is of no use so we remove that feature
"""
taxi.drop(["dropoff_datetime"],inplace=True,axis=1)

In [None]:
"""
from previous analysis we note that some records have passenger count as 0,so those records are of no use so lets remove them.
"""
taxi.index=np.arange(0,taxi.shape[0])
passenger_0=np.where(taxi["passenger_count"]==0)
taxi.drop(passenger_0[0], inplace = True)
taxi.shape

In [None]:
taxi.head(2)

In [None]:
#remove the column id as it is of no use
taxi.drop("id",axis=1,inplace=True)

In [None]:
from datetime import datetime
from datetime import date

In [None]:
def get_weekdays(dates):
  import calendar
  from datetime import date
  week_days=[]
  for i in dates:
    my_date = i.date()
    week_days.append(calendar.day_name[my_date.weekday()])
  return week_days


In [None]:
taxi["pickup_weekday"]=get_weekdays((list(taxi["pickup_datetime"])))

In [None]:
def separate_date(date_time):
  years,months,dates=[],[],[]
  for i in date_time:
    years.append(i.year)
    months.append(i.month)
    dates.append(i.day)
  return years,months,dates
  

In [None]:
def separate_time(date_time):
  hours,minutes,seconds=[],[],[]
  for i in date_time:
    hours.append(i.hour)
    minutes.append(i.minute)
    seconds.append(i.second)
  return hours,minutes,seconds

In [None]:
years,months,dates=separate_date(taxi["pickup_datetime"])
taxi["pickup_year"]=years
taxi["pickup_date"]=dates
taxi["pickup_month"]=months

In [None]:
#for time we will only take hours, as they are important but having minutes and seconds is not required as we just want an idea of time.
hours,minutes,seconds=separate_time(taxi["pickup_datetime"])
taxi["pickup_hour"]=hours

In [None]:
taxi.columns

In [None]:

print(taxi["pickup_year"].value_counts())
#as year is only 2016 so it is of no use


In [None]:
#now we will drop some columns which we do not require
taxi.drop(['pickup_datetime', "pickup_year",'pickup_date'],axis=1,inplace=True)

Calculating distance from latitude and longitude using Haversine’ formula. The haversine formula determines the great-circle distance between two points on a sphere given their longitudes and latitudes. Given by(in miles)

$$Distance = 3963.0*arccos[(sin(lat1)*sin(lat2))+cos(lat1)*cos(lat2)*cos(long1-long2)]$$

First we convert latitude and longitude in radian.

In [None]:
import math
def convert_radian(arg):
  ''' This function convert degree to radian
  input is in degree 
  output is in radian'''

  return arg*(math.pi/180)


In [None]:
distance = [ 'pickup_longitude','pickup_latitude', 'dropoff_longitude','dropoff_latitude']
for col in distance:
  taxi[col] = taxi[col].apply(convert_radian)

In [None]:
taxi.head()

In [None]:
def  haversine_formula(lat1,lat2,long1,long2):
  ''' lat1 = pickup latitude(in radian form)
  lat2 = dropoff_latitude(in radian form)
  long1 = pickup longitude(in radian form)
  long2 = dropoff longitude(in radian form)'''
  a = (np.sin(lat1)*np.sin(lat2))+(np.cos(lat1)*np.cos(lat2)*np.cos(long2-long1))
  b = np.arccos(a)
  c = 3963.0*1.609344*b # convert into KM

  return c

In [None]:
taxi['total_distance'] = haversine_formula(taxi['pickup_latitude'],taxi['dropoff_latitude'],taxi['pickup_longitude'],taxi['dropoff_longitude'])

In [None]:
taxi.describe()

In [None]:
taxi.boxplot(column = 'total_distance')

In [None]:
len(taxi[taxi['total_distance']>50])

In [None]:
#outliers removal of  total_distance
taxi = taxi[taxi['total_distance']<50]

In [None]:
taxi.shape

In [None]:
def convert_weekday(x):
  if x in ['Monday','Tuesday','Wednesday','Thursday','Friday']:
    x = 0
    return x
  else:
    x = 1
    return x

In [None]:
taxi['pickup_is_weekend'] = taxi['pickup_weekday'].apply(convert_weekday)

In [None]:
def convert_pickup_hour(x):
  if x in [0,1,2,3,4,5,6]:
    x = 'mid_night'
    return x
  elif x in [7,8,9,10,11,12]:
    x = 'office_time'
    return x
  elif x in [13,14,15,16,17,18]:
    x = 'lunch_time'
    return x
  else:
    x = 'Evening_time'
    return x

In [None]:
taxi['pickup_shift'] = taxi['pickup_hour'].apply(convert_pickup_hour)

In [None]:
taxi.head()

In [None]:
numeric_feature = taxi.describe().columns
numeric_feature

In [None]:
# Now check the distribution of dependent variable
import seaborn as sns
fig = plt.figure(figsize=(10,7))
sns.distplot(taxi['trip_duration'], color = 'g')

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot(np.sqrt(taxi['trip_duration']), color = 'g')

In [None]:
for col in numeric_feature[1:]:
  if col not in ['passenger_count','trip_duration']:
    fig = plt.figure(figsize=(10,7))
    sns.distplot(taxi[col])
    plt.ylabel(col)
    plt.show()

In [None]:
#let's have a look at correlation of different attributes and remove multicollinearity
plt.figure(figsize=(15,8))
correlation = taxi.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

From here we note that there is a good correlation between trip_durationa and total distance, so we conclude that duration mainly depends on distance and not on the path or the initial and final positions, so we drop the four columns which represents geographic location of final and initial point

In [None]:
#taxi.drop(["pickup_latitude","pickup_longitude","dropoff_latitude","dropoff_longitude"],axis=1,inplace=True)

In [None]:
taxi["speed"]=taxi["total_distance"]/taxi["trip_duration"] 

In [None]:
def remove_outliers2(df,continuous_col=[]):

  if len(continuous_col)==0:

    continuous_col=df.describe().columns
  df.boxplot(rot=90)
  plt.title("before removing outliers",)
  plt.show()
  
  for c in continuous_col:
    df.index=np.arange(0,df.shape[0])
    Q1=np.quantile(df[c],0.25)
    Q3=np.quantile(df[c],0.75)
    IQR= Q3 - Q1
    upper=np.where(df[c]>=(Q3+1.5*IQR))[0]
    #print(upper[0])
    lower=np.where(df[c]<=(Q1-1.5*IQR))[0]   #it will be a tuple and we require a numpy array which is at it's first index.
    #print(lower)
    outliers_idx=np.unique(np.append(upper,lower)) 
    df.drop(outliers_idx, inplace = True) 
    
    #df.loc[upper][c]=Q3
    #df.loc[lower][c]=Q1
     
  df.boxplot(rot=90)
  plt.title("after removing outliers",)
  plt.show()
  return df

In [None]:
taxi=remove_outliers2(taxi,["speed"])

In [None]:
taxi.shape#1397670

In [None]:
taxi.drop(["speed"],axis=1,inplace=True)

In [None]:
#remove multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(x):
  vif = pd.DataFrame()
  vif['columns'] = x.columns
  vif['vif_values'] = [variance_inflation_factor(x.values,i) for i in range(x.shape[1])]

  return vif

In [None]:
numeric_feature1 = taxi.describe().columns
numeric_feature1

In [None]:
calc_vif(taxi[[col for col in numeric_feature1 if col not in ['trip_duration']]])

#we do not need to drop any column as now none of the attribute have a vif value greater than 10

In [None]:
taxi.columns

In [None]:
#Now we do some analysis on categorical variable
categorical_feature = list(taxi.describe(include = 'object').columns)
categorical_feature.extend(['pickup_is_weekend'])

In [None]:
categorical_feature

In [None]:
#to have a look at the distribution of various categorical features
for col in categorical_feature:
  fig = plt.figure(figsize= (10,7))
  ax = fig.gca()
  counts = taxi[col].value_counts()
  counts.plot.bar(ax=ax,color='y')
  ax.set_title(col + 'count')
  ax.set_xlabel(col)
  ax.set_ylabel('frequency')

So we will drop the column pickup_weekday and pickup_hours as pickup_weekday and pickup_is_weekend is correlated to each other. Whereas pickup_hours and pickup_shift are correlated.

In [None]:
#but we do not very much bother about their separate distribution what we care about is how these features affect our dependent variable
features=["vendor_id","pickup_month"]+categorical_feature
for col in features:
  plt.figure(figsize=(15,12),facecolor='white',edgecolor='orange')
  plt.title("trip_duration v/s total_distance".title())
  sns.scatterplot(x="trip_duration",y="total_distance",data=taxi,hue=col, alpha=1, legend="brief")#,y_bins=[10*i for i in range(0,20)])
  plt.show()

From above we note that:
- vendor_id: For a longer trip or for a longer path vendor_id 2 is preffered
- pickup_month: Data is not properly separated but for a better result we will do some more analysis on this column.
- store_and_fwd_flag: This is a biased column, also there is no particular separation so we will drop it.
- pickup_weekday: there is no particular separation so we will drop it.
- pickup_shift and pickup_isweekend shows some boudaries through which we can separate different colours so these attributes are important

In [None]:
taxi.drop(['store_and_fwd_flag', 'pickup_weekday'],axis=1,inplace=True)

Conclusion:
First two features do not bother the pick_up duration much, as we note that all the colurs are mixed , and is particular boundary which tends to separate them so we drop the first two features

In [None]:
categorical_feature=taxi.describe(include="object")

In [None]:
for col in categorical_feature:
  fig = plt.figure(figsize=(10,7))
  ax = fig.gca()
  taxi.boxplot(column = 'trip_duration',by = col, ax=ax)
  ax.set_xlabel(col)
  ax.set_ylabel('trip_duration')
  plt.show()


In [None]:
#performing some analysis on pickup_hour
taxi[["pickup_hour","trip_duration"]].corr()
#from here we note that this attribute do not have a good correlation with trip duration so we drop it
taxi.drop(["pickup_hour"],axis=1,inplace=True)

In [None]:
#some analysis on pickup month

taxi["pickup_month"].value_counts()
#there are total six pickup months and other six month  data is not included so what if we want to get a duration for any 
#of the next six month, so we drop this attribute also
taxi.drop(["pickup_month"],axis=1,inplace=True)

In [None]:
taxi["pickup_shift"].unique()

In [None]:
##Give a Pandas command to convert the categorical attribute, pickup_shift into dummy variables.
taxi['pickup_shift'].replace(to_replace=['mid_night', 'office_time','lunch_time',  'Evening_time'], value=[0,1,2,3] ,inplace=True)

In [None]:
#columns left with us
taxi.columns

In [67]:
(taxi.describe()).columns #here we note that all the columns have numeric data type so data is ready for prediction

Index(['vendor_id', 'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'trip_duration',
       'total_distance', 'pickup_is_weekend', 'pickup_shift'],
      dtype='object')

In [68]:
#building model

independent_var=list((abs((taxi.corr())["trip_duration"]).sort_values(ascending=False)).index)[1:]
dependent_var=["trip_duration"]
independent_var
x=taxi[independent_var]
y=taxi["trip_duration"]
#arranged independent var in proper order

#just to try
x=x.iloc[:50000]
y=y.iloc[:50000]

In [69]:
independent_var

['total_distance',
 'pickup_latitude',
 'dropoff_latitude',
 'pickup_longitude',
 'dropoff_longitude',
 'pickup_is_weekend',
 'pickup_shift',
 'passenger_count',
 'vendor_id']

In [71]:
#train-test split

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(x, y, test_size=0.30, random_state=324)



##Linear Regression

In [70]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [72]:
linear_model=LinearRegression()
linear_model.fit(x_train,y_train)


LinearRegression()

In [73]:
y_train_predicted=linear_model.predict(x_train)
y_test_predicted=linear_model.predict(x_test)




In [74]:
math.sqrt(mean_squared_error(y_train,y_train_predicted))

374.2604327243455

In [75]:
r2_score(y_train, y_train_predicted)

0.6639250939095025

In [76]:
r2_score(y_test, y_test_predicted)

0.645636361190796

##Decesion Tree

In [77]:
from sklearn.tree import DecisionTreeRegressor 
  
# create a regressor object
regressor = DecisionTreeRegressor(max_depth = 15, min_impurity_decrease = 0.1, min_samples_split = 600,random_state = 0) 
  
# fit the regressor with X and Y data
regressor.fit(x_train, y_train)

DecisionTreeRegressor(max_depth=15, min_impurity_decrease=0.1,
                      min_samples_split=600, random_state=0)

In [78]:
regressor.score(x_train,y_train)

0.7187486164000061

In [79]:
y_pred_dt = regressor.predict(x_test)

In [80]:
r2 = r2_score((y_test), (y_pred_dt))
r2

0.6824833546623231

In [81]:
MSE  = mean_squared_error((y_test), (y_pred_dt))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

MSE : 147015.42112379687
RMSE : 383.42590043422587


##Gradient Boosting:

In [116]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
model=GradientBoostingRegressor()
grid_values = {'n_estimators':[50, 100], 'max_depth':[6,8]}
model= GridSearchCV(model, param_grid = grid_values, scoring = 'r2', cv=7)

model.fit(x_train,y_train)

GridSearchCV(cv=7, estimator=GradientBoostingRegressor(),
             param_grid={'max_depth': [6, 8], 'n_estimators': [50, 100]},
             scoring='r2')

In [117]:
y_train_predicted=model.predict(x_train)
y_test_predicted=model.predict(x_test)




In [118]:
math.sqrt(mean_squared_error(y_train,y_train_predicted))

277.7603771991829

In [119]:
r2_score(y_train, y_train_predicted)

0.8148904429002747

In [120]:
r2_score(y_test, y_test_predicted)

0.7158320979485548

In [121]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
xg = XGBRegressor()
learning_rate = [0.5,0.6]
subsample = [0.8,0.9]
max_depth = [7]
grid = dict(learning_rate=learning_rate, subsample=subsample, max_depth=max_depth)
grid_search1 = GridSearchCV(xg,param_grid = grid,scoring = 'r2', cv = 3)
grid_search1.fit(x_train,y_train)



GridSearchCV(cv=3, estimator=XGBRegressor(),
             param_grid={'learning_rate': [0.5, 0.6], 'max_depth': [7],
                         'subsample': [0.8, 0.9]},
             scoring='r2')

In [125]:
print("Best: %f using %s" % (grid_search1.best_score_, grid_search1.best_params_))

Best: 0.699805 using {'learning_rate': 0.5, 'max_depth': 7, 'subsample': 0.9}


In [126]:
from sklearn.ensemble import StackingRegressor
estimators = [ ('rg', regressor), ('lm', model) ]
reg = StackingRegressor( estimators=estimators,final_estimator=grid_search1 )


In [127]:
reg.fit(x_train,y_train)



StackingRegressor(estimators=[('rg',
                               DecisionTreeRegressor(max_depth=15,
                                                     min_impurity_decrease=0.1,
                                                     min_samples_split=600,
                                                     random_state=0)),
                              ('lm',
                               GridSearchCV(cv=7,
                                            estimator=GradientBoostingRegressor(),
                                            param_grid={'max_depth': [6, 8],
                                                        'n_estimators': [50,
                                                                         100]},
                                            scoring='r2'))],
                  final_estimator=GridSearchCV(cv=3, estimator=XGBRegressor(),
                                               param_grid={'learning_rate': [0.5,
                                           

In [128]:
y_train_predicted=reg.predict(x_train)
y_test_predicted=reg.predict(x_test)




In [129]:
math.sqrt(mean_squared_error(y_train,y_train_predicted))

333.89607158637534

In [134]:
r2_train=r2_score(y_train, y_train_predicted)
r2_train

0.7325079531033415

In [135]:
r2_test=r2_score(y_test, y_test_predicted)
r2_test

0.676093478982323

In [132]:
v1= (1-r2_test)
v2= ((x_train.shape[0])-1) / ((x_train.shape[0])-(x_train.shape[1])-1)
adj_rsquared = (1 - (v1 * v2))
adj_rsquared

0.6760101649300463

In [133]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
ridge=Ridge()
params={"alpha":[1e-15,1e-10,1e-8,1e-3,1e-2,1,2,3,4,5,10,20,30]}
ridge_regressor=GridSearchCV(ridge,params,scoring="neg_mean_squared_error",cv=5)
ridge_regressor.fit(x_train,y_train)
print(ridge_regressor.best_params_)   #which lemda value is most suitable.
print(ridge_regressor.best_score_)     
#is linear regression mean  mse is -37.something which is lesser than -31._____  so ridge regression is not good in this case.

{'alpha': 1e-08}
-140187.25184010228


In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso=Lasso()
params={"alpha":[1e-15,1e-10,1e-8,1e-3,1e-2,1,2,3,4,5,10,20,30,40,45,50,55,70,90,100]}
lasso_regressor=GridSearchCV(lasso,params,scoring="neg_mean_squared_error",cv=5)
lasso_regressor.fit(x_train,y_train)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)    

In [None]:
y_train_predicted=ridge_regressor.predict(x_train)
y_test_predicted=ridge_regressor.predict(x_test)




In [None]:
math.sqrt(mean_squared_error(y_train,y_train_predicted))

In [None]:
r2_score(y_train, y_train_predicted)

In [None]:
r2_score(y_test, y_test_predicted)

In [None]:
y_train_predicted=lasso_regressor.predict(x_train)
y_test_predicted=lasso_regressor.predict(x_test)

In [None]:
math.sqrt(mean_squared_error(y_train,y_train_predicted))

In [None]:
r2_test=r2_score(y_train, y_train_predicted)

In [None]:
r2_score(y_test, y_test_predicted)

In [None]:
x_train.shape[0]

In [None]:
v1= (1-r2_test)
v2= ((x_train.shape[0])-1) / ((x_train.shape[0])-(x_train.shape[1])-1)
adj_rsquared = (1 - (v1 * v2))
adj_rsquared

In [None]:
adj_rsquared