# In this notebook I will use Linear Regression
 with different approaches, such as polynomial feature, feature engineering, scaling, regularizations and others. The goal is to predict total amount a customer will spend for services with this cable TV for their duration there.
***
 We will see why categorical features converted into dummies don't do a good job explaining a linear relationship with a dependant numerical variable. It is a much better use for logistic regression case where we need to classify instead predicting a value. 
 ***
 I also demonstrate the HUGE importance of feature engineering. Not all features are created equal.

In [3]:
## I like to load all the libraries at the top so I can always check what I have or mind need

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split, KFold, cross_val_predict
from sklearn.preprocessing import StandardScaler, PolynomialFeatures



In [4]:
### Upload the csv file there are several ways to do that on Colab, 
### easiest is to uplode it to your files on the side bar or write the below code

from google.colab import files

gdn_file = files.upload()

Saving GDN clients.xlsx to GDN clients.xlsx


In [5]:
df = pd.read_excel("GDN clients.xlsx")

df.head(5)

### If your file is in json format you can
### still upload it as per below's code

#test_json = pd.read_json('FILE NAME.json')
#test_json.head(5)

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [6]:
### N of unique values for each column

df.nunique()

CustomerID           7043
Count                   1
Country                 1
State                   1
City                 1129
Zip Code             1652
Lat Long             1652
Latitude             1652
Longitude            1651
Gender                  2
Senior Citizen          2
Partner                 2
Dependents              2
Tenure Months          73
Phone Service           2
Multiple Lines          3
Internet Service        3
Online Security         3
Online Backup           3
Device Protection       3
Tech Support            3
Streaming TV            3
Streaming Movies        3
Contract                3
Paperless Billing       2
Payment Method          4
Monthly Charges      1585
Total Charges        6531
Churn Label             2
Churn Value             2
Churn Score            85
CLTV                 3438
Churn Reason           20
dtype: int64

In [171]:
### We have to clean our data, Decide which features we need for any give task.
### There are several ways to drop columns we dont need. First we make a copy
### of the dataframe and here I will use the pop() to do it

gdn_pop = df.copy()
pd.set_option('display.max_columns', None) ## to see all columns

list_pop = ['CustomerID', 'Latitude','Longitude','Count', 'Country', 'State', 'City','Lat Long', 
        'Online Security', 'Tech Support','Churn Value', 'Churn Score', 
        'Churn Label', 'Churn Reason', 'CLTV', 'Tenure Months', 'Monthly Charges' ]

for x in list_pop:
  gdn_pop.pop(x)





gdn_pop.head(5)

Unnamed: 0,Zip Code,Gender,Senior Citizen,Partner,Dependents,Phone Service,Multiple Lines,Internet Service,Online Backup,Device Protection,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Total Charges
0,90003,Male,No,No,No,Yes,No,DSL,Yes,No,No,No,Month-to-month,Yes,Mailed check,108.15
1,90005,Female,No,No,Yes,Yes,No,Fiber optic,No,No,No,No,Month-to-month,Yes,Electronic check,151.65
2,90006,Female,No,No,Yes,Yes,Yes,Fiber optic,No,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,820.5
3,90010,Female,No,Yes,Yes,Yes,Yes,Fiber optic,No,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,3046.05
4,90015,Male,No,No,Yes,Yes,Yes,Fiber optic,Yes,Yes,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),5036.3


In [7]:
### Here i use the easiest of all ways to to subcreate a dataframe from 
### the main dataframe - use [[]] to do it. Everything inside the double
### brackets stays in the new frame

gdn = df[['Zip Code', 'Gender', 'Senior Citizen',
             'Partner', 'Dependents', 'Phone Service', 
             'Multiple Lines', 'Internet Service', 'Device Protection',
             'Streaming TV','Streaming Movies', 
            'Contract', 'Paperless Billing', 'Payment Method',
            'Total Charges' ]]

gdn = gdn.replace(['0'],'new value')
gdn.head(5) 



Unnamed: 0,Zip Code,Gender,Senior Citizen,Partner,Dependents,Phone Service,Multiple Lines,Internet Service,Device Protection,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Total Charges
0,90003,Male,No,No,No,Yes,No,DSL,No,No,No,Month-to-month,Yes,Mailed check,108.15
1,90005,Female,No,No,Yes,Yes,No,Fiber optic,No,No,No,Month-to-month,Yes,Electronic check,151.65
2,90006,Female,No,No,Yes,Yes,Yes,Fiber optic,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,820.5
3,90010,Female,No,Yes,Yes,Yes,Yes,Fiber optic,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,3046.05
4,90015,Male,No,No,Yes,Yes,Yes,Fiber optic,Yes,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),5036.3


## NEXT I will bring up the unique values each column that holds a categorical value versus integer value

In [8]:
categorical_column_list = ['Gender', 'Senior Citizen',
             'Partner', 'Dependents', 'Phone Service', 
             'Multiple Lines', 'Internet Service', 'Contract', 
             'Device Protection', 'Streaming TV', 'Streaming Movies',
             'Paperless Billing', 'Payment Method']

unique_values_list = list()

for x in categorical_column_list:
  values = gdn[x].unique() 
  unique_values_list.append(values)


unique_values_list



[array(['Male', 'Female'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Yes', 'No'], dtype=object),
 array(['No', 'Yes', 'No phone service'], dtype=object),
 array(['DSL', 'Fiber optic', 'No'], dtype=object),
 array(['Month-to-month', 'Two year', 'One year'], dtype=object),
 array(['No', 'Yes', 'No internet service'], dtype=object),
 array(['No', 'Yes', 'No internet service'], dtype=object),
 array(['No', 'Yes', 'No internet service'], dtype=object),
 array(['Yes', 'No'], dtype=object),
 array(['Mailed check', 'Electronic check', 'Bank transfer (automatic)',
        'Credit card (automatic)'], dtype=object)]

In [9]:
 ### Replacing/cleaning data
 
gdn['Multiple Lines'] = gdn['Multiple Lines'].replace(['No phone service'], 'No')

unique_values_list

[array(['Male', 'Female'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Yes', 'No'], dtype=object),
 array(['No', 'Yes', 'No phone service'], dtype=object),
 array(['DSL', 'Fiber optic', 'No'], dtype=object),
 array(['Month-to-month', 'Two year', 'One year'], dtype=object),
 array(['No', 'Yes', 'No internet service'], dtype=object),
 array(['No', 'Yes', 'No internet service'], dtype=object),
 array(['No', 'Yes', 'No internet service'], dtype=object),
 array(['Yes', 'No'], dtype=object),
 array(['Mailed check', 'Electronic check', 'Bank transfer (automatic)',
        'Credit card (automatic)'], dtype=object)]

## I will convert all categorical values to one hot encoded values


<>

In [10]:
## First I will use Label Encoding which i prefer

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()


for x in categorical_column_list:
  gdn[x+' encoded'] = le.fit_transform(gdn[x])
  

gdn.head(5)

Unnamed: 0,Zip Code,Gender,Senior Citizen,Partner,Dependents,Phone Service,Multiple Lines,Internet Service,Device Protection,Streaming TV,...,Dependents encoded,Phone Service encoded,Multiple Lines encoded,Internet Service encoded,Contract encoded,Device Protection encoded,Streaming TV encoded,Streaming Movies encoded,Paperless Billing encoded,Payment Method encoded
0,90003,Male,No,No,No,Yes,No,DSL,No,No,...,0,1,0,0,0,0,0,0,1,3
1,90005,Female,No,No,Yes,Yes,No,Fiber optic,No,No,...,1,1,0,1,0,0,0,0,1,2
2,90006,Female,No,No,Yes,Yes,Yes,Fiber optic,Yes,Yes,...,1,1,1,1,0,2,2,2,1,2
3,90010,Female,No,Yes,Yes,Yes,Yes,Fiber optic,Yes,Yes,...,1,1,1,1,0,2,2,2,1,2
4,90015,Male,No,No,Yes,Yes,Yes,Fiber optic,Yes,Yes,...,1,1,1,1,0,2,2,2,1,0


In [11]:
### Dropping the columns we already labeled in the cell above, first we make a copy

gdn_labeled = gdn.copy()



In [12]:
for x in categorical_column_list:
  gdn_labeled.pop(x)


gdn_labeled.head(5)

Unnamed: 0,Zip Code,Total Charges,Gender encoded,Senior Citizen encoded,Partner encoded,Dependents encoded,Phone Service encoded,Multiple Lines encoded,Internet Service encoded,Contract encoded,Device Protection encoded,Streaming TV encoded,Streaming Movies encoded,Paperless Billing encoded,Payment Method encoded
0,90003,108.15,1,0,0,0,1,0,0,0,0,0,0,1,3
1,90005,151.65,0,0,0,1,1,0,1,0,0,0,0,1,2
2,90006,820.5,0,0,0,1,1,1,1,0,2,2,2,1,2
3,90010,3046.05,0,0,1,1,1,1,1,0,2,2,2,1,2
4,90015,5036.3,1,0,0,1,1,1,1,0,2,2,2,1,0


In [13]:
## We will use Total Charges as dependent variable and the one I will be predicting 
## Let's move it to the back for visual purposes

total_charges = gdn_labeled['Total Charges']

gdn_labeled.pop('Total Charges')

gdn_labeled['Total Charges'] = total_charges




In [15]:
print(gdn_labeled.shape)

print(gdn_labeled.dtypes)

#gdn_labeled.sort_values(by='Total Charges',ascending=False)

(7043, 15)
Zip Code                      int64
Gender encoded                int64
Senior Citizen encoded        int64
Partner encoded               int64
Dependents encoded            int64
Phone Service encoded         int64
Multiple Lines encoded        int64
Internet Service encoded      int64
Contract encoded              int64
Device Protection encoded     int64
Streaming TV encoded          int64
Streaming Movies encoded      int64
Paperless Billing encoded     int64
Payment Method encoded        int64
Total Charges                object
dtype: object


In [16]:
### Converting total charges from object to float64

gdn_labeled['Total Charges'] = pd.to_numeric(gdn_labeled['Total Charges'], errors='coerce')

#df = df.astype({"Unit_Price": float})



In [17]:
print(gdn_labeled.dtypes)



Zip Code                       int64
Gender encoded                 int64
Senior Citizen encoded         int64
Partner encoded                int64
Dependents encoded             int64
Phone Service encoded          int64
Multiple Lines encoded         int64
Internet Service encoded       int64
Contract encoded               int64
Device Protection encoded      int64
Streaming TV encoded           int64
Streaming Movies encoded       int64
Paperless Billing encoded      int64
Payment Method encoded         int64
Total Charges                float64
dtype: object


In [62]:
### Replace the Nan values with zeros

gdn_labeled['Total Charges'] = gdn_labeled['Total Charges'].fillna(0)


'NaN' in gdn_labeled['Total Charges']

False

Next, I will use the get_dummies() as another good way to get convert categories in numerical value

In [None]:
gdn_with_dummies = gdn_pop.copy() ### this copy of dataset to use to illustrate get_dummies() below



In [None]:
#for x in categorical_column_list:
  #pd.get_dummies(gdn_with_dummies[[x]], drop_first = True)



dummies_gdn = pd.get_dummies(gdn_with_dummies[['Gender', 'Senior Citizen',
             'Partner', 'Dependents', 'Phone Service', 
             'Multiple Lines', 'Internet Service', 'Contract', 
             'Device Protection', 'Streaming TV', 'Streaming Movies',
             'Paperless Billing', 'Payment Method']], drop_first=True)


## Add back the two columns that didn't need dummies

zip_code = gdn_with_dummies['Zip Code']

dummies_gdn['Zip Code'] = zip_code
dummies_gdn['Total Charges'] = total_charges


print(dummies_gdn.shape)
dummies_gdn.head(5)


(7043, 23)


Unnamed: 0,Gender_Male,Senior Citizen_Yes,Partner_Yes,Dependents_Yes,Phone Service_Yes,Multiple Lines_No phone service,Multiple Lines_Yes,Internet Service_Fiber optic,Internet Service_No,Contract_One year,Contract_Two year,Device Protection_No internet service,Device Protection_Yes,Streaming TV_No internet service,Streaming TV_Yes,Streaming Movies_No internet service,Streaming Movies_Yes,Paperless Billing_Yes,Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check,Zip Code,Total Charges
0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,90003,108.15
1,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,90005,151.65
2,0,0,0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,0,1,0,90006,820.5
3,0,0,1,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,0,1,0,90010,3046.05
4,1,0,0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,0,0,0,90015,5036.3


NEXT step is to identify the Dependent variable (Y), which I will be predicting, and the Indpenedent variable (X) and its features 


In [19]:


y_column = 'Total Charges'

X = gdn_labeled.drop('Total Charges', axis =1 )
Y = gdn_labeled[y_column]

print(Y.shape)
print(X.shape)
X.head(5)



(7043,)
(7043, 14)


Unnamed: 0,Zip Code,Gender encoded,Senior Citizen encoded,Partner encoded,Dependents encoded,Phone Service encoded,Multiple Lines encoded,Internet Service encoded,Contract encoded,Device Protection encoded,Streaming TV encoded,Streaming Movies encoded,Paperless Billing encoded,Payment Method encoded
0,90003,1,0,0,0,1,0,0,0,0,0,0,1,3
1,90005,0,0,0,1,1,0,1,0,0,0,0,1,2
2,90006,0,0,0,1,1,1,1,0,2,2,2,1,2
3,90010,0,0,1,1,1,1,1,0,2,2,2,1,2
4,90015,1,0,0,1,1,1,1,0,2,2,2,1,0


## The next step is very improtant and that is to spil the date into train and test sections

In [38]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)


(4930, 14) (4930,)
(2113, 14) (2113,)


Unnamed: 0,Zip Code,Gender encoded,Senior Citizen encoded,Partner encoded,Dependents encoded,Phone Service encoded,Multiple Lines encoded,Internet Service encoded,Contract encoded,Device Protection encoded,Streaming TV encoded,Streaming Movies encoded,Paperless Billing encoded,Payment Method encoded
5577,90241,0,0,0,0,1,0,1,2,2,2,2,1,2
4053,95511,1,0,0,0,1,0,1,0,2,2,2,1,2
260,94598,1,0,1,0,1,1,1,0,0,2,2,1,1
6821,90630,1,0,1,1,1,0,0,2,2,2,2,1,0
6000,93230,0,0,1,1,1,1,0,2,2,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1024,92808,1,0,0,0,1,0,1,0,0,0,0,0,2
6277,94806,1,0,0,0,1,0,1,0,0,0,0,0,2
6640,95969,1,1,1,0,1,1,1,2,2,2,2,1,0
2240,92508,0,0,1,0,1,0,2,2,1,1,1,0,0


In [87]:
### Linear Regression fit/train/predict without Scaled data


lr_regular = LinearRegression()

lr_regular.fit(x_train, y_train)
lr_regular_predict = lr_regular.predict(x_test)

print(round(r2_score(y_test, lr_regular_predict),2))
print(round(mean_squared_error(y_test, lr_regular_predict),2))

df_reg = pd.DataFrame(lr_regular_predict)

df_reg.columns = ['predicted_value']
df_reg['real_value'] = test_y
df_reg['predicted_difference'] = df_reg['real_value'] - df_reg['predicted_value']


df_reg.head(5)

#predict_single_client_regular = lr_regular.predict(x_test[[0]])

0.58
2168472.99


Unnamed: 0,predicted_value,real_value,predicted_difference
0,-654.642769,108.15,762.792769
1,881.087001,151.65,-729.437001
2,423.58943,820.5,396.91057
3,1901.287094,3046.05,1144.762906
4,2699.04555,,


This next step can be very important especially if you have very large numeric values. In our case we have mostly dummies which are small values. The zip code is large value, so we can chose to just Standardize it or all X train features. Either way works. 

In [37]:
sc = StandardScaler()

x_train_sc =  sc.fit_transform(x_train)
x_test_sc = sc.transform(x_test)


array([[ 0.76815799,  0.98510084, -0.44503471, ..., -1.12298735,
         0.83630975,  0.39046863],
       [ 1.36094196,  0.98510084,  2.24701574, ...,  1.13351465,
         0.83630975,  0.39046863],
       [ 1.03740982,  0.98510084, -0.44503471, ...,  0.00526365,
        -1.19572922,  0.39046863],
       ...,
       [-1.13595224,  0.98510084, -0.44503471, ...,  0.00526365,
        -1.19572922, -0.54764558],
       [-0.77963695, -1.0151245 , -0.44503471, ...,  0.00526365,
        -1.19572922,  1.32858284],
       [ 0.11088257, -1.0151245 , -0.44503471, ..., -1.12298735,
        -1.19572922, -0.54764558]])

In [68]:
### Linear Regression fit/train/predict with Scaled data

lr_sc = LinearRegression()

lr_sc.fit(x_train_sc, y_train)
lr_predict_sc = np.round(lr_sc.predict(x_test_sc),2)

print(round(r2_score(y_test, lr_predict_sc),2))
print(round(mean_squared_error(y_test, lr_predict_sc),2))

predict_single_client = lr_sc.predict(x_test_sc[[0]])

print(f'The projected amount the customer will spend with us in $ is {np.round(predict_single_client,2)} \nusing scaled features')

df_sc = pd.DataFrame(lr_predict_sc)

df_sc.columns = ['predicted_value']
df_sc['real_value'] = test_y
df_sc['predicted_difference'] = df_sc['real_value'] - df_sc['predicted_value']


df_sc.head(5)

-0.0
5181246.01
The projected amount the customer will spend with us in $ is [2398.05] 
using scaled features


Unnamed: 0,predicted_value,real_value,predicted_difference
0,2398.05,108.15,-2289.9
1,2088.05,151.65,-1936.4
2,2210.5,820.5,-1390.0
3,2378.28,3046.05,667.77
4,2369.88,,


Here I will give examples of polynomial features if they need to be used at all. That helps train the algorithm to be more accurate in some cases

In [46]:
pl = PolynomialFeatures(degree = 2, include_bias=False)

X_pl = pl.fit_transform(X)


print(f" Before polynomial the number of X features was {X.shape[1]}, with polynomial the features are {X_pl.shape[1]}!")


 Before polynomial the number of X features was 14, with polynomial the features are 119!


DO StandardScaler() for polynomial features as well to compare results later 

In [51]:
train_x_pl, test_x_pl, train_y, test_y = train_test_split(X_pl, Y, test_size=0.3)
print(train_x_pl.shape, train_y.shape)

sc_pl = StandardScaler()

x_train_pl_sc =  sc_pl.fit_transform(train_x_pl)
x_test_pl_sc = sc_pl.transform(test_x_pl)

x_train_pl_sc.shape

(4930, 119) (4930,)


(4930, 119)

In [65]:
### Linear Regression fit/train/predict with Scaled polynomial data

lr_sc_pl = LinearRegression()

lr_sc_pl.fit(x_train_pl_sc, train_y)
lr_predict_sc_pl = np.round(lr_sc_pl.predict(x_test_pl_sc),2)

print(round(r2_score(test_y, lr_predict_sc_pl),2))
print(round(mean_squared_error(test_y, lr_predict_sc_pl),2))

predict_single_client_sc_pl = lr_sc_pl.predict(x_test_pl_sc[[0]])

print(f'The projected amount the customer will spend with us in $ is {np.round(predict_single_client_sc_pl,2)} \nusing scaled polynomial features')

df_sc_pl = pd.DataFrame(lr_predict_sc_pl)

df_sc_pl.columns = ['predicted_value']
df_sc_pl['real_value'] = test_y
df_sc_pl['predicted_difference'] = df_sc_pl['real_value'] - df_sc_pl['predicted_value']


df_sc_pl.head(5)



0.73
1339480.08
The projected amount the customer will spend with us in $ is [1071.14] 
using scaled polynomial features


Unnamed: 0,predicted_value,real_value,predicted_difference
0,1071.11,108.15,-962.96
1,4758.37,151.65,-4606.72
2,754.32,820.5,66.18
3,767.26,3046.05,2278.79
4,1077.54,,


## Let's look at some regularization with Lasso and Ridge next

In [82]:
alpha = 0.001
ridge = Ridge(alpha=alpha)


rr = ridge.fit(x_train_pl_sc, train_y)
y_pred_ridge = rr.predict(x_test_pl_sc)

print(round(r2_score(test_y, y_pred_ridge),2))
print(round(mean_squared_error(test_y, y_pred_ridge),2))

predict_single_client_sc_pl_ridge = rr.predict(x_test_pl_sc[[0]])

print(f'The projected amount the customer will spend with us in $ is {np.round(predict_single_client_sc_pl_ridge,2)} \nusing scaled polynomial features & ridge')

0.73
1339457.02
The projected amount the customer will spend with us in $ is [1070.72] 
using scaled polynomial features & ridge


In [86]:
### Using lasso regularization


alpha_lasso = 0.01
lasso = Lasso(alpha = alpha_lasso)


l = lasso.fit(x_train_pl_sc, y_train)
y_pred_lasso = l.predict(x_test_pl_sc)

print(round(r2_score(test_y, y_pred_lasso),2))
print(round(mean_squared_error(test_y, y_pred_lasso),2))

predict_single_client_sc_pl_lasso = l.predict(x_test_pl_sc[[0]])

print(f'The projected amount the customer will spend with us in $ is {np.round(predict_single_client_sc_pl_lasso,2)} \nusing scaled polynomial features & lasso')

-0.02
5127876.32
The projected amount the customer will spend with us in $ is [2198.87] 
using scaled polynomial features & lasso


  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


# Putting it all together in a DataFrame Results below

In [91]:
df_final_results = pd.DataFrame()



df_final_results = pd.DataFrame(lr_regular_predict)

df_final_results.columns = ['predicted_value']
df_final_results['lr_predict_sc_pl'] = lr_predict_sc_pl
df_final_results['y_pred_ridge'] = y_pred_ridge
df_final_results['y_pred_lasso'] = y_pred_lasso

df_final_results['real_value'] = test_y



df_final_results.head(5)

Unnamed: 0,predicted_value,lr_predict_sc_pl,y_pred_ridge,y_pred_lasso,real_value
0,-654.642769,1071.11,1070.72244,2198.865217,108.15
1,881.087001,4758.37,4758.398607,2961.809375,151.65
2,423.58943,754.32,754.281423,2312.174133,820.5
3,1901.287094,767.26,767.450781,1955.184309,3046.05
4,2699.04555,1077.54,1077.390705,2199.994071,


## I learnt from this project that too many catgeorical features converte into dummies don't do very well explaining a linear relationship with a dependant numerical variable. It is a much better use for logistic regression case where we need to classify instead predicting a value.