 # Multivariate linear regression with feature engineering

# Introduction

In the following study, we will fit a multivariate linear regression model using one hot encoding technique and feature engineer to enrich the dataset. The objective is to predict a given target variable (Mean per capita (100,000) cancer mortalities) as closely as possible.

Based on results from stages A and B, the study also includes regularisation techniques to address overfitting and improve the model's precision.

The ideal model scenario is to follow the next principles:

* Reduces Overfitting: Having fewer repetitive data entries lowers the risk of making predictions based on aleatory variations, reducing overfitting.

* Improves Accuracy: The modelling accuracy improves by having fewer erroneous data points, leading to higher precision.

* Accelerates Training Time: With less data to process, algorithms can train faster, resulting in less training time.

# Used libraries 

In [3]:
import pandas as pd
import numpy as np

#plotting data
#import seaborn as sns
import altair as alt
#import matplotlib.pyplot as plt 

In [4]:
import warnings
warnings.filterwarnings('ignore')

# Reading Data from a CSV File

In [5]:
df_test = pd.read_csv('cancer_us_county-testing.csv')

In [6]:
df_test

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,Id
0,449.000000,154,159.5,479.800000,51880,104926,18.7,57.183158,"(51046.4, 54545.6]",30.2,...,51.0,24.9,13.1,81.260411,4.154831,10.045737,0.876222,41.071243,4.367123,2553
1,340.000000,140,167.2,438.500000,55472,55423,12.4,0.000000,"(54545.6, 61494.5]",46.9,...,37.6,36.3,16.0,93.660078,0.818115,0.626281,3.116360,57.529142,6.844366,904
2,54.000000,18,131.6,410.800000,49380,10103,11.7,0.000000,"(48021.6, 51046.4]",49.4,...,32.6,40.3,19.4,98.292181,0.041152,0.164609,0.051440,55.928482,1.604585,2192
3,94.000000,46,189.4,403.800000,45979,16708,13.5,598.515681,"(45201, 48021.6]",43.9,...,45.1,33.1,13.3,96.090377,1.555569,0.715680,0.378541,48.409405,8.255410,1326
4,2718.000000,1065,168.9,432.100000,51527,726106,20.7,60.597213,"(51046.4, 54545.6]",33.5,...,41.7,37.7,25.8,57.002148,7.093743,14.785464,11.692122,51.852122,6.148433,2394
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
604,27.000000,14,187.6,415.700000,50155,4769,12.0,0.000000,"(48021.6, 51046.4]",42.6,...,42.0,27.4,12.1,85.768985,0.616890,0.042544,2.105935,51.104816,6.659013,2247
605,30.000000,9,131.6,444.600000,46961,4854,14.0,0.000000,"(45201, 48021.6]",41.3,...,40.7,30.9,15.8,96.122281,0.800164,0.307755,1.292573,49.671883,5.760870,2947
606,583.000000,258,187.5,429.400000,39907,127780,22.1,62.607607,"(37413.8, 40362.7]",36.9,...,39.8,37.9,22.7,81.407683,6.285701,4.388991,2.247924,47.875108,6.387886,1746
607,1962.667684,31,174.2,453.549422,50905,14219,9.3,0.000000,"(48021.6, 51046.4]",39.1,...,52.1,33.9,17.6,93.756201,1.658398,0.836286,0.290574,45.219595,5.893846,1822


In [7]:
df_train = pd.read_csv('cancer_us_county-training.csv')

In [8]:
df_train

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,Id
0,88.000000,40,261.0,561.400000,29090,13352,26.8,2771.120431,"[22640, 34218.1]",39.8,...,32.0,47.5,32.9,99.693045,0.044920,0.000000,0.000000,55.499459,6.838710,0
1,73.000000,35,167.3,345.600000,29782,21903,38.8,0.000000,"[22640, 34218.1]",32.3,...,18.8,45.3,34.1,94.791383,1.649850,0.063631,2.854286,52.818296,4.799131,1
2,292.000000,124,191.0,468.400000,41955,48985,15.5,0.000000,"(40362.7, 42724.4]",42.2,...,44.9,34.5,16.0,95.102348,1.741749,0.376429,0.445611,50.560800,3.996826,2
3,1962.667684,7,165.4,453.549422,55378,3007,11.1,0.000000,"(54545.6, 61494.5]",41.6,...,49.6,30.1,15.2,85.833870,0.933677,0.160979,7.244044,52.565181,3.291536,3
4,43.000000,20,160.6,349.700000,26309,8551,35.3,0.000000,"[22640, 34218.1]",43.9,...,30.4,45.1,24.5,24.535525,73.223736,0.394100,1.396239,33.641208,3.166561,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2433,389.000000,157,185.3,456.600000,65485,83199,7.7,0.000000,"(61494.5, 125635]",40.1,...,55.1,24.2,11.3,91.187608,4.445537,1.405213,0.858126,60.095060,5.175689,3041
2434,286.000000,117,196.7,492.400000,42477,46222,16.9,281.251352,"(40362.7, 42724.4]",40.8,...,46.2,34.9,18.8,90.130702,5.943936,0.472935,0.485833,51.648588,4.651829,3042
2435,103.000000,42,204.1,506.700000,40339,18201,21.3,0.000000,"(37413.8, 40362.7]",38.6,...,34.4,36.9,20.6,65.463178,30.550955,0.645909,0.104891,49.758980,4.344104,3043
2436,1962.667684,23,171.1,453.549422,39764,8856,16.7,0.000000,"(37413.8, 40362.7]",43.8,...,37.3,40.0,21.3,94.625317,0.154508,0.629070,0.684251,49.880605,6.210826,3045


# Case of study and data understanding 

Hypothesis

After the feature selection experiments conducted on study part B, the case of study for this stage has the following hypothesis: 

By applying different algorithms techniques, such as regularization and feature engineering, we will handle overfitting, and it will allow finding an accurate model to predict mean per capita (100,000) cancer mortalities.

# Explore the dataset

In [9]:
df_train.describe().round(2)

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,MedianAge,MedianAgeMale,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,Id
count,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,...,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0,2438.0
mean,587.17,180.59,178.85,448.31,47028.43,98515.79,16.85,166.02,45.31,39.64,...,41.24,36.28,19.23,83.78,9.04,1.23,1.95,51.22,5.63,1523.79
std,1236.45,438.47,27.54,53.25,11919.39,274527.19,6.39,563.81,45.05,5.22,...,9.37,7.82,6.09,16.28,14.36,2.56,3.54,6.5,1.97,874.91
min,6.0,3.0,66.3,201.3,22640.0,827.0,3.2,0.0,22.3,22.4,...,13.5,11.2,2.6,10.2,0.0,0.0,0.0,22.99,0.0,0.0
25%,76.0,28.0,161.4,420.3,38872.75,11545.75,12.1,0.0,37.8,36.4,...,34.6,30.92,15.0,77.34,0.63,0.25,0.29,47.83,4.52,775.25
50%,172.5,62.0,178.1,453.55,45186.5,26942.5,15.9,0.0,41.0,39.6,...,41.3,36.4,18.8,90.12,2.3,0.55,0.8,51.66,5.37,1512.5
75%,521.5,151.0,195.3,481.98,52492.5,69524.5,20.4,92.56,44.08,42.5,...,47.6,41.5,23.1,95.46,10.45,1.21,2.11,55.33,6.46,2279.75
max,24965.0,9445.0,293.9,1014.2,125635.0,5238216.0,47.4,9762.31,624.0,64.7,...,70.7,65.1,46.6,100.0,84.87,42.62,41.93,78.08,18.56,3046.0


In [10]:
#Print general information about a DataFrame 
df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2438 entries, 0 to 2437
Data columns (total 35 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   avgAnnCount              2438 non-null   float64
 1   avgDeathsPerYear         2438 non-null   int64  
 2   TARGET_deathRate         2438 non-null   float64
 3   incidenceRate            2438 non-null   float64
 4   medIncome                2438 non-null   int64  
 5   popEst2015               2438 non-null   int64  
 6   povertyPercent           2438 non-null   float64
 7   studyPerCap              2438 non-null   float64
 8   binnedInc                2438 non-null   object 
 9   MedianAge                2438 non-null   float64
 10  MedianAgeMale            2438 non-null   float64
 11  MedianAgeFemale          2438 non-null   float64
 12  Geography                2438 non-null   object 
 13  AvgHouseholdSize         2438 non-null   float64
 14  PercentMarried          

The dataset's information shows the features, data type, and Null count.
PctSomeCol18_24 and PctPrivateCoverageAlone, and PctEmployed16_Over have missing values.
These columns are not a good option for training the model because we can miss data when dropping rows with no value related to these columns.

# Experiment 1

## Train a multivariate linear regression before applying any data transformation 

* No includes standardisation

* No includes outliers cleaning

* No includes categorical data

In [11]:
df_train_no_transf = df_train.copy()
df_test_no_transf = df_test.copy

In [12]:
df_train_no_transf = df_train_no_transf.dropna(axis=1)

In [13]:
df_train_no_transf[['avgAnnCount', 'avgDeathsPerYear', 'incidenceRate',
       'medIncome', 'popEst2015', 'povertyPercent', 'studyPerCap',
       'MedianAge', 'MedianAgeMale', 'MedianAgeFemale',
       'AvgHouseholdSize', 'PercentMarried', 'PctNoHS18_24', 'PctHS18_24',
       'PctBachDeg18_24', 'PctHS25_Over', 'PctBachDeg25_Over',
       'PctUnemployed16_Over', 'PctPrivateCoverage', 'PctEmpPrivCoverage',
       'PctPublicCoverage', 'PctPublicCoverageAlone', 'PctWhite', 'PctBlack',
       'PctAsian', 'PctOtherRace', 'PctMarriedHouseholds', 'BirthRate']]

Unnamed: 0,avgAnnCount,avgDeathsPerYear,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,MedianAge,MedianAgeMale,MedianAgeFemale,...,PctPrivateCoverage,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate
0,88.000000,40,561.400000,29090,13352,26.8,2771.120431,39.8,39.2,40.5,...,44.8,32.0,47.5,32.9,99.693045,0.044920,0.000000,0.000000,55.499459,6.838710
1,73.000000,35,345.600000,29782,21903,38.8,0.000000,32.3,30.8,35.2,...,27.2,18.8,45.3,34.1,94.791383,1.649850,0.063631,2.854286,52.818296,4.799131
2,292.000000,124,468.400000,41955,48985,15.5,0.000000,42.2,40.9,43.8,...,67.4,44.9,34.5,16.0,95.102348,1.741749,0.376429,0.445611,50.560800,3.996826
3,1962.667684,7,453.549422,55378,3007,11.1,0.000000,41.6,38.3,46.3,...,70.9,49.6,30.1,15.2,85.833870,0.933677,0.160979,7.244044,52.565181,3.291536
4,43.000000,20,349.700000,26309,8551,35.3,0.000000,43.9,41.2,47.8,...,54.8,30.4,45.1,24.5,24.535525,73.223736,0.394100,1.396239,33.641208,3.166561
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2433,389.000000,157,456.600000,65485,83199,7.7,0.000000,40.1,39.1,41.0,...,75.7,55.1,24.2,11.3,91.187608,4.445537,1.405213,0.858126,60.095060,5.175689
2434,286.000000,117,492.400000,42477,46222,16.9,281.251352,40.8,39.3,42.2,...,65.9,46.2,34.9,18.8,90.130702,5.943936,0.472935,0.485833,51.648588,4.651829
2435,103.000000,42,506.700000,40339,18201,21.3,0.000000,38.6,39.1,37.9,...,58.4,34.4,36.9,20.6,65.463178,30.550955,0.645909,0.104891,49.758980,4.344104
2436,1962.667684,23,453.549422,39764,8856,16.7,0.000000,43.8,42.1,46.1,...,60.1,37.3,40.0,21.3,94.625317,0.154508,0.629070,0.684251,49.880605,6.210826


### Split the data 

In [14]:
X = df_train_no_transf[['avgAnnCount', 'avgDeathsPerYear', 'incidenceRate',
       'medIncome', 'popEst2015', 'povertyPercent', 'studyPerCap',
       'MedianAge', 'MedianAgeMale', 'MedianAgeFemale',
       'AvgHouseholdSize', 'PercentMarried', 'PctNoHS18_24', 'PctHS18_24',
       'PctBachDeg18_24', 'PctHS25_Over', 'PctBachDeg25_Over',
       'PctUnemployed16_Over', 'PctPrivateCoverage', 'PctEmpPrivCoverage',
       'PctPublicCoverage', 'PctPublicCoverageAlone', 'PctWhite', 'PctBlack',
       'PctAsian', 'PctOtherRace', 'PctMarriedHouseholds', 'BirthRate']].values
y = df_train_no_transf['TARGET_deathRate'].values

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

### Baseline Performance

In [16]:
 #average value of the target variable

y_mean = y_train.mean()
y_mean

179.36451282051283

In [17]:
y_base = np.full(y_train.shape, y_mean)

In [18]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

In [19]:
print(mse(y_train, y_base))
print(mae(y_train, y_base))

774.1999714293228
21.443014017094015


Let's perform the training

In [20]:
from sklearn.linear_model import LinearRegression 

In [21]:
reg1 = LinearRegression()

In [22]:
reg1.fit(X_train, y_train)

LinearRegression()

In [23]:
#model's performance
y_train_preds = reg1.predict(X_train)

In [24]:
# scores on the training set

print(mse(y_train, y_train_preds))
print(mae(y_train, y_train_preds))

358.13559037036185
14.022738628856397


In [25]:
# scores on the validation set

y_valid_preds= reg1.predict(X_valid)
print(mse(y_valid, y_valid_preds))
print(mae(y_valid, y_valid_preds))

373.99952548253924
15.005231257191411


In [26]:
# predictions vs target line charts on the train set
perfect_test = alt.Chart(pd.DataFrame({'target': y_train, 'preds': y_train})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test = alt.Chart(pd.DataFrame({'target': y_train, 'preds': y_train_preds})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test + perfect_test

In [27]:
# predictions versus the target line charts on the validation set
perfect_test = alt.Chart(pd.DataFrame({'target': y_valid, 'preds': y_valid})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test = alt.Chart(pd.DataFrame({'target': y_valid, 'preds': y_valid_preds})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test + perfect_test

let's check on the test set

In [28]:
X_test = df_train_no_transf[['avgAnnCount', 'avgDeathsPerYear', 'incidenceRate',
       'medIncome', 'popEst2015', 'povertyPercent', 'studyPerCap',
       'MedianAge', 'MedianAgeMale', 'MedianAgeFemale',
       'AvgHouseholdSize', 'PercentMarried', 'PctNoHS18_24', 'PctHS18_24',
       'PctBachDeg18_24', 'PctHS25_Over', 'PctBachDeg25_Over',
       'PctUnemployed16_Over', 'PctPrivateCoverage', 'PctEmpPrivCoverage',
       'PctPublicCoverage', 'PctPublicCoverageAlone', 'PctWhite', 'PctBlack',
       'PctAsian', 'PctOtherRace', 'PctMarriedHouseholds', 'BirthRate']].values
y_test = df_train_no_transf['TARGET_deathRate'].values

In [29]:
#model's performance on the test set
y_test_preds = reg1.predict(X_test)
print(mse(y_test, y_test_preds))
print(mae(y_test, y_test_preds))

361.3109801713226
14.219398351016972


In [30]:
# predictions versus the target line charts on the test set
perfect_test = alt.Chart(pd.DataFrame({'target': y_test, 'preds': y_test})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test = alt.Chart(pd.DataFrame({'target': y_test, 'preds': y_test_preds})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test + perfect_test

The current model is not improving the model score compared to the models in Part B. Additionally, the model shows a slight degree of overfitting. 

Based on the results, it is worth performing the data cleaning and feature engineering process.

# Data cleaning

1. Outliers

After analysing the distribution of the variables, some extreme values were discovered. Since we don't have extra information to confirm if we encounter errors in the data, and considering that changing the value for the mean or other aggregation could lead to wrong learning patterns from the model, the decision is to drop the rows with outliers on the variable that is going to predict the target (death rate).
Additionally, the amount of data we will lose doesn't represent a high proportion of the data.

In [31]:
df_train[df_train["incidenceRate"] > 700]

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,Id
851,135.0,23,162.1,1014.2,46954,15052,20.1,0.0,"(45201, 48021.6]",24.6,...,52.2,22.0,8.9,74.888166,15.277213,5.889928,0.460892,36.337594,2.181467,1083


In [32]:
 df_train = df_train.drop(df_train[df_train.incidenceRate > 700].index)

In [33]:
df_train[df_train["avgAnnCount"] > 5000]

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,Id
124,8895.0,2817,163.8,528.0,85886,1501587,7.7,181.807648,"(61494.5, 125635]",40.6,...,55.7,26.2,12.0,80.948638,7.66232,3.781605,4.950535,57.618624,4.419519,156
274,6894.0,2471,164.3,441.1,58127,1982498,15.2,153.846309,"(54545.6, 61494.5]",34.0,...,47.5,25.1,16.4,70.641872,15.420684,4.963004,5.383735,49.654687,5.97949,353
304,15470.0,5780,146.6,401.4,53929,4167947,17.1,177.545444,"(51046.4, 54545.6]",35.6,...,44.1,31.4,19.8,79.580767,5.221044,3.789014,6.188306,47.036501,5.392191,391
470,10411.0,3927,197.9,528.7,41434,1759335,24.1,470.063973,"(40362.7, 42724.4]",37.8,...,41.0,42.0,26.0,53.342526,39.414346,2.904772,1.72881,37.156645,5.676241,602
473,8236.0,3303,211.7,533.5,39037,1567442,25.8,742.61121,"(37413.8, 40362.7]",33.7,...,38.8,41.3,27.6,41.672154,42.75757,6.864827,5.573247,27.459943,5.282606,606
794,8072.0,2584,145.2,463.9,75459,1644518,17.7,1258.727481,"(61494.5, 125635]",36.6,...,49.9,31.8,19.7,56.426514,15.021108,11.653341,12.381291,26.667902,3.548718,1016
817,14477.0,5108,161.4,433.8,54230,4538028,17.3,391.359419,"(51046.4, 54545.6]",32.8,...,42.8,27.5,19.8,63.121729,18.861747,6.572709,8.734237,46.913495,6.028644,1042
872,5978.0,2528,165.0,430.8,45162,949827,15.2,184.244078,"(42724.4, 45201]",47.1,...,38.5,37.1,20.0,82.642191,10.320491,3.170341,0.998844,40.138531,4.507693,1107
909,7861.0,2722,159.6,473.1,84026,1585139,8.9,49.207041,"(61494.5, 125635]",38.5,...,60.8,27.2,13.9,79.053104,4.882669,10.499603,2.388125,51.208076,4.751206,1146
913,6146.0,2183,183.3,544.1,50134,922578,15.2,121.39895,"(48021.6, 51046.4]",40.4,...,51.7,36.6,19.7,78.654143,13.309476,3.150879,2.151622,41.896742,4.815863,1151


In [34]:
 df_train = df_train.drop(df_train[df_train.avgAnnCount > 5000].index)

In [35]:
df_train

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,Id
0,88.000000,40,261.0,561.400000,29090,13352,26.8,2771.120431,"[22640, 34218.1]",39.8,...,32.0,47.5,32.9,99.693045,0.044920,0.000000,0.000000,55.499459,6.838710,0
1,73.000000,35,167.3,345.600000,29782,21903,38.8,0.000000,"[22640, 34218.1]",32.3,...,18.8,45.3,34.1,94.791383,1.649850,0.063631,2.854286,52.818296,4.799131,1
2,292.000000,124,191.0,468.400000,41955,48985,15.5,0.000000,"(40362.7, 42724.4]",42.2,...,44.9,34.5,16.0,95.102348,1.741749,0.376429,0.445611,50.560800,3.996826,2
3,1962.667684,7,165.4,453.549422,55378,3007,11.1,0.000000,"(54545.6, 61494.5]",41.6,...,49.6,30.1,15.2,85.833870,0.933677,0.160979,7.244044,52.565181,3.291536,3
4,43.000000,20,160.6,349.700000,26309,8551,35.3,0.000000,"[22640, 34218.1]",43.9,...,30.4,45.1,24.5,24.535525,73.223736,0.394100,1.396239,33.641208,3.166561,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2433,389.000000,157,185.3,456.600000,65485,83199,7.7,0.000000,"(61494.5, 125635]",40.1,...,55.1,24.2,11.3,91.187608,4.445537,1.405213,0.858126,60.095060,5.175689,3041
2434,286.000000,117,196.7,492.400000,42477,46222,16.9,281.251352,"(40362.7, 42724.4]",40.8,...,46.2,34.9,18.8,90.130702,5.943936,0.472935,0.485833,51.648588,4.651829,3042
2435,103.000000,42,204.1,506.700000,40339,18201,21.3,0.000000,"(37413.8, 40362.7]",38.6,...,34.4,36.9,20.6,65.463178,30.550955,0.645909,0.104891,49.758980,4.344104,3043
2436,1962.667684,23,171.1,453.549422,39764,8856,16.7,0.000000,"(37413.8, 40362.7]",43.8,...,37.3,40.0,21.3,94.625317,0.154508,0.629070,0.684251,49.880605,6.210826,3045


3. transforming variable avgDeathsPerYear and avgAnnCount

standardising the variable to mean number per capita (100.000)

In [36]:
df_train["avgDeathsPerYear"] = (df_train.avgDeathsPerYear/df_train.popEst2015)*100000

standardising the variable to mean number per capita (100.000)

In [37]:
df_train["avgAnnCount"] = (df_train.avgAnnCount/df_train.popEst2015)*100000

drop columns used in standardisation

In [38]:
# Remove columns
df_train = df_train.drop(['popEst2015'], axis=1)

In [39]:
#performing the same process for test data as a common pipeline for data predictions

df_test["avgDeathsPerYear"] = (df_test.avgDeathsPerYear/df_test.popEst2015)*100000
df_test["avgAnnCount"] = (df_test.avgAnnCount/df_test.popEst2015)*100000
df_test = df_test.drop(['popEst2015'], axis=1)

3. Dropping columns

Based on the stage B results, categorical income value is not adding value to the model.
We will delete other columns that will not be part of this analysis.

In [40]:
df_train = df_train.drop(['Id'], axis=1)
df_train = df_train.drop(['binnedInc'], axis=1)

df_test = df_test.drop(['Id'], axis=1)
df_test = df_test.drop(['binnedInc'], axis=1)

In [41]:
#delete 3 columns with missing values
df_train = df_train.dropna(axis=1)
df_test = df_test.dropna(axis=1)

In [42]:
df_train.shape

(2408, 29)

In [43]:
df_test.shape

(609, 29)

# Feature engineering


1. From county extract the state

In [44]:
df_train['state'] = df_train['Geography'].str.rsplit(',').str[-1] 
df_test['state'] = df_test['Geography'].str.rsplit(',').str[-1] 

df_train = df_train.drop(['Geography'], axis=1)
df_test = df_test.drop(['Geography'], axis=1)

2. Transforming categorical values

Before performing categorical transformation, let's save a copy of the data

In [45]:
df_train_noenc = df_train.copy()
df_test_noenc = df_test.copy()

In [46]:
df_train_noenc.shape

(2408, 29)

In [47]:
df_test_noenc.shape

(609, 29)

The following process transforms the categorical data using the get_dummies() function.

In [48]:
df_cat = pd.get_dummies(df_train_noenc["state"])
df_cat

Unnamed: 0,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,District of Columbia,Florida,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2433,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2434,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2435,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2436,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Combine the one-hot encoded columns contained in df_cat into df_train

In [49]:
df_train_enc = pd.concat([df_train_noenc, df_cat], axis=1)
df_train_enc

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,povertyPercent,studyPerCap,MedianAge,MedianAgeMale,MedianAgeFemale,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,659.077292,299.580587,261.0,561.400000,29090,26.8,2771.120431,39.8,39.2,40.5,...,0,0,0,0,0,0,0,0,0,0
1,333.287677,159.795462,167.3,345.600000,29782,38.8,0.000000,32.3,30.8,35.2,...,0,0,1,0,0,0,0,0,0,0
2,596.100847,253.138716,191.0,468.400000,41955,15.5,0.000000,42.2,40.9,43.8,...,0,0,0,0,0,0,0,0,0,0
3,65269.959561,232.790156,165.4,453.549422,55378,11.1,0.000000,41.6,38.3,46.3,...,0,0,0,0,0,0,0,0,0,0
4,502.865162,233.890773,160.6,349.700000,26309,35.3,0.000000,43.9,41.2,47.8,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2433,467.553697,188.704191,185.3,456.600000,65485,7.7,0.000000,40.1,39.1,41.0,...,0,0,0,0,0,1,0,0,0,0
2434,618.752975,253.126217,196.7,492.400000,42477,16.9,281.251352,40.8,39.3,42.2,...,0,0,0,0,0,0,0,0,0,0
2435,565.902972,230.756552,204.1,506.700000,40339,21.3,0.000000,38.6,39.1,37.9,...,0,0,0,0,0,0,0,0,0,0
2436,22162.010885,259.710930,171.1,453.549422,39764,16.7,0.000000,43.8,42.1,46.1,...,0,0,0,0,0,0,0,0,0,0


In [50]:
#dropping source column
df_train_enc = df_train_enc.drop(['state'], axis=1)

In [51]:
df_train_enc.shape

(2408, 79)

In [52]:
#performing the same process for test data as a common pipeline for data predictions
#we need to transform the x_test the same way in which x_train was transformed

df_cat2 = pd.get_dummies(df_test["state"])
df_cat2 = df_cat2.reindex(columns = df_cat.columns, fill_value= 0)

df_test_enc = pd.concat([df_test, df_cat2], axis=1)

df_test_enc = df_test_enc.drop(['state'], axis=1)


In [53]:
df_test_enc.shape

(609, 79)

# Experiment 2

## Train a multivariate linear regression using state categorical variable

In [54]:
df_train_enc.columns

Index(['avgAnnCount', 'avgDeathsPerYear', 'TARGET_deathRate', 'incidenceRate',
       'medIncome', 'povertyPercent', 'studyPerCap', 'MedianAge',
       'MedianAgeMale', 'MedianAgeFemale', 'AvgHouseholdSize',
       'PercentMarried', 'PctNoHS18_24', 'PctHS18_24', 'PctBachDeg18_24',
       'PctHS25_Over', 'PctBachDeg25_Over', 'PctUnemployed16_Over',
       'PctPrivateCoverage', 'PctEmpPrivCoverage', 'PctPublicCoverage',
       'PctPublicCoverageAlone', 'PctWhite', 'PctBlack', 'PctAsian',
       'PctOtherRace', 'PctMarriedHouseholds', 'BirthRate', ' Alabama',
       ' Alaska', ' Arizona', ' Arkansas', ' California', ' Colorado',
       ' Connecticut', ' Delaware', ' District of Columbia', ' Florida',
       ' Georgia', ' Hawaii', ' Idaho', ' Illinois', ' Indiana', ' Iowa',
       ' Kansas', ' Kentucky', ' Louisiana', ' Maine', ' Maryland',
       ' Massachusetts', ' Michigan', ' Minnesota', ' Mississippi',
       ' Missouri', ' Montana', ' Nebraska', ' Nevada', ' New Hampshire',
       ' 

### Split the data 

In [55]:
X_2 = df_train_enc.drop(['TARGET_deathRate'], axis=1).values
y_2 = df_train_enc['TARGET_deathRate'].values

In [56]:
X_train_2, X_valid_2, y_train_2, y_valid_2 = train_test_split(X_2, y_2, test_size=0.2, random_state=42)

### Baseline Performance

In [57]:
 #average value of the target variable

y_mean_2 = y_train_2.mean()
y_mean_2

179.24309449636553

In [58]:
y_base_2 = np.full(y_train_2.shape, y_mean_2)

In [59]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

In [60]:
print(mse(y_train_2, y_base_2))
print(mae(y_train_2, y_base_2))

769.3903027813093
21.40203726887571


let's train the model

In [61]:
reg2 = LinearRegression()

In [62]:
reg2.fit(X_train_2, y_train_2)

LinearRegression()

In [63]:
#model's performance
y_train_preds_2 = reg2.predict(X_train_2)

In [64]:
# scores on the training set

print(mse(y_train_2, y_train_preds_2))
print(mae(y_train_2, y_train_preds_2))

122.35706855697858
8.222481250450963


In [65]:
# scores on the validation set

y_valid_preds_2= reg2.predict(X_valid_2)
print(mse(y_valid_2, y_valid_preds_2))
print(mae(y_valid_2, y_valid_preds_2))

156.2625790179583
9.300460779327974


In [66]:
# predictions vs target line charts on the train set
perfect_test_2 = alt.Chart(pd.DataFrame({'target': y_train_2, 'preds': y_train_2})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_2 = alt.Chart(pd.DataFrame({'target': y_train_2, 'preds': y_train_preds_2})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_2 + perfect_test_2

In [67]:
# predictions versus the target line charts on the validation set
perfect_test_2 = alt.Chart(pd.DataFrame({'target': y_valid_2, 'preds': y_valid_2})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_2 = alt.Chart(pd.DataFrame({'target': y_valid_2, 'preds': y_valid_preds_2})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_2 + perfect_test_2

let's check on the test set

In [68]:
X_test_2 = df_test_enc.drop(['TARGET_deathRate'], axis=1).values
y_test_2 = df_test_enc['TARGET_deathRate'].values

In [69]:
#model's performance on teh test set
y_test_preds_2 = reg2.predict(X_test_2)
print(mse(y_test_2, y_test_preds_2))
print(mae(y_test_2, y_test_preds_2))

143.34793258794429
8.69435567615124


In [70]:
# predictions versus the target line charts on the test set
perfect_test_2 = alt.Chart(pd.DataFrame({'target': y_test_2, 'preds': y_test_2})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_2 = alt.Chart(pd.DataFrame({'target': y_test_2, 'preds': y_test_preds_2})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_2 + perfect_test_2

# Experiment 3

## Principal component analysis - PCA

Linear dimensionality reduction is a technique used to reduce the number of data features by linearly projecting them onto a lower-dimensional subspace. The purpose is to preserve key components with the most significant variation in the data while eliminating the non-critical components with less variation.

The objective is to find interrelation between the features due to the belief of autocorrelation among the variables.

## Split the data 

In [71]:
X_3 = df_train_enc.drop(['TARGET_deathRate'], axis=1).values
y_3 = df_train['TARGET_deathRate'].values

In [72]:
X_train_3, X_valid_3, y_train_3, y_valid_3 = train_test_split(X_3, y_3, test_size=0.2, random_state=42)

In [73]:
X_test_3 = df_test_enc.drop(['TARGET_deathRate'], axis=1).values
y_test_3 = df_test_enc['TARGET_deathRate'].values

## PCA 

Since PCA is influenced by the scale of the features of the data, we will standardise the features.


In [74]:
from sklearn.preprocessing import StandardScaler

In [75]:
#scaling the data
sc = StandardScaler()
 
X_train_3 = sc.fit_transform(X_train_3)
X_valid_3 = sc.transform(X_valid_3)
X_test_3 = sc.transform(X_test_3)

The hyperparameter of the variance to be achieved, is set to 0.9. It means it will maintain 90% of the variance and keep the necessary components to achieve this.

In [76]:
from sklearn.decomposition import PCA

In [77]:
#apply transform on both the training and test set to generate a transformed dataset from the parameters generated 
#from the fit method.
pca = PCA(0.9)
 
X_train_3 = pca.fit_transform(X_train_3)
X_valid_3 = pca.transform(X_valid_3)
X_test_3 = pca.transform(X_test_3)
 
explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.10360559, 0.06419359, 0.03447429, 0.02677007, 0.0231003 ,
       0.0204591 , 0.01988457, 0.01671693, 0.01667492, 0.01620678,
       0.01553202, 0.01515641, 0.01506213, 0.01487453, 0.01450164,
       0.01399757, 0.01389731, 0.01376692, 0.013688  , 0.0135465 ,
       0.01354251, 0.01346885, 0.0134218 , 0.01340975, 0.01339441,
       0.01337762, 0.01336347, 0.01335671, 0.01333807, 0.01332008,
       0.01327796, 0.01326672, 0.01325847, 0.01325314, 0.01323109,
       0.01322091, 0.01320014, 0.01318861, 0.0131813 , 0.01314666,
       0.01314307, 0.01312967, 0.01311479, 0.01308441, 0.01307548,
       0.01306023, 0.01305693, 0.01301993, 0.01300322, 0.01223274,
       0.01140956, 0.0106304 ])

In [78]:
pca.n_components_

52

From the above output, you can observe that to achieve 90% variance, the dimension was reduced to 52 principal components from the 78 dimensions.

## Train a multivariate linear regression using variables from PCA

In [79]:
X_train_3.shape

(1926, 52)

In [80]:
X_valid_3.shape

(482, 52)

In [81]:
X_test_3.shape

(609, 52)

### Baseline Performance

In [82]:
 #average value of the target variable

y_mean_3 = y_train_3.mean()
y_mean_3

179.24309449636553

In [83]:
y_base_3 = np.full(y_train_3.shape, y_mean_3)

In [84]:
print(mse(y_train_3, y_base_3))
print(mae(y_train_3, y_base_3))

769.3903027813093
21.40203726887571


let's train the model

In [85]:
reg3 = LinearRegression()

In [86]:
reg3.fit(X_train_3, y_train_3)

LinearRegression()

In [87]:
#model's performance
y_train_preds_3 = reg3.predict(X_train_3)

In [88]:
# scores on the training set

print(mse(y_train_3, y_train_preds_3))
print(mae(y_train_3, y_train_preds_3))

311.2841435921802
13.209344709372685


In [89]:
# scores on the validation set

y_valid_preds_3= reg3.predict(X_valid_3)
print(mse(y_valid_3, y_valid_preds_3))
print(mae(y_valid_3, y_valid_preds_3))

350.6407371014389
13.951274835351299


In [90]:
# predictions vs target line charts on the train set
perfect_test_3 = alt.Chart(pd.DataFrame({'target': y_train_3, 'preds': y_train_3})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_3 = alt.Chart(pd.DataFrame({'target': y_train_3, 'preds': y_train_preds_3})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_3 + perfect_test_3

In [91]:
# predictions versus the target line charts on the validation set
perfect_test_3 = alt.Chart(pd.DataFrame({'target': y_valid_3, 'preds': y_valid_3})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_3 = alt.Chart(pd.DataFrame({'target': y_valid_3, 'preds': y_valid_preds_3})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_3 + perfect_test_3

let's check on the test set

In [92]:
#model's performance on teh test set
y_test_preds_3 = reg3.predict(X_test_3)
print(mse(y_test_3, y_test_preds_3))
print(mae(y_test_3, y_test_preds_3))

354.4650527530023
13.438803866188964


In [93]:
# predictions versus the target line charts on the test set
perfect_test_3 = alt.Chart(pd.DataFrame({'target': y_test_2, 'preds': y_test_3})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_3 = alt.Chart(pd.DataFrame({'target': y_test_3, 'preds': y_test_preds_3})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_3 + perfect_test_3

In conclusion, a model with PCA variables is not improving the model's accuracy in comparison with model experiment 2 (using all available variables and state categorical feature).

Changing the hyperparameter of the variance to a lower value showed that the accuracy decreased even more.

# Applying regularisation

We will apply regularisation techniques like Lasso, Ridge and Elastinet based on model experiment 2 (multivariate linear regression using state categorical variable) to predict mean per capita (100,000) cancer mortalities. 

The objective is to reduce overfitting from model experiment 2.

## Experiment 4

### Split the data 

In [94]:
X_4 = df_train_enc.drop(['TARGET_deathRate'], axis=1).values
y_4 = df_train_enc['TARGET_deathRate'].values

In [95]:
X_train_4, X_valid_4, y_train_4, y_valid_4 = train_test_split(X_4, y_4, test_size=0.2, random_state=42)

In [96]:
X_test_4 = df_test_enc.drop(['TARGET_deathRate'], axis=1).values
y_test_4 = df_test_enc['TARGET_deathRate'].values

## 1. Train Lasso Model (L1)

In [97]:
from sklearn.linear_model import Lasso 

In [98]:
lasso_reg = Lasso()

In [99]:
lasso_reg.fit(X_train_4, y_train_4)

Lasso()

In [100]:
y_train_preds_4 = lasso_reg.predict(X_train_4)
print(mse(y_train_4, y_train_preds_4))
print(mae(y_train_4, y_train_preds_4))

155.4150873332816
9.248904069799838


In [101]:
y_valid_preds_4 = lasso_reg.predict(X_valid_4)
print(mse(y_valid_4, y_valid_preds_4))
print(mae(y_valid_4, y_valid_preds_4))

172.23864101800723
9.938460208784576


In [102]:
y_test_preds_4 = lasso_reg.predict(X_test_4)
print(mse(y_test_4, y_test_preds_4))
print(mae(y_test_4, y_test_preds_4))

175.69329440227858
9.954422631422533


In [103]:
# predictions vs target line charts on the train set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_train_4, 'preds': y_train_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_train_4, 'preds': y_train_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

In [104]:
# predictions versus the target line charts on the validation set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_valid_4, 'preds': y_valid_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_valid_4, 'preds': y_valid_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

In [105]:
# predictions versus the target line charts on the test set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_test_4, 'preds': y_test_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_test_4, 'preds': y_test_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

## 2. Train Ridge Model (L2)

In [106]:
from sklearn.linear_model import Ridge

In [107]:
ridge_reg = Ridge()
ridge_reg.fit(X_train_4, y_train_4)

Ridge()

In [108]:
y_train_preds_4 = ridge_reg.predict(X_train_4)
print(mse(y_train_4, y_train_preds_4))
print(mae(y_train_4, y_train_preds_4))

122.50609746670989
8.204719943136688


In [109]:
y_valid_preds_4 = ridge_reg.predict(X_valid_4)
print(mse(y_valid_4, y_valid_preds_4))
print(mae(y_valid_4, y_valid_preds_4))

155.43256990671034
9.280055148895716


In [110]:
y_test_preds_4 = ridge_reg.predict(X_test_4)
print(mse(y_test_4, y_test_preds_4))
print(mae(y_test_4, y_test_preds_4))

143.17352822935902
8.679905042186935


In [111]:
# predictions vs target line charts on the train set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_train_4, 'preds': y_train_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_train_4, 'preds': y_train_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

In [112]:
# predictions versus the target line charts on the validation set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_valid_4, 'preds': y_valid_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_valid_4, 'preds': y_valid_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

In [113]:
# predictions versus the target line charts on the test set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_test_4, 'preds': y_test_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_test_4, 'preds': y_test_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

## 3. Train Elastinet Model

In [114]:
from sklearn.linear_model import ElasticNet 

In [115]:
elasticnet_reg = ElasticNet()
elasticnet_reg.fit(X_train_4, y_train_4)

ElasticNet()

In [116]:
y_train_preds_4 = elasticnet_reg.predict(X_train_4)
print(mse(y_train_4, y_train_preds_4))
print(mae(y_train_4, y_train_preds_4))

155.26808693915555
9.229586495475905


In [117]:
y_valid_preds_4 = elasticnet_reg.predict(X_valid_4)
print(mse(y_valid_4, y_valid_preds_4))
print(mae(y_valid_4, y_valid_preds_4))

171.43706990274973
9.94230041706115


In [118]:
y_test_preds_4 = elasticnet_reg.predict(X_test_4)
print(mse(y_test_4, y_test_preds_4))
print(mae(y_test_4, y_test_preds_4))

175.32256014950931
9.943474699264128


In [119]:
# predictions vs target line charts on the train set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_train_4, 'preds': y_train_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_train_4, 'preds': y_train_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

In [120]:
# predictions versus the target line charts on the validation set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_valid_4, 'preds': y_valid_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_valid_4, 'preds': y_valid_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

In [121]:
# predictions versus the target line charts on the test set
perfect_test_4 = alt.Chart(pd.DataFrame({'target': y_test_4, 'preds': y_test_4})).mark_line(color='green').encode(
    x='target',
    y='preds'
)

pred_chart_test_4 = alt.Chart(pd.DataFrame({'target': y_test_4, 'preds': y_test_preds_4})).mark_line().encode(
    x='target',
    y='preds'
  )

pred_chart_test_4 + perfect_test_4

Based on the results from the regularisation models, the difference between accuracy in training, validation and testing is still slightly different. 

The results from the Ridge model and the model without regularisation are similar. It might indicate that more data is needed to achieve better performance, and the difference between the scores is not due to overfitting but to missing key features. 