# Find an Optimal Model for Predicting the Critical Temperatures of Superconductors

You work as a data scientist for a cable manufacturer. Management has decided to start shipping low-resistance cables to clients around the world. To ensure that the right cables are shipped to the right countries, they would like to predict the critical temperatures of various cables based on certain observed readings.

In this activity, you will train a linear regression model and compute the R2 score and the MSE. You will proceed to engineer new features using polynomial features of degree 3. You will compare the R2 score and MSE of this new model to those of the first model to determine overfitting. You will then use regularization to train a model that generalizes to previously unseen data.

>Note: You will find the dataset required for the activity in the Packt GitHub repository.

>The original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data).

>Citation:

>Hamidieh, Kam, A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Volume 154, November 2018, pages 346-354.

In [21]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split, KFold, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler

In [15]:
df = pd.read_csv("../Dataset/superconduct/train.csv")
df

Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,wtd_mean_Valence,gmean_Valence,wtd_gmean_Valence,entropy_Valence,wtd_entropy_Valence,range_Valence,wtd_range_Valence,std_Valence,wtd_std_Valence,critical_temp
0,4,88.944468,57.862692,66.361592,36.116612,1.181795,1.062396,122.90607,31.794921,51.968828,...,2.257143,2.213364,2.219783,1.368922,1.066221,1,1.085714,0.433013,0.437059,29.00
1,5,92.729214,58.518416,73.132787,36.396602,1.449309,1.057755,122.90607,36.161939,47.094633,...,2.257143,1.888175,2.210679,1.557113,1.047221,2,1.128571,0.632456,0.468606,26.00
2,4,88.944468,57.885242,66.361592,36.122509,1.181795,0.975980,122.90607,35.741099,51.968828,...,2.271429,2.213364,2.232679,1.368922,1.029175,1,1.114286,0.433013,0.444697,19.00
3,4,88.944468,57.873967,66.361592,36.119560,1.181795,1.022291,122.90607,33.768010,51.968828,...,2.264286,2.213364,2.226222,1.368922,1.048834,1,1.100000,0.433013,0.440952,22.00
4,4,88.944468,57.840143,66.361592,36.110716,1.181795,1.129224,122.90607,27.848743,51.968828,...,2.242857,2.213364,2.206963,1.368922,1.096052,1,1.057143,0.433013,0.428809,23.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21258,4,106.957877,53.095769,82.515384,43.135565,1.177145,1.254119,146.88130,15.504479,65.764081,...,3.555556,3.223710,3.519911,1.377820,0.913658,1,2.168889,0.433013,0.496904,2.44
21259,5,92.266740,49.021367,64.812662,32.867748,1.323287,1.571630,188.38390,7.353333,69.232655,...,2.047619,2.168944,2.038991,1.594167,1.337246,1,0.904762,0.400000,0.212959,122.10
21260,2,99.663190,95.609104,99.433882,95.464320,0.690847,0.530198,13.51362,53.041104,6.756810,...,4.800000,4.472136,4.781762,0.686962,0.450561,1,3.200000,0.500000,0.400000,1.98
21261,2,99.663190,97.095602,99.433882,96.901083,0.690847,0.640883,13.51362,31.115202,6.756810,...,4.690000,4.472136,4.665819,0.686962,0.577601,1,2.210000,0.500000,0.462493,1.84


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21263 entries, 0 to 21262
Data columns (total 82 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   number_of_elements               21263 non-null  int64  
 1   mean_atomic_mass                 21263 non-null  float64
 2   wtd_mean_atomic_mass             21263 non-null  float64
 3   gmean_atomic_mass                21263 non-null  float64
 4   wtd_gmean_atomic_mass            21263 non-null  float64
 5   entropy_atomic_mass              21263 non-null  float64
 6   wtd_entropy_atomic_mass          21263 non-null  float64
 7   range_atomic_mass                21263 non-null  float64
 8   wtd_range_atomic_mass            21263 non-null  float64
 9   std_atomic_mass                  21263 non-null  float64
 10  wtd_std_atomic_mass              21263 non-null  float64
 11  mean_fie                         21263 non-null  float64
 12  wtd_mean_fie      

In [17]:
X = df.iloc[:,:-1]
y = df[['critical_temp']]
print(X.shape, y.shape)

(21263, 81) (21263, 1)


In [18]:
# split training - test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Create a baseline linear regression model

In [23]:
kf = KFold(n_splits=3)
scoring = ['r2', 'neg_mean_squared_error']

In [35]:
steps1 = [
    ('scaler', MinMaxScaler()),
    ('lr', LinearRegression())
]
pipeline1 = Pipeline(steps1)

result1 = cross_validate(pipeline1, X_test, y_test, cv=kf, scoring=scoring)

result1

{'fit_time': array([0.07395506, 0.05996394, 0.09194851]),
 'score_time': array([0.02198792, 0.0159905 , 0.02798486]),
 'test_r2': array([0.72165815, 0.71293602, 0.72115141]),
 'test_neg_mean_squared_error': array([-329.00478125, -328.9835369 , -322.66484685])}

In [36]:
print(f"R2: {result1['test_r2'].mean()}, MSE: {-result1['test_neg_mean_squared_error'].mean()}")

R2: 0.7185818632099563, MSE: 326.88438833334527


## Create a pipeline to engineer polynomial features and train a linear regression model

In [39]:
steps2 = [
    ('scaler', MinMaxScaler()),
    ('poly', PolynomialFeatures(degree=2)),    
    ('lr', LinearRegression())
]
pipeline2 = Pipeline(steps2)

result2 = cross_validate(pipeline2, X_test, y_test, cv=kf, scoring=scoring)

result2

{'fit_time': array([32.79696918, 32.63448501, 35.21093869]),
 'score_time': array([0.13292384, 0.09594512, 0.14291692]),
 'test_r2': array([-1.45682028e+19, -6.27565877e+18, -9.91687265e+18]),
 'test_neg_mean_squared_error': array([-1.72198625e+22, -7.19208468e+21, -1.14751387e+22])}

In [40]:
print(f"R2: {result2['test_r2'].mean()}, MSE: {-result2['test_neg_mean_squared_error'].mean()}")

R2: -1.0253578068999096e+19, MSE: 1.1962361954314085e+22


## Train a ridge or lasso model

In [41]:
steps3 = [
    ('scaler', MinMaxScaler()),
    ('poly', PolynomialFeatures(degree=2)),    
    ('lr', Lasso(alpha=0.01))
]
pipeline3 = Pipeline(steps3)

result3 = cross_validate(pipeline3, X_test, y_test, cv=kf, scoring=scoring)

result3

{'fit_time': array([14.45138836, 14.75088143, 12.95537543]),
 'score_time': array([0.15790987, 0.12392926, 0.14891315]),
 'test_r2': array([0.78551718, 0.77490672, 0.78465957]),
 'test_neg_mean_squared_error': array([-253.52232644, -257.96334529, -249.17747729])}

In [42]:
print(f"R2: {result3['test_r2'].mean()}, MSE: {-result3['test_neg_mean_squared_error'].mean()}")

R2: 0.7816944898129496, MSE: 253.55438300603456


In [43]:
steps4 = [
    ('scaler', MinMaxScaler()),
    ('poly', PolynomialFeatures(degree=2)),    
    ('lr', Ridge(alpha=0.9))
]
pipeline4 = Pipeline(steps4)

result4 = cross_validate(pipeline4, X_test, y_test, cv=kf, scoring=scoring)

result4

{'fit_time': array([2.68798995, 2.15975904, 1.92189693]),
 'score_time': array([0.17989564, 0.11593509, 0.1059413 ]),
 'test_r2': array([0.81006689, 0.79403683, 0.8012763 ]),
 'test_neg_mean_squared_error': array([-224.50415722, -236.03968931, -229.94971716])}

In [44]:
print(f"R2: {result4['test_r2'].mean()}, MSE: {-result4['test_neg_mean_squared_error'].mean()}")

R2: 0.8017933383599504, MSE: 230.1645212291246


## Test

In [45]:
pipeline = Pipeline(steps4)
model = pipeline.fit(X_train, y_train)
model.score(X_test, y_test)

0.8223554009470894

In [47]:
mean_squared_error(y_test, model.predict(X_test))

206.4210776548909