# Case Study 1

## Business Understanding
You should always state the objective at the beginning of every case (a guideline you should follow in real life as well) and provide some initial "Business Understanding" statements (i.e., what is trying to be solved for and why might it be important)

build a linear regression model using L1 or L2 regularization (or both) to predict the critical temperature. In addition, include in your write-up which variable carries the most importance.

## Data Evaluation and Engineering
Summarize the data being used in the case using appropriate mediums (charts, graphs, tables); address questions such as: Are there missing values? Which variables are needed (which ones are not)? What assumptions or conclusions are you drawing that need to be relayed to your audience?

In [1]:
import numpy as np
import pandas as pd

# Load Data
data_train = pd.read_csv('./data/train.csv')
data_materials = pd.read_csv('./data/unique_m.csv')

# Drop the duplicate column 'critical_temp' in the

data_train = data_train.drop(['critical_temp'], axis=1)

# Merge the two frames
data = pd.merge(data_train, data_materials, left_index=True, right_index=True)

data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21263 entries, 0 to 21262
Columns: 169 entries, number_of_elements to material
dtypes: float64(156), int64(12), object(1)
memory usage: 27.4+ MB


Unnamed: 0,number_of_elements,mean_atomic_mass,wtd_mean_atomic_mass,gmean_atomic_mass,wtd_gmean_atomic_mass,entropy_atomic_mass,wtd_entropy_atomic_mass,range_atomic_mass,wtd_range_atomic_mass,std_atomic_mass,...,Pt,Au,Hg,Tl,Pb,Bi,Po,At,Rn,critical_temp
count,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,...,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0,21263.0
mean,4.115224,87.557631,72.98831,71.290627,58.539916,1.165608,1.063884,115.601251,33.225218,44.391893,...,0.034108,0.020535,0.036663,0.047954,0.042461,0.201009,0.0,0.0,0.0,34.421219
std,1.439295,29.676497,33.490406,31.030272,36.651067,0.36493,0.401423,54.626887,26.967752,20.03543,...,0.307888,0.717975,0.205846,0.272298,0.274365,0.655927,0.0,0.0,0.0,34.254362
min,1.0,6.941,6.423452,5.320573,1.960849,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00021
25%,3.0,72.458076,52.143839,58.041225,35.24899,0.966676,0.775363,78.512902,16.824174,32.890369,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.365
50%,4.0,84.92275,60.696571,66.361592,39.918385,1.199541,1.146783,122.90607,26.636008,45.1235,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0
75%,5.0,100.40441,86.10354,78.116681,73.113234,1.444537,1.359418,154.11932,38.356908,59.322812,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,63.0
max,9.0,208.9804,208.9804,208.9804,208.9804,1.983797,1.958203,207.97246,205.58991,101.0197,...,5.8,64.0,8.0,7.0,19.0,14.0,0.0,0.0,0.0,185.0


In [2]:
# Columns with missing data?
print(data.columns[data.isnull().any()])

Index([], dtype='object')


In [3]:
# Columns with a Constant value
data.columns[data.nunique() <= 1]

Index(['He', 'Ne', 'Ar', 'Kr', 'Xe', 'Pm', 'Po', 'At', 'Rn'], dtype='object')

In [2]:
# Drop columns with constant values
data.drop(columns=['material', 'He', 'Ne', 'Ar', 'Kr', 'Xe', 'Pm', 'Po', 'At', 'Rn'], inplace=True)
print(data.shape)

(21263, 159)


In [3]:
# Create our standard numpy stuff
X = data.drop(columns=['critical_temp']).values
y = data.loc[:,'critical_temp'].values

X_df = data.drop(columns=['critical_temp']).copy(deep=True)

# from statsmodels.stats.outliers_influence import variance_inflation_factor

# vif_data = pd.DataFrame()
# vif_data["feature"] = X_df.columns

# vif_data["VIF"] = [variance_inflation_factor(X_df.values, i) for i in range(len(X_df.columns))]

# vif_data

In [16]:
# https://stats.stackexchange.com/questions/445259/combining-pca-feature-scaling-and-cross-validation-without-training-test-data
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

X_train, X_test, y_train, y_test =\
    train_test_split(X, y,
    test_size=0.2,
    random_state=1)

lasso_pipe_svc = make_pipeline(RobustScaler(), Lasso(random_state=1))
ridge_pipe_svc = make_pipeline(RobustScaler(), Ridge(random_state=1))
elastic_pipe_svc = make_pipeline(RobustScaler(), ElasticNet(random_state=1))

#pipe_svc = make_pipeline(RobustScaler(), Ridge(random_state=1))
#pipe_svc = make_pipeline(StandardScaler(), Lasso(random_state=1))

#param_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_range = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10, 100, 1000, 10000]
param_l1_ratio = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

param_grid_lasso = [{'lasso__alpha': param_range}]
param_grid_ridge = [{'ridge__alpha': param_range}]
param_grid_elastic = [{'elasticnet__alpha': param_range, 'elasticnet__l1_ratio': param_l1_ratio}]

gs_lasso = GridSearchCV(estimator=lasso_pipe_svc, param_grid=param_grid_lasso, scoring='r2', cv=5, n_jobs=-1)
gs_lasso.fit(X_train, y_train)

gs_ridge = GridSearchCV(estimator=ridge_pipe_svc, param_grid=param_grid_ridge, scoring='r2', cv=5, n_jobs=-1)
gs_ridge.fit(X_train, y_train)

gs_elastic = GridSearchCV(estimator=elastic_pipe_svc, param_grid=param_grid_elastic, scoring='r2', cv=5, n_jobs=-1)
gs_elastic.fit(X_train, y_train)

print("Lasso")
print(gs_lasso.best_score_)
print(gs_lasso.best_params_)
print("")

print("Ridge")
print(gs_ridge.best_score_)
print(gs_ridge.best_params_)
print("")

print("Elastic")
print(gs_elastic.best_score_)
print(gs_elastic.best_params_)
print("")

Lasso
0.7129520975572087
{'lasso__alpha': 0.3}

Ridge
0.7065312755243307
{'ridge__alpha': 1000}

Elastic
0.7083853891920866
{'elasticnet__alpha': 0.3, 'elasticnet__l1_ratio': 0.9}



In [18]:
from sklearn import metrics

# Note the X_test gets run through the pipeline above! Very important, it means that the scaler is also run on the test data
y_lasso_pred = gs_lasso.predict(X_test)
y_ridge_pred = gs_ridge.predict(X_test)
y_elastic_pred = gs_elastic.predict(X_test)

print("Lasso")
print("R2 ->", metrics.r2_score(y_test, y_lasso_pred))
print("MAE ->", metrics.mean_absolute_error(y_test, y_lasso_pred))
print("")

print("Ridge")
print("R2 ->", metrics.r2_score(y_test, y_ridge_pred))
print("MAE ->", metrics.mean_absolute_error(y_test, y_ridge_pred))
print("")

print("Elastic")
print("R2 ->", metrics.r2_score(y_test, y_elastic_pred))
print("MAE ->", metrics.mean_absolute_error(y_test, y_elastic_pred))
print("")


Lasso
R2 -> 0.7197254814065868
MAE -> 13.660714491348584

Ridge
R2 -> 0.7264906957630273
MAE -> 13.359929361601825

Elastic
R2 -> 0.7146988481193569
MAE -> 13.811785860925973



In [17]:
#print()
#print(gs.best_estimator_['lasso'].coef_)
idx = 0
cnt = 0

cols_to_drop = []

for x in gs.best_estimator_['lasso'].coef_:
    print(data.columns[idx], x)
    idx += 1
    if x == 0:
        cnt += 1
        cols_to_drop.append(data.columns[idx])

print(cnt)

number_of_elements 0.0
mean_atomic_mass -0.0
wtd_mean_atomic_mass -0.0
gmean_atomic_mass -0.0
wtd_gmean_atomic_mass -0.0
entropy_atomic_mass 0.0
wtd_entropy_atomic_mass 3.9508095199980304
range_atomic_mass 6.955103730407498
wtd_range_atomic_mass -1.2028148462719683
std_atomic_mass 0.0
wtd_std_atomic_mass -0.7363630703449164
mean_fie 1.5836528543493764
wtd_mean_fie 0.0
gmean_fie 0.0
wtd_gmean_fie 0.0
entropy_fie 0.0
wtd_entropy_fie 0.0
range_fie 0.0
wtd_range_fie 0.0
std_fie 0.0
wtd_std_fie 0.0
mean_atomic_radius -0.0
wtd_mean_atomic_radius 0.0
gmean_atomic_radius -0.0
wtd_gmean_atomic_radius -0.0
entropy_atomic_radius 0.0
wtd_entropy_atomic_radius 0.0
range_atomic_radius 1.1880683928528302
wtd_range_atomic_radius -0.0
std_atomic_radius 0.0
wtd_std_atomic_radius 0.0
mean_Density -1.7386569847400537
wtd_mean_Density -0.0
gmean_Density -0.0
wtd_gmean_Density -0.0
entropy_Density -0.0
wtd_entropy_Density 0.0
range_Density 0.0
wtd_range_Density 0.5578557645139761
std_Density 0.0
wtd_std_Den

In [19]:
X_df = data.drop(columns=cols_to_drop).copy(deep=True)

# from statsmodels.stats.outliers_influence import variance_inflation_factor

# vif_data = pd.DataFrame()
# vif_data["feature"] = X_df.columns

# vif_data["VIF"] = [variance_inflation_factor(X_df.values, i) for i in range(len(X_df.columns))]

# vif_data



In [23]:
X = X_df.drop(columns=['critical_temp']).values
y = data.loc[:,'critical_temp'].values

X_train, X_test, y_train, y_test =\
    train_test_split(X, y,
    test_size=0.2,
    random_state=1)

pipe_svc = make_pipeline(RobustScaler(), Ridge(random_state=1))
#pipe_svc = make_pipeline(StandardScaler(), LogisticRegression(random_state=1))


#param_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_range = [0.001, 0.01, 0.1, .2, .3, .4, .5, .6, .7, .8, .9, 1]

param_grid = [{'ridge__alpha': param_range}]

gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='r2', cv=5, n_jobs=-1)

gs = gs.fit(X_train, y_train)

print(gs.best_score_)
print(gs.best_params_)

0.6238304226639614
{'ridge__alpha': 1}


In [24]:
y_pred = gs.predict(X_test)

metrics.r2_score(y_test, y_pred)

0.6145652115524546

In [10]:
idx = 0

for x in gs.best_estimator_['lasso'].coef_:
    print(data.columns[idx], x)
    idx += 1

number_of_elements 13.487984851692977
mean_atomic_mass 2.1225100001983446
wtd_mean_atomic_mass 10.782705622339384
gmean_atomic_mass -1.6643535094439528
wtd_gmean_atomic_mass -4.4237126509847595
entropy_atomic_mass 6.813374319059292
wtd_entropy_atomic_mass -3.2098067719137777
range_atomic_mass -6.847940527216241
wtd_range_atomic_mass -9.951626110077674
std_atomic_mass -1.9779067752527328
wtd_std_atomic_mass -0.9980228629792511
mean_fie 9.16635842901707
wtd_mean_fie 10.102564790255299
gmean_fie -12.722713655854113
wtd_gmean_fie 1.7175514479907312
entropy_fie -1.7438329957558216
wtd_entropy_fie -7.783179803536447
range_fie -1.4902019294386373
wtd_range_fie -2.67511739351774
std_fie 0.12378026980299667
wtd_std_fie 0.0
mean_atomic_radius 0.010917606394030163
wtd_mean_atomic_radius 6.913627781183965
gmean_atomic_radius 0.19085865282896503
wtd_gmean_atomic_radius -0.07433270804044217
entropy_atomic_radius -0.3582102473971797
wtd_entropy_atomic_radius -13.022188142309568


In [8]:
## install pandas 1.2.4
## pip install pandas-profiling==2.8.0

from pandas_profiling import ProfileReport

profile = ProfileReport(data, title="Pandas Profiling Report", minimal=True)

profile.to_file(output_file="PandasProfile.html")

Summarize dataset: 100%|██████████| 168/168 [00:01<00:00, 120.42it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:29<00:00, 29.90s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.61s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 62.67it/s]


## Modeling Preparations
Which methods are you proposing to utilize to solve the problem?  Why is this method appropriate given the business objective? How will you determine if your approach is useful (or how will you differentiate which approach is more useful than another)?  More specifically, what evaluation metrics are most useful given that the problem is a regression one (ex., RMSE, logloss, MAE, etc.)?

## Model Building and Evaluation
In this case, your primary task is to build a linear regression model using L1 or L2 regularization (or both) to predict the critical temperature and will involve the following steps:

- Specify your sampling methodology
- Setup your model(s) - specifying the regularization type chosen and including the parameters utilized by the model
- Analyze your model's performance - referencing your chosen evaluation metric (including supplemental visuals and analysis where appropriate)

## Model Interpretability & Explainability
Using at least one of your models above (if multiple were trained):

- Which variable(s) was (were) ""most important"" and why?  How did you come to the conclusion and how should your audience interpret this?

## Case Conclusions
After all of your technical analysis and modeling; what are you proposing to your audience and why?  How should they view your results and what should they consider when moving forward?  Are there other approaches you'd recommend exploring?  This is where you "bring it all home" in language they understand.