# Case Study 1: Predicting Critical Temperature
By: Allen Hoskins

***
# Introduction

Superconductors are materials that conduct electricity with little or no resistance without onset or buildup of heat. Due to this process, superconductors can create a magnetic field and generate a constant flow of electricity. These materials have a property called `critical temperature`. This property is the temperature at which this material acts as a superconductor. While most materials have an extremely low critical temperature (between 0 and 10 Kelvin), research has been ongoing to find materials with higher critical temperatures.

In this case study, we will utilize Linear Regression with both L1 and L2 regularization to predict the critical temperature of a compound to potentially identify superconductors.
***

# METHODS

### DATA PREPROCESSING:

The original data is composed of two separate CSV files, named `train.csv` and `unique_m.csv`. Once importing the needed packages for the case study, I read in the data and determined the shape and size of both datasets. The `train.csv` dataset consisted of 21,263 rows with 82 columns, and the `unique_m.csv` dataset consisted of 21,263 rows with 88 columns. The two datasets were then able to be merged on the index, and the duplicate response variable of `critical_temp` was able to be dropped, resulting in a final dataframe consisting of 21,263 rows with 168 columns. Before proceeding to Exploratory Data Analysis (EDA), I checked variable datatypes to determine if any other preprocessing needed to be done. With every datatype consisting of a float or integer, I was able to move on to EDA.

### EXPLORATORY DATA ANALYSIS (EDA):

With little information about the 168 explanatory variables and response variable within the data set, I needed to determine if the data needed to be scaled. I first plotted a histogram of the response variable to see the distribution (Fig. 1). The histogram showed that the `critical_temp` was heavily right skewed. To determine if any other variables in the data were heavily skewed, I ran quick descriptive statistics and plotted histograms of the following explanatory variables: `number_of_elements`, `entropy_atomic_mass`, `wtd_mean_atomic_mass`, `critical_temp`. All of these variables appeared to have a large variances and non-standard distributions (Fig 2). 

<figure>
  <figcaption>Fig. 1</figcaption>
  <img
  src="./images/crit_temp_hist.png">
</figure>

<figure>
  <figcaption>Fig. 2</figcaption>
  <img
  src="./images/variable_hist.png">
</figure>

### LINEAR REGRESSION ASSUMPTIONS:

Before performing a linear regssion model, I needed to check the assumptions. These consisted of:

> Linearity: PASS

> Homoscedasticity: FAIL

> Independence: PASS

> Normality: FAIL

After testing the assumptions and determinign that they did not all pass, I needed to prepare the data with transformations below.

### PREP DATA TO MODEL:

Before scaling any of the explanatory variables, the data was separated into explanatory variables (X) and response variable (y). Once separated, I was able to scale the data. Sklearn provides multiple scalers to transform the data, but only `StandardScaler` and `PowerTransformer` were considered at this time. According to Sklearn's website, the definitions of the scalers are as follows:

> StandardScaler: Standardize features by removing the mean and scaling to unit variance.
> PowerTransformer: Apply a power transform feature wise to make data more Gaussian-like. Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired. Supports Yeo-Johnson and Box-Cox.

Both StandardScaler and PowerTransformer were run though the ElasticNet model, and since PowerTransformer produced the lowest mean RMSE, this will be the data set used moving forward. Note that since there are positive and negative integers in the data set, Yeo-Johnson was used instead of Box-Cox. Fig. 3 contains histograms of variables in Fig. 2 after utilizing `PowerTransformer`.

<figure>
  <figcaption>Fig. 3</figcaption>
  <img
  src="./images/variable_hist_post_powertransformer.png">
</figure>

***
# RESULTS

For modeling, I determined that using Sklearn's `ElasticNetCV` model was appropriate as it combines the `l1` and `l2` regularization of `Lasso` and `Ridge` models.

*Models use: 10-fold Cross validation (`Kfold`), `random_seed = 0`, and `max_iter = 20000`.*

**Model GridSearchCV:**

After preprocessing, EDA, and scaling the data, modeling was able to begin. Utilizing the power of Sklearn's `GridSearchCV`, the hyperparameters of `l1_ratio`, `tol`, and `eps` were run though the model and the best output using the `neg_root_mean_squared_error` scoring were output to be used in the final model.

**Grid Search Parameters:**

```
      "l1_ratio": np.arange(0.0, 1.0, 0.1), 
      "tol":      [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
      "eps":      [1e-3, 1e-2, 1e-1, 1, 10, 100]
```

**Best Model Output:**

```
      "l1_ratio": 0.2
      "tol":      0.01
      "eps":      0.001
```

**ElasticNetCV with GridSearchCV Tuned Parameters:**

After performing GridSearchCV to tune the model parameters, Sklearn's `cross_validate` was used to validate the model and determine final performance. The results of all 10 folds are below with a mean RMSE of 16.4637.

<table>
    <tr>
        <td></td>
        <td>fit_time</td>
        <td>score_time</td>
        <td>estimator</td>
        <td>test_score</td>
        <td>train_score</td>
    </tr>
    <tr>
        <td>0</td>
        <td>2.19916</td>
        <td>0.00117993</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.7378</td>
        <td>16.3605</td>
    </tr>
    <tr>
        <td>1</td>
        <td>2.15888</td>
        <td>0.00172806</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.7811</td>
        <td>16.3389</td>
    </tr>
    <tr>
        <td>2</td>
        <td>2.09171</td>
        <td>0.00159836</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.2488</td>
        <td>16.4508</td>
    </tr>
    <tr>
        <td>3</td>
        <td>2.13909</td>
        <td>0.00134301</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.4203</td>
        <td>16.4222</td>
    </tr>
    <tr>
        <td>4</td>
        <td>2.04526</td>
        <td>0.00174427</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.5608</td>
        <td>16.4041</td>
    </tr>
    <tr>
        <td>5</td>
        <td>2.05884</td>
        <td>0.00163698</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>15.7617</td>
        <td>16.4887</td>
    </tr>
    <tr>
        <td>6</td>
        <td>2.1159</td>
        <td>0.00138688</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.2624</td>
        <td>16.446</td>
    </tr>
    <tr>
        <td>7</td>
        <td>2.06758</td>
        <td>0.00157595</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.7527</td>
        <td>16.397</td>
    </tr>
    <tr>
        <td>8</td>
        <td>2.10658</td>
        <td>0.00192094</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.4909</td>
        <td>16.4127</td>
    </tr>
    <tr>
        <td>9</td>
        <td>2.11754</td>
        <td>0.00175214</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.6208</td>
        <td>16.4005</td>
    </tr>
    <tr>
        <td>MEAN</td>
        <td>2.11005</td>
        <td>0.00158665</td>
        <td></td>
        <td>16.4637</td>
        <td>16.4121</td>
    </tr>
</table>

**Feature Importance:**

After completing the ElasticNetCV modeling, I wanted to determine which features were the most important in predicting `critical_temp`.

The top 10 features in the model were:

<table>
    <tr>
        <td></td>
        <td>Feature</td>
        <td>Coefficient</td>
    </tr>
    <tr>
        <td>19</td>
        <td>Cu</td>
        <td>4.99197</td>
    </tr>
    <tr>
        <td>7</td>
        <td>Ba</td>
        <td>4.88638</td>
    </tr>
    <tr>
        <td>12</td>
        <td>Ca</td>
        <td>4.83552</td>
    </tr>
    <tr>
        <td>38</td>
        <td>La</td>
        <td>-3.48937</td>
    </tr>
    <tr>
        <td>163</td>
        <td>wtd_std_Valence</td>
        <td>-3.24568</td>
    </tr>
    <tr>
        <td>9</td>
        <td>Bi</td>
        <td>2.40671</td>
    </tr>
    <tr>
        <td>162</td>
        <td>wtd_std_ThermalConductivity</td>
        <td>2.29054</td>
    </tr>
    <tr>
        <td>57</td>
        <td>Pr</td>
        <td>-2.27065</td>
    </tr>
    <tr>
        <td>31</td>
        <td>Hg</td>
        <td>2.2417</td>
    </tr>
    <tr>
        <td>146</td>
        <td>wtd_mean_ThermalConductivity</td>
        <td>1.89972</td>
    </tr>
</table>

***

# CONCLUSION

In conclusion, the best model that was chosen used a combination of L1 and L2 regularization (`l1_ratio = 0.2`). This model produced a mean RMSE of 16.4637. While not all models are useful, the output of this model can be used as a base in determining superconductors.
***

# CODE:

Attached in file CS1_CODE.ipynb