# Case Study 2: Predicting Critical Temperature
By: Allen Hoskins and Brittany Lewandowski

***
# 1. INTRODUCTION


Diabetes is a metabolic disease impacting 37.3 million Americans. Those affected by the disease have complications producing insulin, a chemical messenger that our body uses to store energy. Although it is uncommon, diabetics can be hospitalized for having critically low or high blood glucose levels. These hospitalizations can be life threatening and should be minimized at all costs.  

 

In this case study, we will use a diabetes data set procured by Dr. Slater, to identify what factors most significantly result in diabetics getting readmitted to hospitals. To accomplish this, we will build a Logistic Regression model and extract its respective feature importances. It is our hope that this research can be leveraged by medical professionals to help treat hospitalized diabetics and to ensure that these patients are not readmitted in the future.  

***

# 2. METHODS


#### DATA UNDERSTANDING: 

Data used in this case study was a diabetes.csv provided by Dr. Slater. Our diabetes.csv contained data related to hospitalized diabetic patients including columns such as: “readmitted,” “patient_nbr,” “insulin,” and “time_in_hospital.” Upon reviewing the contents of our data set, we saved the data into a dataframe named “diabetes_data” and began pre-processing. 


#### DATA PREPROCESSING: 


The first step we performed in pre-processing was reviewing our full data set. Immediately, we recognized that missing values existed in the columns of:  

> race 
> weight 
> payer_code 
> medical_specialty 
> diag_1 
> diag_2 
> diag_3 

After identifying these missing values, we then ran the command, `diabetes_data.dtypes.value_counts()` and saw that our data set contained 37 categorical columns and 13 numeric columns. Noticing that our data set contained categorical columns, we noted that one hot encoding would need to be performed. Full details on our one hot encoding process can be found under the sub-header, “One Hot Encoding” below.  

 

ASSUMPTIONS OF LOGISTIC REGRESSION MODELS: 

 

EXPLORATORY DATA ANALYSIS: 


***
# 3. RESULTS

For modeling, we determined that using Sklearn's `ElasticNetCV` model was appropriate as it combines the `l1` and `l2` regularization of `Lasso` and `Ridge` models.

*Models use: 10-fold Cross validation (`Kfold`), `random_seed = 0`, and `max_iter = 20000`.*

**Model GridSearchCV:**

After preprocessing, EDA, and scaling the data, modeling was able to begin. Utilizing the power of Sklearn's `GridSearchCV`, the hyperparameters of `l1_ratio`, `tol`, and `eps` were run though the model and the best output using the `neg_root_mean_squared_error` scoring were output to be used in the final model.

**Grid Search Parameters:**

```
      "l1_ratio": np.arange(0.0, 1.0, 0.1), 
      "tol":      [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
      "eps":      [1e-3, 1e-2, 1e-1, 1, 10, 100]
```

**Best Model Output:**

```
      "l1_ratio": 0.2
      "tol":      0.01
      "eps":      0.001
```

**ElasticNetCV with GridSearchCV Tuned Parameters:**

After performing GridSearchCV to tune the model parameters, Sklearn's `cross_validate` was used to validate the model and determine final performance. The results of all 10 folds are below with a mean RMSE of 16.4637.

<table>
    <tr>
        <th></th>
        <th>fit_time</th>
        <th>score_time</th>
        <th>estimator</th>
        <th>test_score</th>
        <th>train_score</th>
    </tr>
    <tr>
        <td>0</td>
        <td>2.19916</td>
        <td>0.00117993</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.7378</td>
        <td>16.3605</td>
    </tr>
    <tr>
        <td>1</td>
        <td>2.15888</td>
        <td>0.00172806</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.7811</td>
        <td>16.3389</td>
    </tr>
    <tr>
        <td>2</td>
        <td>2.09171</td>
        <td>0.00159836</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.2488</td>
        <td>16.4508</td>
    </tr>
    <tr>
        <td>3</td>
        <td>2.13909</td>
        <td>0.00134301</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.4203</td>
        <td>16.4222</td>
    </tr>
    <tr>
        <td>4</td>
        <td>2.04526</td>
        <td>0.00174427</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.5608</td>
        <td>16.4041</td>
    </tr>
    <tr>
        <td>5</td>
        <td>2.05884</td>
        <td>0.00163698</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>15.7617</td>
        <td>16.4887</td>
    </tr>
    <tr>
        <td>6</td>
        <td>2.1159</td>
        <td>0.00138688</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.2624</td>
        <td>16.446</td>
    </tr>
    <tr>
        <td>7</td>
        <td>2.06758</td>
        <td>0.00157595</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.7527</td>
        <td>16.397</td>
    </tr>
    <tr>
        <td>8</td>
        <td>2.10658</td>
        <td>0.00192094</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.4909</td>
        <td>16.4127</td>
    </tr>
    <tr>
        <td>9</td>
        <td>2.11754</td>
        <td>0.00175214</td>
        <td>ElasticNetCV(l1_ratio=0.2, max_iter=10000, random_state=0, tol=0.01)</td>
        <td>16.6208</td>
        <td>16.4005</td>
    </tr>
    <tr>
        <td>MEAN</td>
        <td>2.11005</td>
        <td>0.00158665</td>
        <td></td>
        <td>16.4637</td>
        <td>16.4121</td>
    </tr>
</table>

**Feature Importance:**

After completing the ElasticNetCV modeling, we wanted to determine which features were the most important in predicting `critical_temp`.

The top 10 features in the model were:

<table>
    <tr>
        <th></th>
        <th>Feature</th>
        <th>Coefficient</th>
    </tr>
    <tr>
        <td>19</td>
        <td>Cu</td>
        <td>4.99197</td>
    </tr>
    <tr>
        <td>7</td>
        <td>Ba</td>
        <td>4.88638</td>
    </tr>
    <tr>
        <td>12</td>
        <td>Ca</td>
        <td>4.83552</td>
    </tr>
    <tr>
        <td>38</td>
        <td>La</td>
        <td>-3.48937</td>
    </tr>
    <tr>
        <td>163</td>
        <td>wtd_std_Valence</td>
        <td>-3.24568</td>
    </tr>
    <tr>
        <td>9</td>
        <td>Bi</td>
        <td>2.40671</td>
    </tr>
    <tr>
        <td>162</td>
        <td>wtd_std_ThermalConductivity</td>
        <td>2.29054</td>
    </tr>
    <tr>
        <td>57</td>
        <td>Pr</td>
        <td>-2.27065</td>
    </tr>
    <tr>
        <td>31</td>
        <td>Hg</td>
        <td>2.2417</td>
    </tr>
    <tr>
        <td>146</td>
        <td>wtd_mean_ThermalConductivity</td>
        <td>1.89972</td>
    </tr>
</table>

***

# 4. CONCLUSION

***

# 5. CODE:

Attached in file CS2_CODE.ipynb