# Case Study 2: Predicting Critical Temperature
By: Allen Hoskins and Brittany Lewandowski

September 19, 2022 
***
# 1. INTRODUCTION 
 
Diabetes is a metabolic disease impacting 37.3 million Americans. Those affected by the disease have complications producing insulin, a chemical messenger that our body uses to store energy. Although it is uncommon, diabetics can be hospitalized for having critically low or high blood glucose levels. These hospitalizations can be life threatening and should be minimized at all costs.  
 
In this case study, we will use a diabetes data set procured by Dr. Slater, to identify what factors most significantly result in diabetics getting readmitted to hospitals. To accomplish this, we will build a Logistic Regression model and extract its respective feature importances. It is our hope that this research can be leveraged by medical professionals to help treat hospitalized diabetics and to ensure that these patients are not readmitted in the future.  
 
***
# 2. METHODS 
 
#### DATA UNDERSTANDING: 
 
Data used in this case study was a diabetes.csv provided by Dr. Slater. Our diabetes.csv contained data related to hospitalized diabetic patients including columns such as: “readmitted,” “patient_nbr,” “insulin,” and “time_in_hospital.” Upon reviewing the contents of our data set, we saved the data into a data frame named “diabetes_data” and began pre-processing. 
 
#### DATA PREPROCESSING: 
 
The first step we performed in pre-processing was reviewing our full data set. Immediately, we recognized that missing values existed in the columns of:  
 
1.	race 
2.	weight 
3.	payer_code 
4.	medical_specialty 
5.	diag_1 
6.	diag_2 
7.	diag_3 

Given that machine learning models do not handle missing values well, we imputed them using appropriate statistical methods. Full details on how these columns were imputed can be found in the sub-header of this case study titled, “Data Imputation.”
After identifying that missing values existed in our data set, we ran the command, “diabetes_data.info()” and noted the following details of our data frame:

>	Our data frame contained 101,766 rows.
>	Our data frame contained 50 columns. 
>	No null values existed in our data frame. 
>	Our data frame contained 13 numeric columns. 
>	Our data frame contained 37 categorical columns. 

From this output we recognized that one hot encoding (OHE), would need to be performed on our categorical columns. For additional details on our OHE process, please see the sub-header of this case study titled, “One Hot Encoding.”

The next step performed in pre-processing was running the command “diabetes_data.describe() to view the summary statistics of our data frame. Output from this command showed that several columns contained outliers. This was something we remained cognizant of throughout our analysis. 
Finally, to view the distributions of our categorical columns with missing values, we created count plots. Visualizing these columns was important, as it helped us determine what data imputation method was most appropriate for our data. Output from our count plots showed that all seven of our columns with missing data contained non-normal distributions (Exhibit 1.0). Since all seven columns were of the categorical data type, we noted that imputing these columns with either the mean or median value would be appropriate. 

<br><center><figure>
    <img
        src="../images/count_plot_payer_code.png" class="center">
    <figcaption>
        <font size="+1"><em><center>
            Exhibit 1.0: Count Plot of Payer_Code
        </center></em></font>
    </figcaption><br>
</figure></center>


The last step performed in pre-processing was calculating the percentage of missing values in our categorical columns. Exhibit 1.1 details our findings:

<center><table border="1">
<thead>
<tr><th>Column Name</th><th>Percentage of Missing Values</th></tr>
</thead>
<tbody>
<tr><td>race</td><td>2.23%</td></tr>
<tr><td>weight</td><td>96.85%</td></tr>
<tr><td>payer_code</td><td>39.55%</td></tr>
<tr><td>medical_specialty</td><td>49.08%</td></tr>
<tr><td>diag_1</td><td>0.02%</td></tr>
<tr><td>diag_2</td><td>0.35%</td></tr>
<tr><td>diag_3</td><td>1.39%</td></tr>
</tbody>
</table><br>
<caption><font size="+1"><em>Exhibit 1.1: Percent Missing Values</em></caption></center>


#### DATA IMPUTATION: 

Upon reviewing our full data set and calculating the percentage of missing values that existed in our categorical columns, we proceeded to impute our missing values. 

All the missing values in our data set were denoted by: “?”. Since computers cannot impute data with special characters, we converted the question marks to “NaN”. Once this was complete, we re-calculated the sum of missing values in our columns to validate that no data loss had occurred in our conversion process.  

When we considered imputing the columns: “race”, “payer_code”, “medical_specialty,” “diag_1,” “diag_2” and “diag_3”, we tried two different approaches. One approach was imputing these columns with the mode of each column, and the second approach was leaving the columns as is with missing values. We fit our Logistic Regression model on both approaches and found that our performance results were negligible. Consequently, we decided to leave the columns with missing values, as we felt this represented our data the best.  

The first column we chose to impute was our “weight” column. Given that 96% of the data in our “weight” column were missing, we chose to drop the column from our data set. 

Next, we imputed the columns: “diabetesMed,” “change,” and “readmitted” with values of 0 and 1. This was done to simplify OHE as these columns had a maximum of three classes. Please note that although the column: “readmitted” contains three classes, we chose to convert it to a binary variable as we are only concerned with whether a patient has been readmitted or not. Exhibit 1.2 details our conversion process of these columns:

<center><table>
  <tr>
    <th>Column Name</th>
    <th>Original Classes</th>
    <th>Data Dictionary for Converted Classes</th>
  </tr>
  <tr>
    <td>diabetesMed</td>
    <td>No<br>Yes</td>
    <td>0=No<br>1=Yes</td>
  </tr>
  <tr>
    <td>Change</td>
    <td>Ch<br>No</td>
    <td>0=No<br>1=Yes</td>
  </tr>
  <tr>
    <td>readmitted</td>
    <td>NO<br>&#60;30<br>&#62;30</td>
    <td>0=No<br>1=Yes</td>
  </tr>
</table><br>
<caption><font size="+1"><em>Exhibit 1.2: Imputation process for the columns: “diabetesMed”, “change", and” readmitted”</em></caption></center>


#### RE-CODING CATEGORICAL COLUMNS:

When viewing the shape of our data set, we recognized that if we one hot encoded all 37 of our categorical variables, that our data set would be extremely wide. As a result, we decided to reduce the classes in each categorical variable by specifying a threshold for infrequent observations. Exhibit 1.3 details the thresholds that were chosen for each variable, as well as explanations as to why thresholds were chosen.

<center><table border="1">
<thead>
<tr><th>Column Name</th><th>Selected Threshold</th><th>Explanation</th></tr>
</thead>
<tbody>
<tr><td>payer_code</td><td>0.02</td><td>~90% of our data falling into the top 7 classes</td></tr>
<tr><td>medical_specialty</td><td>0.03</td><td>~85% of our data falling into the top 5 classes</td></tr>
<tr><td>max_glu_serum</td><td>0.02</td><td>~96% of our data falling into the top 2 classes</td></tr>
<tr><td>A1Cresult</td><td>0.08</td><td>~91% of our data falling into the top 2 classes</td></tr>
<tr><td>metformin</td><td>0.1</td><td>~98% of our data falling into the top 2 classes</td></tr>
<tr><td>repalglinide</td><td>0.01</td><td>~99% of our data falling into the top 2 classes</td></tr>
<tr><td>nateglinide</td><td>0.9</td><td>~99% of our data falling into the top class</td></tr>
<tr><td>chloropropamide</td><td>0.9</td><td>~99% of our data falling into the top class</td></tr>
<tr><td>glimepiride</td><td>0.9</td><td>~95% of our data falling into the top class</td></tr>
<tr><td>glipizide</td><td>0.1</td><td>~97 of our data falling into the top 2 classes</td></tr>
<tr><td>glyburide</td><td>0.09</td><td>~98 of our data falling into the top 2 classes</td></tr>
<tr><td>pioglitazone</td><td>0.06</td><td>~98 of our data falling into the top 2 classes</td></tr>
<tr><td>rosiglitazone</td><td>0.05</td><td>~98% of our data falling into the top 2 classes</td></tr>
<tr><td>acarbose</td><td>0.9</td><td>~99% of our data falling into the top class</td></tr>
<tr><td>miglitol</td><td>0.9</td><td>~99% of our data falling into the top class</td></tr>
<tr><td>tolazamide</td><td>0.0004</td><td>~99% of our data falls into the top class</td></tr>
<tr><td>glyburide_metformin</td><td>0.9</td><td>~99% of our data falling into the class “No”</td></tr>
<tr><td>diag_1</td><td>0.0075</td><td>many of the columns contained values <= 0.000010</td></tr>
<tr><td>diag_2</td><td>0.0075</td><td>many of the columns contained values <= 0.000010</td></tr>
<tr><td>diag_3</td><td>0.0075</td><td>many of the columns contained values <= 0.000010</td></tr>
</tbody>
</table><br>
<caption><font size="+1"><em>Exhibit 1.3: Detailed Recoding Threshold</em></caption></center>

#### ONE HOT ENCODING: 

Once we imputed our missing values and re-coded our categorical columns we separated our diabetes data set into two variables, one containing all our numeric columns and the other containing all our categorical columns. Next, using Pandas’ get dummies function, we one hot encoded our categorical columns and joined our one hot encoded data to our numeric columns to arrive at our final full data set. 

Please note that for modeling, we scaled our non-hot encoded data to ensure that our full data set was on the same scale. For additional modeling details, please see our sub-header below titled “Modeling.”


#### EXPLORATORY DATA ANALYSIS: 

For our exploratory data analysis (EDA), we began by viewing histograms and pair plots of our data (Exhibits 1.4 & 1.5). Two takeaways from these visualizations included:

1.	Many of our numeric columns exhibited non-normal distributions. 
2.	Our variables were not on the same scale.

<center><figure>
    <img
        src="../images/encounter_id_density.png" class="center">
    <figcaption>
      <font size="+1"><em>
            <em>Exhibit 1.4: Encounter ID Density Plot</em></font>
    </figcaption><br>
</figure></center>


<center><figure>
    <img
        src="../images/admission_type_id_density.png" class="center">
    <figcaption>
        <font size="+1"><em>
            Exhibit 1.5: Admission Type ID Density Plot</em></font>
    </figcaption><br>
</figure></center>


After reviewing our histograms and pair plots, we assessed multicollinearity in our data by creating a correlation plot. Seeing that no columns had a correlation coefficient of 1.0, we chose not to remove any columns from our data set. At this point we were satisfied with our data and began evaluating if our data met our Logistic Regression modeling assumptions.  

#### ASSUMPTIONS OF LOGISTIC REGRESSION MODELS: 
 
The three key assumptions of Logistic Regression models include:
1.	Independent variables have a linear relationship to the log loss of the response. 
2.	Absence of multicollinearity. 
3.	Lack of outliers. 

To assess our first assumption, we created two log odds linear plots of our response variable versus the independent variables (Exhibit 1.6): “time_in_hospital” and “num_medications” 

<center><figure>
    <img
        src="../images/time_in_hosp_log_odds.png" class="center">
    <figcaption>
      <font size="+1"><em>
            <em>Exhibit 1.6: Log Odds for Time in Hospital</em></font>
    </figcaption><br>
</figure></center>


<center><figure>
    <img
        src="../images/num_med_log_odds.png" class="center">
    <figcaption>
        <font size="+1"><em>
            Exhibit 1.7: Log Odds for Num Meds Feature</em></font>
    </figcaption><br>
</figure></center>

<center><figure>
    <img
        src="../images/orig_pair_plot.png" class="center"> 
    <figcaption>
        <font size="+1"><em>
            Exhibit 1.8: Snapshot of several pair plots generated from our data</em></font>
    </figcaption><br>
</figure></center>


Seeing that our log odds plots showed that our independent variables had a linear relationship to the log loss our response, we deemed that our first assumption was met. 

To address our multicollinearity assumption, we created a correlation plot. Given that no columns had a correlation coefficient of 1.0, we proceeded assuming that this assumption was met. 

The final assumption we addressed was lack of outliers. (Exhibit 1.8). As illustrated in our pair plots below, we did see that our data contained outliers. Given that the pair plots were built on un-scaled data, we proceeded in our analysis assuming that this assumption was met. 

<center><figure>
    <img
        src="../images/trans_log_reg_pair_plot.png" class="center">
    <figcaption>
        <font size="+1"><em>
            Exhibit 1.9: Addressing our Logistic Regression outlier assumption with pair plots</em></font>
    </figcaption><br>
</figure></center>


***
# 3. MODEL BUILDING & RESULTS

The use of Sklearn's `LogisticRegression` was used to model the data for this case study.

*Models use: 10-fold Cross validation (`Kfold`), `random_seed = 0`, and `max_iter = 50000`, and scoring metric of `F1`*

**Model HalvingRandomSearchCV:**


After preprocessing, EDA, and scaling the data, modeling was able to begin. To determine the best hyperparameters that we should use, we needed to iterate through several of `sklearn's` modules. We began with utilizing `GridSearchCV`, but due to the shape of our data and inability to scale our CPU, GPU, and Memory for the needs of this project, `GridSearchCV` was unable to complete and we needed to try other methods of tuning hyperparameters. With the use of Skelearn's `experimental` and `model_selection` packages we were able to utilize `HalvingRandomSearchCV` to obtain good, but potentially not the best hyperparameters for this model. `HalvingRandomSearchCV` combines the idea of `HalvingSearchCV` and `RandomizedSearchCV`.
HalvingSearchCV works by modeliong all potential candidates with less data and selects half of the best performing models to add additional resources and data until a "best" model is output. `RandomizedSearchCV` randomly picks candidate modles from the grid to model. 

We passed the below parameters into `HalvingRandomSearchCV` and the best model outputs were the following:

**Halving Random Search CV Parameters:**

```
        "C":            np.logspace(-3,3,7), 
        "l1_ratio":     np.arange(0.0,1.0,0.1),
        'solver':       ['saga'],
        'penalty':      ['elasticnet'],
        "tol":          [1e-9,1e-8,1e-7,1e-6,1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
```

**Best Model Output:**

```
        "C":            100.0
        "l1_ratio":     0.8
        "n_jobs":       -1
        "penalty":      'elasticnet'
        "solver":       'saga'
        "tol":          0.001
```

**ElasticNetCV with GridSearchCV Tuned Parameters:**

After performing `HalvingRandomSearchCV` to tune the model parameters, Sklearn's `cross_validate` was used to validate the model and determine final performance. The results of all 10 folds are below with a mean F1 score of .568746.

<center><table>
    <tr>
        <th></th>
        <th>fit_time</th>
        <th>score_time</th>
        <th>estimator</th>
        <th>test_score</th>
        <th>train_score</th>
    </tr>
    <tr>
        <td>0</td>
        <td>2.43238</td>
        <td>0.00473499</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.568733</td>
        <td>0.568562</td>
    </tr>
    <tr>
        <td>1</td>
        <td>2.60263</td>
        <td>0.00420213</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.56834</td>
        <td>0.568627</td>
    </tr>
    <tr>
        <td>2</td>
        <td>2.9399</td>
        <td>0.00471711</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.567456</td>
        <td>0.568824</td>
    </tr>
    <tr>
        <td>3</td>
        <td>2.86625</td>
        <td>0.00472498</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.566277</td>
        <td>0.56902</td>
    </tr>
    <tr>
        <td>4</td>
        <td>2.24447</td>
        <td>0.00615811</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.56726</td>
        <td>0.568922</td>
    </tr>
    <tr>
        <td>5</td>
        <td>2.248</td>
        <td>0.00481105</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.572271</td>
        <td>0.568409</td>
    </tr>
    <tr>
        <td>6</td>
        <td>2.93157</td>
        <td>0.00560784</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.574194</td>
        <td>0.567999</td>
    </tr>
    <tr>
        <td>7</td>
        <td>2.92202</td>
        <td>0.00460625</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.573015</td>
        <td>0.568162</td>
    </tr>
    <tr>
        <td>8</td>
        <td>2.47317</td>
        <td>0.00449204</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.561812</td>
        <td>0.56956</td>
    </tr>
    <tr>
        <td>9</td>
        <td>2.8959</td>
        <td>0.00579691</td>
        <td>LogisticRegression(C=100.0, l1_ratio=0.8, n_jobs=-1, penalty='elasticnet',random_state=0, solver='saga', tol=0.001)</td>
        <td>0.568101</td>
        <td>0.568796</td>
    </tr>
    <tr>
        <td>MEAN</td>
        <td>2.65563</td>
        <td>0.00498514</td>
        <td></td>
        <td>0.568746</td>
        <td>0.568688</td>
    </tr>
</table></center>

***

# 4. CONCLUSION

In conclusion, after significant updates to thresholds and hyperparameters, we have determined that logistic regression does not properly model this data due to inablilty for the coefficient's to converge. Potential models that woudl be better for this data set would include decision trees or any sort of gradient boosting. 

***

# 5. CODE:

Attached in file CS2_CODE.ipynb