# CPSC 330 hw9

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

## Exercise 1
rubric={points:50}

Write up your analysis from hw8 in a "blog post" or report format. The target audience for your blog post is someone like yourself right before you took this course. They don't necessarily have ML knowledge, but they have a solid foundation in technical matters. The post should focus on explaining **your results and what you did** in a way that's understandable to such a person, **not** a lesson trying to teach someone about machine learning. Again: focus on the results and why they are interesting; avoid pedagogical content.

Your post must include the following elements (not necessarily in this order):

- Description of the problem/decision.
- Description of the dataset (the raw data and/or some EDA).
- Description of the model.
- Description your results, both quantitatively and qualitatively. Make sure to refer to the original problem/decision.
- A section on caveats, describing at least 3 reasons why your results might be incorrect, misleading, overconfident, or otherwise problematic. Make reference to your specific dataset, model, approach, etc. To check that your reasons are specific enough, make sure they would not make sense, if left unchanged, to most students' submissions; for example, do not just say "overfitting" without explaining why you might be worried about overfitting in your specific case.
- At least 3 visualizations. These visualizations must be embedded/interwoven into the text, not pasted at the end. The text must refer directly to each visualization. For example "as shown below" or "the figure demonstrates" or "take a look at Figure 1", etc. It is **not** sufficient to put a visualization in without referring to it directly.

A reasonable length for your entire post would be **800 words**. The maximum allowed is **1000 words**.

#### A note for those who didn't do hw8

If you're working with a partner on this assignment and you didn't work together on hw8, you can choose the analysis of either person for your blog post. If you're working alone on hw9 and didn't do hw8, or if you're working with a partner on hw9 and neither of you did hw8, you may choose one of the CPSC 330 homework assignments as the topic of your blog post instead. In that case, please make it very clear that this is what you're doing.

#### Example blog posts

Here are some examples of applied ML blog posts that you may find useful as inspiration. The target audiences of these posts aren't necessarily the same as yours, and these posts are longer than yours, but they are well-structured and engaging. You are **not required to read these** posts as part of this assignment - they are here only as examples if you'd find that useful.

From the UBC Master of Data Science blog, written by a past student:

- https://ubc-mds.github.io/2019-07-26-predicting-customer-probabilities/

This next one uses R instead of Python, but that might be good in a way, as you can see what it's like for a reader that doesn't understand the code itself (the target audience for your post here):

- https://rpubs.com/RosieB/taylorswiftlyricanalysis

Finally, here are a couple interviews with winners from Kaggle competitions. The format isn't quite the same as a blog post, but you might find them interesting/relevant:

- https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
- https://medium.com/kaggle-blog/winner-interview-with-shivam-bansal-data-science-for-good-challenge-city-of-los-angeles-3294c0ed1fb2


#### A note on plagiarism

You may **NOT** include text or visualizations that were not written/created by you. If you are in any doubt as to what constitutes plagiarism, please just ask. For more information see the [UBC Academic Misconduct policies](http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,959). Please don't copy this from somewhere 🙏. If you can't do it, just don't turn it in - your lowest two homework scores will be dropped anyway.

<font color = blue> Refer to hw_9.pdf for complete write up

# PREDICTING EARLY ICU ADMISSION BASED ON COVID-19 DIAGNOSIS OF PATIENTS IN BRAZIL. 

#### Sheetal Shajan, Angelene Leow

## BACKGROUND MOTIVATION
As the world continues to struggle to contain the ongoing novel virus that is COVID-19, the major obstacle faced by healthcare institutions is the unpredictability in demand for healthcare resources such as ventilators and ICU resources to treat critically ill patients. This obstacle can be mitigated if a machine learning model is able to predict correctly if a patient with covid-19 will be admitted to the ICU hours before it happens. For our project, we have chosen to look at the Covid-19 cases specifically from Hospitals Sírio-Libanês, São Paulo and Brasilia in Brazil. 
    
![](image7.png)




<center> Figure 1: COVID-19 Sirio-Libanese Dataset </center>

## THE DATASET:

The features in the dataset included patient demographic information, disease history, blood results, various measurements of vital signs and whether or not they were admitted to the ICU (0 for no admission, 1 for admission). 



## MODELLING:
Our goal is to predict with accuracy and transparency. The following steps were employed for modelling and prediction:

- ExploratIon
- Data preprocessing
- Modelling
- Visualizing results

### Exploration
We randomly split our dataset into training and testing with ratios of 75/25. No outliers were present in the dataset because the numeric variables in the data were already scaled. 

### Data Preprocessing
We then filled in the median value to replace missing data. We preprocessed the categorical variables by creating binary variables for each unique categorical value. This was done using Sklearn’s One Hot Encoder.

### Modelling
We ran our training and testing set through a DummyClassifier to explore the distribution of data.


![](image3.png)

<center> Figure 2. Dummy classifier score showing patients admitted to ICU (Value = 1) and not admitted to ICU (Value = 0) in our training vs test set. </center>

Figure 2 shows an uneven distribution for both the training and testing set. Dummy classifier predicts the most frequent value in the target variable for all samples. From this, we can see that there is an approximately 75:25 ratio of patients not admitted to ICU: patients admitted.

A random forest model was run with five-fold cross-validation on the training set and it gave a validation accuracy of 86.01%. Cross-validation calls fit and predict on our model five times (in this case) by randomly splitting the data with test and validation. The model gave a test accuracy of 89.12%.


### Data Visualization

To dive deeper into our analysis, we looked into the sensitivity (which is a measure of how many of the actual positive examples we correctly identified) and precision of our model.


![](image4.png)

<center>Figure 4. Error matrix using the random forest model on training data. </center>

Figure 4 shows our model’s prediction counts on the training data. Out of all 468 patients who were admitted to the ICU, our model predicted 97% of the patients correctly. For the patients not admitted to the ICU, our model shows a 99.8% accurate prediction. 

![](image6.png)

<center>Figure 5. Error matrix using the random forest model on testing data. 
 </center>

Figure 5 shows the results from our error matrix on our testing data. The model correctly predicts 99% of the negative samples as negative. However, 38.3% of the patients that were actually admitted to the ICU was falsely predicted as not requiring ICU admission. 

We looked at which features contribute most to our model’s prediction to foster transparency in the model. 

![](image8.png)

<center>Figure 6. Feature importances for Random Forest Model</center>

From Figure 6, we see that the respiratory rates as well as blood pressure measurements have the highest positive prediction coefficients. This means that higher respiratory rates and higher blood pressure measurements in a patient makes the model more likely to predict that the patient will be admitted to the ICU. 

This is a very interesting finding as it aligns with previous clinical research stating that hypertension (high blood pressure) is associated with severity and mortality in Covid-19 patients. An increase in respiratory rate is also a likely sign of respiratory distress, which aligns with the well-known fact that covid-19 is a respiratory virus. 

## CAVEATS
     
### Dealing with missing data
There were 2 patients with missing vital sign values, hence a median value obtained from the training dataset was used. The missing data might have contributed to a different result had there been actual values in place of median values.

### Low recall score 
Our model has poor sensitivity. This is concerning because if our model was implemented in real life, resources for the ICU are underestimated as 2 out of 5 of patients requiring ICU admission are not predicted to be needing admission. This could be due to our unbalanced dataset. The model has more training data available for no admission prediction than training data for actual ICU admission. 

### Disease grouping 1-6 of patients not specified
It is not known how the disease grouping may influence the model to predict admission to ICU. The model may give a different weightage to different but correlated disease groupings. 


### Window grouping of patients discrepancy
It is unknown how long the patient had Covid-19 before being officially diagnosed. If the patient diagnosed with covid 19 had severe symptoms before undergoing covid-testing, it is highly likely the patient will be admitted to the ICU, thus this window is not normalized at baseline.

## CONCLUSION AND FUTURE WORK 
A poor recall score of 63% in our model shows the underestimation in the amount of admissions expected as only 3 out of 5 patients needing ICU facilities are predicted correctly. An acceptable model should be able to predict at least 9 out of 10 patients correctly as it is crucial that healthcare institutions and personnels are well prepared to save lives. 

To build a more robust model, we should aim to improve the dataset we feed into our model. The current demographic information does not include geolocation of patients. Our data is merely from a few hospitals in Brazil and therefore is not generalizable on a global scale of places affected by covid-19. Future work should get more samples from diverse geologic and cultural locations. 

## REFERENCE
Dataset was obtained from https://www.kaggle.com/S%C3%ADrio-Libanes/covid19


## Exercise 2
rubric={points:5}

Describe one effective communication technique (lecture 20) that you used in your post, or an aspect of the post that you are particularly satisfied with.

Max 3 sentences

<font color = blue >We attached a graphic representation of our confusion matrix (fig 4 and 5) for both training and testing sets to give a better quantitative visualization of the recall and precision scores for observed versus predicted targets. It is easier to look at the values corresponding to true negatives, false negative, true positive and false positives with the labels by the side. 

## Exercise 3
rubric={points:10}

Create a visualization that is different from the ones in your report, which was crafted intentionally to mislead or misrepresent the results of your analysis. Include your visualization, and the code that generated it, here. Then, explain what you did: 

- What is the incorrect interpretation that your visualization is trying to show? 
- What is the correct interpretation?
- How would you fix your visualization?

Max 1 paragraph total.

```
#Running the classifiers on training data
n_estimators_values = [3, 10, 30, 100, 300]
scores_rf = []
scores_gb = []
for n in n_estimators_values:
    rf = RandomForestClassifier(n_estimators=n)
    rf.fit(X_train_enc,y_train)
    score_rf = rf.score(X_train_enc,y_train)
    scores_rf.append(score_rf)
    
    gb = GradientBoostingClassifier(n_estimators=n)
    gb.fit(X_train_enc,y_train)
    score_gb = gb.score(X_train_enc,y_train)
    scores_gb.append(score_gb)

#Running the classifiers on test data
n_estimators_values = [10, 30, 100, 300]
scores_rf_test = []
scores_gb_test = []
for n in n_estimators_values:
    rf = RandomForestClassifier(n_estimators=n)
    rf.fit(X_test_enc,y_test)
    score_rf = rf.score(X_test_enc,y_test)
    scores_rf_test.append(score_rf)
    
    gb = GradientBoostingClassifier(n_estimators=n)
    gb.fit(X_test_enc,y_test)
    score_gb = gb.score(X_test_enc,y_test)
    scores_gb_test.append(score_gb)

```

```
#incorrect representation
nmax = 3
plt.plot(n_estimators_values[:nmax], scores_rf[:nmax], label="rf train")
plt.plot(n_estimators_values[:nmax], scores_rf_test[:nmax],  label="rf test")
plt.plot(n_estimators_values[:nmax], scores_gb[:nmax], label="gb train")
plt.plot(n_estimators_values[:nmax], scores_gb_test[:nmax],  label="gb test")
plt.xlabel("n estimators");
plt.ylabel("Accuracy");
plt.legend();
plt.title("For most values of the n_estimators hyperparamter, Random Forest Classifier is better");
```



![](image1.png)


<font color = blue > This plot shows an incorrect interpretation of the models as we only graphed scores of the hyperparameter n_estimators up to 30 even though we know our models perform very similarly with n_estimators > 30.

    

```
## Correct training data plot
plt.plot(n_estimators_values, scores_rf,  label="rf train")
plt.plot(n_estimators_values, scores_gb,  label="gb train")
plt.xlabel("n estimators");
plt.ylabel("Accuracy");
plt.legend();
plt.title("Train Accuracy vs n_estimators for Random Forest and Gradient Boost Classifers");

```

![](image5.png)

```
## Correct test data plot
plt.plot(n_estimators_values, scores_rf_test,  label="rf test")
plt.plot(n_estimators_values, scores_gb_test,  label="gb test")
plt.xlabel("n estimators");
plt.ylabel("Accuracy");
plt.legend();
plt.title("Test Accuracy vs n_estimators for Random Forest and Gradient Boost Classifiers");

```

![](image2.png)

<font color = blue > These plots above for the training and test data show the proper scores for n_estimators chosen from [3,300]

## (optional, not for marks) Exercise 4

Publish your blog post from Exercise 1 publicly using a tool like Hugo, or somewhere like medium.com, and paste a link here. Be sure to pick a tool in which code and code output look reasonable. This link could be a useful line on your resume!

## Submission to Canvas

**IF YOU ARE WORKING WITH A PARTNER** please form the group before submitting - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
2. Save your notebook.
3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.
4. Run the code `submit()` below to go through an interactive submission process to Canvas.
>For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. 

Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.


In [None]:
from canvasutils.submit import submit, convert_notebook

# Note: the canvasutils package should have been installed as part of your environment setup - 
# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md

In [None]:
# convert_notebook("hw9.ipynb", "html")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)

In [None]:
# submit(course_code=53561, token=False)  # uncomment and run when ready to submit 

## And that's it!

And that's it, you're done with homework assignments for CPSC 330! Congratulations!