# Stroke Prediction

Stroke is the 2nd leading cause of death globally, and is a disease that affects millions of people every year (https://en.wikipedia.org/wiki/Stroke). In this project, we will attempt to classify stroke patients using a dataset provided on Kaggle: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset. 
The dataset consist of over $5000$ individuals and $10$ different input variables that we will use to predict the risk of stroke. The input variables are both numerical and categorical, and will be explained below. Some input variables are known risk factors for stroke, like hypertension (high blood pressure https://en.wikipedia.org/wiki/Hypertension) and smoking status. In addition, the data set includes some variables that may not traditionally be considered as risk factors: work type and residence type. 

The source and collection methods for the dataset is confidential. In particular, we do not know which countries the participants come from, or if the data originates from medical records or somewhere else. This is problematic as many of the variables are categorical, and it is unclear exactly how the categories were determined or if the different categories were measured at the same time. Notably, it is not clear what type of stroke the dataset is concerned with. One usually subdivides stroke into two categories: Ischemic stroke, which is when the blood supply to the brain is interrupted, and hemorrhagic stroke, which is in part caused by rupturing blood vessels. The fact that the source of the data is confidential also makes it difficult to assess the quality of the data. Another problem with the data is that it is very unalanced, as there are many more patients without stroke than with stroke. We will attempt to tackle some of these problems in this reprt.

In this report we will mainly attempt different tree-based methods like random forests and boosting and simple neural networks. We will also spend some time exploring parameter tuning, where we will attempt both gridsearch and Bayesian optimization in order to locate optimal hyperparameters. 

# Data Exploration and Cleaning
The dataset contains the following data:

- ```id```. Integer.
- ```gender```. Categorical: ```male```, ```female``` or ```other```.
- ```age```. Float.
- ```hypertension```. Categorical: 1 or 0.
- ```heart_disease```. Categorical: 1 or 0.
- ```ever_married```. Categorical: 1 or 0.
- ```work_type```. Categorical: ```Private```, ```Self-employed```, ```Govt-job```, ```Never_worked```, ```children```.
- ```Residence_type```. Categorical: ```Urban```, ```Rural```.
- ```avg_glucose_level```. Float.
- ```bmi```. Float.
- ```smoking_status```. Categorical: ```never smoked```, ```formerly smoked```, ```smoking```, ```unknown```.
- ```stroke```. Categorical: 1 or 0.

```hypertension``` is given as a categorical variable instead of a numerical value containing systolic and diastolic blood pressures. What is considered hypertension depends on age, so having hypertension as a categorical variable might remove the dependancy between blood pressure and age, which can make it easier to fit useful models. However, we do not know if hypertensionis determined in a consistent way for all patients, and whether having blood pressure as a float would be more useful. For both hypertension and heart disease, we have no information about the severity of the condition, which is a downside of having only two categorical variables.

On the contrary, ```avg_glucose_level``` and ```bmi``` are given as floats, whereas they could have been encoded as categorical variables like hyperglycemia (https://en.wikipedia.org/wiki/Hyperglycemia) and different stages of underweight/overweight, which differs based on age and gender. 

As for ```stroke```, we have no information about what type of stroke the patient suffered from, as we have already mentioned. In addition, we do not have information about the severity of the stroke, or if the patient died directly or indirectly from the stroke.

For the categorical variables ```work_type```, ```ever_married```, ```Residence_type``` and ```smoking_status```, it is uncertain if the categories chosen yield anything more than noise, and if they are useful for predicting stroke. In particular, the categories are not particularily sharp. For example, we do not know the "boundary" between urban and rural residence, or if the patient is currently employed or how long they have been in their current line of work. Smoking status is the only one of these variables that is known to be an important risk factor for stroke. It is therefore problematic that one of the ```smoking_status``` categories is ```unknown```. Keeping the data where smoking status is unknown adds more noise to our data, which can make fitted models less effective. We have a few options for handling this:

- We could remove all the data where smoking status is unknown, which should not be problematic if relatively few patients have unknown smoking status. 
- We could remove the smoking status variable alltogether from all patients, which might be helpful if smoking status proves to be noisy.
- We could keep the data as is, and hope the added information of smoking status for some patients outweighs the noise of the patients where smoking status is unknown.

In this report we choose the last approach. 

If anything, the problems with the data illustrates the importance of high quality data collection. 