# Exercise 1: Development Basics

Version 5.0, Summer Semester 2022

## Part 3: Pandas

## Your Name

Replace the `raise NotImplementedError` with the code `myname = ""` and assign your name to the variable:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert myname != "", "myname should not be empty"

## 3.1: Imports and Data File

* Import the numpy library and access it via `np`.
* Import pandas with `pd`.

In [None]:
# Import numpy (np) and pandas (pd)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Simply execute this cell for the remaining imports
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
%matplotlib inline

Next, use Pandas (`pd`) to load the CSV file we're going to use in this exercise. The Pandas function you need to call is [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). It requires the URL of the dataset as parameter. Store the created Pandas DataFrame in a variable called `df`.

The source URL is: `https://github.com/andijakl/MachineLearning/raw/main/lab%201%20-%20python%20numpy%20pandas/heart_disease_health_indicators_BRFSS2015.csv`

The file is around 22 MB, so downloading could take some time.

### Dataset info

Heart disease is a severe risk and a leading cause of death. Several factors have been identified as risk factors, including high blood pressure, high blood cholesterol and smoking. The Behavioral Risk Factor Surveillance System is a health-related telephone survery conducted annually by the CDC. Each year, they collect responses from over 400,000 Americans.

This dataset contains the data from the year 2015. It has been cleaned up from the original responses from 441,455 individuals with 330 features. Incomplete responses have been dropped, and the number of features has been reduced to the ones being considered as having most impact on heart disease risk. It contains 253,680 responses. Note: there is strong class imbalance. 229,787 respondents do not have/have not had heart disease, while 23,893 have had heart disease.

We want to answer the question: to what extent can a subset of survey responses from BRFSS be used to predict heart disease risk?

* Source: Alex Teboul, CC0 Public Domain: https://www.kaggle.com/alexteboul/heart-disease-health-indicators-dataset/metadata
* Cleaned up version of dataset from Centers for Disease Control and Prevention (CDC), CC0 Public Domain: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system
* Original dataset at CDC (including older and newer data): https://www.cdc.gov/brfss/annual_data/annual_data.htm

In [None]:
# Load dataset into variable df
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df.shape == (253680, 22), "Imported data should have 253680 rows and 22 columns (including the target column)"

## 3.2: Explore the Dataset

Print the first five rows of the dataset using the `head()` function to get a quick glance on what the data looks like.

In [None]:
# Use the head() function to print the first 5 rows of the dataset
# YOUR CODE HERE
raise NotImplementedError()

Use the `describe()` function of the Pandas DataFrame to find out the count, mean, standard deviation & more from the dataset for each column.

In [None]:
# Call method to describe the dataset
# YOUR CODE HERE
raise NotImplementedError()

Based on the printed information you see about the dataset, answer a few questions and assign the numbers to the corresponding variables. Round your answers (up/down) to the next integer.

In [None]:
# What is the mean BMI?
mean_bmi = -1
# Are there more smokers than non-smokers in the dataset (1), based on the mean? Or are there more non-smokers than smokers (0)?
more_smokers = -1
# What is the maximum BMI recorded in the dataset?
max_bmi = -1
# Remove the exception below after you entered the rounded numbers for the variables
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert mean_bmi > -1
assert more_smokers > -1
assert max_bmi > -1

## 3.3: Plotting

Next, plot the histogram of the `BMI` column of the dataframe. Specify that the histogram function should split the data into `30` bins.
*Hint:* call the `hist()` function on the column you retrieve from the dataframe. [Read more about it](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) in the documentation.

In [None]:
# Call method to print histogram with 10 bins
# YOUR CODE HERE
raise NotImplementedError()

## 3.4: Counting Values

Next, use the `value_counts()` function of the `Stroke` column to see how many persons already had a stroke. Store the results in a new variable called `stroke_count`. 

In [None]:
# Define variable stroke_count and store the count of how many persons already had a stroke or not
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert stroke_count.shape == (2,)
assert stroke_count[0.0] > 0
assert stroke_count[1.0] > 0

There are many more people who did not already have a stroke in the dataset. As such, imagine that some data was missing (e.g., due to a data storage failure or because people didn't want to answer that question). However, you could not afford simply dropping all these people from your dataset. Instead, you'd like to fill the missing values in a way that it stays in line with expectations and shouldn't affect your outcome too much.

First, let's say we do not have the stroke answer of the first 100 patients. Use the slice to select the first 100 patients of the `Stroke` column, and again use the `value_counts()` function to see how many people we have in each category for that part. Store the results in a variable `stroke_first_100`.

In [None]:
# Define variable stroke_first_100 and assign the value counts of the first 100 rows of the Stroke column
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert stroke_first_100.shape == (2,)
assert stroke_first_100[0.0] > 0
assert stroke_first_100[1.0] > 0

## 3.5 Missing Data

Now, let's delete the data from our dataset to explore how to fill in missing data as a next step. Assign `np.nan` to the first 100 rows of the `Stroke` column.

In [None]:
# Assign np.nan to the first 100 items in the Stroke column of the dataframe
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert np.isnan(df['Stroke'][0])
assert df['Stroke'][0:100].count() == 0.0

Next, let's see how deleting the data is reflected in our `value_counts`. Count the values in the `Stroke` column (this time in the whole column), and assign the result to a variable `stroke_count_nan`.

In [None]:
# Count values in the Stroke column of the dataframe
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert stroke_count_nan[0.0] == 243297
assert stroke_count_nan[1.0] == 10283

As you can see, the counts were reduced accordingly. However, `value_counts()` doesn't seem to inform us that we have some missing variables. It simply ignores those. For training a classifier, this is important information, as we need to fix the data beforehand.

There is a simple parameter you can add to the `value_counts()` function. Check the [Pandas documentation on value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) for the parameter. Run the function again and assign the results to a variable `stroke_count_nan2`.

In [None]:
# Count number of strokes including nan values. Store the results in stroke_count_nan2
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert stroke_count_nan2[0.0] == 243297
assert stroke_count_nan2[1.0] == 10283
assert stroke_count_nan2[np.nan] == 100

## 3.6 Fill Missing Data

How do we solve the issue with the 100 missing data items about a previous stroke? According to the distribution, the safest bet is to simply set `0` for all missing values. This has by far the greatest chance of being the correct answer.

Use the `fillna()` function to set the value 0 for all missing values of the `Stroke` column. As we're now dealing with a very large dataframe, we don't want to create a copy. Instead, the filling function should directly modify the original data. Set the parameter `inplace=True` for the function to do that.

In [None]:
# Fill the missing stroke values with 0 and use inplace filling
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df['Stroke'].value_counts()[0.0] == 243397
assert df['Stroke'].value_counts()[1.0] == 10283

## 3.7 Missing Data: Median

In other situations, a good strategy could be to replace missing data with the mean or median of the other samples. Let's test this approach with the body mass index (BMI).

First, replace the values 500 to 600 of the `BMI` column with `np.nan`.

In [None]:
# Replace values in rows 500 to 600 of column BMI with np.nan
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert np.isnan(df['BMI'][500])
assert df['BMI'][500:600].count() == 0.0
assert df['BMI'].isna().sum() == 100

Next, use the `median()` function of the dataframe column to calculate the median of the remaining BMI values. Store the result in a new variable called `bmi_median`.

In [None]:
# Store the median of the BMI column in a variable called bmi_median
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert bmi_median == 27.0

Finally, similar to before, replace all missing values in the BMI column with the median from the remaining rows. Again, use the `inplace` variant of the function to directly modify the data we work with.

In [None]:
# Replace the missing BMI values with the median
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert not(np.isnan(df['BMI'][500]))
assert df['BMI'][500:600].count() > 0.0
assert df['BMI'].isna().sum() == 0

## 3.8 First Classification: Data Preparation

Now that we have the data loaded and prepared, let's quickly do a short test how well a decision tree classifier can predict heart diseases based on the data we have about the patients.

First, we need to do a quick additional preparation step: most machine learning libraries need the data and the labels (target classes) in two separate arrays. Simply execute the following line - it will take the `HeartDiseaseorAttack` column (which contains the labels) out of the dataframe and converts it to a numpy array for further use.

In [None]:
# Simply execute this line to extract the target label from the dataframe
# Note: you can only execute this line once. If you execute this twice, it will fail
# as the column has already been removed from the dataframe. Restart the kernel and run
# it again from the top if needed.
y = df.pop('HeartDiseaseorAttack').to_numpy()

In [None]:
assert df.shape == (253680, 21)
assert y.shape == (253680,)
assert type(y) == np.ndarray

The second step: as you know, it's important to train the classifier only on a part of the available data. The rest should be withheld, to get a more reliable estimate on the quality of the classifier when you test it on previously unseen data.

As such, you need to split our large dataset into two parts: training data and test data. As the labels are in a separate array, the same split needs to be applied to the labels, to ensure these are still in the same order and correspond.

The `train_test_split()` function of Scikit Learn can do all these tasks in one step. Take a look at the [example from the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Set up the function accordingly.

Send both the dataframe and the `y` label array to the function. Use a test size of `0.2`. Also supply a `random_state` of `10`. The function will then return 4 separate arrays. Follow the example in the documentation to see how to provide the four array variables.

In [None]:
# Split the data and labels into train and test data, using a test size of 0.2 and a random state of 10
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert X_train.shape == (202944, 21)
assert X_test.shape == (50736, 21)
assert y_train.shape == (202944, )
assert y_test.shape == (50736, )
assert len(X_train['HighBP']) == 202944
assert y_test[0] == 1.0
assert y_train[100] == 0.0
assert X_train['HighBP'][94025] == 1.0

## 3.9 Decision Tree Classifier

Now, the data is fully prepared. The remaining steps are quite short. We need to create the classifier and fit it on the training data.

For this, we will use the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Create the object and assign it to a new variable called `clf`. Supply `min_samples_split=30` as parameter to limit the complexity of the classifier.

In [None]:
# Create a decision tree classifier with a min sample split of 30. Store it in a variable called clf
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(clf) == DecisionTreeClassifier
assert clf.min_samples_split == 30

The most complex part of this notebook is actually training the classifier. But for you, that's probably the easiest line. Use the `fit` function of the `clf`. Supply both the training data (`X_train`) as well as the training labels (`y_train`). Depending on your computer speed, this step might take a few seconds. You do not need to assign the result to a variable; the classifier will simply train itself and keep the machine learning model it built for further use.

In [None]:
# Let the classifier build its model based on the training data and labels, using the fit() function.
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert clf.tree_.node_count > 1000
assert clf.tree_.children_left.shape[0] > 1000
assert clf.tree_.feature[0] > -1
assert clf.tree_.threshold[0] > 0

Let's visualize the beginning of the tree. This code is pre-defined, simply execute the following cell to draw the first three layers of the tree, also using the real feature names based on the dataframe columns.

In [None]:
# Simply execute this block to plot the top of the decision tree
plt.figure(figsize=(15,12))  # set plot size (denoted in inches)
tree.plot_tree(clf, fontsize=10, max_depth=3, feature_names=df.columns)
plt.show()

When looking at the data, you will see that for example Age is usually a quite important criteria, with comparison values like `Age <= 9.5`. To understand more about the data, take a look at the dataset description linked above, as well as the [notebook that was used to clean](https://www.kaggle.com/alexteboul/heart-disease-health-indicators-dataset-notebook) the original questionnaire data. You will then see that age starts with `1` (indicating a range of 18-24), then going in 5-year increments until `13` (80 years or older). Think about it: what does the decision if the age is <= 9.5 mean in that case?

The last task: let's see how well our classifier performs on previously unseen data. We have split our complete dataset, so we can now use the test data for an independent evaluation.

To to this, call the [score()](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.tree.DecisionTreeClassifier.score) function of the classifier. Supply the test data and the corresponding labels. Store the resulting mean accuracy in a variable called `clf_accuracy` and print it.

In [None]:
# Calculate the mean accuracy of the classifier model with the test data.
# Store the results in a new variable clf_accuracy
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert clf_accuracy > 0.8

To really judge the quality of the results, you'd need to read more about the dataset and calculate additional metrics. But as a first shot using a rather simple decision tree classifier, getting almost 90% accuracy based on questionnaire data is definitely a nice first result!