# Lab 2 Classification and Regression

Before we get started, we need to load the relevant Python modules that us used in the code later by running the next code cell.

In [None]:
import mkitlabs.lab2
%matplotlib inline

To complete this lab:
1. Follow the instructions running the code when asked.
2. Discuss each question in your group.
3. Keep notes for your answers to the questions in a separate MS Word document (you can use [this template](Lab2_answers_template.docx)).
4. When completed, briefly discuss your answers with the Lecturer/Teaching Assistant attending your lab. You **do not** need to submit your answers to Studentportalen.

In this lab, we will work with classification and regression. 

## Linear Regression

If we have a data set with two variables that depend on each other, then with the help of linear regression we can make a predictive model. We try to find a causal relationship between two variables, one of which depends on a number of independent variables. We will use a dataset that describes heights and weights of men and women.

To read the dataset, run the following code cell:

In [None]:
mkitlabs.lab2.view_body_stats_table()

This dataset is actually in Imperial units with height measured in inches and weight measured in pounds. Let's first changes this to metric values! Run the following cell that will do just that:

In [None]:
mkitlabs.lab2.fix_body_stats_units()

**Question 1.** How many rows and columns are in the dataset?

Now we can see the dataset shape and what the data points look like, let's get a feel of how data looks by visualizing it. Run the next code cell to create a scatterplot of the data:

In [None]:
mkitlabs.lab2.plot_body_stats_scatter()

**Question 2.** Can you you distinguish between the male and female groups in the dataset? Explain your observations.

One way that we can distinguish between distinct groups that we know about is to colour them differently in our plots. Run the next cell to create the same scatterplot but with male and female data. Use the dropdown selection box to select an appropriate colour for each subset of data to view it more clearly.

In [None]:
mkitlabs.lab2.plot_body_stats_scatter_by_colour()

Now let's look at each category of data in isolation.

Adjust the values with the sliders to try and fit a line that best fits the relationship between heights and weights:

In [None]:
mkitlabs.lab2.plot_interactive_body_stats_scatter(category='Male', colour='blue')

In [None]:
mkitlabs.lab2.plot_interactive_body_stats_scatter(category='Female', colour='pink')

**Question 3.** How easy was it to manually fit the linear relationship to the data? Why was it difficult to do?

Actually we do not need to fit a line manually! There are statistical methods that can fit a line to some data. We can use *linear regression*.

The reason we use linear regression is to find the right line that lies as close to all the points as possible to allow us to enter, for example, a height and then be able to infer the height (make a prediction). As you can see in the scatterplots, the data is a bit more spread out and it actually quite difficult to manually fit the right line $y = ax + b$ to represent linear relationship.

Run the following code cell to automatically calculate the slope and intercept using a linear regression algorithm:

In [None]:
mkitlabs.lab2.linear_regression(category='Male')

You can take these values and go back to the interactive plot to see how well you can fit the line again, based on the linear regression.

To see how the linear model we trained on the male weight and height data, we can plot the linear relationship over our scattergraph we created earlier using the slope and intercept we got from our linear model. Run the following code cell to plot the scatterplot with our linear relationship overlayed on top: 

In [None]:
mkitlabs.lab2.plot_linear_regression_on_scatter(category='Male', colour='blue')

Run the next code cell to do the same for the female weight and height data:

In [None]:
mkitlabs.lab2.plot_linear_regression_on_scatter(category='Female', colour='pink')

**Question 4.** Compare the slope and intercept values for the male and female data. What might the gradient of the slope tell us about the relationship between the height and weight variables? What might the intercept tell us about the data?

Apart from plotting the fitted linear relationship over the scatterplot, our linear regression model also provides us with the possibility to make predictions. We can take an input variable and output a prediction. For example, assuming a fitted linear model that calculates the slope as 0.7 and intercept as 120:

$y = ax + b$

becomes

$y = 0.7x + 120$

where $x$ is our weight. Then we can calculate the predicted height value, $y$.

**Question 5.** Based on your linear models for male and female data (i.e. the slope and intercepts calculated above), what are the predicted heights for a male weighing 80.6kg (the average weight of a man in Sweden) and a female weighing 64.7 kg (the average weight for a woman in Sweden)?

Provide your answers to a minimum of 2 decimal places and document how you calculated each answer.

## Classification

Next, we will explore classification by looking at a dataset relating to the demographics of the survivors of the Titanic disaster. If you are not familiar with the Titanic, you could watch the film but it is more than 3 hours long, so do not watch it during this lab. In this lab we will use a decision tree as a predictive model to predict if a person with particular features might have lived or died on the Titanic.

First, we load the dataset:

In [None]:
mkitlabs.lab2.view_titanic_table()

### Data dictionary

- `pclass` - Passenger class
- `survived` - Whether the person survived or not (0 = No, 1 = Yes)
- `name`- Name of the passenger
- `sex` - Gender of the person (male or female)
- `age`- Age in years
- `sibsp` - # of siblings/spouses aboard the Titanic
- `parch` - # of parents/children aboard the Titanic
- `ticket` - Ticket ID number
- `fare` - Passenger fare
- `cabin` - Cabin number
- `embarked` - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)


**Question 5.** According to this data set, how many passengers were on board the Titanic?

### Preliminary exploration

Before we build our classifier, let us explore the invidual features that might affect the survival outcome. We can extract the number of survivors according to this data by looking at the `survived` columns. Note that for each record a survivor is represented with a 1 and a non-survivor represented with a 0 (zero). This means we can simply retrive the sum each category from the `survived` column as follows using the `value_counts()` function:

In [None]:
mkitlabs.lab2.titanic_value_counts(column='survived')

We can additionally plot this quite easliy:

In [None]:
mkitlabs.lab2.plot_titanic_value_counts(column='survived')

**Question 6.** According to this dataset, what percentage of passengers perished when the Titanic sank? Give your answer to the nearest whole percentage.

**Question 7.** If we were passengers on the Titanic, would we have been more likely than not to have died? Explain your reasoning.

### Analysis based on variables

In this section we will do a little more specific predictions but with only one column that the outcome may be due to.

#### Gender

Let's look at the distribution based on gender:

In [None]:
mkitlabs.lab2.titanic_value_counts(column='sex')

In [None]:
mkitlabs.lab2.plot_titanic_value_counts(column='sex')

We can see that in general there were almost twice as many men as women. Let's check the distribution of survival rates based on gender:

In [None]:
mkitlabs.lab2.plot_titanic_survived_by_gender()

**Question 8.** What do you observe in the data and why do you think the distribution is as seen? Is there any possibility to reason why this is the case from only the data?

#### Age

Let's look at the distribution based on age:

In [None]:
mkitlabs.lab2.plot_titanic_survived_by_age()

**Question 9.** As you can see, there are quite a few values that make the plot impossible to read. Why might plotting this distribution problematic?

To be able to do a better analysis, we can apply some age categorisations to put the ages into "bins". For example, we can add a column in the dataset that categorises a passenger as a child or not.

Run the next code cell to add a column to the original dataset, where we define a child as being a person under a certain age (in years). You can explore how changing the age limit effects the labelling of children and adults in the dataset:

In [None]:
mkitlabs.lab2.view_interactive_titanic_table()

Now we can plot the proportion of adults and children who survived. You can explore how changing the age limit effects the labelling of children and adults and the survival rates:

In [None]:
mkitlabs.lab2.view_interactive_titanic_bar()

**Question 10.** What does the plot showing the proportion of adults and children who survived indicate?

We could continue exploring all of the different features in the dataset like this, but instead we will do something a bit more interesting by building a classifier using decision trees.

### Decision Trees

So far we have tried to identify what might have been crucial for survival. However, there are other ways to identify these features. What we are going to do is build a decision tree model with which we can the use to make predictions.

First of all, we need to do some data cleaning in order to build our model correctly.We can see what columns have missing (null) cells:

In [None]:
mkitlabs.lab2.view_titanic_table().info()

From the output, we can see that there is actually quite a lot missing. However we will ignore some of the incomplete columns in our analysis, and fix the ones we need to keep.

**Question 11.** Which columns are identified as having missing (null) values?

In this case, we need to fix out `age` and `fare` columns and we can do that by filling the null values with values fitting the mean values of the column. This is called **mean or median imputation**.

> "*Mean or median imputation* is the replacement of a missing observation with the mean or median of the non-missing observations for that variable."

Imputing the mean preserves the mean in the original data. If the missing data is missing completely at random, this ensures that the estimate of the mean remains unbiased. Also, by imputing the mean, you are able to preserve the full sample size, otherwise one may have to drop those rows with missing data. To understand this, let's compare in the original data in the age column before and after imputation by looking at the summary statistics:

In [None]:
mkitlabs.lab2.titanic_impute_numeric(column='age')

**Question 12.** What do you notice about the total record count (indicated by `count`), the mean (indicated by `mean`) and standard deviation (indicated by `std`)?

**Question 13.** In what cases would we want to impute the median instead of imputing the mean?

For a decision tree classifer to deal with categorical data, such as `sex` (male or female), we need to encode this as numerical data. We can create a new column, `sex_male`, that is encoded as 0s and 1s - 0 for "not male" (i.e. female) and 1 for male.

Run the following code cell to see how this new column looks:

In [None]:
mkitlabs.lab2.view_titanic_sex_male()

We also impute `fare` columns since there is one value missing there.

Now we can drop the parts of the table we are no longer interested in.

Run the following code cell to see what our final table looks like that we will train on our decision tree classifier model:

In [None]:
mkitlabs.lab2.view_titanic_final_table()

In [None]:
mkitlabs.lab2.view_titanic_final_table().info()

**Question 14.** What do you notice about the numbers of non-null values in the final table? What do you notice about the data types of the values in all of the columns of the table?

In order to train our model we need to split the data into a training dataset and a test dataset. This means that we can train our classifier on the training dataset and the validate it after training using the test dataset that was kept separate during the training process.

We split out two datasets: (1) The full set of features and (2) the target labels (survived or not). We split the data 75% (training) / 25% (test) since we wish to use as much data as possible to train the classifier but preserve enough data to validate the classifier after training.

In [None]:
mkitlabs.lab2.interact_split_datasets()

We then train our decision tree classifer model on the training data to fit the features found in the training dataset. We can vary the depth of the generated tree as well.

Run the next code cell to create a decision tree classifier and to view its performance against the test dataset:

In [None]:
mkitlabs.lab2.titantic_decision_tree()

**Question 15.** Inspect the table created above that contains our predictions output by the decision tree classifier. How well do you think it has performed?

Even though we can see the effects by running some data through the classifer to view the predicted outcomes, we cannot see how the decision tree model makes its decisions. We can visualize the model we trained.

Run the following cell to visualize our decision tree classifier. Use the slider to adjust the training dataset and test dataset sizes (try pushing them to extreme values to see bigger effects on the decision tree):

In [None]:
mkitlabs.lab2.plot_decision_tree()

**Question 16.** Take a look at the decision tree that was created. What do you think about how the decision tree has logically rationalised the selection of survivors? How do you think varying the split between training and test data will effect the decision tree quality of performing?