# S05 - In-Class/After-Class Exercises: Data Preprocessing and Transformation
---
## Instructions
For each exercise, you have a code cell for the response underneath it, where you should write your answer between the lines containing `### start your code here ###` and `### end your code here ###`. Your code can contain one or more lines and you can execute this cell in order to complete the exercise. To execute the cell, you can type `Shift+Enter` or press the play button in the toolbar above. Your results will appear right below this response cell.

### Importing data
In this exercise, we will explore some adapted data set which provides information on the number of patients waiting, and length of time waiting at the end of each quarter, for Inpatient and Day Case admissions at Health and Social Care (HSC) Trusts in Northern Ireland. Data are presented by HSC Trust, specialty, programme of care and time band. The original data can be accessed at [this page](https://data.world/datagov-uk/a593a0b3-29ef-48f2-b2b2-ceb83d841a3c).

This is a description of the columns of our adapted data in the file `day-case-waiting-times.csv`.

| VARIABLE NAME | DESCRIPTION | 
|:----|:----|
|quarter_ending| report date for each quarter|
|HSC_trust| Health and Social Care (HSC) Trusts|
|specialty| specialty of the HSC (e.g., Urology, General Surgery, Plastic Surgery, etc.) |
|program| program of care (e.g, mental health, acute services)|
|0-6_weeks|number of patients who wait for a period between (0, 6] weeks in the corresponding quarter|
|>6-13_weeks|number of patients who wait for a period between (6, 13] weeks in the corresponding quarter|
|>13-21_weeks|number of patients who wait for a period between (13, 21] weeks in the corresponding quarter|
|>21-26_weeks|number of patients who wait for a period between (21, 26] weeks in the corresponding quarter|
|>26-52_weeks|number of patients who wait for a period between (26, 52] weeks in the corresponding quarter|
|>52_weeks|number of patients who wait for a period greater than 52 weeks in the corresponding quarter|
|>26-30_weeks|number of patients who wait for a period between (26, 30] weeks in the corresponding quarter|

Import the data file `day-case-waiting-times.csv` into a `DataFrame` named `df_WT`. Display the first 5 rows of your `DataFrame`.

**Hint**: you can use the `pandas.read_csv()` funtion.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S05_Data_Preprocessing/day-case-waiting-times.csv'
df_WT = pd.read_csv(url)  # reading data file into a DataFrame
df_WT.head()

---

## Preprocessing Data and Missing Values
### Exercise 1: Changing the data types of columns
Take a look at the type of data in your `DataFrame`. The column `quarter_ending` should be of type `datatime64`, and all the columns from `0-6_weeks` to `>26-30_weeks` should be numeric (float or int), as they represent the number of patients whose waiting times were within the corresponding intervals. In this case, the float number is required if the column contains `NaN` value. 

Are your columns of the correct type? If not, convert the data into the correct format. 

**Hint1:** You can use the functions `pandas.to_numeric()`, `pandas.to_datetime`, and/or  `DataFrame.astype()`.

**Hint2:** we can make sure that (i) the non-numerical value will be set as `NaN` by using the argument `errors = 'coerce'`, and (ii) the numeric type is float by using the argument `downcast='float'` in `pandas.to_numeric(column_series, errors = 'coerce', downcast='float')`

More details can be found here [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html)

In [None]:
df_WT.dtypes

In [None]:
### start your code here ###

### end your code here ###
df_WT.dtypes

### Exercise 2:  Missing values
Take a look at the missing values in your `DataFrame`. Implement a line of code which shows the total number of missing values in each column.

In [None]:
df_WT.isna().sum()

Let's assume that the missing values can appear when data is not available. Thus, in this case of having missing values,  the number of patients with waiting times within any of the interval specified by the columns of our `DataFrame` is equal to 0. We can replace the missing numeric values with 0.

**Hint:** you can use the function `DataFrame.fillna()` to replace missing values of the last 7 columns. Make sure the changes are applied/saved to your `DataFrame`.

In [None]:
### start your code here ###

###  end your code here ###
df_WT

### Exercise 3: Aggregating data 1
Execute the code block below. As you can see, the `DataFrame` has a column with the name `>26-52_weeks`, and another column named `>26-30_weeks`. As some data were recorded under one of these columns, thus

1.   we will use only the maximum value from these two columns and 
2.   put the resulting value in the column `>26-52_weeks` and
3.   remove the column `>26-30_weeks`.


**Hint:** you can use the function `DataFrame.drop(columns='...')` to remove the selected column. See this [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [None]:
### start your code here ###
  
### end your code here ###
df_WT

### Exercise 4: Aggregating data 2
Compute the **total number of patients** and add this new piece of information as a new column in the `DataFrame` and assign the name `'total_patients'` to it. You can use the function `df.sum(axis = 1)` to sum based on the columns. Please make sure you sum only the last six columns containing the number of patients into this new column and **plot a histogram** of this column

In [None]:
### start your code here ###

### end your code here ###

## Data Transformation

### Exercise 5: Skewness
Please calculate the skewness of this new `total_patients` column where:
$Skewness = \frac{3(X_{mean}-X_{median})}{\sigma_X}$

**Hint:** you can make use of the `DataFrame`'s functions `DataFrame.mean(), `DataFrame.median()` and `DataFrame.std()` for your calculations.



In [None]:
### start your code here ###

### end your code here ###
print("skewness:",skew)

### Exercise 6: log transformation
Please (i) tranform this `total_patients` column using a log transformation `log(x+1)` and assign it to a new column `log_total_patients` and (ii) plot a histogram of this new column

In [None]:
### start your code here ###

### end your code here ###

### Exercise 7: Dummy Variables
A categorical variable should generally be encoded as **dummy variables** (a.k.a. indicator variables), each taking only one of two values (0 or 1; False or True) prior to being used in the predictive analysis.

Please use **dummy encoding** and create k-1 dummy variables.

This implies you will get $k-1$ = 3 dummy variables corresponding to the values in the variable `program`. You can check the values of the original categorical value by executing the cell below.

In [None]:
df_WT['program'].unique()

First create a new set of columns using `pd.get_dummies(...)` and assign the new set of columns to the new object `df_dummies`

**Hint:** you can use of the `DataFrame` function `pd.get_dummies()`, which automatically  converts categorical variable into dummy/indicator variables. To perform dummy encoding, we can indicate the option `drop_first=True`. You can check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for more information. 


In [None]:
### start your code here ###

### end your code here ###
df_dummies.describe()

To add the set of new columns, we can use the function df1 = df1.join(df2). See this [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) for more detail.

In [None]:
### start your code here ###

### end your code here ###
df_WT