# Tutorial 5 Python Notebook 

In this tutorial we will return to the [Public Health Scotland around Stroke Activity](https://www.opendata.nhs.scot/dataset/scottish-stroke-statistics/resource/47656572-e196-40c8-83e8-08b0b223b2e6) dataset that we used in last week's tutorial. We will also be using the [Health Board Labels](https://www.opendata.nhs.scot/dataset/geography-codes-and-labels/resource/652ff726-e676-4a20-abda-435b98dd7bdc) dataset. 

The aim of this tutorial is to give you some (guided) hands-on experience joining and reshaping data frames, as well as to reinforce some of the learning we have done across the course. There are 8 tasks. 

In [None]:
# load packages and modules 

import pandas as pd 
import numpy as np
from IPython.display import Image

In [None]:
# read in the data sets 

stroke_raw = pd.read_csv("https://www.opendata.nhs.scot/dataset/f5dcf382-e6ca-49f6-b807-4f9cc29555bc/resource/47656572-e196-40c8-83e8-08b0b223b2e6/download/stroke_activitybyhbr.csv")

hb = pd.read_csv("https://www.opendata.nhs.scot/dataset/9f942fdb-e59e-44f5-b534-d6e17229cc7b/resource/652ff726-e676-4a20-abda-435b98dd7bdc/download/hb14_hb19.csv")

It is always a good idea to quickly double check your data has been read in as expected. 

In [None]:
stroke_raw

In [None]:
hb

Below is an image showing where the different Health Boards are in Scotland on a map.

In [None]:
Image(filename = "../figures/Map_of_Health_Boards.png")

## Question to solve 

The question we are trying to answer with the data is: 

> What is the average number of discharges with a stroke diagnosis by age group in the East region of Scotland for all admissions in the finanical year 2019/20 and 2020/21?

### Task 1

Looking at these two data frames, what columns do you think are the linkage keys? 

In [None]:
## your answer


#### Task 1 Solution 

In the `stroke_raw` dataset there is `HBR` (health board of residence), which links to the `HB` column in the `hb` dataset. 

### Task 2

Join the Stroke activity dataset with the [Health Board Labels](https://www.opendata.nhs.scot/dataset/geography-codes-and-labels/resource/652ff726-e676-4a20-abda-435b98dd7bdc) dataset into a new data frame called `stroke_join`. 

In the last above we identified the linkage key variable(s), which is the first step when wanting to complete a join. Next, you need to decide on the type of join you want to use and then implement this in code.

In [None]:
## your answer 


#### Task 2 Solution 

Either a left join or full join would be appropriate. An inner join is not appropriate here as it removes 2880 observations from the `stroke_raw` data set. In this example solution, I have used a left join, with the `stroke_raw` data set on the left and the `hb` data set on the right (keeping all data from `stroke_raw` and only matching data from `hb`)

In [None]:
stroke_join = pd.merge(stroke_raw, hb, 
                       how = "left", 
                       left_on = "HBR", 
                       right_on = "HB")

stroke_join # has 43200 rows and 20 columns - this solution keeps both HBR and HB 

In [None]:
## another solution keeping only 1 likage key column in the resulting df 
## is to rename HBR in stroke_raw first before the merge 

stroke_raw.rename({"HBR": "HB"}, axis = 1).merge(hb, how = "left", on = "HB")

### Task 3

To answer our question outlined above, we do not need all of the columns currently in the `stroke_join` dataset. Process the data to include only the variables needed to answer the question and save this processed dataset into an object called `stroke`.

Check the dtypes of the remaining columns and cast them if not appropriate. 

<details><summary style='color:darkblue'>HINT: Beware of surprise summary or aggregate data! CLICK HERE TO SEE MORE.</summary>

Beware of aggregate or summary level data, even in variables not needed to directly answer the question. Consulting the data dictionary (if provided) or doing data checks is crucial at this stage. 

In [None]:
## your answer


#### Task 3 solution 

The variables we need to answer the question are `FinancialYear`, `AdmissionType`, `AgeGroup`, `Diagnosis`, `NumberOfDischarges`, and `HBName`. We also need `Sex` as we need to filter the data to include only the aggregate level `All` to avoid duplicate data, although `Sex` is not directly related to our question of interest. 

In [None]:
stroke = stroke_join.loc[:,["FinancialYear", "AdmissionType", "AgeGroup", 
                            "Sex", "Diagnosis", "NumberOfDischarges", "HBName"]]

stroke # excellent - now we have only 7 columns but retain all 43200 rows of interest 

In [None]:
stroke.dtypes

The variables which should be categorical are currently of dtype object. So let's convert them. We can do this elengantly with a list data structure input, instead of writing repeating lines of code for each variable

In [None]:
cat_cols = ["FinancialYear", "AdmissionType", "AgeGroup", "Sex", 
            "Diagnosis", "HBName"]

stroke[cat_cols] = stroke[cat_cols].astype("category")

# you could put the list directly into the code instead of creating an object
## but the code is a bit easier to read if you create an object first

In [None]:
stroke.dtypes # all ready to go! 

### Task 4

What is the shape of the `stroke` data currently? Is it in a suitable shape?

In [None]:
## your answer 


#### Task 4 Solution 

The `stroke` data frame is currently in long format as we have a single column for each variable, which is indeed what we want. Long format makes it easier to manipulate and wrangle data. So we will keep it that way. 

### Task 5 

Now that we have our joined data set, it is important to inspect the data for any missing or aggregate values. We know from last week that this data set has many aggregate level responses! Check for the unique values of all 7 variables in `stroke`. Are there any unexpected findings? 

In [None]:
## your answer 


#### Task 5 Solution 

We know from last week that `AgeGroup` and `AdmissionType` each include aggregate level responses. For completeness sake I have included the code exploring these variables again here as well. 

In [None]:
stroke.describe(include = "all") # by default only numeric are included unless you specific to include all 

The describe output does not give us details about what the categories include, so we will need to take a closer look at each variable. It also looks like there are some missing values in the `NumberOfDischarges` column based on the count. 

In [None]:
# check FinanicalYear 

print(stroke.FinancialYear.unique()) # no aggregate responses 

print(stroke.FinancialYear.isna().sum()) # no missing data 

In [None]:
# check AdmissionType 

print(stroke.AdmissionType.unique()) # aggregate response All 

print(stroke.AdmissionType.isna().sum()) # no missing data 

In [None]:
## check AgeGroup  

print(stroke.AgeGroup.unique()) # aggregate responses under75 years & All 

print(stroke.AgeGroup.isna().sum()) # no missing data 

In [None]:
## check Sex 

print(stroke.Sex.unique()) # aggregate responses All - which is what we want to keep 

print(stroke.Sex.isna().sum()) # no missing data 

In [None]:
# check Diagnosis 

print(stroke.Diagnosis.unique()) # no aggregate responses 

print(stroke.Diagnosis.isna().sum()) # no missing data 

In [None]:
# check NumberofDischarges 

print(stroke.NumberOfDischarges.describe()) # values range from 0 to 35404 
# count only 39503 of the total 43200 rows suggesting some NA values 

## we can look for the total number of NaN values using isna and then sum (Truthy and Falsey are helpful here!)
print(stroke.NumberOfDischarges.isna().sum())

## or we can confirm that there are ANY NaN values at all in the column 
print(stroke.NumberOfDischarges.isna().values.any())

In [None]:
# check HBName 

print(stroke.HBName.unique()) # we have some nan values! There are 14 HBs as expected but what could this nan be?
# well, we know from last week that HBR in the stroke dataset included all of Scotland with the country code S92000003

# lets see how many missing data points there are 
print(stroke.HBName.isna().sum()) # 2880 

Because we still have the unedited `stroke_raw` data, we can check if these missing values match up to the aggregate country level. Having a raw data frame version is super useful! 

In [None]:
stroke_raw.loc[stroke_raw["HBR"] == "S92000003"] 

# ah ha! Indeed there are 2880 observations at the country code level in the raw dataset. Mystery solved!! 

### Task 6

We now know there are both aggregate level responses in our data frame as well as missing data. Before we deal with any missing data unnecessarily, let's filter out the responses we are not interested in (i.e., remove the rows we do not need to answer the question) and then check again for any missing data. It is likely that in doing so, the missing data may not be a problem anymore. 

Save your filtered data into a dataframe called `stroke_q`

<details><summary style='color:darkblue'>HINT: Breaking down the question. CLICK HERE TO SEE</summary>

First write down what responses you want to keep for each variable in order to answer the question. Then write the code to do so. 

In [None]:
## your answer 


#### Task 6 Solution

To answer the question we need: 

* Diagnosis only of stroke 
* All non-aggregate level responses of AgeGroup
* Aggregate response All from Sex
* Health Boards in the East Region of Scotland
* the aggregate level ALL response for Admissions 
* the Financial Years 2019/20 & 2020/21

In [None]:
stroke_q = stroke.loc[(stroke["AdmissionType"] == "All") & 
                    (stroke["Diagnosis"] == "Stroke") & 
                    (stroke["Sex"] == "All") & 
                    (stroke["FinancialYear"].isin(["2019/20", "2020/21"])) & 
                    (~stroke["AgeGroup"].isin(["All", "under75 years"])) & 
                    (stroke["HBName"].isin(["NHS Lothian", "NHS Fife", "NHS Borders"]))]

stroke_q # now 24 rows of 7 columns 
## if we did not filter for "All" in Sex, we would have 72 rows (duplicate data as there are 3 categories in Sex)

In [None]:
## lets check for missing values in HBName and NumberOfDischarges in our now filtered df 

print(stroke_q.HBName.isna().sum())

print(stroke_q.NumberOfDischarges.isna().sum())

# happy days, no more missing data to worry about in order to answer the question 

### Task 7 

Now that we have our data prepared and check, answer the question posed at the the start of this notebook:
    
    
> What is the average number of discharges with a stroke diagnosis by age group in the East region of Scotland for all admissions in the finanical year 2019/20 and 2020/21?

In [None]:
## your answer 


#### Task 7 Solution 

In [None]:
stroke_q.groupby(["AgeGroup", "FinancialYear", "HBName"])["NumberOfDischarges"].mean().dropna()

### Task 8 

As I mentioned in this week's content, wide data is often more human readable than long data. Take your solution to Task 7 and make the presentation a nicer by reshaping the data a bit! 

<details><summary style='color:darkblue'>HINT 1: Remember there are multiple functions in Python to reshape data! CLICK HERE TO SEE</summary>

Remember that we learned about 4 functions this week to reshape data in Python. 
    
* `melt` to make data longer and its counterpart `stack` for MutiIndex data frames or Series
* `pivot` to make data wider and its counterpart `unstack` for MutiIndex data frames or Series

<details><summary style='color:darkblue'>HINT 2: What is the data structure of your solution to Task 7? CLICK HERE TO SEE. </summary>
    
If you save your solution to task 7 into an object and then run the code `type(object_name)` you will see that the output is not a dataframe but rather a `pandas.core.series.Series`

In [None]:
## your answer 


#### Task 8 Solution 

In [None]:
stroke_q.groupby(["AgeGroup", "FinancialYear", "HBName"])["NumberOfDischarges"].mean().dropna().unstack()

Play around with the `level` argument in `unstack` in the solution above to get a better understanding of how the function works. The default is `level = -1`. For example, you can pass a list to the `level` argument! 

In [None]:
stroke_q.groupby(["AgeGroup", "FinancialYear", "HBName"])["NumberOfDischarges"].mean().dropna().unstack(level = [-1, 1])

---
## Well done! 🎉 

Well done! You have completed all of the tasks for the Python notebook for this tutorial. If you have not done so yet, now move to the R notebook.

Do not forget your 3 stars, a wish, and a step mini-diaries for this week once you have completed the tutorial notebooks and content for the week. 


---
*Dr. Brittany Blankinship (2024)*