# Tutorial 5 Python Notebook 

In this tutorial we will return to the [Public Health Scotland around Stroke Activity](https://www.opendata.nhs.scot/dataset/scottish-stroke-statistics/resource/47656572-e196-40c8-83e8-08b0b223b2e6) dataset that we used in last week's tutorial. We will also be using the [Health Board Labels](https://www.opendata.nhs.scot/dataset/geography-codes-and-labels/resource/652ff726-e676-4a20-abda-435b98dd7bdc) dataset. 

The aim of this tutorial is to give you some (guided) hands-on experience joining and reshaping data frames, as well as to reinforce some of the learning we have done across the course. There are 8 tasks. 

In [None]:
# load packages and modules 

import pandas as pd 
import numpy as np
from IPython.display import Image

In [None]:
# read in the data sets 

stroke_raw = pd.read_csv("https://www.opendata.nhs.scot/dataset/f5dcf382-e6ca-49f6-b807-4f9cc29555bc/resource/47656572-e196-40c8-83e8-08b0b223b2e6/download/stroke_activitybyhbr.csv")

hb = pd.read_csv("https://www.opendata.nhs.scot/dataset/9f942fdb-e59e-44f5-b534-d6e17229cc7b/resource/652ff726-e676-4a20-abda-435b98dd7bdc/download/hb14_hb19.csv")

It is always a good idea to quickly double check your data has been read in as expected. 

In [None]:
stroke_raw

In [None]:
hb

Below is an image showing where the different Health Boards are in Scotland on a map.

In [None]:
Image(filename = "../figures/Map_of_Health_Boards.png")

## Question to solve 

The question we are trying to answer with the data is: 

> What is the average number of discharges with a stroke diagnosis by age group in the East region of Scotland for all admissions in the finanical year 2019/20 and 2020/21?

### Task 1

Looking at these two data frames, what columns do you think are the linkage keys? 

In [None]:
## your answer


### Task 2

Join the Stroke activity dataset with the [Health Board Labels](https://www.opendata.nhs.scot/dataset/geography-codes-and-labels/resource/652ff726-e676-4a20-abda-435b98dd7bdc) dataset into a new data frame called `stroke_join`. 

In the last above we identified the linkage key variable(s), which is the first step when wanting to complete a join. Next, you need to decide on the type of join you want to use and then implement this in code.

In [None]:
## your answer 


### Task 3

To answer our question outlined above, we do not need all of the columns currently in the `stroke_join` dataset. Process the data to include only the variables needed to answer the question and save this processed dataset into an object called `stroke`.

Check the dtypes of the remaining columns and cast them if not appropriate. 

<details><summary style='color:darkblue'>HINT: Beware of surprise summary or aggregate data! CLICK HERE TO SEE MORE.</summary>

Beware of aggregate or summary level data, even in variables not needed to directly answer the question. Consulting the data dictionary (if provided) or doing data checks is crucial at this stage. 

In [None]:
## your answer


### Task 4

What is the shape of the `stroke` data currently? Is it in a suitable shape?

In [None]:
## your answer 


### Task 5 

Now that we have our joined data set, it is important to inspect the data for any missing or aggregate values. We know from last week that this data set has many aggregate level responses! Check for the unique values of all 7 variables in `stroke`. Are there any unexpected findings? 

In [None]:
## your answer 


### Task 6

We now know there are both aggregate level responses in our data frame as well as missing data. Before we deal with any missing data unnecessarily, let's filter out the responses we are not interested in (i.e., remove the rows we do not need to answer the question) and then check again for any missing data. It is likely that in doing so, the missing data may not be a problem anymore. 

Save your filtered data into a dataframe called `stroke_q`

<details><summary style='color:darkblue'>HINT: Breaking down the question. CLICK HERE TO SEE</summary>

First write down what responses you want to keep for each variable in order to answer the question. Then write the code to do so. 

In [None]:
## your answer 


### Task 7 

Now that we have our data prepared and check, answer the question posed at the the start of this notebook:
    
    
> What is the average number of discharges with a stroke diagnosis by age group in the East region of Scotland for all admissions in the finanical year 2019/20 and 2020/21?

In [None]:
## your answer 


### Task 8 

As I mentioned in this week's content, wide data is often more human readable than long data. Take your solution to Task 7 and make the presentation a nicer by reshaping the data a bit! 

<details><summary style='color:darkblue'>HINT 1: Remember there are multiple functions in Python to reshape data! CLICK HERE TO SEE</summary>

Remember that we learned about 4 functions this week to reshape data in Python. 
    
* `melt` to make data longer and its counterpart `stack` for MutiIndex data frames or Series
* `pivot` to make data wider and its counterpart `unstack` for MutiIndex data frames or Series

<details><summary style='color:darkblue'>HINT 2: What is the data structure of your solution to Task 7? CLICK HERE TO SEE. </summary>
    
If you save your solution to task 7 into an object and then run the code `type(object_name)` you will see that the output is not a dataframe but rather a `pandas.core.series.Series`

In [None]:
## your answer 


---
## Well done! 🎉 

Well done! You have completed all of the tasks for the Python notebook for this tutorial. If you have not done so yet, now move to the R notebook.

Do not forget your 3 stars, a wish, and a step mini-diaries for this week once you have completed the tutorial notebooks and content for the week. 


---
*Dr. Brittany Blankinship (2024)*