# Tutorial 4 Python Notebook 

In this tutorial there are tasks for working with lists, arrays, dictionaries, and data frames.

* Lists = Task 1-3
* Arrays = Task 4-7 
* Dictionaries = Task 8-10
* Data frames = Task 11-16

The aim of this tutorial notebook is to give you some (guided) hands-on experience working with different data structures in Python.

In [None]:
import numpy as np 
import pandas as pd

## Lists 

### Task 1

Create a list containing strings, numbers, a list, and Boolean values. 


In [None]:
## your answer 


### Task 2

How would you index to the first item in the nested list in the list you created in task 1?

In [None]:
## your answer 



### Task 3

Convert the list below into a 1-dimensional array. 

In [None]:
list1 = [12.23, 13.32, 100.1, 36.45]

print(list1)

<details><summary style='color:darkblue'>HINT: Where do we get the array data structure? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

Remember, arrays are **not** a built-in Python data structure. We access them through the `NumPy` package. 
  

## Array 

## Task 4

Use the keyboard variable below to spell your name by referencing the letters in the array. 

<details><summary style='color:darkblue'>HINT: Example solution. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

brittany is: 
    
`print(keyboard[2,4], keyboard[0,3], keyboard[0,7], keyboard[0,4], keyboard[0,4], keyboard[1,0], keyboard[2,5], keyboard[0,5])`

In [None]:
keyboard = np.array([ ['q','w','e','r','t','y','u','i','o','p'],
                      ['a','s','d','f','g','h','j','k','l',';'],
                      ['z','x','c','v','b','n','m','<','>','?']
                   ])

In [None]:
## your answer here 



### Task 5 

Create a 1-dimensional array with values ranging from 20 and 80.

In [None]:
## your answer 


### Task 6

Reverse the array created in Task 5.

<details><summary style='color:darkblue'>HINT: A reminder. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
Indexing with arrays (and lists and dataframes) can be done with the following syntax ` my_array[start_index : stop_index : step/jump]`

<details><summary style='color:darkblue'>HINT: Another reminder. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

Remember that reverse indexing in Python and R are very different! 


In [None]:
## your answer 


### Task 7

Create a 2D array with 1 across the border and 0 inside/middle.

<details><summary style='color:darkblue'>HINT: Breaking down the problem. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

This may sound easy enough, but is actually quite tricky as we are working in 2 dimensions! As a first step, create a 2-D array filled with ones (there is a specific function for this). Then replace the inside or middle values of the array with 0.

In [None]:
## your answer 


## Dictionaries

### Task 8

Create a dictionary comprised of colors blue, red, green; animals dog, cat, horse; and age 33, 56, 24.

In [None]:
## your answer 


### Task 9

Add a new key value pair to the list created in Task 8 called flower comprised of daisy, rose, and lily. 

In [None]:
## your answer 


### Task 10

Convert your dictionary into a data frame. 

<details><summary style='color:darkblue'>HINT: Where do we get the data frame data structure? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

Remember, data frames are **not** a built-in Python data structure. We access them through the `pandas` package. 
  

In [None]:
## your answer 


## Data frames 

For this series of tasks we will be using a publicly available dataset from [Public Health Scotland around Stroke Activity](https://www.opendata.nhs.scot/dataset/scottish-stroke-statistics/resource/47656572-e196-40c8-83e8-08b0b223b2e6). This dataset provides "Information on hospital activity related to cerebrovascular disease (including stroke and subarachnoid haemorrhage)." Look through the link and read the data dictionary at the bottom to familiarise yourself with the variables.  

We will first read in the data. You can read in data from a URL with the `read_csv()` function but inputting the URL as a character string - how helpful! From the link above to the data set there is a URL I have copied at the top of the page. It is good practice when first reading in a data set to name it and add `_raw` or some delineation that it is the raw data. As you process the data for your analytic purposes, you can then save the data in an object without this delineation. This allows you to maintain an object with a version of the raw data that you can refer to later if needed. 

In [None]:
stroke_raw = pd.read_csv("https://www.opendata.nhs.scot/dataset/f5dcf382-e6ca-49f6-b807-4f9cc29555bc/resource/47656572-e196-40c8-83e8-08b0b223b2e6/download/stroke_activitybyhbr.csv")

In [None]:
# it is always a good idea to do a quick visual check of data once you read it in to spot an obvious or blantant parsing issues 
stroke_raw # looks good! 

In [None]:
# using pd.read_csv() will import the data for us as a pandas data frame 
type(stroke_raw)

### Task 11 

Look through the imported data to check it looks as it should based on the data dictionary. Are all the expected variables included? What dimensions does the dataframe have? Do the dtypes of these variables look to be correct? 

<details><summary style='color:darkblue'>HINT: A useful new function! CLICK HERE TO SEE</summary>
    
There is a useful function called `head()` which will print by default the first 5 rows of a dataframe. The counterpart is `tail()` which print the last 5 rows by default. Both functions take the argument `n = ` if you wish to specify a different number of rows other than 5. There are counterpart functions in R with the same name and functionality!
    
This will help in solving the task, but you will need to use some other summary/description functions as well.

In [None]:
## your answer 


### Task 12 

We do not need all of the columns in the dataset. The only variables we need for the next tasks are `FinancialYear`, `AdmissionType`, `AgeGroup`, `Diagnosis`, and `NumberOfDischarges`. Within the `HBR` variable, "S92000003" is the country code for Scotland. The `Sex` variable we do not need for this task, but it too includes an aggregate level "All". Filter the data such that only these aggregate level are included for these 2 variables. Filter the data accordingly and save this processed dataset into an object called `stroke`.

<details><summary style='color:darkblue'>HINT 1: Think about your data structures! CLICK HERE TO SEE</summary>
Remember to think about your input data structures - for example, perhaps you will want to use a list to list the columns we want to keep

<details><summary style='color:darkblue'>HINT 2: Indexing operators CLICK HERE TO SEE</summary>

We learned about `loc` and `iloc` this week to indexing location based on name or numeric index, respectively

<details><summary style='color:darkblue'>HINT 3: Resulting data frame dimensions. CLICK HERE TO SEE</summary>

the object `stroke` should contain 960 rows and 5 columns 


### Task 13 

Check the data types of the remaining 5 variables and convert them to a better data type if needed. 


<details><summary style='color:darkblue'>HINT: Editing the dataframe. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

In order to actually change the data types in the original dataframe, make sure to assign it back to an object since the `astype()` functions returns a copy.

In [None]:
## your answer 


### Task 14

Look at the categories within the categorical variables - is there anything unexpected?


<details><summary style='color:darkblue'>HINT: A warning about the data not to miss. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

It looks like some of our variables include aggregate level responses! Good thing we checked our data. Aggregate data is very common in health and social care data. It is crucial to check your data to ensure you are aware of any aggregate categories. Depending on your specific use case, you may wish to use only the aggregate levels or perhaps remove the aggregate levels and only work with the finer-grained categories.

In [None]:
## your answer



## Task 15 

`AgeGroup` is a bit messy. It should be ordered and includes 2 aggregate categories. Remove the aggregate categories and order the remaining categories.


<details><summary style='color:darkblue'>HINT: Resulting data frame dimensions. CLICK HERE TO SEE</summary>

Your `stroke` data frame should contain 640 rows and 5 columns 


<details><summary style='color:darkblue'>HINT: A new "not" operator. CLICK HERE TO SEE</summary>

In Python is `~` is a bitwise operator for NOT

In [None]:
## your answer 


### Task 16 

Create a summary table with the average number of discharges with a stroke diagnosis by age group for all admissions in the financial years 2021/22 and 2022/23.

<details><summary style='color:darkblue'>HINT: Breaking down the task. CLICK HERE TO SEE</summary>

Further filtering of the data is needed for this task. THEN grouping the data in order to compute our values of interest


In [None]:
## your answer 


---
## Well done! 🎉 

Well done! You have completed all of the tasks for the Python notebook for this tutorial. If you have not done so yet, now move to the R notebook.

Do not forget your 3 stars, a wish, and a step mini-diaries for this week once you have completed the tutorial notebooks and content for the week. 


---
*Dr. Brittany Blankinship (2024)*