# Data Crunching for Health Day 2 - Plotting Data

## What is a graph?

![2020_google_trends.png](attachment:2020_google_trends.png)

## Why do we draw graphs? 

* They are much easier to read than long lists of numbers
* They are a great way to share our data with others
* They can help us discover interesting things about our data

# What kind of graphs can we make?

There is an almost endless range of different types of graphs that can be used to display different types of data. We are just going to look at two today. The line graph and the scatter plot.

![example_graphs-3.png](attachment:example_graphs-3.png)

**Scatter plots** - When two numerical variables are connected we say they are correlated and we can show this on  a scatter plot. Each observation is represented as a point on a graph with two axes. The value that we want to understand the change in is always placed on the vertical axis (y axis). We put the value that we think might be causing this change on the horizontal axis (x axis).

**Line graphs** - A line graph is similar to a scatter plot except that the points are joined with straight lines. A line graph is often used to see changes in data over time – and so the line is often drawn chronologically, with the earliest time at the left hand side and the latest at the right.


## Parts of a graph

It's important that we make our graph easy for other people to understand!

* Always include a title for your graph that explains what the graph shows.
* Always label your axes and include units if they are needed so that it is clear what you have measured.

![what_is_graph.png](attachment:what_is_graph.png)

## First let's set up our notebook

We're going to import the same pandas package that we used last time. We will also import the seaborn package and the matplotlib package, which are very useful tools for drawing graphs.

We also need to include this line to make sure our graphs display nicely on the page:

```%matplotlib inline```

You run this in the code cell below!

In [None]:
# In this cell, import the packages that we need for plotting

%matplotlib inline 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Let's plot some of our cancelled operations data!

We are going to read in the **cancellations_scotland_day_2.csv**. This file contains the same data that we looked at on Wednesday with the NAs removed and the PercentCancelled column added (just as your dataframe would have looked after we cleaned it up).

Remember we can use pd.read_csv() to read in a file and .head() to look at the first few rows.

</br>

<details>
    
<summary>Click me for a hint!</summary>
    
operations = pd.read_csv("cancellations_scotland_day_2.csv")
    
**Read in the cancellations_scotland_day_2.csv in the code cell below and call your dataframe operations.**

In [None]:
#Read in the dataframe here

operations = 

In [None]:
# Have a look at the head of the dataframe



### Let's format our date column

Again, we need to tell python that the Month column in our operations dataframe is also a date by transforming it to the datetime type. We fixed the format to a more understandable one last time, so we don't have to do that again.

So we do just as we did before:

```operations['Month']= pd.to_datetime(operations['Month'])```

Try this yourself in the code cell below.

In [None]:
# Convert the Month column to datetime format



### Drawing a line graph

We might want to look at how the number of total operations scheduled changed over time. To do this we would like to use a line graph.


We will use seaborn, a python package, to draw our graph using the **lineplot** function:

```sns.lineplot(data = our_dataframe, x = "our_time_variable", y = "our_numeric_variable")```

We should also add a title to the graph:

```plt.title("This is the title of this graph")```

And label our x and y axes:

```plt.xlabel("Our time variable")```

```plt.ylabel("Our numeric variable")```

We might also want to show the individual data points on our line, we can do this by adding ```marker = 'o'``` to our lineplot call.

```sns.lineplot(data = our_dataframe, x = "our_time_variable", y = "our_numeric_variable", marker = 'o')```

### Lets plot total scheduled operations against time

Can you think of some reasons for the change in number of total scheduled operations over time?

## Your turn!

### Plot a line graph showing percentage of scheduled operations that have been cancelled over time

**In the cell below plot a labelled line graph of percentage operations cancelled against time.**


<br>

<details>
    
<summary>Click me for a hint!</summary>
The columns that you want to use are Month and PercentCancelled.</details>

In [None]:
#Use sns.lineplot to plot a graph of percentage operations cancelled against time



## Let's investigate a potential reason behind these cancellations

You also have access to a file called **cancellations_scotland_covid.csv**. 

It has all the same columns as the cancellations dataset but this time we have included a column called ReportedCases with the the total number of reported Covid-19 cases in Scotland for each month. 

This dataset is smaller and only runs from July 2020 to May 2022. 

In the code cell below, read in this file and have a look at it.

In [None]:
# Read in the new file using pd.read_csv() and call it operations_covid
operations_covid =


In [None]:
#Have a look at the head of the file


### Let's format our date column again 

Every time we use a date, we need to make sure python knows that it should treat it as one and so it must be a datetime type.

We will do just as we did before:

```operations_covid['Month']= pd.to_datetime(operations_covid['Month'])```

Try this yourself in the cell below.

In [None]:
# Convert the Month column to datetime format



### Plot a line graph showing covid cases over time 

In the code cell below, try plotting a line graph that shows the number of reported Covid-19 cases over time.

In [None]:
#Use sns.lineplot to plot a graph of reported covid-19 cases against time




# Advanced! Are cancelled operations and Covid-19 cases correlated?

### What is correlation?

If there is a correlation between two sets of data, it means they are connected in some way.

![correlation.png](attachment:correlation.png)

We can check for correlation using a scatter plot.

Remember, our dependent variable - the value that we think is being affected - always goes on the vertical axis. The variable that we think is causing the effect goes on the horizontal axis. 

### Plot percentage of operations cancelled against number of recorded Covid-19 cases.

We might hypothesise that an increase in Covid-19 cases could increase the percentage of operations that are cancelled. As we discussed earlier, if we want to draw a plot to look for a correlation we can use a scatter plot.

```sns.scatterplot(data = your_dataframe, x = "explanatory_column_name", y = "dependent_column_name")```

In this case our explanatory variable, the measurement that we thing is causing the effect, is the number of Covid-19 cases (ReportedCases) and our dependent variable, the measurement that we are interested in looking at the change in, is the percentage of operations that were cancelled (PercentCancelled).

**Draw an appropriately labelled scatterplot in the code cell below.**


Do you think reported cases and percentage of cancelled operations are correlated?  
  
If so, in what way? 
  
Why might this be the case?