# Empirical Project 1 - Working in Python

## Getting started in Python

## Preliminary Settings

Let's import the packages we'll need and also configure the settings we want:

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
import warnings
from lets_plot import *

LetsPlot.setup_html(no_js=True)


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use(
    "https://raw.githubusercontent.com/aeturrell/core_python/main/plot_style.txt"
)
# Ignore warnings to make nicer output
warnings.simplefilter("ignore")

## Part 1.1 The behaviour of average surface temperature over time

Let's now grab the data directly and load it in Python.

### Python Walkthrough 1.1

**Importing the datafile into Python**

We want to import the datafile called ‘NH.Ts+dSST.csv’ from NASA's website into Python using Visual Studio Code.

We start by opening Visual Studio Code in the folder we'll be working in. Use File -> Open Folder to do this. We also need to ensure that the interactive Python window that we'll be using to run Python code opens in this folder. In Visual Studio Code, you can ensure that the interactive window starts in the folder that you have open by setting “Jupyter: Notebook File Root” to `${workspaceFolder}` in the Settings menu.

Now create a new file (File -> New File in the menu) and name it `exercise_1.py`. In the new and empty file, right-click and select "Run file in interactive window". This will launch the interactive Python window.

You will need to install the following packages: **pandas**, **numpy**, **lets-plot**, and **matplotlib**. **pandas**, **numpy**, and **matplotlib** provide extra functionality for data analysis, numbers, and plotting respectively. **lets-plot** provides some additional plotting functionality.

We will download the data directly from the internet into our Python session using the **pandas** library. We imported **pandas** as `pd` at the start of the script, so any direct command from that package is going to start with `pd.`.

We're going to use the `pd.read_csv` function to read in some data from a URL.

In [None]:
df = pd.read_csv(
    "https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
    skiprows=1,
    na_values="***",
)

When using the `read_csv` function, we added two options. If you open the spreadsheet in Excel, you will see that the real data only starts in Row 2, so we use the `skiprows = 1` option to skip the first row when importing the data. When looking at the spreadsheet, you can see that missing temperature data is coded as `***`. In order to ensure that the missing temperature data are recorded as numbers, we tell **pandas** that `na_values = "***"` which imports those missing values as `NaN`, which means the number is missing.

Note that we ended each line with a comma. Arguments that get passed to functions like `read_csv` need to be separated by commas. You don't have to include the comma after the last argument, but some people prefer this as a matter of style.

You can also use `pd.read_csv` to open files that are stored locally on your computer. Instead of a URL, enter a file path to wherever you saved your data (enclosed in quote marks).

To check that the data have been imported correctly, you can use the `.head()` function to view the first five rows of the dataset, and confirm that they correspond to the columns in the csv file.

In [None]:
df.head()

Before working with this data, we use the `.info()` function to check that the data were read in as numbers (either real numbers, `float64`, or integers, `int`) rather than strings.

In [None]:
df.info()

You can see that all variables are formatted as either `float64` or `int`, so Python correctly recognises that these data are numbers.

Try importing the data again without using the keyword argument option `na_values="***"` at all and see what difference it makes.

### Python Walkthrough 1.2

**Drawing a line chart of temperature and time**

We will now set the year as the *index* of the dataset. (This will make plotting the time series of temperature easier because the index can automatically take on the role of the x-axis.)

In [None]:
df = df.set_index("Year")
df.head()

In [None]:
df.tail()

The way we've set up our dataframe, the year variable is special because it forms the index. This has a lot of consequences for the operations we can do on the dataframe; for example, if we do whole-dataframe operations like `df+20`, 20 will be added to every value except the values (here years) in the index. 

Next we'll make our chart. We will draw a line chart using data for January, `df["Jan"]` for the years 1880—2016. 


We'll use the **matplotlib** package for this (which we imported at the start as `plt`). A typical **matplotlib** line chart has the following elements:

```python
# create a figure (a bit like a canvas) and an axis (ax)
# onto which to put chart elements
fig, ax = plt.subplots()
# select the column to use 'plot' on, and pass the ax object
# note that the x-axis is given by the index of the dataframe
df["column_name"].plot(ax=ax)
# set the labels and title
ax.set_ylabel("y label")
ax.set_xlabel("x label")
ax.set_title("title")
# show the plot
plt.show()
```

Let's see an example of this with the data for January.

In [None]:
fig, ax = plt.subplots()
df["Jan"].plot(ax=ax)
ax.set_ylabel("y label")
ax.set_xlabel("x label")
ax.set_title("title")
plt.show()

You'll see that we didn't have to say to use years as the x-axis. That's because the years are in the index, and, when we selected a single column of values representing January, **pandas** inferred that we wanted to use the index as the x-axis. Try out plotting

```python
fig, ax = plt.subplots()
ax.plot(df.index, df["Jan"])
ax.set_ylabel("y label")
ax.set_xlabel("x label")
ax.set_title("title")
plt.show()
```

you should find that you get exactly the same chart! But this time, we made it explicit that we wanted the x-axis to be drawn from the index.

Note that a lot of what we can do on the chart can be achieved by calling `ax.[something]` and that we finish by showing the chart with `plt.show()`. (You can also save charts to file with `plt.savefig(name-of-chart.pdf)`.)

Now, we'll do a nicer version of the chart. Rather than "hard code" the month in, we'll abstract the specific month into a "month" variable. That way, the code can easily be re-used for any month.

We'll add a horizontal line to make the chart easier to read using `ax.axhline`. As this zero is actually the average over the period we're looking at, we'll add some text annotation with `ax.annotate`. We specify this by passing in `x` and `y` values for where the text appears. For the `x` position, it's convenient to place it two thirds of the way across the figure regardless of how zoomed in the chart is so we pass "figure fraction" for the units of the x-coordinate; we just use the data for the y-coordinate of the text.

The next line plots January's data—but note we don't specify the x-axis for the data because that's in our dataframe's index.

Then we're onto labeling and titling the chart. We can make use of something here called an "f-string". These allow us to pass Python variables into a string. In this case, we made `month` a variable but we'd like to put it in the title. The simplest f-string that does this would be `f"Average temperature anomaly in {month}..."` which would result in a title of "Average temperature anomaly in Jan..." on the chart. You can pass as many variables into the string as you like, and we can use this to automatically retrieve the max year in the data via `df.index.max()`—this way, if we update our data, the chart title is automatically updated too!

In [None]:
month = "Jan"
fig, ax = plt.subplots()
ax.axhline(0, color="orange")
ax.annotate("1951—1980 average", xy=(0.66, -0.2), xycoords=("figure fraction", "data"))
df[month].plot(ax=ax)
ax.set_title(
    f"Average temperature anomaly in {month} \n in the northern hemisphere (1880—{df.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");

Try different values for `color`, `xy`, and the first argument of `ax.axhline` in the plot function to figure out what these options do. (Note that `xycoords` set the behaviour of `xy`.)

It is important to remember that all axis and chart titles should be enclosed in quotation marks (`""`), as well as any words that are not options (for example, colour names or filenames).

### Python Walkthrough 1.3

**Producing a line chart for the annual temperature anomalies**

This is where the power of programming languages becomes evident: to produce the same line chart for a different variable, we simply take the code used in Python walk-through 1.2 and replace the `"Jan"` with the name for the annual variable (`"J-D"`). We don't need to change the rest of the chart because we created it to be flexible (we select the `month` column but we changed the value of `month` from "Jan" to "J-D").

In [None]:
month = "J-D"
fig, ax = plt.subplots()
ax.axhline(0, color="orange")
ax.annotate("1951—1980 average", xy=(0.68, -0.2), xycoords=("figure fraction", "data"))
df[month].plot(ax=ax)
ax.set_title(
    f"Average annual temperature anomaly in \n in the northern hemisphere (1880—{df.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");

## Part 1.2 Variation in temperature over time


1. Using the monthly data for June, July, and August, create two frequency tables similar to Figure 1.5 for the years 1951–1980 and 1981–2010 respectively. The values in the first column should range from −0.3 to 1.05, in intervals of 0.05. See Python walkthrough 1.4 for how to do this.

### Python Walkthrough 1.4

**Creating frequency tables and histograms**

Since we will be looking at data from different subperiods (year intervals) separately, we will create a categorical variable (a variable that has two or more categories) that indicates the subperiod for each observation (row). In Python this type of variable is called a ‘category’ or categorical. When we create a categorical column, we need to define the categories that this variable can take.

We'll achieve this using the `pd.cut` function, which arranges input data into a series of bins that can have labels. We'll give the data labels that reflect what period they correspond to here, and we'll also specify that there is an order for these categories.

In [None]:
df["Period"] = pd.cut(
    df.index,
    bins=[1921, 1950, 1980, 2010],
    labels=["1921—1950", "1951—1980", "1981—2010"],
    ordered=True,
)

We created a new variable called `"Period"` and defined the possible categories using the `labels=` keyword argument. Since we will not be using data for some years (before 1921 and after 2010), we want `"Period"` to take the value `NaN` (not a number) for these observations (rows), and the appropriate category for all the other observations. The `pd.cut` function does this automatically.

Let's take a look at the last 20 entries of the new column of data using `.tail`:

In [None]:
df["Period"].tail(20)

We'd really like to combine the data from the three summer months. This is easy to do using the `.stack` function. Let's look at the first few rows of the data once stacked using `.head()`

In [None]:
list_of_months = ["Jun", "Jul", "Aug"]
df[list_of_months].stack().head()

Now we need to think about how we can plot the three different periods. **matplotlib** has plenty of ways to do this; one of the easiest is to ask for more than one axis object to put plots on. So, in the below, we ask for `ncols=3`, and this returns multiple `axes` instead of just one axis called `ax`. `axes` is actually a list that we can access individual plots in. To iterate over both axes and periods, we use the `zip` function which works exactly like a zipper: it brings together one entry from each list in turn—so here, one axis and one period. We can use this to plot the histogram data one axis at a time in the zipped `for` loop. Within the loop the data are filtered just to the period, using `==period`, and months, using `list_of_months`, that we want.

Finally we set an overall title and a single x-axis label (on the middle chart).

In [None]:
fig, axes = plt.subplots(ncols=3, figsize=(9, 4), sharex=True, sharey=True)
for ax, period in zip(axes, df["Period"].dropna().unique()):
    df.loc[df["Period"] == period, list_of_months].stack().hist(ax=ax)
    ax.set_title(period)
plt.suptitle("Histogram of temperature anomalies")
axes[1].set_xlabel("Summer temperature distribution")
plt.tight_layout();

To explain what a histogram displays, let's take a look at the histogram for the period from 1921—1950. On the x-axis we have a number of bins, for example 0 to 0.1, 0.1 to 0.2, and so on. The heigh of the bar over each interval represents the count of the number of anomalies that fall in the interval. The bar with the greatest height indicates the most frequently encountered temperature interval.

As you can see from the earlier data, there are virtually no temperature anomalies larger than 0.3. The height of these bars gives a useful overview of the *distribution* of the temperature anomalies.

Now consider how this distribution changes as we move through the three distinct time periods. The distribution is clearly moving to the right for the period 1981–2010, which is an indication that the temperature is increasing; in other words, an indication of global warming.

1. The New York Times article considers the bottom third (the lowest or coldest one-third) of temperature anomalies in 1951–1980 as ‘cold’ and the top third (the highest or hottest one-third) of anomalies as ‘hot’. In decile terms, temperatures in the 1st to 3rd decile are ‘cold’ and temperatures in the 7th to 10th decile or above are ‘hot’. Use Python and **numpy**’s `np.quantile` function to determine what values correspond to the 3rd and 7th decile across all months in 1951–1980. (See Python Walkthrough 1.5 for an example.)


### Python Walkthrough 1.5

**Using the `np.quantile` function**

First, we need to create a variable that contains all monthly anomalies in the years 1951—1980. Then, we'll use **numpy**'s `np.quantile` function to find the required percentiles (0.3 and 0.7 refer to the 3rd and 7th deciles, respectively).

To help us create a variable that encodes this information, we're going to filter our **pandas** dataframe. We want to filter it both by the index and by only the single month columns. We can do both of these at once using the `.loc` method. The `.loc` method works like this `df.loc[[rows you would like], [columns you would like]]`. For rows, we're going to ask for everything inbetween two year limits. For columns, we can do something called "slicing" which selects every column from within a given range in the format `"first column you want":"last column you want".

Note: You may get slightly different values to those shown here if you are using the latest data.

In [None]:
# Create a variable that has years 1951 to 1980, and months Jan to Dec (inclusive)
temp_all_months = df.loc[(df.index >= 1951) & (df.index <= 1980), "Jan":"Dec"]
# Put all the data in stacked format and give the new columns sensible names
temp_all_months = (
    temp_all_months.stack()
    .reset_index()
    .rename(columns={"level_1": "month", 0: "values"})
)
# Take a look at this data:
temp_all_months

In [None]:
quantiles = [0.3, 0.7]
list_of_percentiles = np.quantile(temp_all_months["values"], q=quantiles)

print(f"The cold threshold of {quantiles[0]*100}% is {list_of_percentiles[0]}")
print(f"The hot threshold of {quantiles[1]*100}% is {list_of_percentiles[1]}")

After we have filled it, the variable `list_of_percentiles` (which is a list) contains two numbers, the 30th percentile (1st value) and the 70th percentile (2nd value). When we print out these values using f-strings, we want to access the 1st and 2nd value for the cold and hot thresholds respectively. Python indexes lists and arrays from zero (not from 1!), so, to access the 1st entry, we use `list_of_percentiles[0]`.

It may seem odd to index from zero but many programming languages do this and, well, [some programmers think it's better](https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html) (while some think it's worse—they're more similar to economists than you'd think.)

### Python Walkthrough 1.6

**Computing the proportion of anomalies at a given quantile using the `.mean()` function**

*Note*: You may get slightly different values to those shown here if you are using the latest data.

We repeat the steps used in Python Walkthrough 1.5, now looking at monthly anomalies in the years 1981—2010. We can simply change the year values in the code from Python Walkthrough 1.5.


In [None]:
# Create a variable that has years 1981 to 2010, and months Jan to Dec (inclusive)
temp_all_months = df.loc[(df.index >= 1981) & (df.index <= 2010), "Jan":"Dec"]
# Put all the data in stacked format and give the new columns sensible names
temp_all_months = (
    temp_all_months.stack()
    .reset_index()
    .rename(columns={"level_1": "month", 0: "values"})
)
# Take a look at the start of this data data:
temp_all_months.head()

Now that we have all the monthly data for 1981—2010, we want to count the proportion of observations that are smaller than –0.1. We'll first create a *binary indicator* (ie it's True or False) that says, for each row (observation) in `temp_all_months`, whether the number is lower than the 0.3 quantile or not (given by `list_of_percentiles[0]`). Then we'll take the mean of this list of True and False values; when you take the mean of binary variables, each True evaluates to 1 and each False to 0, so the mean gives us the proportion of entries in `temp_all_months` that are lower than the 0.3 quantile:

In [None]:
entries_less_than_q30 = temp_all_months["values"] < list_of_percentiles[0]
proportion_under_q30 = entries_less_than_q30.mean()
print(
    f"The proportion under {list_of_percentiles[0]} is {proportion_under_q30*100:.2f}%"
)

When we printed out the answer, we used some *number formatting*. This is written as `:.2f` within the curly brackets part of an f-string—this tells Python to display the number with two decimal places. You should also note that, as well as the mean given by `.mean()`, there are various other built-in functions like `.std()` for the standard deviation and `.var()` for the variance.

Now we can assess that between 1951 and 1980, 30% of observations for the temperature anomaly were smaller than –0.10, but between 1981 and 2010 only about two per cent of months are considered cold. That is a large change.

Let’s check whether we get a similar result for the number of observations that are larger than 0.11.

In [None]:
proportion_over_q70 = (temp_all_months["values"] > list_of_percentiles[1]).mean()
print(f"The proportion over {list_of_percentiles[1]} is {proportion_over_q70*100:.2f}%")

### Python Walkthrough 1.7

**Calculating and understanding mean and variance**

The process of computing the mean and variance separately for each period and season separately would be quite tedious. We would prefer a way to cover all of them at once. Let's re-stack the data in a form where `season` is one of the columns and could take the values `DJF`, `MAM`, `JJA`, or `SON`. Let's also have a peek at the structure of the data while we're at it:

In [None]:
temp_all_months = (
    df.loc[:, "DJF":"SON"]
    .stack()
    .reset_index()
    .rename(columns={"level_1": "season", 0: "values"})
)
temp_all_months["Period"] = pd.cut(
    temp_all_months["Year"],
    bins=[1921, 1950, 1980, 2010],
    labels=["1921—1950", "1951—1980", "1981—2010"],
    ordered=True,
)
# Take a look at a cut of the data using `.iloc`, which provides position
temp_all_months.iloc[-135:-125]

Now we'll take the mean and variance at once.

The following line of code will do a lot of things at the same time. First we will allocate each observation (Year—Season—Period—temperature allocation combination) into one of 12 groups defined by the different possible combinations of season and period, e,g, "MAM; 1981-2010". Think of these twelve buckets as eaching having a label like "MAM; 1981-2010" or "JJA; 1921-1950". This is grouping our data and we do this with a `groupby` operation that we pass a list of the variables we'd like to group together; here that will be `"Period"` and `"season"`.

The variable we'd like to apply the grouping to is `"values"` so we then filter down to just the `"values"` column with `["values"]`.

Then, once we have allocated each observation in the "values" column to one of these buckets (groups), we ask **pandas** to calculate the mean and variance of the observations of each of the groups. That is what the `agg([np.mean, np.var])` function does.

In [None]:
grp_mean_var = temp_all_months.groupby(["season", "Period"])["values"].agg(
    [np.mean, np.var]
)
grp_mean_var

We recognise that the variances seem to remain fairly constant across the first two periods, but they do increase markedly for the 1981—2010 period.

We can plot a line chart to see these changes graphically—to do that in the most convenient way, we're going to make use of a different plotting library that works really well with data that is arranged in this format where *each row is an observation, each column is a variable, and each entry is a value*. This is called tidy data.

There are broadly two categories of approach to using code to create data visualisations: imperative, where you build what you want, and declarative, where you say what you want. Choosing which to use involves a trade-off: imperative libraries (like **matplotlib**) offer you flexibility but at the cost of some verbosity; declarative libraries offer you a quick way to plot your data, but only if it’s in the right format to begin with (the *tidy* format!), and customisation to special chart types is more difficult. Python has many excellent plotting packages, but for this book we recommend a tidy data-friendly declarative library called **lets-plot** for the vast majority of charts you might want to make and then **matplotlib** when you need something much more custom.

The **lets-plot** plotting library follows what is called a *grammar of graphics* approach. You don't need to worry too much about what that means for now, only that it is a coherent declarative system for describing and building graphs. There is some background information that you might find useful in getting to grips with **lets-plot**. All plots are composed of the data, the information you want to visualise, and a mapping: the description of how the data’s variables are mapped to aesthetic attributes. There are five mapping components:

- A *layer* is a collection of geometric elements and statistical transformations. Geometric elements, *geoms* for short, represent what you actually see in the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise the data: for example, binning and counting observations to create a histogram, or fitting a linear model.

- *Scales* map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes.

- A *coord*, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.

- A *facet* specifies how to break up and display subsets of data as small multiples.

- A *theme* controls the finer points of display, like the font size and background colour. While the defaults have been chosen with care, you may need to consult other references to create an attractive plot.

At the start of this chapter, we imported **lets-plot** and set it up using `LetsPlot.setup_html(no_js=True)`, so we should be ready to use it!

In [None]:
min_year = 1880
(
    ggplot(temp_all_months, aes(x="Year", y="values", color="season"))
    + geom_abline(slope=0, color="black", size=1)
    + geom_text(
        x=min_year, y=0.1, label="1951—1980 average", hjust="left", color="black"
    )
    + geom_line(size=1)
    + labs(
        title=f"Average annual temperature anomaly in \n in the northern hemisphere ({min_year}—{temp_all_months['Year'].max()})",
        y="Annual temperature anomalies",
    )
    + scale_x_continuous(format="")
)

Let's talk through what each part did here, in case it's not clear:

- `ggplot` took the data and a second argument, the `aes` (short for aesthetic mappings) function.
- in `aes`, we passed the mappings we wanted: year along the x-axis, the values column on the y, and colour to distinguish between different seasons (via `color="season"`)
- `geom_abline` adds a line from a to b, in this case just along the x-axis at y=0.
- `geom_text` adds the text annotation.
- `geom_line` adds a line for each season. Because we already said `color="season"` earlier, we actually get as many lines as there are seasons in our data
- `labs` sets the labels
- `scale_x_continuous(format="")` tells **lets-plot** that the x-axis is a continuous (rather than discrete) scale, and that the format should just be a number (the default is to have a comma for every thousand, which isn't what we want for displaying years).

## Part 1.3 Carbon emissions and the environment

**Learning objectives for this part**

- use scatterplots and the correlation coefficient to assess the degree of association between two variables
- explain what correlation measures and the limitations of correlation.

The government has heard that carbon emissions could be responsible for climate change, and has asked you to investigate whether this is the case. To do so, we are now going to look at carbon emissions over time, and use another type of chart, the scatterplot, to show their relationship to temperature anomalies. One way to measure the relationship between two variables is correlation. Python Walkthrough 1.8 explains what correlation is and how to calculate it.

In the questions below, we will make charts using the CO2 data from the US National Oceanic and Atmospheric Administration. Download the [Excel spreadsheet](https://tinyco.re/3763425) containing this data. Save the data as a csv file in a sub-directory of the folder you have open Visual Studio Code named "data". Import the csv that's now in "data/1_CO2-data.csv" into Python using **pandas** read csv function; the code might look like `df_co2 = pd.read_csv("data/1_CO2-data.csv")`.

### Python Walkthrough 1.8

**Scatterplots and the correlation coefficient**

First we will use the `pd.read_csv` function to import the CO2 datafile into Python, and call it `df_co2`.

In [None]:
df_co2 = pd.read_csv("data/1_CO2-data.csv")
df_co2.head()

This file has monthly data, but in contrast to the data in `df` from earlier, the data is all in so-called tidy format (one observation per row, one column per variable). To make this task easier, we will pick only the June data from the CO2 emissions and add them as an additional variable to the `df` dataset.

Python's **pandas** package has a convenient function called merge to do this. First we create a new dataset that contains only the June emissions data (`df_co2_june`).

In [None]:
df_co2_june = df_co2.loc[df_co2["Month"] == 6]
df_co2_june.head()

Then we use this data in the `pd.merge` function. The merge function takes the original `df` and the `df_co2_june` and merges (combines) them together. To merge, we need some commonality between the columns (or indexes) in the dataframe. In this case, the common variable is `"Year"` so we will use that to do the merging on.

(*Extension*: Hover your cursor over `pd.merge` in Visual Studio Code, type `help(pd.merge)` into the interactive window, or Google ‘pandas merge’ to see the many other options that `pd.merge` allows.)

In [None]:
df_temp_co2 = pd.merge(df_co2_june, df, on="Year")
df_temp_co2[["Year", "Jun", "Trend"]].head()

It looks like it worked! We now have some extra columns from the carbon dioxide data in addition to the temperature anomaly data from before.

To make a scatterplot, we use **lets-plot** again

In [None]:
(
    ggplot(df_temp_co2, aes(x="Jun", y="Trend"))
    + geom_point(color="black", size=3)
    + labs(
        title="Scatterplot of temperature anomalies vs carbon dioxide emissions",
        y="Carbon dioxide levels (trend, mole fraction)",
        x="Temperature anomaly (degrees Celsius)",
    )
)

To calculate the correlation coefficient, we can use the `.corr()` function. *Note*: you may get slightly different results if you are using the latest data.

In [None]:
df_temp_co2[["Jun", "Trend"]].corr(method="pearson")

In this case, the correlation coefficient tells us that an upward-sloping straight line is quite a good fit to the date (as seen on the scatterplot). There is a strong positive association between the two variables (higher temperature anomalies are associated with higher CO2 levels).


One limitation of this correlation measure is that it only tells us about the strength of the upward- or downward-sloping linear relationship between two variables, in other words how closely the scatterplot aligns along an upward- or downward-sloping straight line. The correlation coefficient cannot tell us if the two variables have a different kind of relationship (such as that represented by a wavy line).

*Note*: The word ‘strong’ is often used for coefficients that are close to 1 or −1, and ‘weak’ is often used for coefficients that are close to 0, though there is no precise range of values that are considered ‘strong’ or ‘weak’.

If you need more insight into correlation coefficients, you may find it helpful to watch online tutorials such as [‘Correlation coefficient intuition’](https://tinyco.re/4363520) from the Khan Academy.

As we are dealing with time-series data, it is often more instructive to look at a line plot, as a scatterplot cannot convey how the observations relate to each other in the time dimension.

Let’s start by plotting the June temperature anomalies.


In [None]:
(
    ggplot(df_temp_co2, aes(x="Year", y="Jun"))
    + geom_line(size=1)
    + labs(
        title="June temperature anomalies",
    )
)

Typically, when using the `plot` function we would now only need to add the line for the second variable using the lines command. The issue, however, is that the CO2 emissions variable (Trend) is on a different scale, and the automatic vertical axis scale (from –0.2 to about 1.2) would not allow for the display of Trend. To resolve this issue we can introduce a second panel and put them side-by-side. You can think of the new plotting space as being like a table, with an overall title and two columns.

We'll do this using **lets-plot** again. We're going to create a base plot and save it to a variable called `base_plot`, then we'll create two panels of an overall figure by adding different elements (one for June, one for the trend) onto this base plot. Finally, we'll bring them both together using the `gggrid` function.

In [None]:
base_plot = ggplot(df_temp_co2) + scale_x_continuous(format="")
plot_p = (
    base_plot
    + geom_line(aes(x="Year", y="Jun"), size=1)
    + labs(title="June temperature anomalies")
)
plot_q = (
    base_plot
    + geom_line(aes(x="Year", y="Trend"), size=1)
    + labs(title="Carbon dioxide emissions")
)
gggrid([plot_p, plot_q], ncol=2)

This chapter used the following packages where *sys* is the Python version:

In [None]:
%load_ext watermark
%watermark --iversions