# Empirical Project 1 - Working in Python

## Getting started in Python

TODO

## Part 1.1 The behaviour of average surface temperature over time

**Learning objectives for this part**

- use line charts to describe the behaviour of real-world variables over time.

In the questions below, we look at data from NASA about land–ocean temperature anomalies in the northern hemisphere. Figure 1.1 is constructed using this data, and shows temperatures in the northern hemisphere over the period 1880–2016, expressed as differences from the average temperature from 1951 to 1980. We start by creating charts similar to Figure 1.1, in order to visualize the data and spot patterns more easily.

![](https://www.core-econ.org/doing-economics/book/images/web/figure-01-01.jpg)

Before plotting any charts, we need to get hold of the data. We can use Python to download it directly from the web straight into our code session. But you can also download it separately to take a look in spreadsheet programme such as Microsoft Excel, or in a text editor such as Visual Studio Code. Here's how to get hold of it:

- Go to NASA’s [Goddard Institute for Space Studies website](https://tinyco.re/2515719).
- Under the subheading ‘Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies’, select the CSV version of ‘Northern Hemisphere-mean monthly, seasonal, and annual means’ (right-click and select ‘Save Link As…’).
- The default name of this file is NH.Ts+dSST.csv. Give it a suitable name and save it in an easily accessible location, such as a folder in your documents directory.

1. In this dataset, temperature is measured as ‘anomalies’ rather than as absolute temperature. Using NASA’s Frequently Asked Questions section as a reference, explain in your own words what temperature ‘anomalies’ means. Why have researchers chosen this particular measure over other measures (such as absolute temperature)?

Let's now grab the data directly and load it in Python.

### Python Walkthrough 1.1

**Importing the datafile into Python**

We want to import the datafile called ‘NH.Ts+dSST.csv’ from NASA's website into Python using Visual Studio Code.

We start by opening Visual Studio Code in the folder we'll be working in. Use File -> Open Folder to do this. We also need to ensure that the interactive Python window that we'll be using to run Python code opens in this folder. In Visual Studio Code, you can ensure that the interactive window starts in the folder that you have open by setting “Jupyter: Notebook File Root” to `${workspaceFolder}` in the Settings menu.

Now create a new file (File -> New File in the menu) and name it `exercise_1.py`. In the new and empty file, right-click and select "Run file in interactive window". This will launch the interactive Python window.

You will need to install the following packages: **pandas**, **numpy** and **matplotlib**. Packages are add-ons that extend the functionality of the Python language. To install packages, hit use the <kbd>⌘</kbd> + <kbd>\`</kbd> keyboard shortcut (Mac) or <kbd>ctrl</kbd> + <kbd>\`</kbd> (Windows/Linux), or click "View > Terminal". A new box, called the terminal, should appear at the bottom of the screen. To install packages, type `pip install packagename` into this box.

**pandas**, **numpy** and **matplotlib** provide extra functionality for data analysis, numbers, and plotting respectively.

We will download the data directly from the internet into our Python session using the **pandas** library (which provides data manipulation in Python).

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.read_csv(
    "https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
    skiprows=1,
    na_values="***",
)

When using the `read_csv` function, we added two options. If you open the spreadsheet in Excel, you will see that the real data only starts in Row 2, so we use the `skiprows = 1` option to skip the first row when importing the data. When looking at the spreadsheet, you can see that missing temperature data is coded as `***`. In order to ensure that the missing temperature data are recorded as numbers, we tell **pandas** that `na_values = "***"` which imports those missing values as `NaN`, which means the number is missing.

You can also use `pd.read_csv` to open files that are stored locally on your computer. Instead of a URL, enter a file path to wherever you saved your data (enclosed in quote marks).

To check that the data have been imported correctly, you can use the `.head()` function to view the first five rows of the dataset, and confirm that they correspond to the columns in the csv file.

In [None]:
df.head()

Before working with this data, we use the `.info()` function to check that the data were read in as numbers (either real numbers, `float64`, or integers, `int`) rather than strings.

In [None]:
df.info()

You can see that all variables are formatted as either `float64` or `int`, so Python correctly recognises that these data are numbers.

Now create some line charts using monthly, seasonal, and annual data, which help us look for general patterns over time.

2. Choose one month and plot a line chart with average temperature anomaly on the vertical axis and time (from 1880 to the latest year available) on the horizontal axis. Label each axis appropriately and give your chart a suitable title (Refer to Figure 1.1 as an example.)

### Python Walkthrough 1.2

**Drawing a line chart of temperature and time**

We will now set the year as the *index* of the dataset. This will make plotting the time series of temperature easier.

In [None]:
df = df.set_index("Year")
df.head()

In [None]:
df.tail()

Next we'll make our chart. We'll use the **matplotlib** package for this. The built-in style isn't very attractive so we'll first switch to a prettier one:

In [None]:
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
plt.rcParams["figure.figsize"] = [6, 3]
plt.rcParams["figure.dpi"] = 150

We can now use these variables to draw line charts using the plot function. As an example, we will draw a line chart using data for January, `df["Jan"]` for the years 1880—2016. 

The first line creates a variable for the month (in case we'd like to replot a different month later), the second creates an empty chart, the next two create a horizontal line and annotate it. Then `df[month].plot(ax=ax)` adds the temperature data to the chart. Finally, the title and y-axis label are set. We ensure that the month and year used in the title is always up to date and consistent with the data that we're pulling in by using the variables `month` and `df.index.max()` to write the month and maximum year into the title respectively. (This trick is called an f-string, or function string, and the string begins with an `f` before the opening of the quotation marks.)

In [None]:
month = "Jan"
fig, ax = plt.subplots()
ax.axhline(0, color="orange")
ax.annotate("1951—1980 average", xy=(0.66, -0.2), xycoords=("figure fraction", "data"))
df[month].plot(ax=ax)
ax.set_title(
    f"Average temperature anomaly in {month} \n in the northern hemisphere (1880—{df.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");

Try different values for `color`, `xy`, and the first argument of `ax.axhline` in the plot function to figure out what these options do. (Note that `xycoords` set the behaviour of `xy`.)

It is important to remember that all axis and chart titles should be enclosed in quotation marks (`""`), as well as any words that are not options (for example, colour names or filenames).

3. *Extra practice*: The columns labelled `"DJF"`, `"MAM"`, `"JJA"`, and `"SON"` contain seasonal averages (means). For example, the `"MAM"` column contains the average of the March, April, and May columns for each year. Plot a separate line chart for each season, using average temperature anomaly for that season on the vertical axis and time (from 1880 to the latest year available) on the horizontal axis.

4. The column labelled `"J–D"` contains the average temperature anomaly for each year.
   - a) Plot a line chart with annual average temperature anomaly on the vertical axis and time (from 1880 to the latest year available) on the horizontal axis. Your chart should look like Figure 1.1. *Extension*: Add a horizontal line that intersects the vertical axis at 0, and label it ‘1951–1980 average’.
   - b) What do your charts from Questions 2 to 4(a) suggest about the relationship between temperature and time?


### Python Walkthrough 1.3

**Producing a line chart for the annual temperature anomalies**

This is where the power of programming languages becomes evident: to produce the same line chart for a different variable, we simply take the code used in Python walk-through 1.2 and replace the `"Jan"` with the name for the annual variable (`"J-D"`).

In [None]:
month = "J-D"
fig, ax = plt.subplots()
ax.axhline(0, color="orange")
ax.annotate("1951—1980 average", xy=(0.68, -0.2), xycoords=("figure fraction", "data"))
df[month].plot(ax=ax)
ax.set_title(
    f"Average annual temperature anomaly in \n in the northern hemisphere (1880—{df.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");

5. You now have charts for three different time intervals: month (Question 2), season (Question 3), and year (Question 4). For each time interval, discuss what we can learn about patterns in temperature over time that we might not be able to learn from the charts of other time intervals.
6. Compare your chart from Question 4 to Figure 1.4, which also shows the behaviour of temperature over time using data taken from the National Academy of Sciences.
    - a) Discuss the similarities and differences between the charts. (For example, are the horizontal and vertical axes variables the same, or do the lines have the same shape?)
    - b) Looking at the behaviour of temperature over time from 1000 to 1900 in Figure 1.4, are the observed patterns in your chart unusual?
    - c) Based on your answers to Questions 4 and 5, do you think the government should be concerned about climate change?


![](https://www.core-econ.org/doing-economics/book/images/web/r-figure-01-04.jpg)

## Part 1.2 Variation in temperature over time

**Learning objectives for this part**

- summarize data in a frequency table, and visualize distributions with column charts
- describe a distribution using mean and variance

Aside from changes in the mean temperature, the government is also worried that climate change will result in more frequent extreme weather events. The island has experienced a few major storms and severe heat waves in the past, both of which caused serious damage and disruption to economic activity.

Will weather become more extreme and vary more as a result of climate change? A [New York Times article](https://tinyco.re/8697554) uses the same temperature dataset you have been using to investigate the distribution of temperatures and temperature variability over time. Read through the article, paying close attention to the descriptions of the temperature distributions.

We can use the mean and median to describe distributions, and we can use deciles to describe parts of distributions. To visualize distributions, we can use column charts (sometimes referred to as frequency histograms). We are now going to create similar charts of temperature distributions to the ones in the New York Times article, and look at different ways of summarizing distributions.

In order to create a column chart using the temperature data we have, we first need to summarize the data using a frequency tablefrequency table A record of how many observations in a dataset have a particular value, range of values, or belong to a particular category.close. Instead of using deciles to group the data, we use intervals of 0.05, so that temperature anomalies with a value from −0.3 to −0.25 will be in one group, a value greater than −0.25 and up to 0.2 in another group, and so on. The frequency table shows us how many values belong to a particular group.

1. Using the monthly data for June, July, and August, create two frequency tables similar to Figure 1.5 for the years 1951–1980 and 1981–2010 respectively. The values in the first column should range from −0.3 to 1.05, in intervals of 0.05. See Python walk-through 1.4 for how to do this.


| Range of temperature anomaly (T) 	| Frequency 	|
|----------------------------------	|-----------	|
|               −0.30              	|           	|
|               −0.25              	|           	|
|                 …                	|           	|
|                1.00              	|           	|
|                1.05              	|           	|

2. Using the frequency tables from Question 1:
   - a) Plot two separate column charts (frequency histograms) for 1951–1980 and 1981–2010 to show the distribution of temperatures, with frequency on the vertical axis and the range of temperature anomaly on the horizontal axis. Your charts should look similar to those in the New York Times article.
   - b) Using your charts, describe the similarities and differences (if any) between the distributions of temperature anomalies in 1951–1980 and 1981–2010

### Python Walkthrough 1.4

**Creating frequency tables and histograms**

Since we will be looking at data from different subperiods (year intervals) separately, we will create a categorical variable (a variable that has two or more categories) that indicates the subperiod for each observation (row). In Python this type of variable is called a ‘category’ or categorical. When we create a categorical column, we need to define the categories that this variable can take.

We'll achieve this using a dictionary, a special type of object in Python that provides a mapping from one value to another

In [None]:
df["Period"] = pd.cut(
    df.index,
    bins=[1921, 1950, 1980, 2010],
    labels=["1921—1950", "1951—1980", "1981—2010"],
    ordered=True,
)

We created a new variable called `"Period"` and defined the possible categories using the `labels=` keyword argument. Since we will not be using data for some years (before 1921 and after 2010), we want `"Period"` to take the value `NaN` (not a number) for these observations (rows), and the appropriate category for all the other observations. The `pd.cut` function does this automatically.

Let's take a look at the last 20 entries using `.tail`:

In [None]:
df["Period"].tail(20)

We'd really like to combine the data from the three summer months. This is easy to do using the `.stack` function. Let's look at the first few rows of the data once stacked using `.head()`

In [None]:
list_of_months = ["Jun", "Jul", "Aug"]
df[list_of_months].stack().head()

We need to use all monthly anomalies from June, July, and August, but they are currently in three separate columns.

In [None]:
fig, axes = plt.subplots(ncols=3, figsize=(9, 4), sharex=True, sharey=True)
for ax, period in zip(axes, df["Period"].dropna().unique()):
    df.loc[df["Period"] == period, list_of_months].stack().hist(ax=ax)
    ax.set_title(period)
plt.suptitle("Histogram of temperature anomalies")
axes[1].set_xlabel("Summer temperature distribution")
plt.tight_layout();

To explain what a histogram displays, let's take a look at the histogram for the period from 1921—1950. We will look first at the highest of the bars, which is centred at –0.15. This bar represents values of the temperature anomalies that fall in the interval from –0.2 to –0.1. The height of this bar is a representation of how many values fall into this interval, (23 observations, in this case). As it is the highest bar, this indicates that this is the interval in which the largest proportion of temperature anomalies fell for the period from 1921 to 1950. As you can see, there are virtually no temperature anomalies larger than 0.3. The height of these bars gives a useful overview of the distribution of the temperature anomalies.

Now consider how this distribution changes as we move through the three distinct time periods. The distribution is clearly moving to the right for the period 1981–2010, which is an indication that the temperature is increasing; in other words, an indication of global warming.

Now we will use our data to look at different aspects of distributions. Firstly, we will learn how to use deciles to determine which observations are ‘normal’ and ‘abnormal’, and then learn how to use variancevariance A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 221,334 (large spread).close to describe the shape of a distribution.

3. The New York Times article considers the bottom third (the lowest or coldest one-third) of temperature anomalies in 1951–1980 as ‘cold’ and the top third (the highest or hottest one-third) of anomalies as ‘hot’. In decile terms, temperatures in the 1st to 3rd decile are ‘cold’ and temperatures in the 7th to 10th decile or above are ‘hot’ (rounded to the nearest decile). Use Python and **numpy**’s `np.quantile` function to determine what values correspond to the 3rd and 7th decile across all months in 1951–1980. (See Python Walkthrough 1.5 for an example.)


### Python Walkthrough 1.5

**Using the `quantile` function**

First, we need to create a variable that contains all monthly anomalies in the years 1951—1980. Then, we'll use **pandas** `pd.qcut` function to find the required percentiles (0.3 and 0.7 refer to the 3rd and 7th deciles, respectively).

Note: You may get slightly different values to those shown here if you are using the latest data.

In [None]:
# Create a variable that has years 1951 to 1980, and months Jan to Dec (inclusive)
temp_all_months = df.loc[(df.index >= 1951) & (df.index <= 1980), "Jan":"Dec"]
# Put all the data in stacked format and give the new columns sensible names
temp_all_months = (
    temp_all_months.stack()
    .reset_index()
    .rename(columns={"level_1": "month", 0: "values"})
)
# Take a look at this data:
temp_all_months

In [None]:
quantiles = [0.3, 0.7]
list_of_percentiles = np.quantile(temp_all_months["values"], q=quantiles)

print(f"The cold threshold of {quantiles[0]*100}% is {list_of_percentiles[0]}")
print(f"The hot threshold of {quantiles[1]*100}% is {list_of_percentiles[1]}")

4. Based on the values you found in Question 3, count the number of anomalies that are considered ‘hot’ in 1981–2010, and express this as a percentage of all the temperature observations in that period. Does your answer suggest that we are experiencing hotter weather more frequently in 1981–2010? (Remember that each decile represents 10% of observations, so 30% of temperatures were considered ‘hot’ in 1951–1980.)

### Python Walkthrough 1.6

**Computing the proportion of anomalies at a given quantile using the `.mean()` function**

*Note*: You may get slightly different values to those shown here if you are using the latest data.

We repeat the steps used in Python walk-through 1.5, now looking at monthly anomalies in the years 1981—2010. We can simply change the year values in the code from Walkthrough 1.5.


In [None]:
# Create a variable that has years 1981 to 2010, and months Jan to Dec (inclusive)
temp_all_months = df.loc[(df.index >= 1981) & (df.index <= 2010), "Jan":"Dec"]
# Put all the data in stacked format and give the new columns sensible names
temp_all_months = (
    temp_all_months.stack()
    .reset_index()
    .rename(columns={"level_1": "month", 0: "values"})
)
# Take a look at the start of this data data:
temp_all_months.head()

Now that we have all the monthly data for 1981—2010, we want to count the proportion of observations that are smaller than –0.1. We'll first create a *binary indicator* (ie it's True or False) that says, for each row (observation) in `temp_all_months`, whether the number is lower than the 0.3 quantile or not (given by `list_of_percentiles[0]`). Then we'll take the mean of this list of True and False values; when you take the mean of binary variables, each True evaluates to 1 and each False to 0, so the mean gives us the proportion of entries in `temp_all_months` that are lower than the 0.3 quantile:

In [None]:
entries_less_than_q30 = temp_all_months["values"] < list_of_percentiles[0]
proportion_under_q30 = entries_less_than_q30.mean()
print(f"The proportion under {list_of_percentiles[0]} is {proportion_under_q30*100:.2f}%")

When we printed out the answer, we used some *number formatting*. This is written as `:.2f` within the curly brackets part of an f-string—this tells Python to display the number with two decimal places. You should also note that, as well as the mean given by `.mean()`, there are various other built-in functions like `.std()` for the standard deviation and `.var()` for the variance.

Now we can assess that between 1951 and 1980, 30% of observations for the temperature anomaly were smaller than –0.10, but between 1981 and 2010 only about two per cent of months are considered cold. That is a large change.

Let’s check whether we get a similar result for the number of observations that are larger than 0.11.

In [None]:
proportion_over_q70 = (temp_all_months["values"] > list_of_percentiles[1]).mean()
print(f"The proportion over {list_of_percentiles[1]} is {proportion_over_q70*100:.2f}%")

5. The New York Times article discusses whether temperatures have become more variable over time. One way to measure temperature variability is by calculating the variance of the temperature distribution. For each season (DJF, MAM, JJA, and SON):
    - a) Calculate the mean (average) and variance separately for the following time periods: 1921–1950, 1951–1980, and 1981–2010.
    - b) For each season, compare the variances in different periods, and explain whether or not temperature appears to be more variable in later periods.

### Python Walkthrough 1.7

**Calculating and understanding mean and variance**

The process of computing the mean and variance separately for each period and season separately would be quite tedious. We would prefer a way to cover all of them at once. Let's re-stack the data in a form where `season` is one of the columns and could take the values `DJF`, `MAM`, `JJA`, or `SON`. Let's also have a peek at the structure of the data while we're at it:

In [None]:
temp_all_months = df.loc[:, "DJF":"SON"].stack().reset_index().rename(columns={"level_1": "season", 0: "values"})
temp_all_months["Period"] = pd.cut(
    temp_all_months["Year"],
    bins=[1921, 1950, 1980, 2010],
    labels=["1921—1950", "1951—1980", "1981—2010"],
    ordered=True,
)
temp_all_months.iloc[-135:-125]

Now we'll take the mean and variance at once. We're going to *group* our data using a `groupby` operation that we pass a list of the variables we'd like to group together; here that will be `"Period"` and `"season"`. The variable we'd like to apply the grouping to is `"values"` so we then filter down to just the `"values"` column. Finally we're going to do an *aggregation* using the `.agg` function and we'll pass that a list of functions we'd like to apply. We'll apply `np.mean` and `np.var`, which, respectively, take the mean and variance of any values they are applied to.

In [None]:
grp_mean_var = temp_all_months.groupby(["season", "Period"])["values"].agg([np.mean, np.var])
grp_mean_var

We recognise that the variances seem to remain fairly constant across the first two periods, but they do increase markedly for the 1981—2010 period.

We can plot a line chart to see these changes graphically. (This type of chart is formally known as a ‘time-series plot’). One trick we can use here is that calling `.plot` on wide-format data (with many columns) is interpreted by **pandas** as you wanting to plot all of the columns you've selected. So, we can do a quick transform to our data and have **pandas** plot it in the way we'd like, with all of the different seasons represented.

To do this, we'll set `"Year"` and `"season"` as the index but then unstack them so that season forms the columns of our data. We'll then only select the `"values"` entries as we're not using `"Period"` right now:

In [None]:
wide_format_data = temp_all_months.set_index(["Year", "season"]).unstack()["values"]
wide_format_data.head()

Now we can plot the data by calling `.plot` and passing that function the axis, here called `ax`, we would like it to use. We'll also add a horizontal line at zero and some labels.

In [None]:
fig, ax = plt.subplots()
ax.axhline(0, color="black", alpha=0.3)
ax.annotate("1951—1980 average", xy=(0.68, -0.2), xycoords=("figure fraction", "data"))
wide_format_data.plot(linewidth=1, ax=ax)
ax.set_title(
    f"Average annual temperature anomaly in \n in the northern hemisphere (1880—{wide_format_data.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");

6. Using the findings of the New York Times article and your answers to Questions 1 to 5, discuss whether temperature appears to be more variable over time. Would you advise the government to spend more money on mitigating the effects of extreme weather events?

## Part 1.3 Carbon emissions and the environment

**Learning objectives for this part**

- use scatterplots and the correlation coefficient to assess the degree of association between two variables
- explain what correlation measures and the limitations of correlation.

The government has heard that carbon emissions could be responsible for climate change, and has asked you to investigate whether this is the case. To do so, we are now going to look at carbon emissions over time, and use another type of chart, the scatterplot, to show their relationship to temperature anomalies. One way to measure the relationship between two variables is correlation. Python Walkthrough 1.8 explains what correlation is and how to calculate it.

In the questions below, we will make charts using the CO$_2$ data from the US National Oceanic and Atmospheric Administration. Download the [Excel spreadsheet](https://tinyco.re/3763425) containing this data. Save the data as a csv file in a sub-directory of the folder you have open Visual Studio Code named "data". Import the csv that's now in "data/1_CO2-data.csv" into Python using **pandas** read csv function; the code might look like `df_co2 = pd.read_csv("data/1_CO2-data.csv")`.

1. The CO$_2$ data were recorded from one observatory in Mauna Loa. Using [an Earth System Research Laboratory article](https://tinyco.re/8193893) as a reference, explain whether or not you think this data is a reliable representation of the global atmosphere.

2. The variables trend and interpolated are similar, but not identical. In your own words, explain the difference between these two measures of CO$_2$ levels. Why might there be seasonal variation in CO$_2$ levels?

Now we will use a line chart to look for general patterns over time.

3. Plot a line chart with interpolated and trend CO$_2$ levels on the vertical axis and time (starting from January 1960) on the horizontal axis. Label the axes and the chart legend, and give your chart an appropriate title. What does this chart suggest about the relationship between CO$_2$ and time?

We will now combine the CO$_2$ data with the temperature data from Part 1.1, and then examine the relationship between these two variables visually using scatterplots, and statistically using the correlation coefficient. If you have not yet downloaded the temperature data, follow the instructions in Part 1.1.

4. Choose one month and add the CO$_2$ trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year.
    - a) Make a scatterplot of CO$_2$ level on the vertical axis and temperature anomaly on the horizontal axis.
    - b) Calculate and interpret the (Pearson) correlation coefficient between these two variables.
    - c) Discuss the shortcomings of using this coefficient to summarize the relationship between variables.

### Python Walkthrough 1.8

**Scatterplots and the correlation coefficient**

First we will use the `pd.read_csv` function to import the CO$_2$ datafile into Python, and call it `df_co2`.

In [None]:
df_co2 = pd.read_csv("data/1_CO2-data.csv")
df_co2.head()

This file has monthly data, but in contrast to the data in `df` from earlier, the data is all in so-called tidy format (one observation per row, one column per variable). To make this task easier, we will pick only the June data from the CO$_2$ emissions and add them as an additional variable to the `df` dataset.

Python's **pandas** package has a convenient function called merge to do this. First we create a new dataset that contains only the June emissions data (`df_co2_june`).

In [None]:
df_co2_june = df_co2.loc[df_co2["Month"]==6]
df_co2_june.head()

Then we use this data in the `pd.merge` function. The merge function takes the original `df` and the `df_co2_june` and merges (combines) them together. As the two dataframes have a common variable, `"Year"`, **pandas** automatically matches the data by year.

(*Extension*: Hover your cursor over `pd.merge` in Visual Studio Code, type `help(pd.merge)` into the interactive window, or Google ‘pandas merge’ to see the many other options that `pd.merge` allows.)

In [None]:
df_temp_co2 = pd.merge(df_co2_june, df, on="Year")
df_temp_co2[["Year", "Jun", "Trend"]].head()

It looks like it worked! We now have some extra columns from the CO$_2$ data in addition to the temperature anomaly data from before.

To make a scatterplot, we use the `.plot.scatter()` function.

In [None]:
fig, ax = plt.subplots()
df_temp_co2.plot.scatter(x="Jun", y="Trend", ax=ax)
ax.set_title(r"Scatterplot of temperature anomalies vs CO$_2$ emissions")
ax.set_ylabel(r"CO$_2$ levels (trend, mole fraction)")
ax.set_xlabel("Temperature anomaly (degrees Celsius)");

To calculate the correlation coefficient, we can use the `.corr()` function. *Note*: you may get slightly different results if you are using the latest data.

In [None]:
df_temp_co2[["Jun", "Trend"]].corr(method="pearson")

In this case, the correlation coefficient tells us that an upward-sloping straight line is quite a good fit to the date (as seen on the scatterplot). There is a strong positive association between the two variables (higher temperature anomalies are associated with higher CO$_2$ levels).


One limitation of this correlation measure is that it only tells us about the strength of the upward- or downward-sloping linear relationship between two variables, in other words how closely the scatterplot aligns along an upward- or downward-sloping straight line. The correlation coefficient cannot tell us if the two variables have a different kind of relationship (such as that represented by a wavy line).

*Note*: The word ‘strong’ is often used for coefficients that are close to 1 or −1, and ‘weak’ is often used for coefficients that are close to 0, though there is no precise range of values that are considered ‘strong’ or ‘weak’.

If you need more insight into correlation coefficients, you may find it helpful to watch online tutorials such as [‘Correlation coefficient intuition’](https://tinyco.re/4363520) from the Khan Academy.

As we are dealing with time-series data, it is often more instructive to look at a line plot, as a scatterplot cannot convey how the observations relate to each other in the time dimension.

Let’s start by plotting the June temperature anomalies. To make the plotting easier, we will set the `"Year"` column as the index of the dataframe.


In [None]:
df_temp_co2 = df_temp_co2.set_index("Year")

In [None]:
fig, ax = plt.subplots()
df_temp_co2["Jun"].plot(ax=ax)
ax.set_title(r"June temperature anomalies")
ax.set_ylabel("June temperature anomalies");


Typically, when using the `plot` function we would now only need to add the line for the second variable using the lines command. The issue, however, is that the CO$_2$ emissions variable (Trend) is on a different scale, and the automatic vertical axis scale (from –0.2 to about 1.2) would not allow for the display of Trend. To resolve this issue we can introduce a second panel within the same overall figure space. You can think of the new plotting space as being like a table, with an overall title and two columns.

In [None]:
fig, ax0 = plt.subplots(figsize=(6, 4))
ax1 = ax0.twinx()
df_temp_co2["Jun"].plot(ax=ax0, label="June anomaly")
df_temp_co2["Trend"].plot(ax=ax1, color="k", label="Emissions")
ax0.set_ylabel("June temperature anomalies")
ax1.set_ylabel(r"CO$_2$ emissions")
# set up combined legend
lines0, labels0 = ax0.get_legend_handles_labels()
lines1, labels1 = ax1.get_legend_handles_labels()
ax0.legend(lines0 + lines1, labels0 + labels1)
plt.suptitle(r"June temperature anomalies and CO$_2$ emissions");

This line graph not only shows how the two variables move together in general, but also clearly demonstrates that both variables display a clear upward trend over the sample period. This is an important feature of many (not all) time series variables, and is important for the interpretation (see the ‘Find out more’ box on spurious correlations that follows).

5. *Extra practice*: Choose two months and add the CO$_2$ trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year. Create a separate chart for each month. What do your charts and the correlation coefficients suggest about the relationship between CO$_2$ levels and temperature anomalies?


Even though two variables are strongly correlated with each other, this is not necessarily because one variable’s behaviour is the result of the other (a characteristic known as causation). The two variables could be spuriously correlated. The following example illustrates spurious correlation:

> A child’s academic performance may be positively correlated with the size of the child’s house or the number of rooms in the house, but could we conclude that building an extra room would make a child smarter, or doing well at school would make someone’s house bigger? It is more plausible that income or wealth, which determines the size of home that a family can afford and the resources available for studying, is the ‘unseen factor’ in this relationship. We could also determine whether income is the reason for this spurious correlation by comparing exam scores for children whose parents have similar incomes but different house sizes. If there is no correlation between exam scores and house size, then we can deduce that house size was not ‘causing’ exam scores (or vice versa).

6. Consider the example of spurious correlation described above.
    - a) In your own words, explain spurious correlation and the difference between correlation and causation.
    - b) Give an example of spurious correlation, similar to the one above, for either CO$_2$ levels or temperature anomalies.
    - c) Choose an example of spurious correlation from [Tyler Vigen’s website](https://tinyco.re/8861803). Explain whether you think it is a coincidence, or whether this correlation could be due to one or more other variables.

**Find out more**

*What makes some correlations spurious?*

In the spurious correlations website given in Question 6(c), most of the examples you will see involve data series (variables) that are trending (meaning that they tend to increase or decrease over time). If you calculate a correlation between two variables that are trending, you are bound to find a large positive or negative correlation coefficient, even if there is no plausible explanation for a relationship between the two variables. For example, ‘per capita cheese consumption’ (which increases over time due to increased disposable incomes or greater availability of cheese) has a correlation coefficient of 0.95 with the ‘number of people who die from becoming tangled in their bedsheets’ (which also increases over time due to a growing population and a growing availability of bedsheets).

The case for our example (the relationship between temperature and CO$_2$ emissions) is slightly different. There is a well-known chemical link between the two. So we understand how CO$_2$ emissions could potentially cause changes in temperature. But in general, do not be tempted to conclude that there is a causal link just because a high correlation coefficient can be seen. Be very cautious when attaching too much meaning to high correlation coefficients when the data displays trending behaviour.

This part shows that summary statistics, such as the correlation coefficient, can help identify possible patterns or relationships between variables, but we cannot draw conclusions about causation from them alone. It is also important to think about other explanations for what we see in the data, and whether we would expect there to be a relationship between the two variables.

However, there are ways to determine whether there is a causal relationship between two variables, for example, by looking at the scientific processes that connect the variables (as with CO$_2$ and temperature anomalies), or by using a natural experiment. To read more about how natural experiments help evaluate whether one variable causes another, see [Section 1.8](https://tinyco.re/9384675) of *Economy, Society, and Public Policy*. In Empirical Project 3, we will take a closer look at natural experiments and how we can use them to identify causal links between variables.