# AI UK Challenge Instructions

Thank you for taking part the Alan Turing Institute national AI showcase event, and the Data Study Group challenge. This notebook will share some details on how to get access to the data we've prepared for the event, as well as some inital analysis we've conducted, as well as some ideas on what questions you may wish to dig into during the challenge. Please feel free to come up with your own thoughts on what you'd like to work on: if you want to add secondary data, or even drop our suggested datasets and work with other public sources, go ahead!

We would like each group to come up with a few interesting observations that your facilitator can discuss at the end of the event. If there's something you'd like them to talk about, whether that's a quirk of the data, some predictions based on modelling, or even an exciting approach or research field you've learned about from your group members, please let them know so they can take a note of it.

**IMPORTANT**: Please check the end of this notebook for our suggestions if you're at a loss for what to start looking at, or want more guidance on what your facilitator will be talking about.

Please above all else strive to remain professional and respectful during your group work, and don't worry about solving all of the world's issues in one afternoon. Remember, we're here to learn from each other, not conquer climate change at a single stroke.

Good luck, and please have fun.

# Initial setup

## Install required packages

In [None]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install requests

## Run setup

In [None]:
"""
To run this notebook in Google Colab, you need to the following first:
1. Open this link: https://drive.google.com/drive/folders/1adprVKMxSlXTn-S3ZAbOx545cxv5CzHl?usp=sharing
2. Then go to "Shared with me" in your Google Drive, right-click the "AIUK" folder
and select "Add shortcut to Drive"

Optionally, if you don't have a Google Drive account, you can set colab to False,
and download the data.
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dates
from pathlib import Path

%matplotlib inline
%config InlineBackend.figure_format='retina'

plt.style.use('seaborn')
plt.rcParams["figure.figsize"] = (15,10)

colab = False

if colab:
    from google.colab import drive
    drive.flush_and_unmount()
    drive.mount('/content/drive')
    root_dir = Path('/content/drive/MyDrive/AIUK')
else:
    from data import download_data
    download_data()
    root_dir = Path('./')

# Climate change indicators dataset

Anthropogenic climate change is one of the most important social issues facing humanity, and one in which big data can play a role in determining the most effective policy interventions. Today, we'll be taking a look at some of the key indicators of warming, including CO2 levels, change in sea levels, and so on.

Dataset source: https://datahub.io/collections/climate-change

## CO2

In the next few cells we'll explore the datasets. Let's first take a look at the dataset which gives us the average monthly concentration of CO2 at the Mauna Loa observatory, measured in parts per million (PPM).



In [None]:
co2_df = pd.read_csv(root_dir.joinpath('co2_monthly.csv'), parse_dates=['Date'])
co2_df.head()

As you can see, we get the date in YYYY-MM-DD format, with a monthly average, as well as a seasonally-corrected trend. Number of days refers to the number of day readings averaged, -1 denoting that the monthly data is interpolated from other sources.

Let's plot the monthly average since 1990, and generate a best fit line showing the trend.

In [None]:
since_90 = co2_df[co2_df.Date>'1990']
date_num = dates.date2num(since_90.Date)
params = np.polyfit(date_num, since_90.Average, 1)
fit = np.poly1d(params)
x_fit = np.linspace(date_num.min(), date_num.max())

In [None]:
plt.scatter(pd.to_datetime(since_90.Date), since_90.Average, marker='.')
plt.plot(dates.num2date(x_fit), fit(x_fit), 'r')
plt.show()

## Temperature changes

What about the correlation between temperature change and CO2? Let's add our global anomalous temperatures dataset to our plot as well.

In [None]:
temp_df = pd.read_csv(root_dir.joinpath('temp_monthly.csv'), parse_dates=['Date'])
temp_df.head()

In [None]:
temp_since_90 = temp_df.loc[(temp_df.Date > '1990') & (temp_df.Source == 'GISTEMP')]

In [None]:
fig, ax = plt.subplots()
ax.bar(pd.to_datetime(temp_since_90.Date), temp_since_90.Mean, width=15.0)
ax.set_ylabel('Anomalous temperature change (degrees C)')

ax2 = ax.twinx()
ax2.set_ylabel('CO2 concentration (PPM)')

ax2.plot(dates.num2date(x_fit), fit(x_fit), 'r')
plt.show()

## Sea level

For millions of people around the world, the most immediate impact of climate change is sea level rise, which threatens not only homes and businesses, but also vast areas of agricultural land, as well as access to clean, desalinated water.

<img src="https://cloudfront-us-east-2.images.arcpublishing.com/reuters/3XK6LJVMXBMXZMBV45JCDUZN6Y.jpg">


Let's take a look at the sea level dataset.


In [None]:
s_df = pd.read_csv(root_dir.joinpath('sea_level.csv'), parse_dates=['Year'])
s_df.head()

This set provides two measurements of changes in the average sea level in inches per year since 1880, provided by the CSIRO (Commonwealth Scientific and Industrial Research Organization) and EPA (United States Environmental Protection Agency). We can see where these measurements overlap by plotting the trend of each since 1990.

In [None]:
s_since_90 = s_df[s_df.Year > '1990']

In [None]:
plt.plot(pd.to_datetime(s_since_90.Year), s_since_90['CSIRO Adjusted Sea Level'], label='CSIRO sea level')
plt.plot(pd.to_datetime(s_since_90.Year), s_since_90['NOAA Adjusted Sea Level'], label='NOAA sea level')
plt.legend(loc=(0.7, 0.2), frameon=True, facecolor='white')
plt.show()

## Predicting sea level rise

The rate of increase is somewhat linear, with ~9 inches higher average sea levels over the 1880 figure, but lets try to predict what the increase might look like over the next 30 years. To do that, we'll train a simple regression model.

We'll first collect our data and resample our time scale to the year.

In [None]:
s = s_df[['Year', 'CSIRO Adjusted Sea Level']].groupby(pd.Grouper(key='Year', axis=0, freq='Y')).mean().reset_index()
s.rename(columns = {'Year':'Date', 'CSIRO Adjusted Sea Level': 'Sea Level'}, inplace = True)
s = s[s.Date < '2014']
s.head()

Then, we'll train a linear regressor to predict our outcome (Sea Level) from the date. First, we'll split our available data into a training and test set, then train our regression with the labels from the training set. Then, we'll make some predictions on the test set, and see how well our model fits.

**NOTE**: this is obviously not a practical solution, given that the rate of sea level rises probably isn't solely determined by the calendar year, but multiple climate-related factors, but it will work as a simple demonstration of the principle.

In [None]:
from sklearn import linear_model, model_selection
import datetime

X = np.array(s['Date'].apply(lambda x:x.toordinal())).reshape(-1, 1)
y = s['Sea Level']

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33, random_state=42)
regressor = linear_model.LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

x_dt = [datetime.date.fromordinal(x[0]) for x in X_test]

plt.scatter(x_dt, y_test, color="black")
plt.plot(x_dt, y_pred, color="blue", linewidth=3)

plt.show()

The regression seems to fit quite well with the data that we've got so far, so let's make some predictions for the future.

In [None]:
start = '2014-12-31'
end = '2050-12-31'

new_dates = pd.date_range(start, end, freq='1Y')
x_fut = np.array(new_dates.to_series().apply(lambda x:x.toordinal())).reshape(-1, 1)

y_pred = regressor.predict(x_fut)
fut_dt = [datetime.date.fromordinal(x[0]) for x in x_fut]

plt.plot(fut_dt, y_pred, color="blue", linewidth=3)

plt.show()

From this simple prediction, we can see that if the rate of increase stays stable, we can expect greater than 10 inches of sea level rise by 2050: [enough to put large parts of England, as well as virtually the entire Netherlands](https://coastal.climatecentral.org/map/8/1.1818/51.4022/?theme=sea_level_rise&map_type=year&basemap=roadmap&contiguous=true&elevation_model=best_available&forecast_year=2050&pathway=rcp45&percentile=p50&refresh=true&return_level=return_level_1&rl_model=gtsr&slr_model=kopp_2014), under water.

![London flooding](https://2.bp.blogspot.com/_Fv90J7eTcis/TFs8VbyP1fI/AAAAAAAAGKc/tF6rECNfLbE/s1600/Flood.jpg)

## Other sets

We have also provided the 'fossil_by_nation' dataset, which we will not explore in this notebook. If you're interested in analysing the distribution of fossil fuel usage by nation states over time, we encourage you to explore this dataset.

Here are the column headers to get you started.

| Column Name | Contents |
|---|---|
| Year | 4-digit year |
| Country | Country as an uppercased string (e.g. UNITED KINGDOM) |
| Total | Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) |
| Solid Fuel | Carbon emissions from solid fuel consumption |
| Liquid Fuel | Carbon emissions from liquid fuel consumption |
| Gas Fuel | Carbon emissions from gas fuel consumption |
| Cement | Carbon emissions from cement production |
| Gas Flaring | Carbon emissions from gas flaring |
| Per Capita | Per capita carbon emissions (metric tons of carbon; after 1949 only) |
| Bunker Fuels | Carbon emissions from bunker fuels (not included in total) |

# Climate change sentiment dataset

Beyond the bare numbers, combating climate change will require buy-in from people in all walks of life. Part of persuading people about the need to combat the deadly outcomes of global warming is identifying which parts of the population are already persuaded, and what evidence each side is deploying to support their point of view.

We've provided a dataset includes a large group of social media posts related to climate change, annotated by human experts as referring to either factual news sources (2), or as opinion indicating that the person is promoting either belief or disbelief in anthropogenic warming (1 or -1). Messages without an obvious valence are marked with a 0.

You can find that dataset in your Google Drive at /content/drive/MyDrive/AIUK/cc_sentiment.csv

Dataset source: https://www.kaggle.com/edqian/twitter-climate-change-sentiment-dataset

In [None]:
data = pd.read_csv(root_dir.joinpath('cc_sentiment.csv'))
data.head()

Let's see how many examples of each class we have in our dataset. 

In [None]:
counts = data['sentiment'].value_counts().sort_index().values

plt.bar(range(-1, 3), counts, tick_label=range(-1, 3))

for index, value in enumerate(counts):
    plt.text(index - 1.05, value + 300, str(value))

plt.show()

It looks like our dataset is quite imbalanced: the vast majority of the examples we have express belief in anthropogenic climate change, while a very small minority hold the opposite position, or are neutral. This might have some severe effects on machine learning models we might want to train with it.

## Machine learning with text

Let's explore that idea by training a simple natural language processing (NLP) model to predict whether a message text holds either belief. 

We'll stick with some straightforward principles: first, we'll treat each message as a bag of words, that is we will ignore the order and structure of words, and just count how many times each appears. Second, we'll use a Decision Tree model to make decisions about what belief each message demonstrates, given the words that appear in it.

Let's go ahead and prepare our training and test data by including only messages annotated with either 1 or -1, then converting each message text to a list of word counts.

In [None]:
df = data.loc[(data.sentiment == 1) | (data.sentiment == -1)]
X = df.message
Y = df.sentiment

print(X.shape)
print(Y.shape)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(strip_accents='unicode', stop_words='english')
X_counts = count_vect.fit_transform(X)

# Feature matrix shape
X_counts.shape

Now that we have a set of feature vectors and labels, we can go ahead and split that into training and test sets, and train our model.

In [None]:
from sklearn import tree

X_train, X_test, y_train, y_test = model_selection.train_test_split(X_counts, Y, test_size=0.33, random_state=42)

clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

preds = clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, preds)

An accuracy of nearly 90% looks quite good, right? Well, that number may be giving us the wrong impression. Let's look at some more scoring metrics.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

print(classification_report(y_test, preds))

ConfusionMatrixDisplay(confusion_matrix(y_test, preds)).plot()
plt.show()

You can see that the classifier is almost always predicting that the message shows belief in anthropogenic climate change, regardless of what the message content looks like. Because our dataset is so imbalanced, with by far the majority of examples in the positive class, this means that most examples are going to be classified correctly by default, meaning that the accuracy score will be high, regardless of what our model is actually learning.

Using another metric which takes account of the ratio of false predictions as well as true predictions, like F1 or the Matthews Correlation Coefficient, gives us a much clearer understanding of what is going on.

# Ideas for discussion

Here's a set of questions that you might like to think about answering during your DSG challenge. If you don't see anything you're interested in answering, don't worry - feel free to come up with your own problem to solve or interesting piece of data to dig up.

*  Which countries have historically emitted a lot of co2, and what has changed over time?
*  Extrapolate from the cc data to see what potential changes could do to temperature/atmospheric co2?
*  What topics do people who believe in human-caused climate change talk about, versus those who don't?
*  Do believers/non-believers form networks - do they retweet each other a lot? You might have to think about linking accounts through matching message texts, or using retweets.
*  Can you predict from a message text whether a person is going to believe in anthropogenic climate change better than the simple model we created? 
*  What about if you disregard named entities, like the UN, IPCC, and so on? What does that do to your models' performance?
* The class imbalance in the text dataset is quite bad, can you determine a way to address this to make machine learning easier to do?

# Presenting your work

The final outcome of your DSG challenge will be a very short presentation delivered by your facilitator. We're looking for something that can provide the following:

* An interesting finding for discussion, which you can illustrate with a single slide
* A five minute slot for talking about your experiences during the challenge

That last point is important: we want this to be a collaborative experience, so be prepared that your facilitator might talk about some of the things you come up with during the event. Remember, this event is for the benefit of all participants, so remain polite, collegial, and constructive at all times.