# Data and (machine) learning

In this session we are going to think a bit about how computers use data to 'learn'

Remember yesterday we built a rough street map of Belfast using the trees dataset?


## Import some libraries

In [66]:
# Pandas for analysising tabular data
import pandas as pd
# Seaborn for plots
import seaborn as sns
# Import numpy for sums
import numpy as np
# Statsmodels for statistical tables
import statsmodels
import statsmodels.api as sm

In [None]:
belfast_trees = pd.read_csv("https://www.belfastcity.gov.uk/getmedia/262a1f01-f219-4780-835e-7a833bdd1e1c/odTrees.csv")

In [None]:
sns.set(rc={"figure.figsize": (18, 14)})
sns.scatterplot(x = "LONGITUDE",
            y = "LATITUDE",
            hue = "TYPEOFTREE",
            alpha = 0.2,
            data = belfast_trees)

## Today

Today we are going to use subsets of three datasets: 

1. The [Varieties of Democracy](https://www.v-dem.net/) dataset
2. The [Modern Slavery Index](https://www.globalslaveryindex.org/)
3. The [Gapminder](https://www.gapminder.org/) dataset

Specifically we will use data from 2000/2002.

Some things to note about how Pandas represents data:

1. The index
2. Variables in columns; observations in rows
3. Single datatypes in each column (categorical and numerical)
    * Though we could add ordinal by, say, rank ordering by GDP

## Loading and examining the data

In [8]:
data = pd.read_csv("gapminder_modern_slavery_and_vdem.csv")

What does the dataframe look like?

### Scales

Let's place GDP per capita on a 'log' scale.

This means that every space along the bar is a multiple of ten numerically.

So instead of 10 $\rightarrow$ 20 $\rightarrow$ 30 etc, we will get 10 $\rightarrow$ 100 $\rightarrow$ 1000 etc.

You used to have to calculate this using 'log tables': now we can use the 'Numerical Python' [numpy](https://numpy.org/) library to do it for us. A library is simply a package containing functions that someone else has written already.

The process is to use a function in Pandas (`apply`) that performs a calculation on each cell in a column:

# A question

What if any of the factors determine the modern slavery score?

We can 'model' some of this by looking at data on a plot:

We are now going to try a very simple plotting device to trace the relationship between the two:

What about gender equality?

What does this all mean?

In essence the model reflected in the plot says the following: as you go right by 1 in 'power distributed by gender', 'modern_slavery_score' goes up by a certain amount. The more it goes up - the steeper the slope on the line - the stronger the relationship between the two variables.

We can represent the slope's shape with a relatively simple equation for caclulating `y` from:

1. The predicted value of `y` when `x` is zero
2. How much on average `y` goes up or down when `x` goes up
3. A constant (that we'd have to figure out)
4. An error rate (how variable the point's locations are on the plot)

This is a 'linear' equation used in 'ordinary least squares' regression (roughly speaking):

$$y_i = \alpha + \beta{}x_i + \epsilon_i$$

# Let's make a 'regression table' and talk about what it all means

# What has this got to do with AI?!

Actually what we did there was build a rules-based model for _predicting_ one thing based on data we had about something else. 

If we didn't have a country's data point on slavery, we could use this model to make a prediction based on a gender equality measure plus (the log of) it's GDP per capita.

### **ML and AI does the same thing**

(except its rules are less explicit: we feed in the data but then use different methods to help the computer 'learn' how best to interpret the data so as to predict the 'dependent' variable)

# Methods (and jargon)

1. Linear regression
2. Random walks/random forest
3. Monte Carlo simulations
4. Neural networks (a bit different)
5. Deep learning

Back to the presentation...