# Foundations

## Dimensions

The first concept we're going to tackle is that of a dimension. You're probably used to thinking about dimensions in space. Space, as you've probably heard, is 3D. 

What does that mean? It means that I can describe where you are using 3 numbers. By convention, these are often called `x, y, z`. I can make those numbers really precise - adding lots of decimal places - but there's no need for more than 3 numbers.

In [3]:
#HIDDEN
from ipywidgets import interactive
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np

%matplotlib inline

def plot_3d(x, y, z):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(x, y, z, color="g", s=200)
    plt.title('With these three sliders, you can specify where the ball is in space')
    ax.set_xlim([0, 50])
    ax.set_ylim([0, 50])
    ax.set_zlim([0, 50])


interactive_plot = interactive(plot_3d, x=(0, 50, 1), y=(0, 50, 1), z=(0, 50, 1))
output = interactive_plot.children[-1]
output.layout.width = '2500px'

interactive_plot

interactive(children=(IntSlider(value=25, description='x', max=50), IntSlider(value=25, description='y', max=5…

What happens if I try and describe where you are using just 2 numbers? It's ambigious. I can specify where you are in terms of longitude or latitude, but you could be at any height. Or I can choose to specify your height, and longitude, but then you could be at any latitude.

In [44]:
#HIDDEN

def plot_3d(x, y):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
        
    plt.title('If I only describe position with 2 numbers, I could be anywhere on this line')
    
    zz = list(range(100))
    xx = np.ones_like(zz)*x
    yy = np.ones_like(zz)*y

    ax.plot(xx, yy, zz, alpha=0.5, color='g', linewidth=10)

    ax.set_xlim([0, 50])
    ax.set_ylim([0, 50])
    ax.set_zlim([0, 50])


interactive_plot = interactive(plot_3d, x=(0, 50, 1), y=(0, 50, 1))
output = interactive_plot.children[-1]
output.layout.width = '2500px'

interactive_plot




interactive(children=(IntSlider(value=25, description='x', max=50), IntSlider(value=25, description='y', max=5…

And how about if I use a single number? Then things are *really* tricky, because you could be anywhere in the other two dimensions we aren't specifying. The expanse of green you see below is typically referred to as a 'plane', which is simply some shape in 2 dimensions or higher.

In [48]:
#HIDDEN

def plot_3d(z):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
    xx, yy = np.meshgrid(range(50), range(50))
    z = np.ones_like(xx)*z
    ax.plot_surface(xx, yy, z, color='g', alpha=0.2)
    ax.set_xlim([0, 50])
    ax.set_ylim([0, 50])
    ax.set_zlim([0, 50])
    
    plt.title('If I only describe spatial position with 1 number, I could be anywhere on the plane')

    
interactive_plot = interactive(plot_3d, z =(0, 50, 1))
output = interactive_plot.children[-1]
output.layout.width = '2500px'

interactive_plot


interactive(children=(IntSlider(value=25, description='z', max=50), Output(layout=Layout(width='2500px'))), _d…

So as you can see, if we're operating in a 3D space, trying to describe where we are using less than 3 numbers is hard.

What if I want to specify your position in time? How many numbers do I need? Just 1 - your position in time can be represented just by a single number. That's why it's called a timeline:

In [46]:
#HIDDEN

def plot_1d(x):
    fig = plt.figure(figsize=(10,2))
    ax = fig.add_subplot(111)

    ax.scatter(x,0, color="g", s=200)
    ax.axhline()
    ax.set_xlim([0, 50])
    ax.set_xlabel('Year')


interactive_plot = interactive(plot_1d, x=(0, 50, 1), y=(0, 50, 1))
output = interactive_plot.children[-1]
output.layout.width = '2500px'

interactive_plot


interactive(children=(IntSlider(value=25, description='x', max=50), Output(layout=Layout(width='2500px'))), _d…

And so if we want to specify somebody's position in space and time, we're going to need 4 dimensions - 3 for space, and 1 for time. That's hard to depict - here I've used colour as the 4th dimension:


In [49]:
#HIDDEN

def plot_3d(x, y, z, year):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
    NUM_COLORS = 2000
    cm = plt.get_cmap('jet')
    color = cm(1.*year/NUM_COLORS)  # color will now be an RGBA tuple
    ax.scatter(x, y, z, color=color, s=200)
    ax.set_xlim([0, 50])
    ax.set_ylim([0, 50])
    ax.set_zlim([0, 50])



interactive_plot = interactive(plot_3d, x=(0, 50, 1), y=(0, 50, 1), z=(0, 50, 1), year=(0, 2019, 10))
output = interactive_plot.children[-1]
output.layout.width = '2500px'

interactive_plot

interactive(children=(IntSlider(value=25, description='x', max=50), IntSlider(value=25, description='y', max=5…

So that's really hard to draw. Unfortunately, so are most of the spaces we're going to tackle in this piece. Because once you start looking for them, high dimensional spaces are everywhere. Let's look at the stats for one of my childhood idols, diminuitive rugby player Jason Robinson:

In [14]:
#HIDDEN

pd.DataFrame(np.array([56, 52, 4, 150, 30, 0, 0, 0, 39, 17, 0]).reshape(1,-1),
            columns=['Matches','Start', 'Sub', 'Pts', 'Tries', 'Conv', 'Pens', 'Drop','Won', 'Lost', 'Draw'],
            index=['Jason Robinson'])

Unnamed: 0,Matches,Start,Sub,Pts,Tries,Conv,Pens,Drop,Won,Lost,Draw
Jason Robinson,56,52,4,150,30,0,0,0,39,17,0


So we have numbers corresponding to the number of matches he played, started, appeared as substitute, how many points he scored, how many tries he scored, how many conversions, penalties, dropgoals he scored, and how many games he won, lost and drew, and his winning percentage.

So what do we have? That's right, we have a 12 dimensional space.

Every rugby player in the `ESPN` database can be represented as a point in this space. Here are a couple more:

In [21]:
#HIDDEN

df = pd.DataFrame(np.array([56, 52, 4, 150, 30, 0, 0, 0, 39, 17, 0,
                      63, 54, 9, 185, 37, 0, 0, 0, 44, 17, 2,
                      112, 106, 6, 1598, 29, 293, 281, 8, 99, 12, 1]).reshape(3,-1),
            columns=['Matches','Start', 'Sub', 'Pts', 'Tries', 'Conv', 'Pens', 'Drop','Won', 'Lost', 'Draw'],
            index=['Jason Robinson', 'Jonah Lomu', 'Dan Carter'])
df

Unnamed: 0,Matches,Start,Sub,Pts,Tries,Conv,Pens,Drop,Won,Lost,Draw
Jason Robinson,56,52,4,150,30,0,0,0,39,17,0
Jonah Lomu,63,54,9,185,37,0,0,0,44,17,2
Dan Carter,112,106,6,1598,29,293,281,8,99,12,1


To re-iterate the link to the dimensions we discussed earlier, here's what these three guys look like plotted in a 3D subspace of the columns. You can choose which to use:

In [31]:
#HIDDEN

def plot_3d(x_col, y_col, z_col):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
    
    x = df[x_col].values
    y = df[y_col].values
    z = df[y_col].values
    
    scatter = ax.scatter(x, y, z, color=['r', 'g', 'b'], s=200)
    
    ax.set_xlabel(x_col)
    ax.set_ylabel(y_col)
    ax.set_zlabel(z_col)
    
    labels = ['Jason Robinson', 'Jonah Lomu', 'Dan Carter']
    for i, txt in enumerate(labels):
        ax.text(x[i], y[i], z[i], txt)


interactive_plot = interactive(plot_3d, 
                               x_col=(df.columns.values), 
                               y_col=df.columns.values, 
                               z_col=df.columns.values)
output = interactive_plot.children[-1]
output.layout.width = '2500px'
interactive_plot

interactive(children=(Dropdown(description='x_col', options=('Matches', 'Start', 'Sub', 'Pts', 'Tries', 'Conv'…

If you're used to working with spreadsheets or databases, you're probably thinking: oh I get it, *dimensions are basically like columns in my spreadsheet*. And that's exactly right - you can think of each row in your database as a point in a high-dimensional space defined by the columns.


### Why does this matter?
**Machine learning is (more or less) the business of predicting some dimensions given some others.**

Let's dig into this given the examples we have so far. An example might be:

1. Predicting longitude based upon latitude
2. Predicting height above the earth based upon longitude and latitude
3. Predicting how many points you've scored based upon how many tries, conversions, and dropgoals you've socred

These range in difficulty, from very hard to very easy. You can probably see this intuitively: 

1. If I know your latitude, I can draw a line upon which your longitude might lie. But there are lots of different possible latitudes for a given longitude. The fact that you're unlikely to be in the sea probably helps, but we probably can't be super precise.


2. If I know your latitude and longitude, I might actually be able to say quite alot about how high you are above the earth. For instance, if you're in New York, then you're likely to be higher above the earth than if you're in rural Zimbabwe. 


3. This one's actually trivial, because the number of points you scored is a product of the tries, conversions, and dropgoals you've scored! So there's a very simple mathematical rule we could write down to describe this relationship. But actually, we could learn it from the data too, as we'll show.


Is life really this simple? Is all machine learning predicting one column of a database from a bunch of the others? Almost. So let's think about Go, the ancient Chinese game that DeepMind cracked using something called Deep Reinforcement Learning.

Can we build this kind of 'database-style' representation? It's kind of tricky. We want to be able to pick the next move we make. Let's think about the ingredients we need:

- For every position on the board, 
    - whether there is a white piece there, a black piece there, or neither
- Whether that move was good or not

The ingredients are simple, but actually getting them is rather hard:

- a Go board is 19 x 19, so there are 361 positions that we need to specify. That's ok - we can have 361 columns. Unfortunately, we're going to need a lot of examples to understand what each column means. Imagine a sport you've never heard of (like, I don't know, rugby), where somebody gives you 361 numbers to describe how good each player is, along with an answer of how good that player actually is. You're going to need *a lot* of examples of players to figure out the significance of each of those columns.

- We don't actually have access to that information! That's a real nuisance. Nobody is telling us whether a given move was good or bad- all we know is whether the game in which that move occurred eventually worked out well for the player or not. You can see that this is a kind of bad measure of whether an individual move was good or bad. This problem is known as **credit assignment** - how do I know whether I did a good thing but lost anyway, or won despite some bad moves?

## Next time: learning
In the next post, we're going to dig into the different kind of things we can learn about these dimensions, and some of the common problems we encounter. We'll also talk about how life can get complicated when we get to higher dimensional spaces, and start tying some of this friendly database/dimension stuff to unfriendly matrix algebra (hopefully, in the process, friendlifying it).
