# Practical session 1

In this session we will explore the general concept of creating a mathematical model.

A mathematical model is an equation that captures the relationship between one set of variables (the inputs) and another (the outputs). These variables are typically things that we can observe, measure and/or categorise and we express in numbers. We normally create models because we are interested in knowing the value of the outputs for a given set of inputs.



## Practical exercise 1: a simple linear correlation

In this session we will explore the simple case of one input and one output to get used to expressing things in equations. In this first exercise, we will use the heights and weights datasets that can be found in Kaggle (https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset).

Imagine the following question: given a person's height, what would you say is their weight?

To answer it, and given the historical data, we can create a mathematical model that tells us what is the relationship between height and weight. Then, we can use this model to try to predict a person's weight, given their height.

The following cell will show you what this data looks like.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv("heights_weights.csv")
height = data["Height(Inches)"].to_numpy()
weight = data["Weight(Pounds)"].to_numpy()

plt.figure(figsize=(10,10))
plt.scatter(height, weight, s=1)
plt.xlabel("Height [Inches]")
plt.ylabel("Weight [Pounds]")

We are going to use a very simple mathematical model to try to explain the relationship between height and weight: a linear model.

A linear model is defined by two parameters:
- a_1: tells the model "how fast" the output grows with respect to the input, this is called the slope.
- a_0: tells the model what is the value of the output when the input is 0, this is called the cutoff point.

And a whole the whole family of linear models is:

y = a_0 + a_1*x

A single value for a_0 and a_1 gives you one linear model in that family of models.

A linear model between height and weight is defined by saying x=Height and y=Weight:

Weight = SLOPE * Height + CUTOFF_POINT

In the cell below you can set a value for the slope and another for the cutoff point. It will overlay your model (a yellow line) on top of the data presented before. Try to find the best fitting model and record the values for the slope and the cutoff point of your model.

In [None]:
## -- Your parameters here -- ##

SLOPE = 
CUTOFF = 


## --- NO EDIT START --- ##

import numpy as np
import numpy.typing as npt
def linear_model(
    a: float,
    b: float,
    x: npt.NDArray[np.float_]
):
    """Return x + b"""
    return a * x + b

x_vals = np.linspace(height.min(), height.max(), height.size)
y_vals = linear_model(
    a=SLOPE,
    b=CUTOFF,
    x=x_vals,
)
plt.figure(figsize=(10,10))
plt.scatter(height, weight, s=1, label="data")
plt.plot(x_vals, y_vals, 'y', label="model")
plt.xlabel("Height [Inches]")
plt.ylabel("Weight [Pounds]")
plt.legend()

## --- NO EDIT END --- ##

## Practical exercise 2: an intergalactic experiment

Imagine that you are a scientist, and you are measuring the distance that an object travels in free fall for a given amount of time. You would like to gather some empirical evidence, so you let a ball drop for different amounts of time. However, you have limited time to do your experiments, your instruments are noisy and the weather conditions change a lot, so you take 25000 data points by measuring the time traveled (between 30 and 60 seconds) and the distance traveled by a ball.

Armed with this data, you want to create a model that will then allow you to predict what will be the distance travelled by the ball if left to fall for different amounts of time.

By running the cell below you can observe the data that you have collected.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv("free_fall.csv")
time = data["time"].to_numpy()
distance_travelled = data["distance"].to_numpy()

plt.figure(figsize=(10,10))
plt.scatter(time, distance_travelled, s=1)
plt.xlabel("Time [seconds]")
plt.ylabel("Distanced travelled [meters]")

Unlike in the exercise before, we can see that the relationship between distance and time is not linear. Instead, it seems to have a curvature. Lucky for us, we also know a family of models that allows us to describe that curvature. This is the family of quadratic models:

y = a_0 + a_1*x + a_2 * x^2

As you can see from the formula above, one model in that family is now defined by three parameters: a_0, a_1 and a_2. In general, models with more parameters allow us to represent more complex relationships between inputs and outputs.

In the following cell, repeat the exercise from above but now trying to find a good quadratic model that describes the relationship between time and distance travelled in free fall!

In [None]:
## -- Your parameters here -- ##

A_0 = 
A_1 = 
A_2 = 


## --- NO EDIT START --- ##

import numpy as np
import numpy.typing as npt
def quadratic_model(
    a_0: float,
    a_1: float,
    a_2: float,
    x: npt.NDArray[np.float_]
):
    """Return x + b"""
    return a_0 + a_1 * x + a_2 * x**2

x_vals = np.linspace(time.min(), time.max(), time.size)
y_vals = quadratic_model(
    a_0=A_0,
    a_1=A_1,
    a_2=A_2,
    x=x_vals,
)
plt.figure(figsize=(10,10))
plt.scatter(time, distance_travelled, s=1, label="data")
plt.plot(x_vals, y_vals, 'y', label="model")
plt.xlabel("Time [seconds]")
plt.ylabel("Distance Travelled [meters]")
plt.legend()

## --- NO EDIT END --- ##

### Bonus question 1

In what planet were these measurements made?

### Bonus question 2

What would happen if I used a quadratic model for the first exercise?