# Project - Projectile 1

In this project, we aim to apply supervised machine learning techniques to a classic problem in physics: predicting the horizontal distance traveled by a projectile. The objective is to build predictive models for two related but distinct tasks. 

- For Task 1, the goal is to predict the horizontal distance based on the initial velocity components along the x and y axes ($v_x$ and $v_y$). 
- For Task 2, the model will predict the horizontal distance using the magnitude of the initial velocity ($v$) and the launch angle ($\alpha$). 

Both tasks involve training regression models on simulated projectile data, with the ultimate aim of accurately capturing the underlying physical relationships from example data. We will see that one given system when trained for different problems shows the flexibility to adapt to different situations.


## 1. Overview

Recall key ingredients in supervised machine learning:

- Task (T)
- Experience (E)
- Performance measure (P)
- Hypothesis Space (Machine learning model)
- Learning Algorithm 
- Generalisation

## 2. Task-1: horizontal distance from the initial velocity components ($v_x$ & $v_y$)

### 2.1 Formulate the task for Machine Learning

In supervised machine learning, the goal is always about to infer from data ("experience") the relation between two sets of variables called "**features**" and "**labels**" (also called "**targets**") of some subject. Both **feature** and **label** can be composed by multiple quantities or variables, where each variable represents some property of the subject. The task meant for a supervised learning system is to return as accurately as possible the **label** when a **feature** compressing a set of pre-conventioned variables is provided. Thus from the machine's perspective, the **features** are alternatively called "**input**" and the **label** is alternatively called "**output**". 


In the current project, 

- the "subject" under investigation is the projectile launched under the effect of gravity, 
- the "features" is the pair of launching velocity components $(v_x,\; v_y)$, and 
- the "label" is the horizontal distance (i.e. along $x$) of the projectile landing position from the launching point, denoted $d$.

Mapping from the **feature** to the **label** for a subject in the real world is the **target function**. It is the _true association of a label to some features_, i.e. the **target function**, that is meant to be learned by a machine. 

> A **target function** is the real / true function that associates a specific value of **label** for a given **feature**.
> In the current project, the target function is the function $f_T(\cdot)$ that takes a set of features $(v_x, v_y)$ and return the distance $d$ in a real world projectile experiment. Formally, $$ f_T(\cdot):\{(v_x, v_y)\}\rightarrow \{ d \} \quad \text{equivalently}\quad f_T(v_x, v_y) = d\quad.$$


The **target function**, denoted $f_T(\cdot)$, is specified by 

- the form of the "feature" (or "input") -- the _domain of definition_ and the meaning for each of its variables.
- the form of the "label" (or "output") -- the _domain of definition_ and the meaning for each of its variables.  

> In the current project, the target function $f_T(\cdot)$ is specified by 
> - the feature domain $D_F\hat{=}\{ (v_x, v_y) | v_x \in \mathbb{R}^+, \; v_y \in \mathbb{R}^+ \}$ where $v_x$ and $v_y$ are respectively the horizonal and vertical components of the launching velocity, and 
> - the label domain $D_T\hat{=}\{d|d\in \mathbb{R}^+\}$ where $d$ represents the landing distance of the projectile.
> 
> The target function is formally  $$ f_T:D_F\rightarrow D_T $$
>
> In the majority of situations, unlike the current projectile problem where the target function can be resolved (using physics), the target function is too complex to be resolved, and the task of supervised machine learning is to infer that unknown **target function** using certain techniques with available data.   

### 2.2 Data exploration

- Load data files:
  - "training_set_1.dat"
  - "test_set_1.dat" 
  - "training_set_2.dat"
  - "test_set_2.dat"
with `numpy.loadtxt`. 

- Explore the header of the data file, and determine which columns are the inputs and which columns are the output.

- Explore the loaded data structure. How many entries (samples) in each dataset?

- Using plotting tools `matplotlib` to explore the following aspects of the datasets

  - For each quantity, what is its distribution in the sample? Is the distribution in the training set aligned with the test set? 

  - How are different quantites correlated with each other? 

  - How does the target value depend on the input features?

### 2.3 Mini linear regression with "scikit-learn"

#### Ordinary Least Square Regression (without feature engineering)

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Use training data (vx, vy) to predict dx
X_train = train1[:, :2]
y_train = train1[:, 2]
X_test = test1[:, :2]
y_test = test1[:, 2]

# Initialize OLS regression model
ols = LinearRegression()

# Train OLS regression model
ols.fit(X_train, y_train)

# Predict on test set
y_pred = ols.predict(X_test)

# Compute performance measure (mean squared error)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (on test set):", mse)

# Optionally, show model coefficients
print("Learned coefficients:", ols.coef_)
print("Learned intercept:", ols.intercept_)


#### Ordinary Least Square Regression with feature engineering

For a more complex model, one needs to construct more tailored features from the raw input features — a process known as _feature engineering_.

We will explore a specific type of feature engineering -- polynomial feature expansion by _taking powers and cross-products of the original features, allowing models to capture nonlinear relationships within the data_.

We will perform the polynomial expansion up to order 3.

In [None]:
# Create a feature maxtrix in analogy with X_train before, but this time with polynomial features up to order 3 constructed from the raw input feature

In [None]:
# Repeat the traininig and testing process as before, and print the results

In [None]:
from sklearn.linear_model import Ridge

# Use training data (vx, vy) to predict dx, same as above
X_train = train1[:, :2]
y_train = train1[:, 2]
X_test = test1[:, :2]
y_test = test1[:, 2]

# Initialize Ridge regression model
ridge = Ridge(alpha=1.0)

# Train Ridge regression model with penalty coefficient alpha=1.0
ridge.fit(X_train, y_train)

# Predict on test set
y_pred_ridge = ridge.predict(X_test)

# Compute performance measure (mean squared error)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print("Ridge Regression Mean Squared Error (on test set):", mse_ridge)

# Optionally, show model coefficients
print("Ridge Regression learned coefficients:", ridge.coef_)
print("Ridge Regression learned intercept:", ridge.intercept_)


In [None]:
## code for generating training and test sets ./data/projectile_1
g = 9.81

def distance_vx_vy(vx, vy):
    return vx*(vy/g)*2
    
def distance_v_alpha(v, alpha):
    return v*np.sin(2*alpha)*(v/g)

n_train = 10000
n_test = 2000

vx = np.random.normal(80, 15, n_train + n_test)    
vy = np.random.normal(80, 15, n_train + n_test)

v = np.sqrt(vx**2 + vy**2)
alpha = np.arctan2(vy, vx)

dx_std = 10
dx = vx*(vy/g)*2 + np.random.normal(0, dx_std, n_train + n_test)


train1 = np.zeros((n_train, 3))
train1[:,0] = vx[:n_train]
train1[:,1] = vy[:n_train]
train1[:,2] = dx[:n_train]

test1 = np.zeros((n_test, 3))
test1[:,0] = vx[n_train:n_train+n_test]
test1[:,1] = vy[n_train:n_train+n_test]
test1[:,2] = dx[n_train:n_train+n_test]

train2 = np.zeros((n_train, 3))
train2[:,0] = v[:n_train]
train2[:,1] = alpha[:n_train]
train2[:,2] = dx[:n_train]

test2 = np.zeros((n_test, 3))
test2[:,0] = v[n_train:n_train+n_test]
test2[:,1] = alpha[n_train:n_train+n_test]
test2[:,2] = dx[n_train:n_train+n_test]


# np.savetxt('./data/projectile_1/training_set_1.dat', train1, header='vx vy dx')
# np.savetxt('./data/projectile_1/training_set_2.dat', train2, header='v alpha dx')
# np.savetxt('./data/projectile_1/test_set_1.dat', test1, header='vx vy dx')
# np.savetxt('./data/projectile_1/test_set_2.dat', test2, header='v alpha dx')

In [None]:
import matplotlib.pyplot as plt

sc = plt.scatter(vx[:1000], vy[:1000], c=dx[:1000], marker='o')
plt.colorbar(sc, label='dx')
plt.xlabel('vx')
plt.ylabel('vy')
plt.title('vy vs vx colored by dx (first 1000 points)')



In [None]:
plt.plot(vx, dx, 'o', alpha=0.2, mec='none' )
plt.xscale('log')
plt.yscale('log')
plt.xlabel('vx')
plt.ylabel('dx')
plt.title('dx vs vx')
plt.show()
