# Database of Core Functions and ML Concepts

This notebook intended as a place to gather non-topic-specific ML concepts for easy access and function implementations that can be easily copied to other workbooks.

**Contents**:
- Validation Sets
- R-squared (`print_score`)
- Read and writing files
- Axis = 0/1
- ggplot

## Validation Sets

Perhaps the most crucial part of a Machine Learning model is a proper decomposition into training and validation (and test) sets, so we can accurately evaluate the performance of our model.

- Train set: used to train our model
- Validation set: used to evaluate the performance of our model at each stage
- Test (hold-out) set: used to determine final performance of the model, having completed all training

In general, we actually want to construct non-random validation sets to prove the generalisation of our model to a new type of data. For example, if the data is obviously constructed within a timeframe, split the data by date to prove that our model is capable of extrapolation.

Ultimately, we want our validation set to be representative of the test set. As such, for the validation set and test set, fit a number of different ML models and check to see that there is indeed a positive correlation between these two scores. We want a model that performs better on our validation set to also perform better on our test set.

**Cross-Validation:** used by dividing data into n sets, model is trained leaving each one out in turn, with the remainder as a validation set.

Pros:
- Can use all of the data, fgood when we have a small dataset

Cons:
- Time: must re-train model over and over
- Validation sets are constructed at random

Because of this, think carefully before using cross-validation.

**Constructing basic validation set:**

In [3]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note use .copy() to create a new object in memory

Consider ensembling different models to get a better result

Parameters: variables specified by the model.  
Hyper-Parameters: structures of the model itself that we can tune.



# $R^2$

We often use the $R^2$ metric to evaluate our model accuracy. In short, the metric measures the proportion of variance about the mean captured by our model.

$$ R^2 = \frac{\sum{(y_i - \hat{y_i})^2}}{\sum{(y_i - \bar{y})^2}} $$

$$ R^2 \in (- \infty, 1) $$

Print score function:

In [28]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean()) # Loss function as per Kaggle

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

## Read and Writing Files

**CSV:**

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/alexhoward/data/insurance_kaggle/train.csv')

In [15]:
df_new = df[:5][['id','target']]

In [17]:
df_new.to_csv('/Users/alexhoward/data/insurance_kaggle/new_train.csv')

**Feather** files are an extremely quick way to read DFs into pandas:

In [18]:
df_new.to_feather('/Users/alexhoward/data/insurance_kaggle/feather')

In [19]:
new_df = pd.read_feather('/Users/alexhoward/data/insurance_kaggle/feather')

## Axis = 0/1

When performing an aggregation function (e.g. mean) over a DF or array, we require to specify an axis to aggregate over:
- Axis = 0 : aggregate over cols
- Axis = 1 : aggregate over rows

e.g.

In [20]:
import numpy as np

In [23]:
new_df

Unnamed: 0,id,target
0,7,0
1,9,0
2,13,0
3,16,0
4,17,0


In [22]:
np.mean(new_df, axis = 1)

0    3.5
1    4.5
2    6.5
3    8.0
4    8.5
dtype: float64

In [24]:
np.mean(new_df, axis = 0)

id        12.4
target     0.0
dtype: float64

In [26]:
np.mean(new_df) # default is cols

id        12.4
target     0.0
dtype: float64

## ggplot

We can plot with Data Frames in Python with ggplot in an identical way to R.

.... insert ....