# Cross Validation
**January 15, 2025**

To train a model, don't test it on the data used to train the model! You can't really tell how good your model is if you don't.  

- Separate data into testing and training data.  
    - 20% could be a good for testing data. But what 20% do you take?
    - could take just the last chunk if it is random data. 
    - Or take our random chunks of data for testing data. 

Sometimes data is split into *three* sets:
1. Training (70)
2. Validation (20)
3. Testing (10)

## 1. k-fold Cross Validation  
Separates data into k groups (called 'folds')  
- If you want 20%, then you want 5 folds  
Trains your model k times, setting aside 1 fold for testing each time.

## 2. Leave-one-out Cross Validation (LOOCV)  
A fancy form of k-fold where it trains your model on all data except 1 data point. (1000 data points means it will model it 1000 times)



In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

In [10]:


exercise = pd.read_csv('Data/exercise.csv')
y = np.array(exercise['Weight Lost'])
X = exercise.drop(['Date','Weight Lost'], axis=1).values

# Ordinal Encoder won't like nan values. Change to 'None'
# This fits with data since there was 0 activity for that day
X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# When putting in the columns in each imputer/encoder, indicate the column
# of the original matrix
  # [0]: Calories - fill missing values
  # [1]: Exercise Type - One-hot encoding
  # [3]: Quality of Exercise - Ordinal encoding

ct = ColumnTransformer(transformers=[
      ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean'), [0]),  # This is placed first in X
      ('onehot', OneHotEncoder(), [1]),                                         # This is placed second in X
      ('oe', OrdinalEncoder(categories=[['None','Low','Medium','High']]), [3])  # This is placed third in X
    ], remainder='passthrough')                     # Remaining columns placed in order after the last encoder



X = np.array(ct.fit_transform(X))
X

array([[2520.0, 1.0, 0.0, 0.0, 1.0, 10, 194.5],
       [1850.0, 0.0, 1.0, 0.0, 3.0, 10, 193.0],
       [1925.0, 0.0, 0.0, 1.0, 2.0, 30, 191.8],
       [1790.0, 1.0, 0.0, 0.0, 3.0, 20, 187.0],
       [2120.0, 0.0, 1.0, 0.0, 2.0, 10, 189.0],
       [1910.0, 0.0, 0.0, 1.0, 2.0, 35, 186.0],
       [1845.0, 1.0, 0.0, 0.0, 1.0, 20, 186.0],
       [2343.0, 0.0, 1.0, 0.0, 1.0, 15, 189.0],
       [1886.0, 0.0, 0.0, 1.0, 3.0, 30, 188.0],
       [2149.0, 1.0, 0.0, 0.0, 3.0, 15, 190.0],
       [1797.0, 0.0, 1.0, 0.0, 3.0, 10, 187.0],
       [1990.6666666666667, 0.0, 0.0, 1.0, 2.0, 25, 186.0],
       [1934.0, 1.0, 0.0, 0.0, 3.0, 10, 184.0],
       [2129.0, 0.0, 1.0, 0.0, 1.0, 5, 186.0],
       [1872.0, 0.0, 0.0, 1.0, 0.0, 0, 185.0],
       [1957.0, 1.0, 0.0, 0.0, 2.0, 15, 183.5],
       [1790.0, 0.0, 1.0, 0.0, 2.0, 10, 181.0],
       [1990.6666666666667, 0.0, 0.0, 1.0, 3.0, 30, 180.0],
       [1842.0, 1.0, 0.0, 0.0, 3.0, 25, 178.0],
       [2173.0, 0.0, 1.0, 0.0, 2.0, 15, 178.3]], dtype=object)

# Train

In [None]:
# Set aside 17% of the data for testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.17, random_state=22) # , shuffle=True


In [None]:
# k-fold cross validation

n = 5
from sklearn.model_selection import cross_val_score

score = cross_val_score(tree_regression, X, y, scoring="mean_square_error", cv=n)

In [None]:
# leave one out
from sklearn.model_selection import cross_val_score

score = cross_val_score(tree_regression, X, y, scoring="mean_square_error", cv=len(X))