# <center>Validation</center>

<center> <img src = 'https://scikit-learn.org/stable/_images/grid_search_workflow.png' width=80%> </center>

* [`Scikit-Learn` Validation Docs](https://scikit-learn.org/stable/modules/cross_validation.html)

## 1. Types of Validation Techniques

**Main Validation Techniques**
* `KFold` - is a cross-iterator for K-fold validation.
* `StratifiedKFold` - is the same, but also stratified (balanced).
* `GroupKFold` - is a K-fragment iterator with disjoint groups.
* `RepeatedKFold` - K-fold validation with repetitions.

<center> <img src = 'https://scikit-learn.ru/wp-content/uploads/2021/10/image-161.png' width=%90> </center>

**Can be useful too**
* `StratifiedGroupKFold` - it's the same, but also stratified (balanced)
* `RepeatedStratifiedKFold` - and here it's stratified, but additionally with repetitions

* `Shuffle Split` - shuffles all samples first, then divides them into a set number of folds.
* `Time Series Split` - used when the data is ordered by time.
* `Leave One Out (LOO)` - removes only 1 data sample from the dataset for each fold (creates all possible training sets).
* `Leave One Group Out` - the same thing, but removes 1 group of samples each time.
* `Leave P Out` - removes P samples from the dataset and creates all possible sets, the values in the sets can be repeated.

**Which one to use and when:**
- Is there a temporary dependence?  
=>> `TimeSeriesSplit`
- Little data and learning fast?  
=>> `LeaveOneOut` 
- A lot of data, but learning fast?  
=>> `KFold`
- Does it take a lot of data and a long time to learn?  
=>> `train_test_split()`
- Is there a class imbalance?  
=>> Any iterator with the prefix `Stratified`
- There are groups and it is impossible to allow their simultaneous presence in `train` and `test`?  
=>> Any iterator with the prefix `Group` will do the thing.

## 2. Import Libraries

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.float_format", "{:.2f}".format)
pd.set_option("display.max_columns", None)

from classes import Paths

## 3. Load Datasets

In [2]:
paths = Paths()
train = pd.read_csv(paths.car_train)
test = pd.read_csv(paths.car_test)
display("train", train.sample(4))
display("test", test.sample(4))

'train'

Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class
586,A-4064758I,VW Polo VI,economy,petrol,4.76,2014,69687,2015,23.88,wheel_shake
2107,Q19307898c,Kia Rio X,economy,petrol,3.04,2013,43207,2017,48.69,engine_check
2247,W15477386C,Mercedes-Benz E200,business,petrol,4.24,2015,77482,2021,74.58,another_bug
1417,t11302884D,Hyundai Solaris,economy,petrol,2.5,2013,53632,2019,27.42,engine_fuel


'test'

Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work
126,Q-1114187O,Kia Rio,economy,petrol,5.0,2012,25439,2020
1532,p22351271i,Smart Coupe,economy,petrol,2.8,2013,42239,2017
1557,w-2100178z,BMW 320i,business,petrol,5.18,2015,81031,2018
1625,T-9875021z,VW Tiguan,economy,petrol,4.78,2013,45777,2019


## 4. Basic `Feature Engineering` - generate and add new features

In [3]:
rides = pd.read_csv(paths.rides_info)
rides.sample(4)

Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
604717,m70438466c,q73965014U,e1R,2020-02-05,0.43,13,165,39,108.0,1,513.9,0,9.31,-23.51
88629,Q17536496W,G-7187279a,N1Z,2020-02-03,3.73,50,645,38,62.0,2,1580.68,0,0.65,27.47
175126,t32070265e,M14830267a,E1v,2020-02-13,5.47,42,623,53,70.0,2,603.52,0,-8.61,-72.96
223822,L20659841V,P56763603O,O1e,2020-01-31,4.32,40,554,44,78.0,0,1797.49,0,3.21,-39.11


In [4]:
f = lambda x: x.nunique()
rides_df_gr = rides.groupby("car_id", as_index=False).agg(
    mean_rating=("rating", "mean"),
    distance_sum=("distance", "sum"),
    rating_min=("rating", "min"),
    speed_msx=("speed_max", "max"),
    user_ride_quality_median=("user_ride_quality", "median"),
    deviation_normal_count=("deviation_normal", "count"),
    user_uniq=("user_id", f),
)

rides_df_gr.head(4)

Unnamed: 0,car_id,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq
0,A-1049127W,4.26,11257529.31,0.1,179.73,-0.29,174,172
1,A-1079539w,4.09,19127650.5,0.1,184.51,2.51,174,173
2,A-1162143G,4.66,2995193.85,0.1,180.0,0.64,174,172
3,A-1228282M,4.23,17936850.54,0.1,182.45,-15.66,174,174


In [5]:
def add_features(df):
    if "mean_rating" not in df.columns:
        df = pd.merge(df
                      , rides_df_gr
                      , on="car_id"
                      , how="left")
    return df

train = add_features(train)
test = add_features(test)
display(train.sample(4))
display(test.sample(4))

Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq
410,c50744646N,VW Polo,economy,petrol,3.5,2015,86909,2017,70.74,another_bug,4.77,12535862.83,0.1,155.41,-6.31,174,170
2217,B-6462073D,Smart ForTwo,economy,petrol,7.56,2015,81857,2019,38.41,gear_stick,4.45,8953549.06,0.1,108.95,8.59,174,171
1806,O-1758049J,VW Polo VI,economy,petrol,5.6,2014,58615,2017,50.21,engine_overheat,4.33,27257716.71,0.1,196.91,-7.57,174,173
2086,V-1493273B,Renault Sandero,standart,petrol,6.02,2015,74211,2019,37.61,engine_fuel,4.45,12596567.27,0.1,190.85,-1.25,174,170


Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq
875,P-1572930P,VW Polo,economy,petrol,5.36,2015,91048,2020,4.67,14112543.27,0.1,187.28,13.77,174,173
1191,C-2095125S,Renault Sandero,standart,petrol,5.5,2012,20056,2020,4.03,18428426.89,0.1,181.46,-9.39,174,173
72,L-2155385D,Renault Sandero,standart,petrol,5.36,2013,48763,2020,4.52,3986406.54,0.1,163.24,11.96,174,172
196,u79013135M,Renault Kaptur,standart,petrol,3.8,2013,40382,2020,4.14,13244055.41,0.1,178.62,-16.57,174,171
