# <center>Validation</center>

<center> <img src = 'https://scikit-learn.org/stable/_images/grid_search_workflow.png' width=80%> </center>

* [`Scikit-Learn` Validation Docs](https://scikit-learn.org/stable/modules/cross_validation.html)

## 1. Types of Validation Techniques

**Main Validation Techniques**
* `KFold` - is a cross-iterator for K-fold validation.
* `StratifiedKFold` - is the same, but also stratified (balanced).
* `GroupKFold` - is a K-fragment iterator with disjoint groups.
* `RepeatedKFold` - K-fold validation with repetitions.

<center> <img src = 'https://scikit-learn.ru/wp-content/uploads/2021/10/image-161.png' width=%90> </center>

**Can be useful too**
* `StratifiedGroupKFold` - it's the same, but also stratified (balanced)
* `RepeatedStratifiedKFold` - and here it's stratified, but additionally with repetitions

* `Shuffle Split` - shuffles all samples first, then divides them into a set number of folds.
* `Time Series Split` - used when the data is ordered by time.
* `Leave One Out (LOO)` - removes only 1 data sample from the dataset for each fold (creates all possible training sets).
* `Leave One Group Out` - the same thing, but removes 1 group of samples each time.
* `Leave P Out` - removes P samples from the dataset and creates all possible sets, the values in the sets can be repeated.

**Which one to use and when:**
- Is there a temporary dependence?  
=>> `TimeSeriesSplit`
- Little data and learning fast?  
=>> `LeaveOneOut` 
- A lot of data, but learning fast?  
=>> `KFold`
- Does it take a lot of data and a long time to learn?  
=>> `train_test_split()`
- Is there a class imbalance?  
=>> Any iterator with the prefix `Stratified`
- There are groups and it is impossible to allow their simultaneous presence in `train` and `test`?  
=>> Any iterator with the prefix `Group` will do the thing.

## 2. Import Libraries

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.float_format", "{:.2f}".format)
pd.set_option("display.max_columns", None)

from classes import Paths

## 3. Load Datasets

In [2]:
paths = Paths()
train = pd.read_csv(paths.car_train)
test = pd.read_csv(paths.car_test)
display("train", train.sample(4))
display("test", test.sample(4))

'train'

Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class
2305,q-1778461b,Smart ForTwo,economy,petrol,6.42,2017,108795,2017,44.95,engine_overheat
163,q25090977a,Skoda Rapid,economy,petrol,3.02,2017,130192,2014,50.66,engine_ignition
791,m-2205063P,Smart ForFour,economy,petrol,5.1,2013,46222,2016,74.88,another_bug
1466,v-1780358c,BMW 320i,business,petrol,4.92,2016,101821,2019,62.98,engine_ignition


'test'

Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work
1798,q-1324803Z,Nissan Qashqai,standart,petrol,4.6,2012,19521,2016
1540,a-1327017v,Smart ForFour,economy,petrol,5.76,2013,40551,2016
1156,E20487110v,Smart ForTwo,economy,petrol,3.88,2017,126887,2015
1147,y-2096513h,VW Polo VI,economy,petrol,5.18,2015,81060,2019


## 4. Basic `Feature Engineering` - generate and add new features

In [3]:
rides = pd.read_csv(paths.rides_info)
rides.sample(4)

Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
525924,n83007455Q,l-1270744J,v1I,2020-02-20,2.61,34,539,34,40.0,1,378.5,0,23.52,-44.24
193808,p14550354Y,N21592745W,G1h,2020-03-18,1.75,46,731,27,96.0,1,1103.68,0,4.18,-5.1
287461,P13982772E,U-1298358e,w1y,2020-01-08,4.61,62,738,51,61.0,2,2258.63,0,3.9,6.95
703151,b24238972z,x13343580w,e1z,2020-01-10,5.1,48,762,38,61.0,16,1013.03,0,-7.4,7.37


In [4]:
f = lambda x: x.nunique()
rides_df_gr = rides.groupby("car_id", as_index=False).agg(
    mean_rating=("rating", "mean"),
    distance_sum=("distance", "sum"),
    rating_min=("rating", "min"),
    speed_msx=("speed_max", "max"),
    user_ride_quality_median=("user_ride_quality", "median"),
    deviation_normal_count=("deviation_normal", "count"),
    user_uniq=("user_id", f),
)

rides_df_gr.head(4)

Unnamed: 0,car_id,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq
0,A-1049127W,4.26,11257529.31,0.1,179.73,-0.29,174,172
1,A-1079539w,4.09,19127650.5,0.1,184.51,2.51,174,173
2,A-1162143G,4.66,2995193.85,0.1,180.0,0.64,174,172
3,A-1228282M,4.23,17936850.54,0.1,182.45,-15.66,174,174


In [5]:
def add_features(df):
    if "mean_rating" not in df.columns:
        df = pd.merge(df
                      , rides_df_gr
                      , on="car_id"
                      , how="left")
    return df

train = add_features(train)
test = add_features(test)
display(train.sample(4))
display(test.sample(4))

Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq
1416,N76275408I,Smart ForFour,economy,petrol,4.86,2017,114970,2019,47.5,engine_check,4.54,10422587.75,0.1,165.91,4.57,174,171
1819,J22037391Y,Renault Kaptur,standart,petrol,3.76,2014,61805,2018,26.54,break_bug,5.54,12222384.43,0.1,119.88,1.2,174,172
1605,p60146435a,Smart ForTwo,economy,petrol,3.38,2016,110875,2015,42.37,engine_check,4.5,15858021.63,0.1,164.04,-8.06,174,172
595,H-1213278K,Skoda Rapid,economy,petrol,6.4,2016,90556,2016,63.2,another_bug,4.92,15826062.18,0.1,199.24,-9.99,174,172


Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq
914,H11201074S,Renault Kaptur,standart,petrol,4.92,2015,79640,2017,3.88,12334087.67,0.0,194.28,1.95,174,172
1234,n12252517h,Kia Rio X-line,economy,petrol,2.98,2014,57684,2020,5.25,16228842.75,0.1,110.84,18.47,174,174
391,V-4730687q,VW Tiguan,economy,petrol,4.1,2016,95799,2018,5.18,13619709.95,0.36,110.11,-5.73,174,172
152,y-2178923H,Kia Rio,economy,petrol,3.94,2013,38668,2020,4.14,13269553.02,0.0,200.87,5.26,174,173


## 5. Encode categorical features (o-h-e)

In [6]:
# cat_features = ["car_type", "fuel_type", "model"]
cat_features = list(test.select_dtypes("O").columns)
cat_features.pop(0)

train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)

train.head(4)

Unnamed: 0,car_id,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,mean_rating,distance_sum,rating_min,speed_msx,user_ride_quality_median,deviation_normal_count,user_uniq,model_Audi A3,model_Audi A4,model_Audi Q3,model_BMW 320i,model_Fiat 500,model_Hyundai Solaris,model_Kia Rio,model_Kia Rio X,model_Kia Rio X-line,model_Kia Sportage,model_MINI CooperSE,model_Mercedes-Benz E200,model_Mercedes-Benz GLC,model_Mini Cooper,model_Nissan Qashqai,model_Renault Kaptur,model_Renault Sandero,model_Skoda Rapid,model_Smart Coupe,model_Smart ForFour,model_Smart ForTwo,model_Tesla Model 3,model_VW Polo,model_VW Polo VI,model_VW Tiguan,model_Volkswagen ID.4,car_type_business,car_type_economy,car_type_premium,car_type_standart,fuel_type_electro,fuel_type_petrol
0,y13744087j,3.78,2015,76163,2021,108.53,another_bug,4.74,12141310.41,0.1,180.86,0.02,174,170,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True
1,O41613818T,3.9,2015,78218,2021,35.2,electro_bug,4.48,18039092.84,0.0,187.86,12.31,174,174,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,True
2,d-2109686j,6.3,2012,23340,2017,38.62,gear_stick,4.77,15883659.43,0.1,102.38,2.51,174,173,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True
3,u29695600e,4.04,2011,1263,2020,30.34,engine_fuel,3.88,16518828.77,0.1,172.79,-5.03,174,170,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True


## 6. Classify features

In [7]:
features2drop = ["car_id", "target_reg"]  # То, что можно выбросить
targets = ["target_class", "target_reg"]  # Таргеты
cat_features = ["car_type", "fuel_type", "model"]

filtered_features = [i for i in train.columns if (i not in targets and i not in features2drop)]
num_features = [i for i in filtered_features if i not in cat_features]


print("cat_features :", len(cat_features), cat_features)
print("num_features :", len(num_features), num_features)
print("targets", targets)

cat_features : 3 ['car_type', 'fuel_type', 'model']
num_features : 43 ['car_rating', 'year_to_start', 'riders', 'year_to_work', 'mean_rating', 'distance_sum', 'rating_min', 'speed_msx', 'user_ride_quality_median', 'deviation_normal_count', 'user_uniq', 'model_Audi A3', 'model_Audi A4', 'model_Audi Q3', 'model_BMW 320i', 'model_Fiat 500', 'model_Hyundai Solaris', 'model_Kia Rio', 'model_Kia Rio X', 'model_Kia Rio X-line', 'model_Kia Sportage', 'model_MINI CooperSE', 'model_Mercedes-Benz E200', 'model_Mercedes-Benz GLC', 'model_Mini Cooper', 'model_Nissan Qashqai', 'model_Renault Kaptur', 'model_Renault Sandero', 'model_Skoda Rapid', 'model_Smart Coupe', 'model_Smart ForFour', 'model_Smart ForTwo', 'model_Tesla Model 3', 'model_VW Polo', 'model_VW Polo VI', 'model_VW Tiguan', 'model_Volkswagen ID.4 ', 'car_type_business', 'car_type_economy', 'car_type_premium', 'car_type_standart', 'fuel_type_electro', 'fuel_type_petrol']
targets ['target_class', 'target_reg']


## 7. Train Random Forest with K-Fold validation

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, StratifiedKFold

In [9]:
X = train[filtered_features].drop(targets, axis=1, errors="ignore")
y = train[["target_class"]]



In [10]:
n_splits = 5
clfs = []
scores = []

# validation parameters, training process will be done with n_splits (folds)
kf = KFold(n_splits=n_splits, shuffle=True, random_state=21)

for num, (train_index, test_index) in enumerate(kf.split(X)):

    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    clf = RandomForestClassifier(
        n_estimators=2_000,
        min_samples_leaf=4,
        n_jobs=-1,
        max_features=0.60,
        # class_weight = 'balanced',
        random_state=7575,
        max_depth=6,
    )

    clf.fit(X_train, y_train["target_class"])
    clfs.append(clf)  # save model for further usage (predictions)

    y_pred = clf.predict(X_test)
    score = np.mean(np.array(y_pred == y_test["target_class"]))
    scores.append(score)
    print(f"fold: {num} acc: {score}")

print(f"Number of classifiers: {len(clfs)}")
print(f"Number of splits: {n_splits}")
assert len(clfs) == n_splits  # check that we have all models

# calculate mean accuracy score and its std
print("mean accuracy score --", np.mean(scores, dtype="float16"), np.std(scores).round(4))

fold: 0 acc: 0.8376068376068376
fold: 1 acc: 0.8034188034188035
fold: 2 acc: 0.7944325481798715
fold: 3 acc: 0.7708779443254818
fold: 4 acc: 0.8115631691648822
Number of classifiers: 5
Number of splits: 5
mean accuracy score -- 0.803 0.0218
