# Kaggle advices

## Features and values

1. **Feature scale has an impact on non-tree based models**. It doesn’t affect the logical models like Decision Trees, Random Forests, CART, Boosting Algorithms on Decision Trees, e.t.c. So, when we work with linear models, KNN, NN, we should scale (MinMaxScaler, StandardScaler, etc.) all numeric features. We also can change scale for some features, if we think that they are important to us.

2. **MinMax Scaling decreases variance** while **Standard Scaling makes zero mean**. So use SS when you are interested in the components that maximize the variance (PCA) or if you have a small multilayer NN with tanh activation function and weights initialized near zero. Anyway, if in doubt, use Standard Scale, it won’t hurt! However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, **typical neural network algorithm require data that are on a 0-1 scale**.

3. To cope with **outliers** problem, use clipping (another name in the financial sector is **winsorization**).

4. Another solution to the **outliers** problem is rank transformation. Rank transformation sorts all elements according to their rank. That’s why it may cope with the outliers, because numbers like [-100, 1, 5] will be transformed to [1, 2, 3]. Use **scipy.stats.rankdata** for ranking. You need to store mapping of train data and its ranking, or you can concatenate train and test data before ranking. 

5. Also you can use **log transformation** and **square root** for outliers problem, because these transformations make feature values closer to each other, it diminishes their scale. These transformations can significantly improve the result of NN.

6. **MinMaxScaler**, **StandardScaler** are not useful for outlier problems, because they don’t change relative distance between outliers and other features.

7. **Decision Tree based models have difficulties with multiplication and division dependencies**, so, if you can, multiply and divide some necessary features to help DT models find these dependencies. E.g. divide the price of the house by its area and get the price for 1 m^2.

8. If you work with prices, use **fractional feature generation**, when you add a new feature, which represents a fractional part of the price, e.g 2.49 => 0.49. This feature can help the model utilize the differences in people's perception of these prices.

9. Use **sklearn.preprocessing.LabelEncoder** to encode categorical values to numerical in alphabetical (sorted) order, e.g.: [S, A, Q] => [2, 1, 3].

10. Use pd.factorize to encode categorical values to numerical in order of their appearance, e.g. if S is the first value met in data, it will be encoded as 1, while A, which is met second in data, will be encoded as 2.

11. Frequency encoding is an encoding which transforms values to their relative frequencies in data. If we have multiple feature categories with the same frequencies, we may apply ranking operations for them. 

12. We can concatenate categorical/ordinal features and one-hot encode them. E.g class 1, class 2 and male/female may be transformed to 1male, 2male, 1female, 2female features with one-hot encoding.

13. Workflow to clean missing values:
    - Build histograms, distributions and boxplots to detect outliers and then treat with them like with missing data
    - Check for errors in data cleaning/transformation (duplicated data, amount of null data, different representation of the same data)
    - Use data from additional sources to fill missing values
    - Drop row/column
    - Fill missing values with reasonable estimates computed from the available data (mean, median). This is good for non-tree models, but not for trees
    - We can also calculate mean/median for each category and replace missing values with these estimates for each category
    - Fill missing values with outliers (-999, -1, etc)
    - Of course you have to ignore missing values while calculating mean, median and other statistics and other feature generation
    - We can also change missing values which present in test data but do not present in train to category, which is not present in train data. The point is that a model, that didn’t train on that category before will treat it randomly
    - We can also change missing values to their frequencies
    - We can also add a new feature isnull for each feature with missing values
    - XGBoost and LAMA can handle NaN’s
    - Add the new feature, that counts the number of NaN’s in row

14. https://scikit-learn.org/stable/modules/preprocessing.html - **all scikit-learn instruments for data preprocessing**, including normalization, solving of outliers problem, mapping to Gaussian distribution, discretization, binarization, etc.

15. If you **find mistakes** in data, don’t hurry to correct or drop them. Maybe there is some pattern in those mistakes. Maybe you should add a feature **“is_correct”** which indicates if the data column has a mistake and maybe this would help your model to make a better prediction.

16. **Feature engineering**:
![image-3.png](attachment:image-3.png)

17. Use **Box-Kox transformation** (e.g. calculating logarithm and various powers of feature values and pick the best power or logarithm) to find non-linear dependencies among features or remove them (in case of logarithm).

18. **Feature selection** s a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. Feature selection dataflow:
- determine which column is a result column;
- drop all columns with only one not-null value;
- look which numerical columns have ordinal values, transform them to object values;
- determine which numerical columns have more than 5% (may be different value) of null values and fill them with mean values, drop others with more than 5% of null values;
- determine which object columns have more than 1% (may be different value) of null values and fill them with mode values, drop others with more than 1% of null values;
- normalize all numerical columns
- some of numerical columns, which don’t have a linear relation with target, could be binned and analyzed by bins according to target column binned values;
- analyze which new useful columns can be created;
- delete all useless for ML columns: e.g. columns which not suitable for ML or columns which leak data about result;
- build correlation matrix between all columns and result column, drop all columns with correlation ratio less than 0.4 (may be different value);
- determine object columns that have too many different values (e.g. more than 10, value may be different), drop them;
- determine object columns that have low variance (e.g. more than 90% of rows is occupied by one values, value may be different), drop them;
- determine numerical columns that have low variance (e.g. less than 1%, value may be different), drop them;
- use additional instruments of Feature Selection, you can read about them here [here](https://machinelearningmastery.com/feature-selection-machine-learning-python)  or [here](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)
- **Very Important!** Reset DF Index before model fitting!!! (but that’s not precise);
- after model was fittied, analyze which features have lowest coefficients by accessing .coef_ attribute of estimator.




## Validation

1. You should understand how train and test sets are generated. If they have different generation algorithms => they have different data balance => **you should validate only on the train set and don’t use the test set**.

2. When use **MAE** it’s better to aggregate data **by mode**, when use **MSE** - **by mean**.

3. Data can be splitted by:
    - **row** (one row - one object, random train/test split can be used);
    - **time** (e.g. need to predict sales for next day/week/month: **time series split** can be used);
    - **ID** (when id of objects are hidden and multiple rows in each dataset may belong to one object, need **prior clustering, then random train/test split can be used**).
    
4. Your validation scheme must mimic the train/test split of competition. E.g **validation set must resemble a competition test set**.

5. When submitting your results to competition it’s recommended to select the result with the **best LB score** and result **with the best validation score**. This should be done because distribution of the test set may be not similar to the train test set and then the result with the best LB score wins. But if it is similar to the train set, then the result with the best validation score is better. You can also consider submissions with **high LB score but low correlation with validation score** - maybe you were lucky and occasionally hit high LB score.

6. What to do if **validations on the dataset are differ**? 
    - increase number of K in K-fold (but 5-10 is usually enough)
    - use couple of different K-folds with different random seed
    - tune model on one split, evaluate for another
    
## Leaderbord

1. What to do if **LB score doesn’t match validation score**
    - check if we have too little data in the test LB
    - check if we overfitted
    - check if we chose correct splitting strategy
    - check if train/test have different distributions
    
2. **Leaderbord probing** types:
     - simply extracting all ground truth from public part of LB (read Alek Trott’s post);
    - find categories that definitely have the same target value and then try to guess this value;
    - lifehack for LB probing for binary classification, logloss and allegedly the same distribution for public and private parts:
    
![image.png](attachment:image.png)
    where C is number of constant (0 or 1) in our probe prediction, L - leaderboard score, N - number of rows in the test set, N1 - real number of ones for target value in the test set; last equation helps us to find the real percent of ones in the test; so if we will rebalance our prediction to have the same percent of ones, we may significantly improve our score.
    
3. **Data leakage** can be obtained from:
    - features from test set, especially this works for time series (e.g. prices from next week, or number of cargo tracks; they definitely connected with number of sales for the next week)
    - meta data (e.g. resolution of photos with dogs and cat may be different, or maybe they were made with different cameras, that can be used to classify that photos almost perfectly)
    - ID’s, some information can be stored in ID’s (e.g. they may be hash of target value)
    - row order (e.g. duplicate rows may have the same target value)
    
## Metrics

1.  **Optimize MSE instead of RMSE**. It’s much more simple to calculate and minimizing of MSE leads to minimizing of RMSE. Or use **R squared (1 - MSE(pred)/MSE(constant))**.

2. Optimal **MSE** constant is **mean**, optimal **MAE** constant is **median**, optimal **Logloss** constant prediction is **probability of each class**.

3. **When to use MAE or MSE (RMSE, R^2) in the regression tasks?** 
    - if you have a lot of outliers in data and sure that they are outliers - use MAE, it’s more robust than MSE and better works with outliers
    - if you think that unusual objects are normal and they are not outliers - use MSE

4. **MSPE, MAPE and RMSLE** are sensitive to **relative errors**. **MSPE and MAPE** are sensitive to **small targets**, while **RMSLE** is less biased towards **small targets**.

5. If you need to optimize **MAE** loss you can also use Huber loss instead.

6. If you want to use **MSPE/MAPE** in your model, use **MSE/MAE** with option **‘sample_weights’**. But not every model accepts this function, so you can also can resample dataset with **df.sample(weights=sample_weights)** option and then use MSE/MAE. **You don’t need to do the same with the test set.** To make model’s score more stable, resample dataset multiple times.

7. To optimize **RMSLE**:
    - transform target on the train set: z = log(y + 1)
    - transform predictions back on the test set: y = exp(z) - 1

## Feature encoding

1. **Mean encoding** - powerful technique, which encodes each value of features by some statistic for corresponding target:
    - mean = ones/(ones + zeroes)
    - weight of evidence = ln(ones/zeroes)*100 (for binary target)
    - count = sum(target)
    - diff = ones - zeroes
    - std, percentiles, distribution bins, etc for regression
    - for time series we can count statistics for previous period: count, sum, cumsum, mean, median, etc.

2. Mean encoding regularization:
    - **CV (KFold scheme)** - may work poorly for small subset of data
    - **smoothing - (mean(target) \* n_rows + global_mean \* alpha) / (n_rows + alpha)** - if category is huge (n_rows is big number), then we can trust the estimated encoding, but if category is rare - we can’t and we return to smooth by **global mean** value; parameter **alpha** controls the regularization - if it’s zero - then we return to classic mean, if it’s infinity - then mean turns to global mean value. Usually alpha is equal to the category size we can trust
    - **add some noise** - this method is unstable - if we add too much noise - we turn feature to garbage, but if we add to little noise - quality of regularisation will become worse; this method is usually used together with LOO regularization
    - **expanding mean** - fix some order for category and use first N-1 rows to encode Nth row (e.g. cumsum/cumcount); requires no hyperparameter tuning, but leads to irregular encoding quality, it can be solved by averaging model on encodings calculated from different data permutations
    - best regularization methods are **CV loop** and **expanding mean** - they are robust enough and easy to tune

3. We can also use **mean encoding of feature interactions**. To do that we must define which features combinations are met most frequently in the decision tree splits and select that combinations. Two features interact in the tree if they are in the neighbouring nodes. So we iterate through the model and calculate how often each pair of features appeared. The most frequent interactions are probably worth encoding.
![image-2.png](attachment:image-2.png)

4. **Correct mean encoding validation order**:
Local experiments:
    - estimate encodings on X_tr
    - map them to X_tr and X_val
    - regularize on X_tr
    - validate model on X_tr and X_val split
        Submission:
estimate encodings on whole Train data
    - map them to train and test
    - regularize on train
    - fit on train

5. Mean encodings give significant improvement only on specific datasets. But if they give, this improvement is really worth it.

## Hyperparameters

1. XGBoost/LigthGBM/CatBoost hyperparameters
- **num_round/num_iterations** - how many trees we want to build; there is a nice trick to improve model score: for fixed learning rate find number of trees on which validation score is the best, then divide learning rate by some constant alpha and multiply number of trees on the same constant. That will let us find the best point more precisely
- **max_depth/max_depth or num_leaves** - max depth of decision tree. If increasing max depth of tree doesn’t lead to overfitting, that means that data has a lot of deep interaction in the data. So we have to stop tuning and try to create some additional features
- **bagging_fraction/subsample** - size of data batch to sample for next tree fitting  (if model is overfitting - try to decrease this parameter)
- **colsample_bytree or colsample_by_level/feature_fraction** - number of features to use for fitting next tree (if model is overfitting - try to decrease this parameter)
- **min_child_weight, lambda, alpha/min_data_in_leaf, lambda_l1, lambda_l2** - setting that tells algorithm to stop splitting when sample size in a node goes below a given threshold (for regression) or stop splitting when node reached certain degree of purity - **one of the most important hyperparameter for boosting models**, parameters could be different: 0, 1, 20, 300, etc. don’t hesitate to experiment with them
- **eta/learning_rate** - learning rate
- **seed/\*_seed** - there is no point in fixing random seed of the model because every hyperparameter change will lead to completely different model, but you can set and change it to be sure that seed changing doesn’t affect the model and it’s stable

2. Random Forest hyperparameters:
- **n_estimators** - number of trees
- **max_depth** - depth for each tree, can be set to None, which means unlimited tree depth, recommended depth is 7
- **max_features** - max number of feachers to use in each patch
- **min_sample_leaf** - similar to min_child_weight/min_data_in_leaf for XGBoost/LightGBM
- **criterion** - ‘gini’, ‘entropy’, 'chisq', 'mse'... gini is usually better and faster
- **n_samples_split** - mininum number of samples in node to split

3. Neural Nets. Watch **kaggle_tips/Models/NNs/NN_main_tips/NN_main_tips.ipynb**

4. Good practice is to average results from models with different random seeds and hyperparameters (**e.g. average LGBM model with optimal depth 5 by three LGBM models with max_depth 3, 4 and 5**).

5. More hyperparameter optimization info:

    - https://scikit-learn.org/stable/modules/grid_search.html

    - http://fastml.com/optimizing-hyperparams-with-hyperopt/

    - https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

6. If you **hunt for a medal** in the competition, look for competitions with a small number of entries in the leaderboard top and top teams that prematurely consist of one member.

7. Hyperparameter optimization preparation. Sort all parameters by these principles:
    - importance (from very important to not important at all);
    - feasibility (from easy to tune to not tunable at all (tuning may take a very long time))
    - understanding (from “I understand what this parameter is doing” to “I don’t understand at all”)

## Data

1. **Data loading**. You can do basic preprocessing of data (fill nans, remove wrong data, encode categorical values, etc.) and then save them to hdf5/npy (for numpy or pandas) file format for much faster loading. You can also preprocess and split train set on train and validation, save them to files and later use them to train you model.

2. By default data is stored in 64-bit arrays. Most of the time you can **safely downcast to 32-bit**.

3. Pandas support data reading **by chunks**. That can be used for memory saving.

4. Start with fastest models - e.g. LightGBM.

## Other

1. **Do not use cross-validation before the latest stages of your work**. On the first stages - train/test split is enough.

2. Switch to hyperparameter tuning and ensembling only when you are satisfied by feature engineering.

3. **Start competition from a simple baseline with a simple model** (e.g Random Forest), then switch to forums and discussion reading. After a week or so of lone work, explore the kernels of the other participants. Get some ideas from there, use them for your model or apply the ideas from your model to these kernels and then ensemble your models. **Never join the competition in the beginning.**

4. Try to create all possible features and then find the most useful among them.

5. **Use macros for frequently used code.**

6. **Modeling**:
![image-4.png](attachment:image-4.png)

7. **Ensembling**. You should save all best test predictions (best on validation and best on LB) from your different models to combine them in the end. Combining can be made in different ways: from averaging to multilayer stacking. Small data requires simpler ensemble techniques (like averaging). For bigger data use stacking.

8. **Create github repository with useful methods and update it!**

9. **Matrix factorization** is decomposition of the MxN matrix to Mxd and dxN matrices which is often used for text processing, recommendation systems, rank tasks, etc. value d is usually in range from 5 to 100.

10. Use **PCA** and **NMF** methods for matrix factorization. NMF is more suitable for decision trees. You can also use PCA and NMF for logarithmic values.

11. Boosting can be **weight based** and **residual based**. 
- **weight based** models calculate which object has the bigger impact on the overall error and which object has smaller impact and assign their weights according to their impacts. There may be other methods of calculating weights, e.g. MAE + 1. Nevertheless, their purpose doesn't change, and the next model fits on this new weights and learns to build it’s own. Example of weight based models: AdaBoost and LogitBoost.
- **residual based** models calculate error of the previous model and use it for the next model as a new target feature. Residual based models also exclude some trees when fit, it’s some sort of regularization. Examples of residual based model XGBoost, LightGBM, CatBoost, H2OGBM, Sklearn GBM.

12. Stacking algorithm:
    - split dataset on the train, validation and test sets;
    - fit models M1, M2 and M3 (for example) on the train dataset;
    - predict targets on validation and test sets for each model;
    - fit new model on M1, M2 and M3 predictions for validation set to properly combine that predictions;
    - use this new model to predict the target for the test set.
It actually can have any number of stages.
Remember that your models must be diversal. This diversity may come from:
- different algorithms
- different input features (e.g categorical feature for first model and OneHot encoding of that feature for the second model).
Meta model shouldn’t be too complicated.

13. Replacing missing counts values with **-1** is a good practice, because you can't estimate std for missing values and this will eventually give you -1. And normally -1 has a negative relationship with the target.

14. **Kaggle solutions:**
    - http://ndres.me/kaggle-past-solutions/
    - http://www.chioka.in/kaggle-competition-solutions/
    - https://github.com/ShuaiW/kaggle-classification/

15. If you see on the plot than your NN begin to overfit, **don't stop the process**, just let it to learn at least 10x of time when you saw overfitting first time. Very frequently it continues to tradin correctly after some period of overfitting (**double descent**).

16. If your train/val loss moves up and down, this is maybe because of unequal distribution of objects features/classes in your batches.
66. Probabilities of class from NN are not the real probabilities of objects belong to that class. You should calibrate these probabilities with **sklearn.CalibratedClassifierCV** for all models or **Calibrated Trees** for trees or **Softmax with temperature** for NNs

17. Try **lightning.ai** or **wandb.ai** They simplify training, validation and inference of the models.

18. When finish you model, **try to clusterize test/val data and check model on each of these data**. Maybe this will help to find weak sides of your model or split your model on couple of models, and each of them will work better on it's own cluster.

19. Don't use 0 or 42 seed. **Use date YYYYMMDD as seed**, it will protect you from seed cherry picking.