## Useful links

### Overview of algorithms and parameters in [H2O documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html)

### [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) repository

### [Arbitrary order factorization machines](https://github.com/geffy/tffm)

### [AWS Cloud Computing](https://aws.amazon.com/)

### [Python tSNE package](https://github.com/danielfrg/tsne)

### Libraries to work with sparse CTR-like data: [LibFM](http://www.libfm.org/), [LibFFM](https://www.csie.ntu.edu.tw/~cjlin/libffm/)

### Another tree-based method: RGF ([implemetation](https://github.com/baidu/fast_rgf), [paper](https://arxiv.org/pdf/1109.0887.pdf))

### [Effective use of pandas](https://tomaugspurger.github.io/)

###  [Feature Scaling and the effect of standardization for machine learning algorithms](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html)



### [Discover Feature Engineering, How to Engineer Features and How to Get Good at It](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

### [Discussion of feature engineering on Quora](https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering)

### [Feature extraction from text with Sklearn](http://scikit-learn.org/stable/modules/feature_extraction.html)

### [Text feature extraction examples](https://andhint.github.io/machine-learning/nlp/Feature-Extraction-From-Text/)

### [Tutorial to Word2vec](https://www.tensorflow.org/tutorials/word2vec)

### [Tutorial to word2vec usage](https://rare-technologies.com/word2vec-tutorial/)

### [Text Classification With Word2Vec](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)

### [Introduction to Word Embedding Models with Word2Vec](https://taylorwhitten.github.io/blog/word2vec)



### [TextBlob](https://github.com/sloria/TextBlob)

### [Using pretrained models in Keras](https://keras.io/applications/)

### [Image classification with a pre-trained deep neural network](https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11)

### [How to Retrain Inception's Final Layer for New Categories in Tensorflow](https://www.tensorflow.org/tutorials/image_retraining)

### [Fine-tuning Deep Learning Models in Keras](https://flyyufelix.github.io/2016/10/08/fine-tuning-in-keras-part2.html)

### [How to select final model](http://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/)

### [Decision Trees: “Gini” vs. “Entropy” criteria](https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria)

### [Understanding ROC curves](http://www.navan.name/roc/)


### [Learning to Rank using Gradient Descent](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)

### [RankLib](https://sourceforge.net/p/lemur/wiki/RankLib/)

### [Learning to Rank Overview](https://wellecks.wordpress.com/2015/01/15/learning-to-rank-overview)

### [Tuning the hyper-parameters of an estimator sklearn](http://scikit-learn.org/stable/modules/grid_search.html)

### [Optimizing hyperparameters with hyperopt](http://fastml.com/optimizing-hyperparams-with-hyperopt/)

### [Guide to Parameter Tuning in Gradient Boosting (GBM)](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)

### [Far0n's framework for Kaggle competitions kaggletils](https://github.com/Far0n/kaggletils)

### [Overview of Matrix Decomposition methods (sklearn)](Overview of Matrix Decomposition methods (sklearn))

### [Multicore t-SNE implementation](Multicore t-SNE implementation)

### [Comparison of Manifold Learning methods (sklearn)](http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html)

### [How to Use t-SNE Effectively (distill.pub blog)](https://distill.pub/2016/misread-tsne/)

### [tSNE homepage (Laurens van der Maaten)](https://lvdmaaten.github.io/tsne/)

### [Example: tSNE with different perplexities (sklearn)](http://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py)

### [Facebook Research's paper about extracting categorical features from trees](https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/)

### [Example: Feature transformations with ensembles of trees (sklearn)](http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html)

### [Kaggle ensembling guide at MLWave.com (overview of approaches)](https://mlwave.com/kaggle-ensembling-guide/)

### [StackNet](https://github.com/kaz-Anova/StackNet)

### [Heamy — a set of useful tools for competitive data science (including ensembling)](https://github.com/rushter/heamy)

## Tips and Techniques

### Data Loading

1. Do basic preprocessing and convert csv/txt files into hdf5(for panda dataframes)/npy(for numpy arrays) for faster loading
2. By default pandas data is stored in 64-bit arrays, most times can be safely downcast  to 32-bit.
3. Large datasets can be processed in chunks

### Preprocessing

### Numerical Features:

#### Transformation
1. Do scaling for non-tree based models.
2. For outliers, can try clipping by value or by percentage, also know as winsorization.
3. Rank transformation, scipy.stats.rankdata()
4. np.log(1+x), np.sqrt(x+2/3) etc., useful for non-tree model esp. neural net.

#### Feature generation
1. ratio, e.g. price per square foot.
2. $+-*/$ features can be helpful. For example, even GBM has difficulty approximating these simple operations.
3. take fractional part, e.g. $2.49 --> 0.49$

### Categorical Features

#### Transformation

1. Label Encoding: sklearn.preprocessing.LabelEncoder or pandas.factorize, mainly for tree-based model
2. Frequency Encoding: map values to their frequencies, mainly for tree-based model
3. 1-hot Encoding: often used for non-tree based model
4. May consider replacing levels by the mean of certain numerical features

#### Feature generation (before imputation of missing data)

1. Add feature interaction: for example PClass + Gender --> 1Male, 2Female etc.

###  Mean encoding (given a categorical variable, for each 
Goods - number of 1 in a level, Bads - number of 0 in a level
1. Likelihood = mean(target)
2. Weight of evidence = ln(Goods/Bads) * 100
3. Count = Goods
4. Diff = Goods - Bads

Example:
means = X_tr.groupby(col).target.mean()

train_new[col+'_mean_target'] = train_new[col].map(means)

var_new[col+'_mean_target'] = var_new[col].map(means)

### Mean encoding can cause overfitting, needs regularization
####  CV loop inside training data (enough data, 4 to 5 folds)

y_tr = df_tr['target'].values

skf = StratifiedKFold(y_tr, 5, shuffle=True, random_state=42)

for tr_ind, val_ind in skf:

    X_tr, X_val = df_tr.iloc[tr_ind], df_tr.iloc[val_ind]
    
    for col in cols:
    
        means = X_val[col].map(X_tr.groupby(col).target.mean())
        
        X_val[col + '_mean_target'] =means
   
   train_new.iloc[val_ind] = X_val

prior = df_tr['target'].mean()

train_new.fillna(prior, inplace=True)

####  Smoothing:
$
\frac{mean(target) * nrows + globalmean * \alpha}{nrows + alpha}
$

#### Adding random noise (hard to work)

####  Sorting and calculating expanding mean (used in CatBoost, check it out)

cumsum = df_tr.groupby(col)['target'].cumsum() - df_tr['target']

cumcnt = df_tr.goupby(col).cumcount()

train_new['col + '_mean_target'] = cumsum/cumcnt

### DateTime Features

1. Periodicity: day number in week, month, season, year. Second, minute, hour etc.
2. Time-since (a. a fixed date such as 1/1/2000, b. e.g. last holiday etc, to next holiday).

### Coordinates

1. Distance to the nearest interesting place
2. Calculate aggregrate statistics for objects surrounding area
3. Do clustering first, then distance to the center.
4. if train decision tree from coordinates, can add slightly rotated coordinates as new features.

### Feature Extraction from Text

#### Bag of words

1. Preprocessing: lowercase, stemming, lemmetization, stopwords
2. Ngram
3. Postprocessing: TFIDF

#### Word2vec, Doc2vec

### query comparision features
1. number of matching words
2. cosine distance between tf-idf representations
3. distance between average word2vec vectors
4. Levenshtein distance

### Statistics and distance based features
1. E.g.,  give data with columns: uid, page_id, ad_price, ad_position, can add max/min price per user/page and/or min_price_position per user/page.


In [1]:
gb = df.groupby(['user_id', 'page_id'], as_index=False).agg({'ad_price':{'max_price':np.max, 'min_price': np.min}})
gb.columns = ['user_id', 'page_id', 'min_price', 'max_price']
df = pd.merge(df, gb, how='left', on=['user_id', 'page_id'])

NameError: name 'df' is not defined

Can also try 
1. How many pages user visited
2. Standard deviation of prices
3. Most visited page
4. many more

2. Can also use neighbors:
    - Explicit group is not needed
    - More flexible
    - Much harder to implement
    
    
    Example:
    - Number of houses in 500m, 1000m etc.
    - Average price per square meter in 500m, 1000m etc.
    - Number of schools/supermarkets/parking lots in 500m, 1000m etc.
    - Distance to the closest subway station
    
    Concrete example:
    - mean encoding all variables
    - for every point, find 2000 nearest neighbors using Bray-Curtis metric
    - Calculate various features:
    - eg: mean target of nearest 5, 10, 15, 500, 2000 neighbors
    - eg: mean distance to the 10 closest neighbors
    - eg: mean distance to the 10 closest neighbors with target 0
    - eg: mean distance to the 10 closest negibhors with target 1

### Matrix Factorization for Feature Extraction
1. Can be applied to only some columns
2. Can provide additional diversity, good for ensembles
3. It's a lossy transformation, usually choose 5 - 100 latent factors
4. check out sklearn: SVD, PCA, truncated SVD (works with sparse matrices), Non-negative matrix factorization (good for oucnts like data) 
5. can often be used to log(x), log(x+1) etc.

### Feature Interaction
Eg. Feature1, Feature2 ---> Feature1_Feature2 (can use concatination or something like sum, division and other functional operations

### Extract Features from Decision Tree
Mark each leaf into a binary feature

1. sklearn: tree_model.apply()
2. xgboost: booster.predict(pred_leaf = True)

### t_SNE method for feature extraction
1. sklearn
2. tsne package is better
3. Perplexity parameter 5 to 100 often

### Technical tricks

#### Plot variable importance
plt.plot(rf.feature_importance)

plt.xticks(np.arange(X.shape[1]), X.columns.tolist(), rotation=90);

#### show progress bar
from tqdm import tqdm_notebook

### Visualization
#### Single Variable
1. histogram: plt.hist(x)
2. plt.plot(x, '.')
3. scatter plot with label as color: plt.scatter(range(len(x)), x, c=y)

#### Feature Pairs
1. plt.scatter(x1, x2)
2. pd.scatter_matrix(df)
3. feature groups, e.g. plot sorted mean value: df.mean().sort_values().plot(stype='.')

#### Tools
1. Seaborn
2. Plotty
3. Bokeh
4. ggplot
5. networkx


### Data Cleaning and Checking

#### Duplicated and Constant Features
1. train.nunique(axis=1) = 1
2. train.T.drop_duplicates()
3. Categorical duplicated features: F1, F2 just have different levels, e.g.,  F1: A B C, F2: C D E.

 for f in categorical_features:
 
     train[f] = train[f].factorize()
 
 train.T.drop_duplicates()
 


#### Duplicated Rows

#### Check if dataset is shuffled

#### Remove constant features

`dropna = False` makes nunique treat NaNs as a distinct value

feats_counts = train.nunique(dropna = False)

feats_counts.sort_values()[:10]

constant_features = feats_counts.loc[feats_counts==1].index.tolist()

print (constant_features)


traintest.drop(constant_features,axis = 1,inplace=True)

#### Remove duplicated features

Fills NaNs with something we can find later if needed.

##### traintest.fillna('NaN', inplace=True)

 encode each feature:
 
 for col in tqdm_notebook(traintest.columns):
    train_enc[col] = train[col].factorize()[0]
    
    
 up_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):

    for c2 in train_enc.columns[i + 1:]:

        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
        
            dup_cols[c2] = c1

#### get cat columns and num columns
cat_cols = list(train.select_dtypes(include=['object']).columns)

num_cols = list(train.select_dtypes(exclude=['object']).columns)

### Missing value:
check the number of Nans for each row, maybe a good feature.

train.isnull().sum(axis=1).head(15)

## Optimize metrics

### Probability calibration
1. Platt scaling: just fit Logistic Regression to your predictions (like stacking)
2. Isotonic regression: Just fit Isotonic regression to your predictions (like stacking)
3. Stacking: Just fit XGboost or neural net to your predictions.

## Hypter parameters tuning
### Tools:
1. Hypteropt
2. scikit-optimize
3. Spearmint
4.  GpyOpt

### Color coding parameters
1. red: used to constrain the model: increasing it impedes fitting, but reduces overfitting; decreases to allow model fit easier
2. green:  opposite to red


### Tree based models

#### GBDT: xgboost, lightgbm, catboost


#### RandomForest, ExtraTrees: scikit-learn

#### Othres: RGF(baidu/fast_rgf)  regularized greedy forest  (slow, can be used for small datasets)


### Xgboost and lightGBM

1. max_depth(xgb, lgb), num_leaves(lgb):  if increasing it cannot lead to overfitting, then there might be lots of intersactions, stop tunning, find new features.  Start with 7.
2. subsample(xgb), bagging_fraction(lgb)
3. colsample_bytree, col_sample_bylevel(xgb),  feature_faction(lgb)
4. min_child_weight (xgb), min_data_in_leaf(lgb). Increasing it will causes the model to be more conservative. __Important parameters, try value 0, 5, 15, 300 __
5. eta(xgb), learning_rate(lgb), num_round(xgb), num_iterations(xgb): fix eta to be small such as 0.1 or 0.001, then find how many rounds it will take to overfit.  After finding the number of steps using early stopping, there is a __trick, multiple the rounds by alpha, then divide the learning_rate by alpha

### Random Forest and Extra Trees
1. Number of trees: start with 10 to see how fast it trains, then set to a big number and plot the curve of validationt error v.s. number of trees

### Neural Nets (FNN)
1. Number of neurons per layers
2. Number of layers
3. optimization methods: advanced methods such as Adam etc. are faster but often lead to overfitting, while SGD + momentum is slower but often generalizer better.
4. large batch size often leads to overfitting, 32 or 64 is better.
5. learning rate, usually starts with 0.1, then reduce it.
6. rule of thumb: if increases the batch size by a factor of $\alpha$, can also increase learning rate by the same factor

### Linear models
1. Scikit Learn
      - SVC/SVR: sklearn wraps libLinear and libSVM, compile for multicore support 
      - LogisticRegression / LinearRegression + regularizers
      - SGDCClassifier / SGDRegressor
2. For datasets that don't fit in the memory, we can use Vowpal Wabbit:
   - FTRL

## Ensembling

### Bagging example

In [1]:
model = RandomForestRegressor()
bags = 10
seed = 1

bagged_prediction = np.zero(test.shape[0])

for n in range(0, bags):
    model.set_params(random_state = seed + n)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    bagged_prediction += preds
bagged_prediction /= bags

SyntaxError: invalid syntax (<ipython-input-1-fb59c76fbc3d>, line 6)

### Boosting
1. Adaboost in sklearn is very good
2. LogitBoost in Weka


- XGboost
- LightGBM
- H2o's GBM (can handel categorical variaible directly
- Catboost

### Stacking tips

#### 1st level tips
* Diversity based on algorithms:
    + 2-3 gradient boosted trees (lightgbm, xgboost, h2o, catboost)
    + 2-3 neural nets (keras, pytorch) maybe one a bit deep with 3 hidden layers, one with 2 hidden layers, one with 1 hidden layer.
    + 1-2 extra tree / random forest
    + 1-2 linear models such as logistic regression, ridge regression, linear svm etc.
    + 1-2 knn
    + 1 Factorization machine (libfmz)
    + if data permit, 1 svm with nonlinear kernel
* Diversity based on input data:
    + Categorical features: try 1-hot, label-encoding, target-encoding
    + Numerical features: outliers, binning, derivatives, percentiles, scaling, 
    + Interactions, col1 */+- col2, unsuervised (such as kmean etc), groupby 
    

#### Subsequent level tips:
   * Simpler (or shallower) Algorithm
       + gradient boosted trees with small depth (2, 3)
       + Linear models with high regularization
       + Extra trees
       + Shallow network (1 hidden layer)
       + Knn with BrayCurtis Distance
       + Brute force search for best linear weights based on cv
   * Feature engineering:
       + pairwise difference between meta features
       + row-wise statics such as averages or stds
       + standard feature selection techniques
       * For every 7.5 models in previous level we add 1 in the mets
   * Be mindful of target leekage

#### Software for stacking

* StackNet
* Stacked ensembles from H2O
* Xcessiv


#### Tips about StackNet

* It supports many prominent tools (xgboost, lightgbm, h2o, keras)
* Can run classifiers in regression and vice vera
* [parameters section for other tools](https://github.com/kaz-Anova/StackNet/blob/master/parameters/PARAMETERS.MD) 

## Stacking validation

There are a number of ways to validate second level models (meta-models). In this reading material you will find a description for the most popular ones. If not specified, we assume that the data does not have a time component. We also assume we already validated and fixed hyperparameters for the first level models (models).


a) Simple holdout scheme

1. Split train data into three parts: partA and partB and partC.
2. Fit N diverse models on partA, predict for partB, partC, test_data getting meta-features partB_meta, partC_meta and test_meta respectively.
3. Fit a metamodel to a partB_meta while validating its hyperparameters on partC_meta.
4. When the metamodel is validated, fit it to [partB_meta, partC_meta] and predict for test_meta.


b) Meta holdout scheme with OOF meta-features

1. Split train data into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in train_data we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them train_meta.
2. Fit models to whole train data and predict for test data. Let's call these features test_meta.
3. Split train_meta into two parts: train_metaA and train_metaB. Fit a meta-model to train_metaA while validating its hyperparameters on train_metaB.
4. When the meta-model is validated, fit it to train_meta and predict for test_meta.


c) Meta KFold scheme with OOF meta-features

1. Obtain OOF predictions train_meta and test metafeatures test_meta using b.1 and b.2.
2. Use KFold scheme on train_meta to validate hyperparameters for meta-model. A common practice to fix seed for this KFold to be the same as seed for KFold used to get OOF predictions.
3. When the meta-model is validated, fit it to train_meta and predict for test_meta.


d) Holdout scheme with OOF meta-features

1. Split train data into two parts: partA and partB.
2. Split partA into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in partA we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them partA_meta.
3. Fit models to whole partA and predict for partB and test_data, getting partB_meta and test_meta respectively.
4. Fit a meta-model to a partA_meta, using partB_meta to validate its hyperparameters.
5. When the meta-model is validated basically do 2. and 3. without dividing train_data into parts and then train a meta-model. That is, first get out-of-fold predictions train_meta for the train_data using models. Then train models on train_data, predict for test_data, getting test_meta. Train meta-model on the train_meta and predict for test_meta.


e) KFold scheme with OOF meta-features

1. To validate the model we basically do d.1 -- d.4 but we divide train data into parts partA and partB M times using KFold strategy with M folds.
2. When the meta-model is validated do d.5.

### Validation in presence of time component


f) KFold scheme in time series

In time-series task we usually have a fixed period of time we are asked to predict. Like day, week, month or arbitrary period with duration of T.

1. Split the train data into chunks of duration T. Select first M chunks.
2. Fit N diverse models on those M chunks and predict for the chunk M+1. Then fit those models on first M+1 chunks and predict for chunk M+2 and so on, until you hit the end. After that use all train data to fit models and get predictions for test. Now we will have meta-features for the chunks starting from number M+1 as well as meta-features for the test.
3. Now we can use meta-features from first K chunks [M+1,M+2,..,M+K] to fit level 2 models and validate them on chunk M+K+1. Essentially we are back to step 1. with the lesser amount of chunks and meta-features instead of features.


g) KFold scheme in time series with limited amount of data

We may often encounter a situation, where scheme f) is not applicable, especially with limited amount of data. For example, when we have only years 2014, 2015, 2016 in train and we need to predict for a whole year 2017 in test. In such cases scheme c) could be of help, but with one constraint: KFold split should be done with the respect to the time component. For example, in case of data with several years we would treat each year as a fold.

## From the project

### Suggestion1.

A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get lagged values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference.

### Suggestion2.

You can get a rather good score after creating some lag-based features like in advice from previous week and feeding them into gradient boosted trees model.

Apart from item/shop pair lags you can try adding lagged values of total shop or total item sales (which are essentially mean-encodings). All of that is going to add some new information.

### Suggestion3.

If you successfully made use of previous advises, it's time to move forward and incorporate some new knowledge from week 4. Here are several things you can do:

1. Try to carefully tune hyper parameters of your models, maybe there is a better set of parameters for your model out there. But don't spend too much time on it.
2. Try ensembling. Start with simple averaging of linear model and gradient boosted trees like in programming assignment notebook. And then try to use stacking.
3. Explore new features! There is a lot of useful information in the data: text descriptions, item categories, seasonal trends.

