## Useful links

### Overview of algorithms and parameters in [H2O documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html)

### [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) repository

### [Arbitrary order factorization machines](https://github.com/geffy/tffm)

### [AWS Cloud Computing](https://aws.amazon.com/)

### [Python tSNE package](https://github.com/danielfrg/tsne)

### Libraries to work with sparse CTR-like data: [LibFM](http://www.libfm.org/), [LibFFM](https://www.csie.ntu.edu.tw/~cjlin/libffm/)

### Another tree-based method: RGF ([implemetation](https://github.com/baidu/fast_rgf), [paper](https://arxiv.org/pdf/1109.0887.pdf))

### [Effective use of pandas](https://tomaugspurger.github.io/)

###  [Feature Scaling and the effect of standardization for machine learning algorithms](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html)



### [Discover Feature Engineering, How to Engineer Features and How to Get Good at It](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

### [Discussion of feature engineering on Quora](https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering)

### [Feature extraction from text with Sklearn](http://scikit-learn.org/stable/modules/feature_extraction.html)

### [Text feature extraction examples](https://andhint.github.io/machine-learning/nlp/Feature-Extraction-From-Text/)

### [Tutorial to Word2vec](https://www.tensorflow.org/tutorials/word2vec)

### [Tutorial to word2vec usage](https://rare-technologies.com/word2vec-tutorial/)

### [Text Classification With Word2Vec](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)

### [Introduction to Word Embedding Models with Word2Vec](https://taylorwhitten.github.io/blog/word2vec)



### [TextBlob](https://github.com/sloria/TextBlob)

### [Using pretrained models in Keras](https://keras.io/applications/)

### [Image classification with a pre-trained deep neural network](https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11)

### [How to Retrain Inception's Final Layer for New Categories in Tensorflow](https://www.tensorflow.org/tutorials/image_retraining)

### [Fine-tuning Deep Learning Models in Keras](https://flyyufelix.github.io/2016/10/08/fine-tuning-in-keras-part2.html)

### [How to select final model](http://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/)

### [Decision Trees: “Gini” vs. “Entropy” criteria](https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria)

### [Understanding ROC curves](http://www.navan.name/roc/)


### [Learning to Rank using Gradient Descent](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)

### [RankLib](https://sourceforge.net/p/lemur/wiki/RankLib/)

### [Learning to Rank Overview](https://wellecks.wordpress.com/2015/01/15/learning-to-rank-overview)

## Tips and Techniques

### Preprocessing

### Numerical Features:

#### Transformation
1. Do scaling for non-tree based models.
2. For outliers, can try clipping by value or by percentage, also know as winsorization.
3. Rank transformation, scipy.stats.rankdata()
4. np.log(1+x), np.sqrt(x+2/3) etc., useful for non-tree model esp. neural net.

#### Feature generation
1. ratio, e.g. price per square foot.
2. $+-*/$ features can be helpful. For example, even GBM has difficulty approximating these simple operations.
3. take fractional part, e.g. $2.49 --> 0.49$

### Categorical Features

#### Transformation

1. Label Encoding: sklearn.preprocessing.LabelEncoder or pandas.factorize, mainly for tree-based model
2. Frequency Encoding: map values to their frequencies, mainly for tree-based model
3. 1-hot Encoding: often used for non-tree based model
4. May consider replacing levels by the mean of certain numerical features

#### Feature generation (before imputation of missing data)

1. Add feature interaction: for example PClass + Gender --> 1Male, 2Female etc.

###  Mean encoding (given a categorical variable, for each 
Goods - number of 1 in a level, Bads - number of 0 in a level
1. Likelihood = mean(target)
2. Weight of evidence = ln(Goods/Bads) * 100
3. Count = Goods
4. Diff = Goods - Bads

Example:
means = X_tr.groupby(col).target.mean()

train_new[col+'_mean_target'] = train_new[col].map(means)

var_new[col+'_mean_target'] = var_new[col].map(means)

### Mean encoding can cause overfitting, needs regularization
####  CV loop inside training data (enough data, 4 to 5 folds)

y_tr = df_tr['target'].values

skf = StratifiedKFold(y_tr, 5, shuffle=True, random_state=42)

for tr_ind, val_ind in skf:

    X_tr, X_val = df_tr.iloc[tr_ind], df_tr.iloc[val_ind]
    
    for col in cols:
    
        means = X_val[col].map(X_tr.groupby(col).target.mean())
        
        X_val[col + '_mean_target'] =means
   
   train_new.iloc[val_ind] = X_val

prior = df_tr['target'].mean()

train_new.fillna(prior, inplace=True)

####  Smoothing:
$
\frac{mean(target) * nrows + globalmean * \alpha}{nrows + alpha}
$

#### Adding random noise (hard to work)

####  Sorting and calculating expanding mean (used in CatBoost, check it out)

cumsum = df_tr.groupby(col)['target'].cumsum() - df_tr['target']

cumcnt = df_tr.goupby(col).cumcount()

train_new['col + '_mean_target'] = cumsum/cumcnt

### DateTime Features

1. Periodicity: day number in week, month, season, year. Second, minute, hour etc.
2. Time-since (a. a fixed date such as 1/1/2000, b. e.g. last holiday etc, to next holiday).

### Coordinates

1. Distance to the nearest interesting place
2. Calculate aggregrate statistics for objects surrounding area
3. Do clustering first, then distance to the center.
4. if train decision tree from coordinates, can add slightly rotated coordinates as new features.

### Feature Extraction from Text

#### Bag of words

1. Preprocessing: lowercase, stemming, lemmetization, stopwords
2. Ngram
3. Postprocessing: TFIDF

#### Word2vec, Doc2vec

### Technical tricks

#### Plot variable importance
plt.plot(rf.feature_importance)

plt.xticks(np.arange(X.shape[1]), X.columns.tolist(), rotation=90);

#### show progress bar
from tqdm import tqdm_notebook

### Visualization
#### Single Variable
1. histogram: plt.hist(x)
2. plt.plot(x, '.')
3. scatter plot with label as color: plt.scatter(range(len(x)), x, c=y)

#### Feature Pairs
1. plt.scatter(x1, x2)
2. pd.scatter_matrix(df)
3. feature groups, e.g. plot sorted mean value: df.mean().sort_values().plot(stype='.')

#### Tools
1. Seaborn
2. Plotty
3. Bokeh
4. ggplot
5. networkx


### Data Cleaning and Checking

#### Duplicated and Constant Features
1. train.nunique(axis=1) = 1
2. train.T.drop_duplicates()
3. Categorical duplicated features: F1, F2 just have different levels, e.g.,  F1: A B C, F2: C D E.

 for f in categorical_features:
 
     train[f] = train[f].factorize()
 
 train.T.drop_duplicates()
 


#### Duplicated Rows

#### Check if dataset is shuffled

#### Remove constant features

`dropna = False` makes nunique treat NaNs as a distinct value

feats_counts = train.nunique(dropna = False)

feats_counts.sort_values()[:10]

constant_features = feats_counts.loc[feats_counts==1].index.tolist()

print (constant_features)


traintest.drop(constant_features,axis = 1,inplace=True)

#### Remove duplicated features

Fills NaNs with something we can find later if needed.

##### traintest.fillna('NaN', inplace=True)

 encode each feature:
 
 for col in tqdm_notebook(traintest.columns):
    train_enc[col] = train[col].factorize()[0]
    
    
 up_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):

    for c2 in train_enc.columns[i + 1:]:

        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
        
            dup_cols[c2] = c1

#### get cat columns and num columns
cat_cols = list(train.select_dtypes(include=['object']).columns)

num_cols = list(train.select_dtypes(exclude=['object']).columns)

### Missing value:
check the number of Nans for each row, maybe a good feature.

train.isnull().sum(axis=1).head(15)

## Optimize metrics

### Probability calibration
1. Platt scaling: just fit Logistic Regression to your predictions (like stacking)
2. Isotonic regression: Just fit Isotonic regression to your predictions (like stacking)
3. Stacking: Just fit XGboost or neural net to your predictions.

In [None]:
result = []
for r_ind, f_ind in kf.split(all_data):
    print('f_ind', len(f_ind), 'r_ind', len(r_ind))
    X_r, X_f = all_data.iloc[r_ind], all_data.iloc[f_ind]
    means = X_f['item_id'].map(X_r.groupby('item_id').target.mean())
    X_f['item_target_enc'] = means`
    result.append(X_f)
#all_data['item_target_enc'].fillna(0.3343, inplace=True) 


In [None]:
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')