# How to Win a Data Science/Kaggle Competition

* Yandex

#### Real World ML Pipeline is a complicated process which includes:
* Understanding the business problem
* Problem formalization
* Data collecting
* Data preprocessing
* Modeling
* Way to evaluate model in real life
* Way to deploy model

#### It's not only about algorithms
* It is all about data and making things work, not about algorithms themselves
    * Everyone can and will tune classic approaches
    * We need some insights to win
* Sometimes there is no ML

#### Do not limit yourself
* It is ok to use heuristics, and manual data analysis
* Do not be afraid of
    * Complex solutions
    * Advanced feature engineering
    * Doing huge calculation
    
#### Be creative
* It is OK to modify or hack existing algorithms or even to design a completly new algorithm
* Do not be afraid of reading source codes and changing them

* __VOWPAL WABBIT__: linear model used for large datasets
* pyTorch

[Playground](http://playground.tensorflow.org)

[Classifier Comparison](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

## Preprocessing data

* feature preprocessing is often necessary
* feature generation (define a new fearure that might provide better insight)
* preprocessing and generation pipelines depend on model type

### Numeric features
* scaling
* rank, moves outliers closer to the mean
* Log transform, and raising to the power <1, can help linear and NN by bringing features closer to the mean

### Categorical and ordinal features
* Frequency encoding maps categories to numbers
* Label and frequency encodings are often used for tree-based models
* One-hot encoding is often used for non-tree-based models
* Interactions of categorical features can help linear models and KNN
* adding two features in categorical data before label ecoding could be very helpful too.

### Datetime and coordinates
* Date and time
    1. Periodicity, number of days, weeks, months, seasons, yeards, seconds, minutes, and hours
    2. Time since, row-independent/dependent moment
    3. Difference between dates
* Coordinates
    * Interesting places from train/tast data or additional data
    * Centers of clusters
    * Aggregated statistics
    
### Missing data
* Fillna approaches
    * -999, -1, etc
    * mean, median
    * Reconstruct value
* The choice of method to fill NaN depends on the situation
* Usual way to deal with missing values is to replace them with -999, mean or median
* Binary feature **isnull** can be beneficial
* In general, avoid filling nans before feature generation
* Xgboost can handle NaN

### Feature extraction from texts and images

#### Text to vector
1. Bag of words, creates a arrays where earch column corresponds to a word and row to a sentence. If a word appears in a sentces that column gets a 1, all other columns get a 0.
    * sklearn_feature_extraction.txt.CountVectorizer
    * TFiDF
        * term frequency, $tf=1/x.sum(axis=1)[:,None]; x=x*tf$
        * Inverse Document Frequency, $idf=np.log(x.shape[0]/(x>0).sum(0)); x=x*idf$
        * sklearn.feature_extraction.text.TfidVectorizer
    * N-grams, sklearn_feature_extraction.txt.CountVectorizer, **Ngram_range** changes the overlap size, while **analyzer** is fo character Ngrams
    * Preprocess text
        * Lowercase
        * Lemmatixation/Stemming: stemming cuts the words to a common origin, while lemmatization using morphology and origin of words
        * Stopwods: prepositions and very common words
    * Very large vectors
    * Meaning of each value in vector is know
2. Embedings (~word2vec): converts each word to a vector in a sophisticated space. Addition and substraction can be applied to this vectors and the result should be interpretable, e.g. king+woman-man=queen
    * Words: Word2vec, Glove, FastText, etc
    * Sentences: Doc2vec, etc
    * Relatively small vectors
    * Values in vector can be interpreted only in some cases
    * The words with similar meaning often have similar embeddings

### Image to vector

1. Descriptors
2. Training network from scratch
3. Finetuning
4. Augmentation: rotating/stretching the original data