# EDA

Things to check for:  
- Inconsistent column names (capitalized, spaces, symbols)
- Duplicate rows, untidy data  
- Wrong variable types  
- Cardinality of categorical variables  
- Distribution of numerical feature / **outliers**  
- Distribution / correlation of missing values  
- Potential relationships  

Types of Plots:
- **Quantitative variables (univariate)**: histrograms, density plots (independent of bin size), boxplots (good for outliers)
- **Categorical variables (univariate)**: frequency count and countplot.
- **Quantitative v. Quantitative**: correlation matrix/heatmap, scatterplot, sns.jointplot (fancy scatter with histogram), scatterplot matrix
- **Quantitative v. Categorical**: scatterplots with different colors, compare boxplots/violinplots side by side. Very useful: see example with catplot with two categorical dimensions.
- **Categorical v. Categorical**: countplot + hue, contingency tables

Some techniques:
- df.head(), df.tail(), df.shape, df.columns
- df.dtypes, df.info(), df.describe()
- Frequency Count (good for categorical): .value_counts(dropna=False)
- Summary Statistics (good for outliers): .describe()
- bar plots for discrete data:  (df[col].value_counts().plot.bar())
- t-SNE
    
Some tips/suggestions:
- sns.set() make the plots look better
- Seaborn plots return an Axes object. We can use matplotlib syntax to add titles, change axis, etc.
- use masks for filtering (more organized)

# Data Cleaning

Tidy Data Rules:  
1) Columns represent separate variabes  
2) Rows represent individual observations  
3) Observational units form tables   


- Check for data types and unique values: any categorical column "disguised" as numeric? Numerical as strings?  

Some useful techniques:  
- **pd.melt(...)**: useful when columns represent similar information (imagine one column per country for GDP information, for example). pd.pivot_table(...) does the opposite transformation.  
- **.str methods** and other methods: select/process strings in a vectorized fashion. Powerful: **re + .str.contains(pattern)**  
- Use **.astype('float32')** to change data types. Very useful for constructing feature interactions.   
- **.apply + custom function** is very flexible, but slow. Using pandas vectorized methods is faster (like .str methods). Check examples.

# Preprocessing

## Missing Values

- Be careful with **data leakage** for imputation (use the train set or cross validation). 
- **Mean or Median imputation**: usually median is better. Both can severely distort the distribution and artificially increase the number of outliers (by imputing median/mean, we reduce the variance). Good practice with imputation: creating **flag** variables.  


Some interesting options on sklearn:  
- **IterativeImputer**: uses a model, like BayesianRidge to estimate the missing values. Very flexible.
- **KNNImputer**: similar to the above, but limited to KNN.

## Categorical Features

- In a **production** environment, be careful about **unseen categories**. The encoding needs to be integrated to the pipeline. Alternatives: dropping data with unseen categories, encode as rare category or as missing value.  
- **Caution**: one-hot encoding is usually really bad for tree-based models, specially with high cardinality. Avoid if possible.
- **Frequency encoding**: be careful with leakage and labels with the same frequency.  
- **Target encoding**: you can try **mean, median, std**. Be extra careful with leakage here. Use cross validation and oof predictions.
- **Probability Ratio**: P(1)/P(0) for each label.  
- **High cardinality**: besides using the techniques above, you can try to bin into smaller groups. Optimal way: order by target mean and break into groups following that order (LightGBM does this). Option: bundle of **rare** categories.


## Continuous Transformations

- For **non-Gaussian/Skewed** variables: log, reciprocal (1/x), square root, exponential, Box-Cox / Yeo-Johnson (see below)
- **MaxAbsScaler**: robust to very small standard deviation and preserves zero entries on sparse data.  
- **RobustScaler**: removes the median ans scales according to quantile range (1st and 3rd).  
- Mapping to uniform distribution (QuantileTransformer)
- Mapping to Gaussian distribution: stabilizes variance and reduces skewness. PowerTransformer (Yeo-Johnson or **Box-Cox**). Box-Cox only works for strictly positive data. 

**Sklearn**  
- from sklearn.preprocessing import StandardScaler, MinMaxSacaler, MaxAbsScaler, RobustScaler, QuantileTransformer  
- QuantileTransformer(random_state=0): maps to uniform distribution between 0 and 1. We can map to the normal distribution by setting output_distribution='normal'.
- pt = preprocessing.PowerTransformation(method=**'box-cox'**, standardize=False)

## Text

- Bag of words ignores the order the words appear on a document. These will need to be stored using sparse matrices.  
- **CountVectorizer** provides raw term frequencies. For some estimators (Bernoulli Naive Bayes) it might be better to set binary=True option.    
- Words that occur in many documents are not very useful for classification. We can adjust for this effect by using term frequency-inverse document frequency (**tf-idf**).  
- **n-grams** instead of 1-gram can be useful for similar phrases: 'This is fun', 'is this fun?'. 'is this' would appear on the second phrase (maybe we should include the question mark?)  
- Choosing which **stop words** to use is not a trivial problem. Words that are uninformative in one problem can be very important in a different problem. Countvectorizer has the option 'english' stop words. Removing stop words is more useful when using raw counting (tfidf downplay repeated words anyway).  
- Check the documentation for useful tips about **encoding**.  
- For word **stemming** (transforming a word into its root form), use the NLTK package. Stemming can result in weird words (like thu from thus). **Lemmanization** provides more natural forms (lemmas), but does not seem to improve predictions (ans is also more computationally expensive).

- We need to hold the complete vocabulary on memory for CountVectorizer and/or inverse document frequencies. When doing **online learning**, we can use a **HashingVectorizer** instead.

**Sklearn**  
- from sklearn.feature_extraction.text import **CountVectorizer()**. ngram_range=(1, 1) is the default.  
- Setting **different analyzers** with ngram_range argument. CountVectorizer(analyzer='char_wb', ngram_range=(2,2)) will create n-grams from characters inside word boundaries padded with spaces. The 'char' analyzer creates n-grams that span across words.
- from sklearn.feature_extraction.text import **TfidfTransformer** 

# Feature Engineering

## Feature Generation

Things to try:
- Ratios, total distance, difference, sum, interaction, fractional part of prices
- Concatenation of string features
2. Target Encoding (be careful with spilling)

## Feature Selection

- Useful tools from sklearn: **VarianceThreshold** and **SelectFrom Model**. The latter works with models that have coef_ or fature_importnces_ attribute.

**Permutation Feature Importance**  
- We can evaluate permutation feature importance on the training set or the validation set (for generalization error).  
- Features that are important on the training set but not on the validation set could be causing the model to overfit.  
- Feature deemed non-important for some model with low predictive performance coul be highly predictive for a model that generalizes better -> evaluate how good a model is with cross-validation before evaluating feature importance.  
- Imputiry-based feature importance from tree based models give importance to features that may not be predictive on unseen data.   
- We can use different scoring methods with permutationf feature importance.
- Tip: cluster features that are correlated and keep one feature from each cluster.

# Model Selection / Tuning

**Cross-validation**
- Even when using a validation set, our model choice can depend on the random choice for the pair of (train, validation). It might be better to do cross-validation instead.  


- For **large imbalance** in the distribution of classes, use stratification. If the order of classes is not arbitraty, use **shuffling** (you can use the option on KFold to shuffle the indeces only).


- **Important**: consider doing cross validation on some preprocessing steps as well (use pipelines)


- from sklearn.model_selection import **cross_val_score** (we can give an integer to cv or a cross-validation generator)   
  scores = cross_val_score(clf, X, y, cv=5, scoring=...)  
  See https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter  
  We can use different cross validation strategies (below) or a custom iterator (see code section):  
  from sklearn.model_selection import ShuffleSplit  
  cv = ShuffleSplit(...)  
  cross_val_sore(clf, X, y, cv=cv)   
  
  
- **cross_validate** returns more data and accepts a list of metrics;  
  **cross_val_predict** returns cross-validated predictions
  
  
- **Cross validation iterators**: KFold, RepeatedKFold, Leave One Out (LOO), eave P Out (LPO), ShuffleSplit, StratifiedKFold, RepeatedStratifiedKFold, StratifiedShuffleSplit. More advanced: iterators for groups.
- **Custom fold iterators**: check notebook with example.


- **Time Series Split**: returns first k fold as the training set and the k+1 as the test set (successive training sets are supersets).

**Parameter Search**
- Grid of parameters are very flexible: you can vary models and preprocessing methods as well, not just parameters. Check notebook for example.

# Pipelines

- Pipeline for chaining multiple transformers at the same time


- Applying **transformations in parallel** and concatenating the results (useful to combine several feature extraction methods):  
  from sklearn.pipeline import **FeatureUnion**  
  transformer = FeatureUnion(transformer_list = [('name', SimpleImputer(...)), ...])
  
  
  
- Quick way to construct a pipeline from estimators:  
  from sklearn.pipeline import **make_pipeline**  
  make_pipeline(transformer, DecisionTreeClassifier())  
  obs: we can't name the estimators using the function  
  
  
- Extremely important: **custom sklearn transformers** (check code section). Use map + dictionaries extensively.   


- Very useful command: pipeline.get_params().keys()



- Excellent tool/tips: **make_column_selector**, using a transformer on the remainder argument, **make_column_transformer**


- Should we select features before or during the pipeline? Trade-off between flexibility and more code. If the data does not change a lot, use pre-selected features.