<h1 align="center"  style='color:blue;font-size:30px'> My Personal Data Science Notes </h1>
<h2> 👋 Hi Everyone, </h2>

Here are my personal notes about various **fundamental** and **unique** methods or functions related to **Data Science** that are gathered from multiple notebooks and sources (Mostly from a <a href="https://www.coursera.org/learn/competitive-data-science/">Coursera Course</a>). 

I wrote this notebook to support my journey in learning Data Science and I hope that this notebook would be useful for fellow learners too. Thank you!

<a id="toc"></a>
<h2>🗨️ List of contents:</h2>
<div style="background: #f9f9f9 none repeat scroll 0 0;border: 1px solid #aaa;display: table;font-size: 95%;margin-bottom: 1em;padding: 20px;width: 400px;">
<ul style="font-weight: 700;text-align: left;list-style: outside none none !important;">
    <li style="list-style: outside none none !important;"><a href="#common">Common Library and Utilities</a>
    <li style="list-style: outside none none !important;"><a href="#tod">Data Types & Preprocessing</a>
    <li style="list-style: outside none none !important;"><a href="#eda">Exploratory Data Analysis</a>
    <li style="list-style: outside none none !important;"><a href="#validation">Validation</a>
    <li style="list-style: outside none none !important;"><a href="#leakages">Data Leakages</a>
</ul>
</div>

# Common Library and Utilities <a id="common"></a>

<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left;margin-right:10px;" >  Back to the list of contents</a>

In [None]:
# Data wrangling
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

# OS
import os
import warnings
warnings.filterwarnings('ignore')

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm.notebook import tqdm # Progress bar

# Scalers
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

# Models
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn.linear_model import Perceptron
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

# Cross-validation
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
from sklearn.model_selection import cross_validate

# GridSearchCV
from sklearn.model_selection import GridSearchCV

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

# Boost
import lightgbm as lgb
import xgboost as xgb
import catboost as cat

# Deep learning
import tensorflow as tf
import keras
import torch

In [None]:
def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)                # type: ignore
    #torch.backends.cudnn.deterministic = True  # type: ignore
    #torch.backends.cudnn.benchmark = True      # type: ignore
    

@contextmanager
def timer(name: str) -> None:
    """Timer Util"""
    t0 = time.time()
    print("[{}] start".format(name))
    yield
    print("[{}] done in {:.0f} s".format(name, time.time() - t0))

set_seed(42)

# Data Types & Preprocessing <a id="tod"></a>

<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left;margin-right:10px;" >  Back to the list of contents</a>

### Numeric features

In [None]:
# 1. Scaling
# MinMaxScaler : To [0,1]
from sklearn.preprocessing import MinMaxScaler

# StandardScaler : Mean=0, std=1
from sklearn.preprocessing import StandardScaler

# 2. Outliers
# Winsorization : The main purpose of winsorization is to remove outliers by clipping feature's values.

# 3. Rank
from scipy.stats import rankdata

# 4. Transformation
# Log transform : np.log(1+x)
# Raising to the power < 1 : np.sqrt(x + 2/3)

# FEATURE GENERATION
# Ex : Generating decimal feature of a sale

### Ordinal features
Example:
Ticket class, driver's license type, education

In [None]:
# Label encoding
# Alphabetical (sorted) : [S,C,Q]->[2,1,3]
from sklearn.preprocessing import LabelEncoder

# Order of appereance : [S,C,Q]->[1,2,3]
pandas.factorize

### Categorical features

In [None]:
# One-hot encoding
pandas.get_dummies
from sklearn.preprocessing import OneHotEncoder

# Combine more two/more cat features to one features
# Example: pclass + sex = pclass_sex

### Datetime features

In [None]:
# Format example: 25.01.2009
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.DatetimeIndex(pd.to_datetime(df['date'], format='%d.%m.%Y')).year
df['month'] = pd.DatetimeIndex(pd.to_datetime(df['date'], format='%d.%m.%Y')).month
df['day'] = pd.DatetimeIndex(pd.to_datetime(df['date'], format='%d.%m.%Y')).day

### Additional external sources
* <a href="https://scikit-learn.org/stable/modules/preprocessing.html">Preprocessing in Sklearn</a> <br>
* <a href="https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/">Discover Feature Engineering</a>


# Exploratory Data Analysis (EDA) <a id="eda"></a>

<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left;margin-right:10px;" >  Back to the list of contents</a>

### Transpose viewing

In [None]:
# Useful to show all the columns
train.head().T

### Finding out feature importance

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a copy to work with
X = train.copy()

# Save and drop labels
y = train.y
X = X.drop('y', axis=1)

# fill NANs 
X = X.fillna(-999)

# Label encoder
for c in train.columns[train.dtypes == 'object']:
    X[c] = X[c].factorize()[0]
    
rf = RandomForestClassifier()
rf.fit(X,y)

In [None]:
plt.plot(rf.feature_importances_)
plt.xticks(np.arange(X.shape[1]), X.columns.tolist(), rotation=90);

### Feature encoding using factorize

In [None]:
train_enc =  pd.DataFrame(index = train.index)

for col in tqdm_notebook(traintest.columns):
    train_enc[col] = train[col].factorize()[0]

### Using mask in .loc

In [None]:
mask = (nunique.astype(float)/train.shape[0] < 0.8) & (nunique.astype(float)/train.shape[0] > 0.4)
train.loc[:25, mask]

### Retrieving categorical & numerical columns

In [None]:
cat_cols = list(train.select_dtypes(include=['object']).columns)
num_cols = list(train.select_dtypes(exclude=['object']).columns)

### Functions

In [None]:
def autolabel(arrayA):
    ''' Label each colored square with the corresponding data value. 
    If value > 20, the text is in black, else in white.
    '''
    arrayA = np.array(arrayA)
    for i in range(arrayA.shape[0]):
        for j in range(arrayA.shape[1]):
                plt.text(j,i, "%.2f"%arrayA[i,j], ha='center', va='bottom',color='w')

In [None]:
def hist_it(feat):
    ''' Make a histogram
    '''
    plt.figure(figsize=(16,4))
    feat[Y==0].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.8)
    feat[Y==1].hist(bins=range(int(feat.min()),int(feat.max()+2)),normed=True,alpha=0.5)
    plt.ylim((0,1))
    
def hist_it1(feat):
    plt.figure(figsize=(16,4))
    feat[Y==0].hist(bins=100,range=(feat.min(),feat.max()),normed=True,alpha=0.5)
    feat[Y==1].hist(bins=100,range=(feat.min(),feat.max()),normed=True,alpha=0.5)
    plt.ylim((0,1))

In [None]:
def gt_matrix(feats,sz=16):
    '''Make a > (greater than) matrix to observe patterns in features
    '''
    a = []
    for i,c1 in enumerate(feats):
        b = [] 
        for j,c2 in enumerate(feats):
            mask = (~train[c1].isnull()) & (~train[c2].isnull())
            if i>=j:
                b.append((train.loc[mask,c1].values>=train.loc[mask,c2].values).mean())
            else:
                b.append((train.loc[mask,c1].values>train.loc[mask,c2].values).mean())

        a.append(b)

    plt.figure(figsize = (sz,sz))
    plt.imshow(a, interpolation = 'None')
    _ = plt.xticks(range(len(feats)),feats,rotation = 90)
    _ = plt.yticks(range(len(feats)),feats,rotation = 0)
    autolabel(a)

# Validation <a id="validation"></a>

<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left;margin-right:10px;" >  Back to the list of contents</a>

### Common validation strategies
**Holdout scheme:**
1. Split train data into two parts: partA and partB.
2. Fit the model on partA, predict for partB.
3. Use predictions for partB for estimating model quality. Find such hyper-parameters, that quality on partB is maximized.

**K-Fold scheme:**
1. Split train data into K folds.
2. Iterate though each fold: retrain the model on all folds except current fold, predict for the current fold.
3. Use the predictions to calculate quality on each fold. Find such hyper-parameters, that quality on each fold is maximized. You can also estimate mean and variance of the loss. This is very helpful in order to understand significance of improvement.

**LOO (Leave-One-Out) scheme:**
1. Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset).
2. In the end you will get LOO predictions for every sample in the trainset and can calculate loss.

In [None]:
# Holdout validation : n-groups = 1
# Enough data but score and optimal parameter are similar
from sklearn.model_selection import ShuffleSplit

# K-fold : n-groups = k
# Enough data but score and optimal parameter differ
from sklearn.model_selection import Kfold

# Leave-one-out : n-groups = len(train)
# Small amount of data
from sklearn.model_selection import LeaveOneOut

**Stratification** : Preserve the same target distribution over different folds

Stratification is useful for: <br>
• Small datasets <br>
• Unbalanced datasets <br>
• Multiclass clasification

### Data splitting strategies
Validation should be set up to mimic the train/test split of a competition

1. Random, rowwise
2. Timewise (Ex: Predicting sales in a shop)
3. By ID
4. Combined

**Example:** <br>
• Train set: Ratio of Men >> Ratio of Women <br>
• Test set: Ratio of Women >> Ratio of Men

**What to do ?** <br>
*Adjust/force the validation data so that it mimic the test set (Ratio of Women >> Ratio of Men)*

### Submission problems

Usually on Kaggle it is allowed to select two final submissions, which will be checked against the private LB and contribute to the competitor's final position. A common practice is to select one submission with a **best validation score**, and another submission which **scored best on Public LB**.


**Cause of LB Shuffle** <br>
• Randomness <br>
• Little amount of data <br>
• Different public/private distributions

### Additional external sources
* <a href="http://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/"> Advices on validation in a competition </a>
* <a href="https://scikit-learn.org/stable/modules/cross_validation.html"> Validation in Sklearn </a>


# Data Leakages <a id="dl"></a>

<a href="#toc"><img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Circle-icons-arrow-up.svg/1200px-Circle-icons-arrow-up.svg.png" style="width:20px;hight:20px;float:left;margin-right:10px;" >  Back to the list of contents</a>

The most common types of leaks:
1. Date / Time
2. Meta data
3. Information in IDs -> Might be a hash of something
4. Row order

### Additional external source
* <a href="https://www.kaggle.com/olegtrott/the-perfect-score-script">The "Perfect Score" Script<a>

<span style='color:blue;font-size:18px' >Work in progress...</span>