## Contents
- Summary Statistics
- Data Cleaning
- Dataframe Operations
- Preprocessing
- EDA
  - Univariate Analysis
  - Bivariate Analysis
- Library Imports    
- Ml Algos
    - Regression
    - Classification
    - Clustering
- Pipeline  
- Evaluation Metrics
- Neural Networks
    - Regression
    - Classification
- *GPU Support*

### Summary statistics
- df.head()
- df.info()
- Numerical column summary
```
df.describe().transpose()
```
- Categorical column summary
```
df.describe(include=['O']).transpose()
```
- df.shape
- Get columns with missing values
```
missing_value_df = round(df.isnull().sum() / len(df)  * 100,2)
missing_value_df[missing_value_df > 0].sort_values(ascending=False)
```

### Data Cleaning

#### 1. Duplicate rows
df[df.duplicated()]

#### 2. Drop columns where all values are same
df.columns[df.nunique() <= 1]


### Dataframe Ops

####  1. Drop columns
df.drop(columns=column_list)

####  2. Get columns whose name starting with
```
fb_user_cols = [col for col in df.columns if 'fb_user' in col]
df[fb_user_cols]
```
####  3. Categorical columns
```
# get categorical columns
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
```
####  4. Numerical columns
```
# get numerical columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

```
#### 5. Filter column
```
df[['count_rech_3g_6', 'arpu_3g_6', 'monthly_3g_6', 'sachet_3g_6']][df.arpu_3g_6.isnull()][:5]
```
#### 6. Row wise sum 
```
df['churn'] = df[['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']].sum(axis = 1) == 0
```
#### 7. Column wise sum 
```
df['churn'] = df[['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']].sum(axis = 0) == 0
```
#### 8. Value Counts %
```
(df.churn.value_counts() / len(df) * 100).sort_values(ascending=False)
```
#### 9. Dataframe get columns whose name contains the string
```
def get_col(df, col_str):
    '''
    returns column names of df having the stringcol_str
    '''
    return np.array([col for col in df.columns if col_str in col])
```

### Preprocessing

- Missing Value Percentage
```
missing_value_df = round(df.isnull().sum() / len(df)  * 100,2)
missing_value_df[missing_value_df > 0].sort_values(ascending=False)
```
- Missing Value Imputation
```
import numpy as np
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer(strategy='median')
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
```
- Fill null with 0
```
df[rech_cols] = df[rech_cols].apply(lambda x: x.fillna(0))
```
- Categorical columns

    **ordinal encoding**
```
def label_encode(val, mapping):
    return mapping[val]
```
```
mapping_utilities = {'ELO' : 1, 'NoSeWa' : 2, 'NoSewr': 3, 'AllPub' : 4}
df['Utilities'] = df['Utilities'].apply(lambda val :        label_encode(val,mapping_utilities))
```

    *Same as above usking sklearn*
```
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
label_X_train[good_label_cols] = enc.fit_transform(label_X_train[good_label_cols])
label_X_valid[good_label_cols] = enc.transform(label_X_valid[good_label_cols])
```

    *In the case that the validation data contains values that don't appear in the training data, the encoder will throw an error*

```
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if 
                   set(X_valid[col]).issubset(set(X_train[col]))]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
```

   **nominal encoding**
    
```
cat_cols = ["MSZoning","Street","Alley"]
df_cat = df[cat_cols]
df_cat_dummies = pd.get_dummies(df_cat, drop_first=True)
df = df.drop(cat_cols, axis=1)
df = pd.concat([df, df_cat_dummies], axis=1)
```

*The above is going to have probolems if there are some categories present in train but not in test, so use below*
```
from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_X_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

OH_X_train.index = X_train.index
OH_X_valid.index = X_valid.index

num_X_train = X_train.drop(low_cardinality_cols + high_cardinality_cols, axis=1)
num_X_valid = X_valid.drop(low_cardinality_cols + high_cardinality_cols, axis=1)

OH_X_train = pd.concat([num_X_train, OH_X_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_X_valid], axis=1)
```

### EDA

#### Univariate analysis

- Plots - distplot for numerical & countplot for categorical
```
def uni(col):
    '''
        distplot for numerical
        countplot for categorical
    '''
    if 'int' in str(col.dtype):
        sns.distplot(col)
    elif 'int' in str(col.dtype):
        sns.countplot(col)
```
- Sub plots
```
plt.figure(figsize=(16, 12))
uni_cols = list(get_col(high_value_cust_df, 'net'))
for col in uni_cols:
    plt.subplot(3,3, uni_cols.index(col) + 1)
    uni(high_value_cust_df[col])
```

#### Bivariate analysis
- Categorical vs numerical relationship
```
sns.boxplot(data=capped_df, x='churn', y='aon')
```
- Correlations
```
df.corr()
sns.heatmap(df.corr(), annot = True, cmap="YlGnBu")
```

### Library Imports
- Pandas and numpy
```
import pandas as pd
import numpy as np
```
- Plots
```
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
```
- Scaling
```
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
```
- Split
```
from sklearn.model_selection import train_test_split
```
- Pipeline
```
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
```
- Linear Regression
```
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
```
- Logistic Regression
```
from sklearn.linear_model import LogisticRegression
```
- Decision Trees
- RandomForestRegressor
```
from sklearn.ensemble import RandomForestRegressor
```
- Random Forest Classifier
```
from sklearn.ensemble import RandomForestClassifier
```
- XGBRegressor
```
from xgboost import XGBRegressor
```
- PCA
```
from sklearn.decomposition import PCA
```
- Clustering
- Cross Validation
```
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
```
- Evaluation Metrics
    - Regression
    ```
    from sklearn import metrics
    from sklearn.metrics import mean_squared_error, r2_score
    ```
    - Classification
    ```
    from sklearn.metrics import classification_report
    ```

### ML Algos

#### 1. Regression
- RandomForestRegressor
```
RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=5, min_samples_leaf=10,
                      n_estimators=100)
```
- XGBRegressor
```
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=-1) 
my_model_2.fit(X_train, y_train) # Your code here
predictions_2 = my_model_2.predict(X_valid)
```
#### 2. Classification
#### 3. Clustering


### Pipeline
- makes preprocessing and modelling steps eaiser for train and test data
- saves from human error

```
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
preds_test = my_pipeline.predict(X_test)
```





### Cross-Validation
- K Fold Cross Validation with cross_val_score
```
from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)
```

- GridSearchCV for hyperparameter tuning
```
rf_2 = RandomForestClassifier(random_state=42, n_jobs=2)
params = {
    'max_depth': [10,20,40, 50],
    'min_samples_leaf': [15,20,50,100,150],
    'n_estimators': [100, 125, 150]
}
folds = StratifiedKFold(n_splits = 4, shuffle = True, random_state = 4)
grid_search_2 = GridSearchCV(estimator=rf_2,
                           param_grid=params,
                           cv = folds,
                           n_jobs=2, verbose=1, scoring="roc_auc")
grid_search_2.fit(X_train, y_train)
rf_best_interpret = grid_search_2.best_estimator_
pred_probs_test = rf_best_interpret.predict_proba(X_test)
print(classification_report(y_test, rf_best_interpret.predict(X_test)))
```

### Neural Networks
- ANN
    - Regression
    - Classification
- CNN
    - Read a digital Image
    ```
    from keras.datasets import mnist
    import numpy as np
    import cv2
    import matplotlib.pyplot as plt
    
    (x_train, _), (x_test, _) = mnist.load_data()
    print("The shape of x_train dataset is", x_train.shape
    
    # selecting the first sample
    x = x_train[1]
    print("The dimension of x is 2D matrix as ", x.shape)
    # Resizing the image
    x = cv2.resize(x, (18,18))
    
    plt.imshow(x, cmap='gray')
    
    # Reading color image
    cat = cv2.imread('cat.jpg')
    plt.imshow(cv2.cvtColor(cat, cv2.COLOR_BGR2RGB))
    ```
    - CIFAR Dataset (Batch Normalization + DropOut)
    ```
    import tensorflow as tf
    from tensorflow.keras.datasets import cifar10
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    from tensorflow.keras import Sequential
    from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization, Conv2D, MaxPooling2D
    
    # Build Model
    model = Sequential()
    model.add(Conv2D(32, (3, 3), padding='same',
                 input_shape=x_train.shape[1:]))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(32, (3, 3)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(64, (3, 3), padding='same'))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(64, (3, 3)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes))
    model.add(Activation('softmax'))
        
    # summary of the model
    print(model.summary())

	# compile
	model.compile(loss='categorical_crossentropy',
	              optimizer='sgd',
	              metrics=['accuracy'])
	
	x_train = x_train.astype('float32')
	x_test = x_test.astype('float32')
	
	# Normalizing the input image
	x_train /= 255
	x_test /= 255
    
    # Training the model
	model.fit(x_train, y_train,
	              batch_size=batch_size,
	              epochs=epochs,
	              validation_data=(x_test, y_test),
	              shuffle=True)
                  
    ```
    - Transfer Learning
    ```
    import tensorflow as tf
	from tensorflow import keras
	from tensorflow.keras import layers, optimizers
	from tensorflow.keras.layers import Input, Add,Dropout, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalAveragePooling2D
	from tensorflow.keras.models import Model, load_model
	from tensorflow.keras.preprocessing import image
	from tensorflow.keras.utils import  plot_model
	from tensorflow.keras.applications.imagenet_utils import preprocess_input
	from tensorflow.keras.initializers import glorot_uniform
	from tensorflow.keras.preprocessing.image import ImageDataGenerator,load_img, img_to_array
	from tensorflow.keras.applications import ResNet50
	from keras.applications.resnet import preprocess_input
    ```
        - Method 1 (Use pretrained model as is, just change last few layers)
	    ```
	     base_model = ResNet50(weights='imagenet', include_top=False)    
	     # As we are using ResNet model only for feature extraction and not adjusting the weights
	     # we freeze the layers in base model
	     for layer in base_model.layers:
	         layer.trainable = False
	        
	     # Get base model output 
	     base_model_ouput = base_model.output
	    
	     # Adding our own layer 
	     x = GlobalAveragePooling2D()(base_model_ouput)
	     # Adding fully connected layer
	     x = Dense(512, activation='relu')(x)
	     x = Dense(num_classes, activation='softmax', name='fcnew')(x)
	    
	     model = Model(inputs=base_model.input, outputs=x)
	     model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
	    
	     image_size = 224
		 batch_size = 64

		 train_data_gen = ImageDataGenerator(preprocessing_function = preprocess_input,
		    shear_range=0.2, zoom_range=0.2, horizontal_flip=True)
		 valid_data_gen = ImageDataGenerator(preprocessing_function = preprocess_input)
		 train_generator = train_data_gen.flow_from_directory(train_dir, (image_size,image_size), batch_size=batch_size, class_mode='categorical')
		 valid_generator = valid_data_gen.flow_from_directory(test_dir, (image_size,image_size), batch_size=batch_size, class_mode='categorical')
         
		 model.fit(
		     train_generator,
		     steps_per_epoch=train_generator.n//batch_size,
		     validation_data=valid_generator,
		     validation_steps=valid_generator.n//batch_size,
		     epochs=epochs)         
	    ```
        - Method 2(Freeze top 140 layers, retrain rest (better performance))
        ```
		 epochs = 10
		 split_at = 140
         
         #freeze top 140 layers and train after that
         for layer in model.layers[:split_at]: 
		   layer.trainable = False
		 for layer in model.layers[split_at:]: 
		   layer.trainable = True          		    
		 model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
         
		 # Choosing lower learning rate for fine-tuning
		 # learning rate is generally 10-1000 times lower than normal learning rate, if we are fine tuning the initial layers
         
		 sgd = optimizers.SGD(lr=0.001, decay=1e-6, momentum=0.9, nesterov=True)
		 model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
         
		 model.fit_generator(
		    train_generator,
		    steps_per_epoch=train_generator.n//batch_size,
		    validation_data=valid_generator,
		    validation_steps=valid_generator.n//batch_size,
		    epochs=epochs,
		    verbose=1)         
        ```