## Contents
- Summary Statistics
- Data Cleaning
- Dataframe Operations
- Preprocessing
- EDA
  - Univariate Analysis
  - Bivariate Analysis
- Ml Algos
    - Regression
    - Classification
    - Clustering
- Evaluation Metrics
- Neural Networks
    - Regression
    - Classification
- Library Imports    
- *GPU Support*

### Summary statistics
- df.head()
- df.info()
- Numerical column summary
```
df.describe().transpose()
```
- Categorical column summary
```
df.describe(include=['O']).transpose()
```
- df.shape
- Get columns with missing values
```
missing_value_df = round(df.isnull().sum() / len(df)  * 100,2)
missing_value_df[missing_value_df > 0].sort_values(ascending=False)
```

### Data Cleaning

#### 1. Duplicate rows
df[df.duplicated()]

#### 2. Drop columns where all values are same
df.columns[df.nunique() <= 1]


### Dataframe Ops

####  1. Drop columns
df.drop(columns=column_list)

####  2. Get columns whose name starting with
```
fb_user_cols = [col for col in df.columns if 'fb_user' in col]
df[fb_user_cols]
```
####  3. Categorical columns
```
# get categorical columns
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
```
#### 4. Filter column
```
df[['count_rech_3g_6', 'arpu_3g_6', 'monthly_3g_6', 'sachet_3g_6']][df.arpu_3g_6.isnull()][:5]
```
#### 5. Row wise sum 
```
df['churn'] = df[['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']].sum(axis = 1) == 0
```
#### 6. Column wise sum 
```
df['churn'] = df[['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']].sum(axis = 0) == 0
```

#### 7. Value Counts %
```
(df.churn.value_counts() / len(df) * 100).sort_values(ascending=False)
```

#### 8. Dataframe get columns whose name contains the string
```
def get_col(df, col_str):
    '''
    returns column names of df having the stringcol_str
    '''
    return np.array([col for col in df.columns if col_str in col])
```

### Preprocessing

- Missing Value Percentage
```
missing_value_df = round(df.isnull().sum() / len(df)  * 100,2)
missing_value_df[missing_value_df > 0].sort_values(ascending=False)
```
- Missing Value Imputation
```
import numpy as np
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer(strategy='median')
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
```
- Fill null with 0
```
df[rech_cols] = df[rech_cols].apply(lambda x: x.fillna(0))
```
- Categorical columns

    **ordinal encoding**
```
def label_encode(val, mapping):
    return mapping[val]
```
```
mapping_utilities = {'ELO' : 1, 'NoSeWa' : 2, 'NoSewr': 3, 'AllPub' : 4}
df['Utilities'] = df['Utilities'].apply(lambda val :        label_encode(val,mapping_utilities))
```

    *Same as above usking sklearn*
```
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
label_X_train[good_label_cols] = enc.fit_transform(label_X_train[good_label_cols])
label_X_valid[good_label_cols] = enc.transform(label_X_valid[good_label_cols])
```

    *In the case that the validation data contains values that don't appear in the training data, the encoder will throw an error*

```
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if 
                   set(X_valid[col]).issubset(set(X_train[col]))]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
```

   **nominal encoding**
    
```
cat_cols = ["MSZoning","Street","Alley"]
df_cat = df[cat_cols]
df_cat_dummies = pd.get_dummies(df_cat, drop_first=True)
df = df.drop(cat_cols, axis=1)
df = pd.concat([df, df_cat_dummies], axis=1)
```

*The above is going to have probolems if there are some categories present in train but not in test, so use below*
```
from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_X_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

OH_X_train.index = X_train.index
OH_X_valid.index = X_valid.index

num_X_train = X_train.drop(low_cardinality_cols + high_cardinality_cols, axis=1)
num_X_valid = X_valid.drop(low_cardinality_cols + high_cardinality_cols, axis=1)

OH_X_train = pd.concat([num_X_train, OH_X_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_X_valid], axis=1)
```

### EDA

#### Univariate analysis

- Plots - distplot for numerical & countplot for categorical
```
def uni(col):
    '''
        distplot for numerical
        countplot for categorical
    '''
    if 'int' in str(col.dtype):
        sns.distplot(col)
    elif 'int' in str(col.dtype):
        sns.countplot(col)
```
- Sub plots
```
plt.figure(figsize=(16, 12))
uni_cols = list(get_col(high_value_cust_df, 'net'))
for col in uni_cols:
    plt.subplot(3,3, uni_cols.index(col) + 1)
    uni(high_value_cust_df[col])
```

#### Bivariate analysis
- Categorical vs numerical relationship
```
sns.boxplot(data=capped_df, x='churn', y='aon')
```
- Correlations
```
df.corr()
sns.heatmap(df.corr(), annot = True, cmap="YlGnBu")
```

### Library Imports
- Pandas and numpy
    ```
    import pandas as pd
    import numpy as np
    ```
- Plots
    ```
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    ```
- Scaling
    ```
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    ```
- Split
    ```
    from sklearn.model_selection import train_test_split
    ```
- Linear Regression
    ```
    from sklearn.linear_model import LinearRegression
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    ```
- Logistic Regression
    ```
    from sklearn.linear_model import LogisticRegression
    ```
- Decision Trees
- RandomForestRegressor
    ```
    from sklearn.ensemble import RandomForestRegressor
    ```
- Random Forest Classifier
    ```
    from sklearn.ensemble import RandomForestClassifier
    ```
- PCA
    ```
    from sklearn.decomposition import PCA
    ```
- Clustering
- Cross Validation
    ```
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    ```
- Evaluation Metrics
    - Regression
        ```
        from sklearn import metrics
        from sklearn.metrics import mean_squared_error, r2_score
        ```
    - Classification
        ```
        from sklearn.metrics import classification_report
        ```

### ML Algos

#### 1. Regression
- RandomForestRegressor
    ```
    RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=5, min_samples_leaf=10,
                          n_estimators=100)
    ```
#### 2. Classification
#### 3. Clustering


In [8]:
### 1. 

In [3]:
w = np.array([3,1,5,7,4])

In [7]:
w.shape

(5,)