# Data Science ODL Project: Assessment 2

ID:

## Case study

Refer to the brief

## 1. Aims, objectives and plan (4 marks)

### a) Aims and objectives
A company has collected various attributes of the steel that they anneal, along with the attributes they have recorded the kind of annealing which was done previouly. 
They want to use these attributes and predict the type of Annealling that should be performed given a new instance of steel atrtibutes. 
The aim is develop a unbiased model which correctly predicts the annlealing class 98% of the times regardless of the class. In other words, if there 100 instances of each class, then the model 
should be able to correctly detect atleast 98 instances correctly from each class, which indicates that they want a model which has a very high True Positivity Rate.

###  b) Plan
Please demonstrate how you have conducted the project with a simple Gantt chart.

## 2. Understanding the case study (4 marks)

###  Case study analysis
State the key points that you found in the case and how you intend to deal with them appropriately to address the client's needs. (You can include more than four points.)


1. Overview of the Data
- The data is medium size with 798 training examples and 100 examples in test set. 
- There are 38 attributes, out which 6 are real-valued, 3 are oridnals and 29 categorical attributes. 
- The documentation of the data states that '-' represents the not_applicable values and '?' represent missing values. 
- Of the 29 categorical variables, 19 have binary values.
- There are 6 documentated annealing (target) types. '1', '2', '3', '4', '5' and 'U' 

2. Class Imbalance
    - Target distribution is heavily disblanced 
        - Class '3' dominates with 76% of the instances
        - Class '2' is present in  11.4% of instances
        - Class '5' is present in 7% of the instances
        - Class 'U' is present in 4% of the instances
        - Class '1' is present in 1% of the instances [only 8 examples]
        - Class '4' is not present in the data.
        - We will use synthetic minority over-sampling technique on the non-dominant classes.
            - N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, 321-357, 2002.
        - Since class '1' has only 1% in training data and no presence in test set. 
            - We will ignore prediction of this class. 
            - With only 8 points our performance metrics will have, very little to NO significance. 
        - Class 4 has no instances in both train and test set, this class will be implicitly ignored.
        - Both these classes will require more data.


3. Missing Values
    - 9 attributes in the training set have <b> NO MISSING VALUES </b>.
    - Continous attributes [carbon, hardness, strength, thickm width, len] don't have any missing values in the train set.
        - But our pipeline will still have "impute with mean step" for these attributes to deal missingness during inference.
    - Out of 38 attributes 29 attributes have missing values. All of which are categorical or ordinal.
    - The amount of missingness varies from 8 % to 100%.
    - The data skewed in terms of missingness also, such that there are 4 variables 
        - ['steel', 'surface_quality', 'condition', 'formability'] with [8%, 27%, 33%, 35.4%] missingness, respectively.
        - Rest of the missing attributes have median missingness of 98% and a minimum of 76% missingness. 
    - To deal with this missing-ness we ran various experiments in the background.
        - Experiment 1 
            - We drop Drop all attributes/columns which have more that 35% missingness.
            - This leaves us with only 13 attributes, which is 34% of the original number attributes.
            - Imputation with mode of the training data, as all of them are categorical/ordinal.
       - Experiment 2
           - This experiment was insipired by 2 facts
               1. In experiment 1 we had dropped 66% of the attributes. Which is way too many dropped attributes, 25 in number. 
                   - There is a high chance that some of these attributes have high discrimatory power, wrt to the target.
               2. As mentioned above in the data documentation that:
                   - '-' represents the not_applicable values
                   - Of the 29 categorical variables, 19 have binary values.
                   - Except "shape", all other 18 attributes have very high missing values.
                   - Combining the above facts, it would makes a lot of sense to
                       - Impute missing values with  "not_applicable".
                   - However, we still drop attributes with more than 99% of missing values, which would be only 10 attributes.
                3. We would continue to impute ['steel', 'surface_quality', 'condition', 'formability'] with training data's mode in this experiment.
       - Experiment 3
           - Same as experiment 2, but we also impute ['steel', 'surface_quality', 'condition', 'formability'] with "not_applicable".
      
    
4. Since the client is interested in a high True Positivity Rate, we perform grid search based using F_beta, with more importace to recall, i.e. having the beta value set to 1.28. As per documentation of sklearn, a beta value higher than 1.0 prefers recall more than precision. 



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from statistics import variance, mean
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import check_scoring
from sklearn.model_selection import cross_validate

KeyboardInterrupt: 

## 3. Pre-processing applied (20 marks)
Enter the code in the cells below to execute each of the stated sub-tasks. 


#### Since the algorithms used in sklearn provide the functionality to internally convert string based categorical target to appropriate label encoding, we skip this step
#### Instead we present here the code to read the data file and get the train, val and test set.

In [None]:

def get_train_val_data():
    
    df = pd.read_csv("dataset/anneal.data")
    
    # Because replace "?" represents missing. 
    df.replace("?", np.nan, inplace=True)

    # Because only 8 data poins have target as "1".
    # Our metrics won't be reliable for this class.
    
    df = df[df.target != "1"]
    
    # Just a fix as some values which are supposed to be int have string representation for this column
    df.enamelability = df.enamelability.astype(float)
    
    
    df_y = df.target
    df_X = df.drop(labels=["target"], axis=1)
    
    # Just re-organinsing the data such that countinous columns are at the end. 
    numerical_features = ["carbon", "hardness", "strength", "thick", "width", "len"]
    all_categorical_features = list(set(df_X.columns.to_list()) - set(numerical_features))
    df_X = pd.concat([df_X[all_categorical], df_X[numerical_features]], axis=1)
    
    return df_X, df_y

def get_test_data():
    
    df_test = pd.read_csv("dataset/anneal.test")
    df_test.replace("?", np.nan, inplace=True)
    
    df_test.enamelability = df_test.enamelability.astype(float)


    y_test = df_test.target
    X_test = df_test.drop(labels=["target"], axis=1)

    return X_test, y_test

###  b) Removing synonymous and noisy attributes if necessary 

##### All the attributes to remove will be added to a drop_features list
##### We will maintain different lists of drop attributes for various experiments as mentioned in the Case Study Section



In [None]:
X_train, y_train = get_train_val_data()

In [None]:
# Product Type attribute is the same throughout the dataset, hence it can removed
X_train.product_type.value_counts()

In [None]:
# Display amount of missingness in attributes
missing_means = X_train.isnull().mean().multiply(100).sort_values(ascending=False)
missing_means.head(10)

In [None]:
# Experiment 1 drop all attributes with more the 35% missing values
drop_attributes_exp1 = ["product_type"] + missing_means[missing_means > 35.0].index.to_list()

# Experiment 2 drop all attributes with more the 99% missing values
drop_attributes_exp2 = ["product_type"] + missing_means[missing_means > 99.0].index.to_list()

# Experiment 3; same as Exp 2.
drop_attributes_exp2 = ["product_type"] + missing_means[missing_means > 99.0].index.to_list()

###### Column Dropper Transformer for Pipeline

In [None]:
class ColumnDropperTransformer:
    def __init__(self, column):
        self.columns = column

    def transform(self, X, y=None):
        X_new = X.drop(self.columns, axis=1)
        return X_new

    def fit(self, X, y=None):
        return self

###  c) Dealing with missing values if necessary 
##### We are dealing with missing values in three different ways as mentioned in Experiment 1, 2 and 3.

In [None]:
# Experiment 1: Impute missingness less than 35% with mode.
mode_imputer_list_exp1 = ["steel", "shape", "bore", "surface_quality", "formability"]

# Experiment 2: Impute missingness 75% with a not_applicable value/category. Impute missingness less than 35% with mode.
mode_imputer_list_exp2 = ["steel", "shape", "bore", "surface_quality", "formability"]
na_imputer_list_exp2 = missing_means[(missing_means > 70.0) & (missing_means < 99.0)].index.to_list()

# Experiment 3: Impute all missing value with NA category.
na_imputer_list_exp3 = missing_means[(missing_means > 0.0) & (missing_means < 99.0)].index.to_list()

###### NA imputer Transformer for Pipeline

In [None]:
class NAColumnTransformer:
    def __init__(self, column):
        self.columns = column

    def transform(self, X, y=None):
        
        without_formability = list(set(self.columns) - {'formability'})
        
        # Imputer string based features with string NA
        X_new = X[without_formability].fillna("NA")
        X[wo_form] = X_new
        
        # Creating a new 0 category for without_formability
        if "formability" in seld.columns:
            X['formability'] = X['formability'].fillna(0).astype(int)
        return X

    def fit(self, X, y=None):
        return self


###  d) Rescaling if necessary if necessary 
##### Standard Scaler Pipeline for numerical attributes


In [None]:
def get_numerical_pipeline(scale=False):
    if scale:
        numerical_pipeline = Pipeline(steps=[('ss', StandardScaler())])
        return numerical_pipeline

    return 'passthrough'

### e) Categorical Feature Pipeline

In [None]:
# in case of experiment 3, simple imputer will not have any effect
# as all the NA values would be imputed by na_imputer.

def get_categorical_pipeline(cat_features_to_drop, na_imputer_cols):

    categorical_pipeline = Pipeline(steps=[
        ('drop_column', ColumnDropperTransformer(cat_features_to_drop)),
        ('na_imputer', NAColumnTransformer(na_imputer_cols)),
        ('mode', SimpleImputer(strategy='most_frequent')),
        ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
    ])
    return categorical_pipeline

### f)  Full Preprocessor Pipeline

In [None]:
def get_full_processeror(all_categorical_features, cat_features_to_drop, numerical_features, na_imputer_cols, scale_numerical):
    """
    
    :param all_categorical_features: List of all the categorical features.
    :param cat_features_to_drop: List of categorical features to drop.
    :param numerical_features: List of numerical features.
    :param na_imputer_cols: List of features to imputer with new "not_applicable" category. 
    :param scale_numerical: Boolean which controls scaling of numerical features.
    :return: 
    """
    
    categorical_pipeline = get_categorical_pipeline(cat_features_to_drop, na_imputer_cols)
    numerical_pipeline = get_numerical_pipeline(scale=scale_numerical)
    full_processor = ColumnTransformer(transformers=[
        ('category', categorical_pipeline, all_categorical_features),
        ('numerical', numerical_pipeline, numerical_features)
    ])
    return full_processor

#### Experiment 1 Data Pipeline
- We drop Drop all attributes/columns which have more that 35% missingness.
- We impute categorical values with mode.

In [None]:
all_features = X_train.columns.to_list()
numerical_features = ["carbon", "hardness", "strength", "thick", "width", "len"]
all_categorical_features = list(set(all_features) - set(numerical_features))
cat_features_to_drop = drop_attributes_exp1
na_imputer_cols = []
scale_numerical = False

experiment1_data_pipeline = get_full_processeror(
    all_categorical_features=all_categorical_features, 
    cat_features_to_drop=cat_features_to_drop, 
    numerical_features=numerical_features, 
    na_imputer_cols=na_imputer_cols, 
    scale_numerical=scale_numerical)

## 4. Technique 1 (20 marks)

### a) Discuss your motivation for choosing the technique and provide a schematic figure of the process

We use Random Forest as first technique because of the following motivations:

1. The algorithm reduces the overfitting and variance problem by using bagging and ensembling.
    - It tackles overfitting and variance by builds many decision trees with subsets of features[bagging].
    - It then takes the majority vote of the outputs of all trees to give the final output. 
2. Reqiures no features scaling. 
3. Robustness towards outliers.
4. The grid search is easy to perform.
    - Unlike paramteric models, doesn't require careful tuning of regularisation parameter, as regularisation is in-built because of ensembling and bagging.

Enter the correct code in the cells below to execute each of the stated sub-tasks.
### b) Setting hyper parameters with rationale


###### We perform a grid search on number of estimators (decision trees) and max depth of those decision trees.
###### The number of estimators will tackle variance problem and but at the same time will increase the time complexity of the algorithm.
###### We tune the max depth parameter casue very deep descision trees cause overfitting, but literature also suggests that, we can allow the decision trees to grow as deep as possible as long as the ensemble is large enough.


In [None]:
rf_param_grid = {
        "model__n_estimators": [20, 40, 60, 80, 100, 120, 150, 200],
        "model__max_depth": [5, 10, 20, 30, 40, None],

    }

### c) Optimising hyper parameters
We perform nested cross validation along with grid search, and return the best estimators. 
Instead of running nested CV only once, we perform many trials of it and record the best estimators on each fold, this is because the 
during nested CV the best estimator on each fold might be different, so we keep track of the best estimators during many runs and pick the 
top 3 estimators by frequency. 

These top 3 estimators will then be ensembled again to get the final classifier.

In [None]:
mean_train_Fbeta_scores = []
mean_test_Fbeta_score = []
estimator_frequency = []
def nestes_cross_validation(X_train, y_train, param_grid, classifier, datapipeline):
    n_jobs = 40

    pipeline = Pipeline(steps=[
        ('preprocess', datapipeline),
        ('model', classifier)
    ])

    clf = GridSearchCV(estimator=rf_pipeline, param_grid=param_grid,
                       scoring=make_scorer(recall_biased_Fbeta), cv=3, n_jobs=n_jobs, return_train_score=True)

    
    scorer = check_scoring(estimator, scoring=make_scorer(recall_biased_Fbeta))
    cv_results = cross_validate(
        estimator=pipeline,
        X=X_train,
        y=y_train,
        scoring={"score": scorer},
        cv=3,
        n_jobs=n_jobs,
        pre_dispatch="2*n_jobs",
        error_score=np.nan,
        return_estimator=True,
        return_train_score=True
    )

    run_test_Fbeta_score_mean = mean(cv_results['test_score'])
    run_train_Fbeta_score_mean = mean(cv_results['train_score'])
    
    # run_var = variance(cv_results['test_score'])
    mean_train_Fbeta_scores.append(run_train_Fbeta_score_mean)
    mean_test_Fbeta_scores.append(run_test_Fbeta_score_mean)

    for estimator in cv_results['estimator']:
        estimator_frequency[(estimator.best_estimator_.named_steps.model.max_depth, estimator.best_estimator_.named_steps.model.n_estimators)] += 1

### Experiment 1

In [None]:
rf = RandomForestClassifier(n_jobs=n_jobs, oob_score=False)
for i in range(5):
        grid_search(X_train.copy(), y_train.copy(), rf_param_grid, rf, experiment1_data_pipeline)
        print("OVERALL: ", mean(mean_train_Fbeta_scores), mean(mean_test_Fbeta_score))
        print(estimator_count)
        print("*"*10)

### d) Performance metrics for training

In [None]:
def recall_biased_Fbeta(y_true, y_pred):
    fbs = fbeta_score(y_true, y_pred, average='macro', beta=1.28)
    return fbs

def print_recall_biased_Fbeta(y_true, y_pred):
    print(classification_report(y_true, y_pred, zero_division=0))  # print classification report
    fbs = fbeta_score(y_true, y_pred, average='macro', beta=1.28)
    print("F1_beta: ", fbs)
    print("\n***---***\n")
    return fbs

## 5. Technique 2 (20 marks)

### a) Discuss your motivation for choosing the technique and  provide a schematic figure of the process

100-200 words


Enter the correct code in the cells below to execute each of the stated sub-tasks.
### b) Setting hyper parameters with rationale


### c) Optimising hyper parameters


### d) Performance metrics for training

## 6. Comparison of metrics performance for testing (16 marks)
Enter the correct code in the cells below to execute each of the stated sub-tasks. 


### a) Use of cross validation for both techniques to deal with over-fitting

### b) Comparison with appropriate metrics for testing

### c) Model selection (ROC or other charts)

## 7. Final recommendation of best model (8 marks)

### a) Discuss the results from a technical perspective, for example, overfitting discussion, complexity and efficiency

100-200 words


### b) Discuss the results from a business perspective, for example, results interpretation, relevance and balance with technical perspective

100-200 words

## 8. Conclusion (8 marks)

### a) What has been successfully accomplished and what has not been successful?
100-300 words

### b) Reflecting back on the analysis, what could you have done differently if you were to do the project again?

100-300 words

### c) Provide a wish list of future work that you would like to do

100-200 words