# Part 1: Classification

Please include any imports (allowed by Ed) you require throughout your notebook in the first cell.

In [1]:
# Import all libraries
# to make this notebook's output stable across runs
import numpy as np
import pandas as pd 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

np.random.seed(0)

## Data loading and preprocessing

Load the cardio_diseases_.csv dataset and set the X and y variables to the data and class respectively.

You will need to load this file into numpy arrays for the attribute data and the labels. So that we can test your code more effectively, please complete this task inside the given function scaffold, and have your function return these arrays.

Begin by looking at the file and its format. Notice any missing values and how they are encoded in the file. In the returned X array, missing values should be encoded as np.nan. All numerical values should be positive numbers, and any invalid numbers should be encoded as np.nan.

This dataset includes categorical values which should be changed to numerical values for some methods. Binary categorical values [No,Yes] or [F,M] should be replaced to [0,1]. Other categorical values should be changed from [Normal, Above normal, Well above normal] to [0,0.5,1]

While there are multiple ways to load the file correctly, a suggested function to use is pd.read_csv. Look through the documentation to check which arguments you will need to pass to the function to load the file correctly. If you choose to use this approach, you will need to extract the appropriate numpy arrays from the pandas dataframe, and exclude any headers.

The X array returned by your function should have shape (number of examples, number of attributes), and the y array returned by your function should have shape (number of examples,). We will also test your function with some different datasets with the same data types, delimiters, and encoding of missing values. However, these files may have a different filename, number of examples and/or attributes, so you should not hard code these values in your solution. There will not be any missing target values, and the targets will always be in the final column.


In [2]:
### TEST FUNCTION: test_data_loading
# DO NOT REMOVE THE LINE ABOVE
import pandas as pd

def load_heart_disease_data(filename):
    """Load the dataset located at the filname string as described above."""
    
    # TODO 
    # return X, y
    # 读取数据
    file_new = pd.read_csv(filename)
    
    # 填充缺失值
    file_new.fillna(np.nan, inplace=True)

    #分类字典
    bio_class = {'No':0, 'Yes':1, 'F':0, 'M':1, 'nan':'NAN'}
    other_class = {'Normal':0, 'Above normal':0.5, 'Well above normal':1}
    
    # 处理数据
    for column_name, column_value in file_new.items():
        try:
            numeric_value = pd.to_numeric(column_value)
            file_new[column_name] = np.where(file_new[column_name] < 0, np.nan, file_new[column_name])
        except ValueError:
            if file_new[column_name].nunique() == 3:
                file_new[column_name] = file_new[column_name].replace(other_class)
            else:
                file_new[column_name] = file_new[column_name].replace(bio_class)
    
    # 分离特征和标签
    X = file_new.iloc[:, :-1].values
    y = file_new.iloc[:, -1].values

    print("X shape:", X.shape)
    print("y shape:", y.shape)
    return X,y

filename = 'cardio_diseases.csv'
X, y = load_heart_disease_data(filename)

X shape: (6878, 11)
y shape: (6878,)


Next, you should investigate the attributes of the dataset and find any invalid values for any attributes of the data. 

Replace any missing or invalid values with the mean of that attribute for numeric data and the most frequent value for categorical data. Take a look at the following documentation to see a suitable function to perform this using sklearn: [Documentation](https://scikit-learn.org/stable/modules/impute.html#impute).

Many of the classification algorithms we will use will benefit from normalisation, so you should perform min-max normalisation on appropriate attributes.

In [3]:
### TEST FUNCTION: test_preprocessing
# DO NOT REMOVE THE LINE ABOVE

def process_data(X):
    """Fill missing (np.nan) values in the input array as described above."""
    """Scale data using MinMaxScaler as described above."""
    # TODO
    # return X

    #数据转换
    df = pd.DataFrame(X)

    #填补空值
    proc_1 = SimpleImputer(missing_values=np.nan, strategy='mean')
    proc_2 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

    for column_name,column_value in df.items():
        if df[column_name].nunique() > 3:
            df[column_name] = proc_1.fit_transform(df[[column_name]])
        else:
            df[column_name] = proc_2.fit_transform(df[[column_name]])
        
        
    #正则化 with sklearn
    scaler = MinMaxScaler()

    for column_name,column_value in df.items():
        if df[column_name].nunique() > 3:
            df[column_name] = scaler.fit_transform(df[[column_name]])


    X = df.values

    return X

X_scaled = process_data(X)

In [19]:
### SKIP
# This cell won't be marked. Use it to try out your code.


## Implementing classification algorithms and comparing with 10-fold stratified cross-validation.

For the following tasks, you are required to implement functions which create algorithms and evaluate them with 10 fold cross validation. 

In order to make this reproducible, it is important that the folds are kept consistent across runs. You can utilise `sklearn.model_selection.StratifiedKFolds` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)) to provide consistency in the generation of the 10 folds. This object can be passed into the `cv` argument of `cross_val_score`. Make sure to shuffle the data in this process, and set `random_state` to be 0. Use the variable name `cvKFold`.

In [4]:
### TEST FUNCTION: test_cvkfold
# DO NOT REMOVE THE LINE ABOVE
from sklearn.model_selection import cross_val_score, StratifiedKFold

n_splits = 10
random_state = 0

cvKFold=StratifiedKFold(n_splits=n_splits,random_state=random_state,shuffle=True)

In [21]:
### SKIP
# This cell won't be marked. Use it to try out your code.


**K-Nearest Neighbors**

We have seen how to implement a KNN classifier in the lab. Your task is to implement a KNN for classification using sklearn.neighbors.KNeighborsClassification ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))

Use the function definition below, so that any appropriate hyperparameters can optionally be passed in and accessed as a dictionary. Note you also can pass arguments as a dictionary to functions (such as sklearn constructors) using the ** syntax.

Fill in the function to perform K-nearest neighbors. Test the function with K=11 and Manhattan distance.

The format of your output should be:

Cross validation score: x.xx

In [22]:
### TEST FUNCTION: test_k_nearest_neighbors
# DO NOT REMOVE THE LINE ABOVE

#K-Nearest Neighbors

#构建函数
def knnClassifier(X, y,**hyperparams):

    #建立knn模型
    knn = KNeighborsClassifier(**hyperparams)

    # 使用 cross_val_score 执行交叉验证，并传递分层折叠策略 cv
    # 计算平均得分
    scores = cross_val_score(knn, X, y, cv=cvKFold)
    mean_score = scores.mean()
    

    return knn,mean_score

# 调用函数并传递分层折叠策略 cvKFold
hyperparams={'n_neighbors':11,'p':1}
knn_class_model, mean_score = knnClassifier(X_scaled, y, **hyperparams)

print("Cross validation score: {:.2f}".format(mean_score))

Cross validation score: 0.71


In [23]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Naive Bayes**

Fill in the function to use the Gaussian Naive Bayes function on all attributes. Use the sklearn implementation in [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). 

The format of your output should be:

Cross validation score: x.xx

In [24]:
### TEST FUNCTION: test_naive_bayes
# DO NOT REMOVE THE LINE ABOVE

#Naïve Bayes

def nbClassifier(X, y, **hyperparams):
    # TODO
    # return model, score
    
    #建立nb模型
    nb = GaussianNB()

    # 使用 cross_val_score 执行交叉验证，并传递分层折叠策略 cv
    scores = cross_val_score(nb, X, y, cv=cvKFold)
    mean_score = scores.mean()

    return nb, mean_score

nb_model, mean_score = nbClassifier(X_scaled, y)

print("Cross validation score: {:.2f}".format(mean_score))

Cross validation score: 0.71


In [25]:
### SKIP
# This cell won't be marked. Use it to try out your code.


There is an issue with our current implementation of Naive Bayes using GaussianNB above. GaussianNB can still work effectively with these attributes, but a slight improvement can be made by removing these attributes. 

Answer the Part 1 Question on Gaussian Naive Bayes.

**Decision Tree** 

As shown in the tutorials, decision trees can often perform well in classification tasks. Fill in the function to perform classifier with a decision tree classifier. Test the function with entropy, max depth of 5 and sqrt max features. Read through [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

The format of your output should be:

Cross validation score: x.xx

In [26]:
### TEST FUNCTION: test_decision_tree_classifier
# DO NOT REMOVE THE LINE ABOVE

#Decision Trees

def dtClassifier(X, y, **hyperparams):
    # TODO
    # return model, score   
    tree = DecisionTreeClassifier(**hyperparams,random_state=0)
    
    scores = cross_val_score(tree, X, y, cv=cvKFold)
    mean_score = scores.mean()

    return tree, mean_score

hyperparams = {'criterion':'entropy', 'max_depth':5,'max_features':'sqrt'}
tree_model, mean_score = dtClassifier(X_scaled, y,**hyperparams)

print("Cross validation score: {:.2f}".format(mean_score))

Cross validation score: 0.73


In [27]:

### SKIP
# This cell won't be marked. Use it to try out your code.


**Support Vector Machine**

Fill in the function to perform a linear support vector machine classifier. Try utilising [`sklearn.svm.LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC). Test the function with lasso regularization, C = 0.005 and set dual to "auto".

The format of your output should be:

Cross validation score: x.xx

In [7]:
### TEST FUNCTION: test_svm
# DO NOT REMOVE THE LINE ABOVE

#Support Vector Machine
from sklearn.svm import LinearSVC

def svmClassifier(X, y, **hyperparams):
    # TODO
    # return model, score
    las_svm = LinearSVC(**hyperparams)

    scores = cross_val_score(las_svm, X, y, cv=cvKFold)
    mean_score = scores.mean()

    return las_svm,mean_score

hyperparams={'penalty':'l1','C':0.005,'dual':'auto','random_state':0}
svm_model, mean_score = svmClassifier(X_scaled, y, **hyperparams)

print("Cross validation score: {:.2f}".format(mean_score))



Cross validation score: 0.72


In [29]:
### SKIP
# This cell won't be marked. Use it to try out your code.


## Ensemble Methods

Ensembles are powerful tools in machine learning that seek to improve predictive performance by combining predictions from multiple models.

**Bagging with logistic regression**

Fill in the function to perform bagging with logistic regression as the base classifier. Test the bagging with 20 estimators and a maximum of half of the samples. The logistic regression should be set with an C of 2 and using ridge regularisation.

*Hint:* The hyperparams dict should be split to only pass the relevant hyperparameters to bagging and the logistic regression.

The format of your output should be:

Cross validation score: x.xx

In [30]:
### TEST FUNCTION: test_bagging_lr
# DO NOT REMOVE THE LINE ABOVE

#Bagging

def baggingClassifier(X, y,**hyperparams):
    # TODO
    # return model, score
    logistic_regression_params = {k: v for k, v in hyperparams.items() if k in LogisticRegression().get_params()}
    bagging_params = {k: v for k, v in hyperparams.items() if k in BaggingClassifier().get_params()}

    logistic_regression_model = LogisticRegression(**logistic_regression_params,random_state=0)
    bagging_model = BaggingClassifier(estimator=logistic_regression_model,**bagging_params,random_state=0)

    scores = cross_val_score(bagging_model, X, y, cv=cvKFold)
    mean_score = scores.mean()

    

    return bagging_model,mean_score

hyperparams = {'C':2,'penalty':'l2','n_estimators':20, 'max_samples':0.5}
bagging_model, mean_score = baggingClassifier(X_scaled, y, **hyperparams)

print("Cross validation score: {:.2f}".format(mean_score))

Cross validation score: 0.73


In [31]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Adaboost on DT**

Fill in the function to perform boosting with Adaboost on a decision tree as the base classifier. Test your function with the Adaboost with 25 estimators and a learning rate of 0.1. The decision tree should have a max depth of 3 with a log loss criterion.

*Hint:* The hyperparams dict should only pass the relevant hyperparameters to Adaboost and the decision tree.

The format of your output should be:

Cross validation  score: x.xx


In [32]:
### TEST FUNCTION: test_adaboost_dt
# DO NOT REMOVE THE LINE ABOVE

#Adaboost
from sklearn.ensemble import AdaBoostClassifier


def adaBoostClassifier(X, y,**hyperparams):
    # TODO
    # return model, score
    decision_tree_params = {k: v for k, v in hyperparams.items() if k in DecisionTreeClassifier().get_params()}
    ada_params = {k: v for k, v in hyperparams.items() if k in AdaBoostClassifier().get_params()}


    decision_tree = DecisionTreeClassifier(**decision_tree_params,random_state=0)
    ada_clf = AdaBoostClassifier(estimator=decision_tree, **ada_params, random_state=0)

    scores = cross_val_score(ada_clf, X, y, cv=cvKFold)
    mean_score = scores.mean()

    return ada_clf,mean_score

hyperparams = {'criterion':'entropy', 'max_depth':3,'n_estimators':25, 'learning_rate':0.1}
ada_clf, mean_score = adaBoostClassifier(X_scaled, y, **hyperparams)

print("Cross validation score: {:.2f}".format(mean_score))


Cross validation score: 0.73


In [33]:

### SKIP
# This cell won't be marked. Use it to try out your code.



## Hyperparameter tuning

Cross-validation is an excellent tool for determining the best generalisation performance and determining the best hyperparameters. However, when we are looking at slower algorithms or large data, we may choose to use a validation set instead of cross-validation. Here you will perform hyperparameter tuning using a test set consisting of 25% of the data, and a validation set consisting of 10% of the training set. Set any random_state parameters to 0. 

In [6]:
### TEST FUNCTION: test_assert_splitting
# Create this function to use on any subset of this data

# TODO: uncomment this code to create the initial train test split

from sklearn.model_selection import train_test_split

def train_val_test_split(X,y):
    # TODO
    # return X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test
    X_train_all, X_test, y_train_all, y_test = train_test_split(X_scaled, y, test_size=0.25,stratify=y,random_state=0)
    X_train,X_val,y_train,y_val = train_test_split(X_train_all,y_train_all,test_size=0.1,stratify=y_train_all,random_state=0)

    return X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test

X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test = train_val_test_split(X_scaled,y)

In [35]:
### SKIP
# This cell won't be marked. Use it to try out your code.

Perform a grid search with Adaboost on linear SVM. You should select  and handle appropriate hyperparameters for both methods, and return the best set of hyperparameters found.

In the following cell, you should define a grid with three parameters for the AdaBoost and/or the linear SVM.

Use the variable names provided in the scaffold. We will ensure your results are reproduced and that you set reasonable parameter values to achieve a threshold score.

The hyperparameter tuning will also be run on a different subset of the data, so try and make sure your hyperparameter ranges have good coverage.


Output format:

Best hyperparameter combination: {'param1': value, 'param2': value, ...}
Best model's test set score: x.xx

In [11]:
### TEST FUNCTION: test_parameter_tuning
# DO NOT REMOVE THE LINE ABOVE

from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC

def adaBoostGrid(X, y,**hyperparams):
    # TODO
    # return model, score
    best_score1 = 0
    best_params1 = {}

    X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test = train_val_test_split(X,y)
    lin_svm_params = {k: v for k, v in hyperparams.items() if k in LinearSVC().get_params()}
    ada_params = {k: v for k, v in hyperparams.items() if k in AdaBoostClassifier().get_params()}

    #训练LinearSVM的参数
    for C in lin_svm_params.get('C',[0.1]):
        # 创建Linear SVM分类器
        lin_svm_classifier = LinearSVC(C=C,random_state=0)

        # 训练模型
        lin_svm_classifier.fit(X_train, y_train)

        # 在验证集上评估性能
        y_pred = lin_svm_classifier.predict(X_val)
        accuracy = accuracy_score(y_val, y_pred)

        # 如果性能更好，更新最佳超参数
        if accuracy > best_score1:
            best_score1 = accuracy
            best_params1 = {'C': C}
    
    #得到的linearSVC 模型

    LinSVC_mod = LinearSVC(dual = 'auto',**best_params1,random_state=0)
    
    best_score = 0
    best_params = {}
    
    for n_estimators in ada_params.get('n_estimators',[100]):
        for learning_rate in ada_params.get('learning_rate',[0.1]):
            # 创建Adaboost分类器
            adaboost_classifier = AdaBoostClassifier(estimator=LinSVC_mod, algorithm='SAMME',n_estimators=n_estimators, learning_rate=learning_rate, random_state=0)

            # 训练模型
            adaboost_classifier.fit(X_train, y_train)

            # 在验证集上评估性能
            y_pred = adaboost_classifier.predict(X_val)
            accuracy = accuracy_score(y_val, y_pred)

            # 如果性能更好，更新最佳超参数
            if accuracy > best_score:
                best_score = accuracy
                best_params = {'n_estimators': n_estimators, 'learning_rate': learning_rate}



        ada_clf = AdaBoostClassifier(algorithm='SAMME',**best_params,random_state=0)
        ada_clf_fit = ada_clf.fit(X_train_all, y_train_all)
        
        y_pred = ada_clf.predict(X_test)
        test_set_score = accuracy_score(y_test, y_pred)
        
        param_grid['C'] = best_params1['C']
        param_grid['n_estimators'] = best_params['n_estimators']
        param_grid['learning_rate'] = best_params['learning_rate']
        best_params_new=param_grid

    return ada_clf,best_params_new,best_score,test_set_score



param_grid = {
     'algorithm':'SAMME',
     'dual':'auto',
     'C':[0.1,1,10],
     'n_estimators':[1,10,100],
     'learning_rate':[0.1,1,10],
}



adaboost_mod = adaBoostGrid(X_scaled,y,**param_grid)
best_model, best_params, best_val_score = adaboost_mod[0],adaboost_mod[1],adaboost_mod[2]
test_score = adaboost_mod[3]
# Print your result as above


print("Best parameters combination: {}".format(best_params))
print("Best model's test set score: {:.2f}".format(test_score))



Best parameters combination: {'algorithm': ['SAMME'], 'dual': ['auto'], 'C': 0.1, 'n_estimators': 100, 'learning_rate': 0.1}
Best model's test set score: 0.74


In [17]:
### TEST FUNCTION: test_adaboost_grid
# DO NOT REMOVE THE LINE ABOVE


In [8]:
### SKIP
# This cell won't be marked. Use it to try out your code.
