### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

#### Process the text and obtain a bag of words-based features 

In [6]:
def extract_bag_of_words_train_test(train_file, test_file):
    # Read the CSV file and extract Bag of Words Features
    import pandas as pd
    import regex as re
    import nltk
    from nltk.stem import PorterStemmer
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    def stem_removal(text):  
        stemmer = PorterStemmer()
        tokens = []
        for token in text.split():
            tokens.append(stemmer.stem(token))
        return tokens
    
    def preprocessor(text):
        text = re.sub('<[^>]*>', '', text) # Remove HTML markup
        text = re.sub('[\W]+', ' ', text.lower())
        return text
    
    tfidf = TfidfVectorizer(tokenizer=stem_removal,
                            preprocessor=preprocessor) # Performs better than CountVectorizer()
    
    df_train = pd.read_csv(train_file)
    df_train = df_train.sample(frac=1).reset_index(drop=True) # Shuffling to remove any bias
    df_test = pd.read_csv(test_file)
    
    X_train = tfidf.fit_transform(df_train['review'])
    y_train = df_train['sentiment'].apply(lambda sent: 1 if sent == "positive" else 0).values
    
    X_test = tfidf.transform(df_test['review'])
    y_test = df_test['sentiment'].apply(lambda sent: 1 if sent == "positive" else 0).values
    
    return (X_train,y_train,X_test,y_test)

In [7]:
class SVMClassifier:
    def __init__(self, kernel='custom_poly', degree=3, coef=0.4444444444444444):
        import numpy as np
        from sklearn import svm
        #implement initialisation
        self.kernel_lst = ['linear', 'rbf', 'poly', 'sigmoid']
        self.kernel = kernel
        self.classifier = None
        self.train = None
        
        # PARAMETER INITIALISATION
        self.coef = coef
        self.degree = degree
        
    # define your own kernel here
    # Refer to the documentation here: https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html
    
    # CROSS VALIDATION FOR HYPER PARAMETER TUNING LOCATED IN SCRAP NOTEBOOK
    # best_param_ -> degree=3, coef=0.4444444444444444
    
    # CUSTOM KERNELS
    def log(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances
        
        degree=self.degree
        degree = 1 if degree is None else self.degree
        return -np.log(euclidean_distances(X, y)**degree + 1)
    
    def laplacian(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances
        from scipy import sparse
        
        if sparse.issparse(X):
            sigma = np.sqrt(X.shape[1] * np.var(X.toarray()))
        else:
            sigma = np.sqrt(X.shape[1] * np.var(X))
        return -euclidean_distances(X, y)/sigma
    
    def poly(self, X, y):
        import numpy as np
        
        degree=self.degree
        coef=self.coef
        
        degree = 1 if degree is None else self.degree
        coef = 0 if coef is None else self.coef

        X = X.A
        y = y.A
        gamma = 1.0/(X.shape[1] * np.var(X))
        return (gamma*X.dot(y.T) + coef)**degree
    
    # ESTIMATOR FUNCTIONS
    def get_params(self, deep = True):
        return {'degree': self.degree, 'coef':self.coef}
    
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

    def fit(self, X,y):
        # training of the SVM
        # Ensure you call your own defined kernel here
        from sklearn import svm
        if self.kernel in self.kernel_lst:
            svc = svm.SVC(kernel=self.kernel)
            svc.fit(X, y)
        else:
            svc = svm.SVC(kernel='precomputed')
            if self.kernel == 'log':
                svc.fit(self.log(X, X), y)
            elif self.kernel == 'laplacian':
                svc.fit(self.laplacian(X, X), y)
            elif self.kernel == 'custom_poly':
                svc.fit(self.poly(X, X), y)

        self.train = X
        self.classifier = svc

    def predict(self, X):
        # prediction routine for the SVM
        from sklearn import svm
        svc = self.classifier
        X_train = self.train

        if self.kernel in self.kernel_lst:
            return svc.predict(X)
        else:
            if self.kernel == 'log':
                return svc.predict(self.log(X, X_train))
            elif self.kernel == 'laplacian':
                return svc.predict(self.laplacian(X, X_train))
            elif self.kernel == 'custom_poly':
                return svc.predict(self.poly(X, X_train))


### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [8]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [9]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")

Accuracy: 0.8826666666666667


### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [1]:

class BoostingClassifier:
    # You need to implement this classifier. 
    def __init__(self):
        #implement initialisation
        self.num_clfs = 50
        self.trained_clfs = []
    
    def saame(self, X, y):
        import scipy
        import numpy as np
        from sklearn.tree import DecisionTreeClassifier
        
        # implement SAAME algorithm (AdaBoost)
        clfs = []
        N = X.shape[0]
        W = np.full(N, (1 / N), dtype=np.float64)
        
        for _ in range(self.num_clfs):
            clf = DecisionTreeClassifier(max_depth=1).fit(X, y, sample_weight=W)
            error = sum(W * np.where(clf.predict(X) != y, 1, 0)) / sum(W)
            alpha = np.log((1 - error) / error)
            W = np.array([W[i] * np.exp(alpha * np.where(clf.predict(X) != y, 1, 0))[i] for i in range(W.shape[0])])
            clfs.append((clf, alpha))
        return clfs

    def fit(self, X, y):
        self.trained_clfs = self.saame(X, y)

    def predict(self, X):
        import scipy
        import numpy as np
        from sklearn.tree import DecisionTreeClassifier
        
        # Predict
        classes = np.array([0, 1])
        preds = np.zeros((X.shape[0], 1))

        for cls in classes:
            class_preds = np.zeros((X.shape[0], 1))
            
            for i in range(self.num_clfs):    
                clf, alpha = self.trained_clfs[i]
                class_preds += alpha * np.expand_dims(np.where(clf.predict(X) == cls, 1, 0), axis=1)
                
            preds = np.append(preds, class_preds, axis=1)
        
        pred_idx = np.argmax(preds, axis=1) - 1
        y_pred = classes[pred_idx]
        
        return y_pred

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [3]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, Y_Pred)
    return acc

In [4]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")