### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

#### Process the text and obtain a bag of words-based features 

In [1]:
import pandas as pd

def extract_bag_of_words_train_test(train_file, test_file):
    import numpy as np
    from bs4 import BeautifulSoup
    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    import sklearn
    from sklearn.feature_extraction.text import CountVectorizer
    def tokenize(text): 
        tknzr = TweetTokenizer()
        return tknzr.tokenize(text)

    def stem(doc):
        from nltk.stem import PorterStemmer
        stemmer = PorterStemmer()
        return (stemmer.stem(w) for w in analyzer(doc))
    # Read the CSV file and extract Bag of Words Features
    train_file = pd.read_csv('movie_review_train.csv')
    test_file = pd.read_csv('movie_review_test.csv')

    train_file['review'] = train_file['review'].apply(lambda x: BeautifulSoup(x, "lxml").text)
    test_file['review'] = test_file['review'].apply(lambda x: BeautifulSoup(x, "lxml").text)

    train_file['sentiment'] = train_file['sentiment'].apply(lambda x: 1 if x=='negative' else 0)
    test_file['sentiment'] = test_file['sentiment'].apply(lambda x: 1 if x=='negative' else 0)

    stop_words = set(stopwords.words('english'))
    train_file['review'] = train_file['review'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop_words)]))
    test_file['review'] = test_file['review'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop_words)]))


    
    en_stopwords = set(stopwords.words("english")) 

    
    
    analyzer = CountVectorizer().build_analyzer()
    vectorizer = CountVectorizer(
    analyzer = stem,
    tokenizer = tokenize,
    ngram_range=(1, 1),
    stop_words = en_stopwords)


    X_train = vectorizer.fit_transform(train_file['review'].values)
    X_test = vectorizer.transform(test_file['review'].values)   
    return X_train, train_file['sentiment'], X_test, test_file['sentiment']
    
train_file = open("movie_review_train.csv", "r")
test_file = open("movie_review_test.csv", "r")


In [2]:

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import TweetTokenizer
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, roc_auc_score, recall_score, precision_score







In [6]:
def linear_kernel(x_i, x_j):
    print(x_i.dot(x_j.T))
    return x_i.dot(x_j.T)

def my_kernel(X, Y):
    return np.dot(X, Y.T)

def hyperbolic_tangent_kernel(x_i, x_j):
    return np.tanh(x_i.dot(x_j.T))

import numpy as np

def get_gram(x1, x2, kernel):
    return np.array([[kernel(_x1, _x2) for _x2 in x2] for _x1 in x1])

    
def RBF(x1, x2, gamma  = 1):
    return np.exp(-gamma*np.linalg.norm(x1-x2))



"""
get_gram(X_train, X_train, RBF)
clf_svm =  SVC(C=0.01, probability=True, kernel='precomputed', class_weight="balanced")

print("Learning")
clf_svm.fit(X_train, y_train)
print("Predicting")
clf_svm.score(X_test, y_test)


print("Learning")
clf_svm.fit(get_gram(X_train, X_train, RBF), y_train)
print("Predicting")
clf_svm.score(X_test, y_test)


for kernel in ("linear", "poly", "rbf"):
    clf = SVC(C=0.01, kernel=kernel, class_weight="balanced", probability= True)
    clf.fit(X_train, y_train)
    print("Kernel: ", kernel)
    print(clf.score(X_test, y_test))
"""
X_train, y_train, X_test, y_test = extract_bag_of_words_train_test(train_file, test_file)
kfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)


for kernel in ("linear", "poly", "rbf"):
    clf = SVC(kernel=kernel, class_weight="balanced", probability= True)
    grid_svm = GridSearchCV(estimator=clf,
                    param_grid = {'C': [0.01, 0.05, 0.1, 1, 10], 'gamma': [1,0.1,0.01,0.001]}, 
                    cv = kfolds,
                    scoring="roc_auc",
                    verbose=1,   
                    n_jobs=-1) 
    print("Kernal ", kernel)
    print("Learning")
    grid_svm.fit(X_train, y_train)
    print("Predicting")
    print(grid_svm.score(X_test, y_test))








Kernal  linear
Learning
Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [6]:
print(grid_svm.best_estimator_)

SVC(C=10, class_weight='balanced', gamma=0.001, probability=True)


In [3]:
from sklearn import metrics
class SVMClassifier:
    def __init__(self):
        import numpy as np
        from sklearn import svm
        from sklearn.svm import SVC
        #implement initialisation
    # define your own kernel here
        self.clf = SVC(C=0.01, class_weight='balanced', probability=True, kernel = 'linear')
    # Refer to the documentation here: https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html
    def fit(self, X, y):
        # training of the SVM
        # Ensure you call your own defined kernel here
        self.clf.fit(X, y)
        return None

    def predict(self, X):
        # prediction routine for the SVM
        y_pred = self.clf.predict(X)
        return y_pred



### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [4]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [5]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")



Accuracy: 0.8613333333333333


### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [3]:
X_train, y_train, X_test, y_test = extract_bag_of_words_train_test(train_file, test_file)



In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn import base



class AdaBoostClassifier:
    
    def __init__(self,n_estimators=50, max_depth = 1):
        self.n_estimators = n_estimators
        self.trees = np.zeros(shape=self.n_estimators, dtype=object)
        self.max_depth = max_depth
        self.base_estimator = DecisionTreeClassifier(max_depth=self.max_depth)
        self.estimator_errors_ = np.zeros(shape = n_estimators)
        self.treees = []
        
    def fit(self,X,y):
        
        n = len(y)
        weights = np.ones(shape=n) / n

        for idx in range(self.n_estimators):
            learner_tree = base.clone(self.base_estimator)
            weak_learner_pred = learner_tree.fit(X,y,sample_weight=weights).predict
            err = np.average((weak_learner_pred(X) != y),weights=weights,axis=0)
            self.estimator_errors_[idx] = err
            learner_weight = (np.log((1-err)/err)+np.log(2-1))
            weights *= np.exp(learner_weight*(weak_learner_pred(X) != y)*(weights > 0))
            self.trees[idx] = (learner_weight, weak_learner_pred)

        
    def vector(self, y):
        y_vec = []
        for yi in y:
            v = np.ones(2)*(-1/(2-1))
            v[yi] = 1
            y_vec.append(v)
        return np.array(y_vec)


    def predict(self,X):
        k = 2
        y_pred = sum(learner_weight * self.vector(weak_learner_pred(X)) 
        for learner_weight, weak_learner_pred in self.trees)
        
        return np.argmax(y_pred,axis=1)


from sklearn.metrics import accuracy_score


dt = AdaBoostClassifier(n_estimators=500, max_depth = 4)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(acc)



0.8513333333333334


In [5]:

parameters={'iteration': [10,20,30,40,50,100,500], 'max_depth':[1,2,3,4,5,10,15,20]}
acc_best = 0
for key in parameters:
    for value in parameters[key]:
        if key == 'iteration':
            for i in range(len(parameters['max_depth'])):
                dt = AdaBoostClassifier(n_estimators=value, max_depth = parameters['max_depth'][i])
                dt.fit(X_train, y_train)
                y_pred = dt.predict(X_test)
                acc = accuracy_score(y_test, y_pred)
                if acc >acc_best:
                    acc_best = acc
                    best_parameter = ['Best N', value, 'Best depth', parameters['max_depth'][i]]
                    print(best_parameter, acc)
        else:
            for i in range(len(parameters['iteration'])):
                dt = AdaBoostClassifier(n_estimators=parameters['iteration'][i], max_depth = value)
                dt.fit(X_train, y_train)
                y_pred = dt.predict(X_test)
                acc = accuracy_score(y_test, y_pred)     
                if acc >acc_best:
                    acc_best = acc
                    best_parameter = ['Best N', parameters['iteration'][i], 'Best depth', value]
                    print(best_parameter, acc)

print(best_parameter, acc_best)

['Best N', 10, 'Best depth', 1] 0.7073333333333334
['Best N', 10, 'Best depth', 2] 0.7513333333333333
['Best N', 10, 'Best depth', 3] 0.768
['Best N', 10, 'Best depth', 4] 0.7813333333333333
['Best N', 20, 'Best depth', 3] 0.7933333333333333
['Best N', 20, 'Best depth', 4] 0.802
['Best N', 20, 'Best depth', 5] 0.8066666666666666
['Best N', 30, 'Best depth', 3] 0.8126666666666666
['Best N', 30, 'Best depth', 5] 0.8186666666666667
['Best N', 40, 'Best depth', 3] 0.822
['Best N', 40, 'Best depth', 5] 0.8306666666666667
['Best N', 100, 'Best depth', 5] 0.8326666666666667
['Best N', 500, 'Best depth', 1] 0.8366666666666667
['Best N', 500, 'Best depth', 2] 0.8393333333333334
['Best N', 500, 'Best depth', 3] 0.8453333333333334
['Best N', 500, 'Best depth', 4] 0.8533333333333334
['Best N', 500, 'Best depth', 4] 0.8533333333333334


In [None]:
from sklearn.tree import DecisionTreeClassifier
kfolds = StratifiedKFold(n_splits=5, shuffle=True)

parameters={"splitter":["best","random"],
            "max_depth" : [1,2,5,7,9,10,12],
           "min_samples_leaf":[2,3,4,5],
           "max_leaf_nodes":[None,10,20,30,40,50], 'criterion':['entropy', 'gini'],'max_features':[None, 'auto', 'sqrt','log2']}
reg_decision_model = DecisionTreeClassifier(class_weight = 'balanced')
tuning_model = GridSearchCV(reg_decision_model,param_grid=parameters,scoring="roc_auc",cv=kfolds,verbose=1, n_jobs= -1)
tuning_model.fit(X_train,y_train)
print(tuning_model.score(X_test, y_test))



Fitting 5 folds for each of 2688 candidates, totalling 13440 fits
0.7866639745685675


In [147]:
print(tuning_model.best_estimator_)

DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=12, max_features='sqrt', min_samples_leaf=5)


### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [158]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, Y_Pred)
    return acc

In [159]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")

In [160]:
print(acc)

0.75
