### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

#### Process the text and obtain a bag of words-based features 

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from bs4 import BeautifulSoup
import re
import nltk


import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import os

In [3]:
train_file = pd.read_csv('movie_review_train.csv')
test_file = pd.read_csv('movie_review_test.csv')

In [4]:
train_file.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
tokenizer=ToktokTokenizer()

def normalize_data(train_file, test_file):    
    tokenizer= ToktokTokenizer()
    #Setting English stopwords
    stopword_list=nltk.corpus.stopwords.words('english')

    def strip_html(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    #Removing the square brackets
    def remove_between_square_brackets(text):
        return re.sub('\[[^]]*\]', '', text)

    #Removing the noisy text
    def denoise_text(text):
        text = strip_html(text)
        text = remove_between_square_brackets(text)
        return text

    train_file['review']= train_file['review'].apply(denoise_text)
    test_file['review']= test_file['review'].apply(denoise_text)


    #Define function for removing special characters
    def remove_special_characters(text, remove_digits=True):
        pattern=r'[^a-zA-z0-9\s]'
        text=re.sub(pattern,'',text)
        return text
    #Apply function on review column
    train_file['review'] = train_file['review'].apply(remove_special_characters)
    test_file['review'] = test_file['review'].apply(remove_special_characters)

    # remove whitespace
    train_file['review'] = train_file['review'].apply(lambda x: x.strip())
    test_file['review'] = test_file['review'].apply(lambda x: x.strip())

        
    # convert to lower
    train_file['review'] = train_file['review'].apply(lambda x: x.lower())
    test_file['review'] = test_file['review'].apply(lambda x: x.lower())

    #Stemming the text
    def simple_stemmer(text):
        ps= nltk.porter.PorterStemmer()
        text= ' '.join([ps.stem(word) for word in text.split()])
        return text
    #Apply function on review column
    train_file['review'] = train_file['review'].apply(simple_stemmer)
    test_file['review'] = test_file['review'].apply(simple_stemmer)

    stop=set(stopwords.words('english'))
    print(stop)

#removing the stopwords
    def remove_stopwords(text, is_lower_case=False):
        tokens = tokenizer.tokenize(text)
        tokens = [token.strip() for token in tokens]
        if is_lower_case:
            filtered_tokens = [token for token in tokens if token not in stopword_list]
        else:
            filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
        filtered_text = ' '.join(filtered_tokens)    
        return filtered_text

#Apply function on review column
    train_file['review']=train_file['review'].apply(remove_stopwords)
    test_file['review']=test_file['review'].apply(remove_stopwords)


    X_train = train_file['review']
    X_test = test_file['review']
    
    return X_train, X_test

In [43]:
def extract_bag_of_words_train_test(train_file, test_file):
    import numpy as np
    import nltk
    # Read the CSV file and extract Bag of Words Features
    X_train, X_test = normalize_data(train_file,test_file)

    #Count vectorizer for bag of words
    cv = CountVectorizer(analyzer = "word", tokenizer = None, max_features = 5000) 
    #cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
    #transformed train reviews
    X_train = cv.fit_transform(X_train)
    #transformed test reviews
    X_test = cv.transform(X_test)

    #labeling the sentient data
    lb = LabelBinarizer()
    #transformed sentiment data
    y_train = lb.fit_transform(train_file['sentiment'])
    y_test = lb.fit_transform(test_file['sentiment'])
    #y_train = train_file['review']
    #y_test = test_file['sentiment']
    
    return X_train,y_train,X_test,y_test

In [44]:
X_train, y_train, X_test, y_test = extract_bag_of_words_train_test(train_file, test_file)

{'ma', 'wouldn', 'their', 'off', 'its', 'not', 'or', 'between', 'mustn', 'have', 't', "didn't", 'myself', 'are', 'll', "you'd", 'about', 'yourself', 'now', "weren't", 'does', 'her', 'down', 'he', "that'll", 'me', 'to', 'more', 'being', 'how', 'shan', 'our', 'through', 'when', "mightn't", 'shouldn', "wouldn't", 'again', 'those', 'any', 'no', 'should', 've', "it's", 'just', 'ourselves', 'only', 'can', 'further', 'd', 'doesn', 'all', 'in', 'by', 're', 'than', "couldn't", 'few', "she's", 'doing', 'who', "hasn't", 'isn', 'where', 'each', "don't", 'whom', "aren't", 'itself', 'needn', 'ours', 'the', 'why', 'won', "haven't", 'such', 'before', 's', 'too', 'had', 'there', 'own', 'above', 'mightn', 'out', 'himself', 'up', 'some', 'were', 'until', 'that', 'under', "shouldn't", 'with', 'be', 'most', 'very', 'these', "you'll", 'a', "wasn't", 'yourselves', 'after', 'both', 'other', 'will', "you're", 'my', 'ain', 'having', 'hasn', 'during', 'at', 'your', 'which', 'of', 'and', "hadn't", 'we', 'them', '

In [63]:
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
import numpy as np



decision_tree = tree.DecisionTreeClassifier(max_depth=2, random_state=456).fit(X_train, y_train)
# Bias calculation with 8-1 loss
predictions = decision_tree.predict(X_train)
print("Algorithm: Decision Tree")
print("0-1 Loss" + str(np.round(zero_one_loss(predictions, y_train))))

NameError: name 'tree' is not defined

In [50]:
# Let's try with the GRADIENT BOOSTING CLASSIFIER

from sklearn.ensemble import GradientBoostingClassifier
clf=GradientBoostingClassifier(learning_rate = 0.5, max_depth = 2, max_features = 5000, n_estimators=300,random_state= 0)
clf.fit(X_train,y_train.ravel())
clf.score(X_test ,y_test.ravel())

0.8493333333333334

In [51]:
from sklearn.model_selection import GridSearchCV
mod=GridSearchCV(clf,param_grid={'n_estimators': [80,100]})
mod.fit(X_train,y_train.ravel())

GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.5,
                                                  max_depth=2,
                                                  max_features=5000,
                                                  n_estimators=300,
                                                  random_state=0),
             param_grid={'n_estimators': [80, 100]})

In [52]:
GradientBoostingClassifier(learning_rate=0.25, max_depth=2, max_features=5000,
                           n_estimators=80, random_state=0)
clf.fit(X_train,y_train.ravel())
clf.score(X_test,y_test.ravel())

0.8493333333333334

In [78]:
# make predictions using adaboost for classification
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier
# define dataset

# define the model
model = AdaBoostClassifier(n_estimators= 400, learning_rate = 0.5, random_state=0)
# fit the model on the whole dataset
model.fit(X_train, y_train.ravel())
# make a single prediction
#predict = model.predict(y_test)
print(model.score(X_test,y_test))
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

0.8513333333333334
              precision    recall  f1-score   support

           0       0.85      0.85      0.85       731
           1       0.85      0.86      0.86       769

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500

Confusion Matrix:
[[618 113]
 [110 659]]


In [109]:
# train-test split evaluation of xgboost model
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

model = XGBClassifier(use_label_encoder= False, max_depth = 1, verbosity = 0, tree_method = 'hist')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X_train, y_train.ravel(), scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
model = XGBClassifier(n_estimators = 260, use_label_encoder= False,  max_depth = 10, verbosity = 0, tree_method = 'hist')
model.fit(X_train, y_train.ravel())

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 0.807 (0.016)
Accuracy: 84.07%


In [65]:
# gradient boosting for classification in scikit-learn
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
#X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = GradientBoostingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X_train, y_train.ravel(), scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model = GradientBoostingClassifier()
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# make a single prediction
#yhat = model.predict(row)
#print('Prediction: %d' % yhat[0])

Accuracy: 0.802 (0.017)
Accuracy: 81.33%


In [2]:
class SVMClassifier:
    def __init__(self):
        import numpy as np
        from sklearn import svm
        #implement initialisation
        self.some_paramter=1
    # define your own kernel here
    # Refer to the documentation here: https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html
    def fit(self, X,y):
        # training of the SVM
        # Ensure you call your own defined kernel here
        return
    def predict(self, X):
        # prediction routine for the SVM
        return    

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [3]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [None]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")

### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [4]:
class BoostingClassifier:
    # You need to implement this classifier. 
    def __init__(self):
        import numpy as np
        #implement initialisation
        self.some_paramter=1
    def fit(self, X,y):
        from sklearn.tree import DecisionTreeClassifier
        import numpy as np
        #implement training of the boosting classifier
        return 
    def predict(self, X):
        # implement prediction of the boosting classifier
        return

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [5]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, Y_Pred)
    return acc

In [None]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")