# Theory (15%)

## Question 1

*What is the difference between a rule-based system and a machine learning
system? (5%)*

A rule-based system involves a set of pre-defined rules that are applied to data. The output of which are known as the answers. This system does not seek to learn from the input provided, it simply executes the rules defined. For example, if the height of a person is above 180cm label them 'tall', otherwise label them 'small'. 

A machine learning system is trained, rather than defined, by providing it with data and answers (only if it is supervised machine learning). The system looks for statistical structure in this input and outputs the rules to be used. For example, predict the height of school children in Cardiff by learning from data that provides the age and height of school children in Swansea.

## Question 2

*What is the difference between unsupervised and supervised learning? (5%)*

Supervised learning is when a machine learning system is provided the output variable as well as the input variables. By providing the system with the 'answer' it is able learn the statistical characteristics of an observation that leads to a particular output (continuous or categorical). The resulting algorithm can then be used to predict future, unknown outputs. For example, predicting height given age by providing the height and age of school children.

Unsupervised learning is when a machine learning system is only provided with input variables, no output variables are supplied. The system can't predict outcome as no 'answer' is supplied. Instead, it has to produce an algortihm to classify each observation to a pre-defined number of groups. To do this it looks for statistical similarities between the observations. For example, grouping animals that are similar given information about their size and speed.

## Question 3

*What do we mean when we say that a machine learning system is overfitting? (5%)*

Overfitting occurs when a machine learning system produces an algorithm that too specifically fits itself to the training data provided. This will often result in high accuracy on that one dataset, but low accuracy when applied to new, unseen data. A system that is overfitting will try to over complicate the model and will not take into account the element of randomness (known as white noise) that occurs in each observation.

# Practical (85%)

## Question 1

*Your algorithm gets the following results in a classification experiment. Please compute the precision, recall, f-measure and accuracy manually (without the help of your computer/Python, please provide all steps and formulas). Include the process to get to the final result. (20%)*

Id | Prediction | Gold
- | - | -
1 | True | True
2 | True | True
3 | False | True
4 | True | True
5 | False | True
6 | False | True
7 | True | True
8 | True | True
9 | True | True
10 | False | False
11 | False | False
12 | False | False
13 | True | False
14 | False | False
15 | False | False
16 | False | False
17 | False | False
18 | True | False
19 | True | False
20 | False | False

To compute the evaluation measures, we first need to construct the confusion matrix. This can be done by manually counting the combinations of True and False between the prediction and the gold standard i.e. True/True, True/False, False/True, False/False.

Confusion Matrix | Predicted: True | Predicted: False | Total
- | - | - | -
Actual: True | 6 (TP) | 3 (FP) | 9
Actual: False | 3 (FN) | 8 (TN) | 11
Total | 9 | 11 | 20

Where TP is the number of True Positives, FP is False Positives, FN is False Negatives, and TN is True Negatives. Now we can compute the following measures;

### Accuracy

Defined as the proportion of individuals correctly classified.

$Accuracy = \dfrac{TP + TN}{TP + TN + FP + FN}$

$Accuracy = \dfrac{6 + 8}{6 + 8 + 3 + 3}$

$Accuracy = \dfrac{14}{20}$

$Accuracy = 0.7$

### Precision

Defined as the proportion of positive predictions that are correct.

$Precision = \dfrac{TP}{TP + FP}$

$Precision = \dfrac{6}{6 + 3}$

$Precision = \dfrac{2}{3}$

$Precision = 0.667$ $(3dp)$

### Recall

Defined as the proportion of positive cases that were predicted as positive.

$Recall = \dfrac{TP}{TP + FN}$

$Recall = \dfrac{6}{6 + 3}$

$Recall = \dfrac{2}{3}$

$Recall = 0.667$ $(3dp)$

### F-Measure

Defined as the harmonic mean of precision and recall. It's used to evaluate the performance of an algorithm in which you aren't neccesarily trying to improve only the precision or only the recall.

$F_1 = 2 \times \dfrac{precision \times recall}{precision + recall}$

$F_1 = 2 \times \dfrac{\dfrac{2}{3} \times \dfrac{2}{3}}{\dfrac{2}{3} + \dfrac{2}{3}}$

$F_1 = 2/3$

$F_1 = 0.667$ $(3dp)$

## Question 2

*You are given a dataset (named Wine dataset) with different measured properties of different wines (dataset available in Learning Central). Your goal is to develop a machine learning model to predict the quality of an unseen wine given these properties. Train two machine learning regression models and check their performance. Write, for each of the models, the main Python instructions to train and predict the labels (one line each, no need to include any data preprocessing) and the performance in the test set in terms of Root Mean Squared Error (RMSE) (30%)* 

### Load necessary modules 

In [4]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt

### Read in training and test data

In [29]:
wine_train = pd.read_csv("Data/Wine/wine_train.csv", sep=';')
wine_test = pd.read_csv("Data/Wine/wine_test.csv", sep=';')

### Model 1

Least squares linear regression will be used to predict wine quality for the first model. It will use only one predictor variable; residual sugar. To begin, split the wine dataset by input variable (residual sugar) and output variable (wine quality). Repeat this with the test set. 

In [22]:
wine_train_x_m1 = wine_train[['residual sugar']]
wine_train_y = wine_train['quality']
wine_test_x_m1 = wine_test[['residual sugar']]
wine_test_y = wine_test['quality']

Start by fitting a linear regression model to the training data. This will perform least squares regression which attempts to fit a model that minimises the sums of squares between the mapping function and the data. As the model is given the output variable (quality) this is a supervised machine learning model. Once trained, the `.predict` method can be called to predict wine quality of a new batch of wines, given their residual sugar levels. After this, root mean square error provides a measure of performance of the model and can be used to compare accuracy with other models.

In [23]:
lin_reg_m1 = LinearRegression()
lin_reg_m1.fit(wine_train_x_m1, wine_train_y)
predictions_m1 = lin_reg_m1.predict(wine_test_x_m1)
rmse_m1 = sqrt(mean_squared_error(wine_test_y, predictions_m1))

### Model 2

Least squares linear regression will be used to predict wine quality for the second model. It will use two predictor variables; alcohol and volatile acidity.

In [24]:
wine_train_x_m2 = wine_train[['alcohol', 'volatile acidity']]
wine_test_x_m2 = wine_test[['alcohol', 'volatile acidity']]

In [25]:
lin_reg_m2 = LinearRegression()
lin_reg_m2.fit(wine_train_x_m2, wine_train_y)
predictions_m2 = lin_reg_m2.predict(wine_test_x_m2)
rmse_m2 = sqrt(mean_squared_error(wine_test_y, predictions_m2))

### Compare models

In [27]:
print('Root Mean Square Error for Model 1 \n----------')
print(round(rmse_m1, 3))
print('\n Root Mean Square Error for Model 2 \n----------')
print(round(rmse_m2, 3))

Root Mean Square Error for Model 1 
----------
0.886

 Root Mean Square Error for Model 2 
----------
0.788


### Conclusion

Comparing the root mean square error (RMSE) of both models indicates that model 2 outperformed model 1. The root mean square error measures the difference between the predictions and the 'true' results. Therefore a smaller RMSE indicates a more accurate model. That is, by including both alcohol and volatile acidity in the model, better predictions of wine quality are achieved than if only residual sugar is used.

## Question 3

*Train an SVM binary classifier using the Hateval dataset (available in Learning
Central). The task consists of predicting whether a tweet represents hate speech
or not. You can preprocess and choose the features freely. Evaluate the
performance of your classifier in terms of accuracy using 10-fold cross-validation.
Write a table with the results of the classifier (accuracy, precision, recall and
F-measure) in each of the folds and write a small summary (up to 500 words) of
how you preprocessed the data, chose the feature/s, and trained and evaluated
your model (35%)*

### Load modules

In [161]:
import nltk
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

### Inspect data

In [135]:
hateval = pd.read_csv("Data/Hateval/hateval.tsv", delimiter='\t')
hateval.head()

Unnamed: 0,id,text,label
0,201,"Hurray, saving us $$$ in so many ways @potus @...",1
1,202,Why would young fighting age men be the vast m...,1
2,203,@KamalaHarris Illegals Dump their Kids at the ...,1
3,204,NY Times: 'Nearly All White' States Pose 'an A...,0
4,205,Orban in Brussels: European leaders are ignori...,0


In [136]:
print("Total no of tweets \n------")
print(hateval.label.count())
print("\nOf which considered hate speech \n------")
print(hateval.label.sum())

Total no of tweets 
------
9000

Of which considered hate speech 
------
3783


### Split data into training and test set

In [137]:
hateval_train, hateval_test = train_test_split(hateval, train_size=0.8, random_state = 42)

In [138]:
print(hateval_train.shape)
print(hateval_test.shape)

(7200, 3)
(1800, 3)


### Check proportions are similar between training and test

In [169]:
hateval_train['label'].value_counts() / len(hateval_train['label'])

0    0.577361
1    0.422639
Name: label, dtype: float64

In [171]:
hateval_test['label'].value_counts() / len(hateval_test['label'])

0    0.588889
1    0.411111
Name: label, dtype: float64

### Create a list of unique tokens that have been lemmatized and made lower case

In [139]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [140]:
def get_tokens_vocab(df):
    list_tokens=[]
    for index, row in df.iterrows():
      sentence_split=nltk.tokenize.sent_tokenize(row['text'])
      for sentence in sentence_split:
        tokens = nltk.tokenize.word_tokenize(sentence)
        for token in tokens:
          list_tokens.append(lemmatizer.lemmatize(token).lower())
    return(list_tokens)

In [141]:
def get_tokens(string):
    sentence_split=nltk.tokenize.sent_tokenize(string)
    list_tokens=[]
    for sentence in sentence_split:
      list_tokens_sentence=nltk.tokenize.word_tokenize(sentence)
      for token in list_tokens_sentence:
        list_tokens.append(lemmatizer.lemmatize(token).lower())
    return list_tokens

### Create set of stopwords that will be removed later

In [142]:
# take set of stopwords from nltk
stopwords=set(nltk.corpus.stopwords.words('english'))
# manually add more punctuation
stopwords.add(".")
stopwords.add(",")
stopwords.add("--")
stopwords.add("``")
stopwords.add("#")
stopwords.add("@")
stopwords.add(":")
stopwords.add("'s")
stopwords.add("’")
stopwords.add("...")
stopwords.add("n't")
stopwords.add("'")
stopwords.add("-")
stopwords.add(";")

### Sort tokens in dictionary to include the top n most used words

In [143]:
def sort_tokens(tokens, n):
    dict_word_freq={}
    for token in tokens:
        if token in stopwords: continue
        elif token not in dict_word_freq: dict_word_freq[token]=1
        else: dict_word_freq[token]+=1
    sorted_tokens = sorted(dict_word_freq.items(), key=lambda x: x[1], reverse=True)[:n]
    return(sorted_tokens)

### Return the vocabulary to be used as features in SVM

In [144]:
def get_vocabulary(sorted_tokens):
    vocabulary=[]
    for word,frequency in sorted_tokens:
        vocabulary.append(word)
    return(np.asarray(vocabulary))

### Run `hateval` data through functions

In [145]:
hateval_tokens = get_tokens_vocab(hateval_train)
hateval_tokens_sorted = sort_tokens(hateval_tokens, 5)
hateval_vocab = get_vocabulary(hateval_tokens_sorted)

### Return features as an array of counts

In [146]:
def get_features(df, vocabulary):
    features_array=[]
    for index, row in df.iterrows():
        tokens=get_tokens(row['text'])
        features=np.zeros(len(vocabulary))
        for i, word in enumerate(vocabulary):
            if word in tokens:
                features[i]=tokens.count(word)
        features_array.append(features)
    return np.asarray(features_array)

In [150]:
x_train = get_features(hateval_train, hateval_vocab)
x_test = get_features(hateval_test, hateval_vocab)
y_train = np.asarray(hateval_train['label'])
y_test = np.asarray(hateval_test['label'])

In [148]:
svm_clf = sklearn.svm.SVC(kernel="linear",gamma='auto')
svm_clf.fit(x_train,y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [158]:
y_test_predictions = svm_clf.predict(x_test)

In [162]:
print(confusion_matrix(y_test, y_test_predictions))

[[618 442]
 [199 541]]
