# Assignment Week 2

#### Arianna Bisazza and Prajit Dhar 
#### Introduction to Neural Networks
#### BSc Information Science, University of Groningen, September 2021


Please enter your name and student ID below -

Name - Felix Zailskas

Student ID - S3918270

In this week's Assignment, you will -

- Load and analyse a NLP dataset on Spam classification

- Create three features of your choice to solve the task

- Use these three features in the *LogRegClassifier* from the Lab, and try to improve upon the majority baseline

**IMPORTANT: You must complete the Lab Exercise of Week 2 before starting this Assignment**

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.__version__

'0.11.2'

### LOADING DATA

Load the dataset *youtube-comments.csv*.

The dataset is available on Nestor > Week2. Download it and place it in the same folder as this notebook.

If loaded correctly, you will see a dataset of 2 columns, *COMMENT* and *CLASS*, where

*COMMENT* consists of comments from the Youtube videos of 5 musical artists, and

*CLASS* denotes if the particular comment is a spam or not.

For more information on the dataset, see this [link](https://archive-beta.ics.uci.edu/ml/datasets/YouTube%20Spam%20Collection).

### Task 1: Load the dataset

Also provide basic statistics on the dataset, such as the number of rows and columns and whether it is a *balanced* dataset or not.

In [2]:
# Load dataset from .csv using pandas dataframe
df = pd.read_csv("youtube-comments.csv", sep=",")
# Visualize
df
# Check proportion of spam to non spam comments
amt_spam = df["CLASS"].sum()
amt_no_spam = len(df) - amt_spam
print("#Spam: ", amt_spam,", #No Spam: ", amt_no_spam," Proportion: ", amt_spam / len(df))

#Spam:  841 , #No Spam:  919  Proportion:  0.4778409090909091


### Task 2: Create 3 Features

Based on the dataset and on what you already know about spam comments, create 3 features from these comments.

Example features are: length of the comment (in characters), length of the comments (in words), presence of the word "Hey". 
These features are implemented in the code below, and are **not** acceptable choices for the assignment.

In general, your features can be boolean (binary), categorical or numerical, but for this exercise we recommend using binary values as that makes it easier to manually set the feature weights.

Possible ideas include detecting the presence of an URL in the comment, or counting the number of pronouns in the text (possible binary feature: are there more than N pronouns?), or the number of capitalized words etc.

Tip: Rather than using a for loop, you can use **str** methods, as seen [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) and [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html).

Should you have difficulties working with Pandas, please reach out to us during the lab sessions.

In [3]:
# EXAMPLE FEATURES

df['LEN']=df.COMMENT.str.len()
df['NWORDS']=df.COMMENT.str.count(' ') + 1
df['HAS_WORD_HEY']=df.COMMENT.str.contains('Hey')
df

Unnamed: 0,COMMENT,CLASS,LEN,NWORDS,HAS_WORD_HEY
0,"Huh, anyway check out this you[tube] channel: ...",1,56,8,False
1,Hey guys check out my new channel and our firs...,1,166,32,True
2,just for test I have to say murdev.com,1,38,8,False
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,48,11,False
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1,39,7,False
...,...,...,...,...,...
1755,well done shakira,0,17,3,False
1756,I love this song because we sing it at Camp al...,0,58,13,False
1757,I love this song for two reasons: 1.it is abou...,0,93,18,False
1758,Shakira u are so wiredo,0,23,5,False


In [4]:
# IMPLEMENT YOUR FEATURES HERE:
df["Link"]=df.COMMENT.str.contains('http://|https://|watch\?v=', regex=True)
df["Promote"]=df.COMMENT.str.contains('.*?check.*?out|.*?my.*?channel',regex=True)
df["Like"]=df.COMMENT.str.contains('subscribe|like|comment', regex=True)
df

Unnamed: 0,COMMENT,CLASS,LEN,NWORDS,HAS_WORD_HEY,Link,Promote,Like
0,"Huh, anyway check out this you[tube] channel: ...",1,56,8,False,False,True,False
1,Hey guys check out my new channel and our firs...,1,166,32,True,False,True,True
2,just for test I have to say murdev.com,1,38,8,False,False,False,False
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,48,11,False,False,True,False
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1,39,7,False,True,False,False
...,...,...,...,...,...,...,...,...
1755,well done shakira,0,17,3,False,False,False,False
1756,I love this song because we sing it at Camp al...,0,58,13,False,False,False,False
1757,I love this song for two reasons: 1.it is abou...,0,93,18,False,False,False,False
1758,Shakira u are so wiredo,0,23,5,False,False,False,False


### Task 3: Justify your choice of features


Why did you choose these features for this task? Explain below.
There are no right or wrong answers here.

The features I used are the following:
1. The precence of a link in the comment.<br>
Links are usually an indication for a spam message and are not allowed to be shared on many platforms. To realize this I am scanning the strings for the following substrings: "http://", "https://" and "watch?v="
2. Promotion of a channel.<br>
The promotion of a different channel is usually considered spam. To check for this I investigate the strings for the combination of strings "my channel" and "check out".
3. Asking for likes.<br>
Asking for likes, comments or subscriptions is generally considered spam. I check the strings for occureces of "like" "comment" and "subscribe" as those are the verbs asking for a like etc.

### LogRegClassifier definition

The LogRegClassifier and the sigmoid function from the lab are declared below.

In [5]:
def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

class LogRegClassifier:
    
    def __init__(self, n):
        self.weights = np.zeros(n)
        self.bias = 0.0

        
    def set_weights(self, vals):
        self.weights = vals
    
    #The new setter function for setting the bias value
    def set_bias(self, val):
        self.bias = val
        

    def print_parameters(self):
        print("weights:", self.weights)
        print("bias: ", self.bias)

    # Function predict takes the argument:
    # - 'x': vector of features representing a data sample (e.g. single apartment),
    def predict_prob(self, x):

        assert len(x)==self.weights.size

        z = 0
    
        # Add multiplication of each feature with its weight
        for weight, feature in zip(self.weights, x):
            z += weight * feature
        
        # Add bias
        z += self.bias
        
        # Compute sigmoid
        prob = sigmoid(z)
        return prob

    def train(self):
        print("not implemented yet!")

### Task 4: Create your LogRegClassifier

Note with our code we are only able to do prediction, no training yet, so we'll have to set our feature weights manually.

Initialize a LogRegClassifier object with three features.  
Then, set its feature weights and bias term manually. (There are no right or wrong values for these. Just try out some numbers according to your intuition, or if you have no intuition, pick uniform values and zero bias.)

In [6]:
# Creating the classifier with proper size
lg_classifier = LogRegClassifier(3)
# Setting weights
# I am assuming that all three features make a comment more likely to be spam.
weights = np.array([0.5, 0.4, 0.3])
lg_classifier.set_weights(weights)
# Setting bias
# Since there are less spam comments than non spam comments I will set the bias to be negative
bias = -0.15
lg_classifier.set_bias(bias)
# Inspect new weights and bias
lg_classifier.print_parameters()

weights: [0.5 0.4 0.3]
bias:  -0.15


### Select a section of the dataset

We extract here a subset of the first 200 rows (observations) to be used as our test for spam classification.

In [7]:
df_test = df.head(200).copy()

### Task 5: Feature extraction and analysis

Starting from the test dataframe, extract the features you would like to use for the classification task.

In [8]:
# Defining the features
features = ["Link", "Promote", "Like"]

Finally, perform data analysis on this subset. Do you see any interesting patterns? 

You could use Pandas functions such as *describe()*, *value_counts()*, etc for the same.

In [9]:
# Examining the amount of occurences since we are only using binary features
amt_link = df_test["Link"].sum()
amt_promo = df_test["Promote"].sum()
amt_like = df_test["Like"].sum()
amt_link_promo = np.where((df_test["Link"]) & (df_test["Promote"]), 1, 0).sum()
amt_like_promo = np.where(df_test["Like"] & df_test["Promote"], 1, 0).sum()
amt_link_like = np.where(df_test["Link"] & df_test["Like"], 1, 0).sum()
amt_all = np.where(df_test["Link"] & df_test["Promote"] & df_test["Like"], 1, 0).sum()
amt_any = np.where(df_test["Link"] | df_test["Promote"] | df_test["Like"], 1, 0).sum()
print("Amounts:")
print("Link:", amt_link)
print("Promote:", amt_promo)
print("Like:", amt_like)
print("Link AND Promote:", amt_link_promo)
print("Like AND Promote:", amt_like_promo)
print("Link AND Like:", amt_link_like)
print("Link AND Promote AND Link:", amt_all)
print("Link OR Promote OR Link:", amt_any)
print("Amount of Spam comments:", df_test["CLASS"].sum())

Amounts:
Link: 48
Promote: 38
Like: 39
Link AND Promote: 2
Like AND Promote: 10
Link AND Like: 3
Link AND Promote AND Link: 0
Link OR Promote OR Link: 110
Amount of Spam comments: 125


Explain here: 



By investigating the amount of times the features appear in the comments on their own and with each other we can get a rough idea on whether they could help classify spam comments.<br>
It seems that in the test set it is quite unlikely that a comment has more than one of the three features and never all of them. This means that if indeed these are features categorizing spam comments, that there are different types of spam comments. This is in accordance with the assumption I made while choosing the three features.<br>
We can also see that all three features appear roughly the same amount of times in the test data set. In case that they are useful features they are probably similarly likely to identify a spam comment.<br>
Finally we can see that the total amount of the comments with at least one of the features is similar to the total amount of spam comments. Assuming that the comments with the features are spam comments than the simple identification of one of the features should lead to a classification as a spam comment.<br>
In the following section we will see whether these assumptions can be confirmed or not.<br><br>
Note that we can also see here that the amount of spam comments in the test set is higher than non spam comments. This is different in the full data set and therefore the bias might need to be changed for the test set.

### Task 6: Run the LogRegClassifier classifier 

For each item in the test set, predict if a comment was a spam or not.

Also report on the accuracy of the model. Is it better than the majority baseline?

In [10]:
# Creating the majority choice comparison
# For this I will use functions from the first Tutorial
def mostFreqClass(df_test, col):
    return df_test[col].mode()[0]

def mostFreqPred(df_test, col):
    return np.full((len(df_test), 1), mostFreqClass(df_test, col))

In [11]:
df_test["mostFreqPred"] = mostFreqPred(df_test, "CLASS")
df_test

Unnamed: 0,COMMENT,CLASS,LEN,NWORDS,HAS_WORD_HEY,Link,Promote,Like,mostFreqPred
0,"Huh, anyway check out this you[tube] channel: ...",1,56,8,False,False,True,False,1
1,Hey guys check out my new channel and our firs...,1,166,32,True,False,True,True,1
2,just for test I have to say murdev.com,1,38,8,False,False,False,False,1
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,48,11,False,False,True,False,1
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1,39,7,False,True,False,False,1
...,...,...,...,...,...,...,...,...,...
195,What is he saying?!?!?!?!?!?!?!?$? ﻿,0,36,5,False,False,False,False,1
196,this has so many views﻿,0,23,5,False,False,False,False,1
197,OMG over 2 billion views!﻿,0,26,5,False,False,False,False,1
198,Subscribe to me plz plz plz plz plz plZ ﻿,1,41,10,False,False,False,False,1


In [12]:
# Finding probability for all rows
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test

Unnamed: 0,COMMENT,CLASS,LEN,NWORDS,HAS_WORD_HEY,Link,Promote,Like,mostFreqPred,prob
0,"Huh, anyway check out this you[tube] channel: ...",1,56,8,False,False,True,False,1,0.562177
1,Hey guys check out my new channel and our firs...,1,166,32,True,False,True,True,1,0.634136
2,just for test I have to say murdev.com,1,38,8,False,False,False,False,1,0.462570
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,48,11,False,False,True,False,1,0.562177
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1,39,7,False,True,False,False,1,0.586618
...,...,...,...,...,...,...,...,...,...,...
195,What is he saying?!?!?!?!?!?!?!?$? ﻿,0,36,5,False,False,False,False,1,0.462570
196,this has so many views﻿,0,23,5,False,False,False,False,1,0.462570
197,OMG over 2 billion views!﻿,0,26,5,False,False,False,False,1,0.462570
198,Subscribe to me plz plz plz plz plz plZ ﻿,1,41,10,False,False,False,False,1,0.462570


In [13]:
# adding the class with a threshold of 0.5
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)
df_test

Unnamed: 0,COMMENT,CLASS,LEN,NWORDS,HAS_WORD_HEY,Link,Promote,Like,mostFreqPred,prob,pred_class
0,"Huh, anyway check out this you[tube] channel: ...",1,56,8,False,False,True,False,1,0.562177,1
1,Hey guys check out my new channel and our firs...,1,166,32,True,False,True,True,1,0.634136,1
2,just for test I have to say murdev.com,1,38,8,False,False,False,False,1,0.462570,0
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,48,11,False,False,True,False,1,0.562177,1
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1,39,7,False,True,False,False,1,0.586618,1
...,...,...,...,...,...,...,...,...,...,...,...
195,What is he saying?!?!?!?!?!?!?!?$? ﻿,0,36,5,False,False,False,False,1,0.462570,0
196,this has so many views﻿,0,23,5,False,False,False,False,1,0.462570,0
197,OMG over 2 billion views!﻿,0,26,5,False,False,False,False,1,0.462570,0
198,Subscribe to me plz plz plz plz plz plZ ﻿,1,41,10,False,False,False,False,1,0.462570,0


In [14]:
# checking the accuracy and comparing to the most frequent prediction
correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

correct_mfp = (df_test["mostFreqPred"] == df_test["CLASS"])
nb_correct_mfp = correct_mfp.sum()
accuracy_mfp = nb_correct_mfp/correct_mfp.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)


weights: [0.5 0.4 0.3]
bias:  -0.15
Accuracy Classifier: 0.835
Accuracy Most Frequent Prediction: 0.625


We can see that our classifier already has a higher accuracy than the trivial most frequent prediction classifier.

### Task 7: Tuning the classifier's weights

Try out various feature weights and bias values. Are you able to increase your accuracy?

First I will try to change the bias to a positive value as the test set has more spam comments than non spam comments.

In [15]:
# make bias positive
lg_classifier.set_bias(0.1)

In [16]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.5 0.4 0.3]
bias:  0.1
Accuracy Classifier: 0.625
Accuracy Most Frequent Prediction: 0.625


This actually decreased the accuracy so I will first try a larger bias value and than an even smaller bias than the original one.

In [17]:
# make bias even greater
lg_classifier.set_bias(0.2)

In [18]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.5 0.4 0.3]
bias:  0.2
Accuracy Classifier: 0.625
Accuracy Most Frequent Prediction: 0.625


In [19]:
# make bias even smaller
lg_classifier.set_bias(-0.2)

In [20]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.5 0.4 0.3]
bias:  -0.2
Accuracy Classifier: 0.835
Accuracy Most Frequent Prediction: 0.625


In [21]:
# make bias even smaller
lg_classifier.set_bias(-0.4)

In [22]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.5 0.4 0.3]
bias:  -0.4
Accuracy Classifier: 0.785
Accuracy Most Frequent Prediction: 0.625


In [23]:
# make bias even smaller
lg_classifier.set_bias(-0.5)

In [24]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.5 0.4 0.3]
bias:  -0.5
Accuracy Classifier: 0.655
Accuracy Most Frequent Prediction: 0.625


It seems that a value of -0.2 to -0.4 is a good choice for the bias term.

In [25]:
# set bias
lg_classifier.set_bias(-0.3)

Let us now play with the weights.

In [26]:
# set weights
lg_classifier.set_weights(np.array([0.5, 0.5, 0.5]))

In [27]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.5 0.5 0.5]
bias:  -0.3
Accuracy Classifier: 0.835
Accuracy Most Frequent Prediction: 0.625


In [28]:
# set weights
lg_classifier.set_weights(np.array([0.2, 0.5, 0.8]))

In [29]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.2 0.5 0.8]
bias:  -0.3
Accuracy Classifier: 0.63
Accuracy Most Frequent Prediction: 0.625


In [30]:
# set weights
lg_classifier.set_weights(np.array([0.8, 0.3, 0.8]))

In [31]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.8 0.3 0.8]
bias:  -0.3
Accuracy Classifier: 0.835
Accuracy Most Frequent Prediction: 0.625


In [32]:
# set weights
lg_classifier.set_weights(np.array([0.8, 0.8, 0.8]))

In [33]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.8 0.8 0.8]
bias:  -0.3
Accuracy Classifier: 0.835
Accuracy Most Frequent Prediction: 0.625


In [34]:
# set weights
lg_classifier.set_weights(np.array([-0.8, -0.8, -0.1]))

In [35]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [-0.8 -0.8 -0.1]
bias:  -0.3
Accuracy Classifier: 0.375
Accuracy Most Frequent Prediction: 0.625


In [36]:
# set weights
lg_classifier.set_weights(np.array([0.2, 0.2, 0.1]))

In [37]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.2 0.2 0.1]
bias:  -0.3
Accuracy Classifier: 0.45
Accuracy Most Frequent Prediction: 0.625


In [38]:
# set weights
lg_classifier.set_weights(np.array([0.99, 0.99, 0.99]))

In [39]:
# re-evaluate
df_test['prob']=df_test[features].apply(lg_classifier.predict_prob,axis=1)
df_test["pred_class"] = np.where(df_test["prob"] >= 0.5, 1, 0)

correct_lg = (df_test["pred_class"] == df_test["CLASS"])
nb_correct_lg = correct_lg.sum()
accuracy_lg = nb_correct_lg/correct_lg.count()

lg_classifier.print_parameters()
print("Accuracy Classifier:", accuracy_lg)
print("Accuracy Most Frequent Prediction:", accuracy_mfp)

weights: [0.99 0.99 0.99]
bias:  -0.3
Accuracy Classifier: 0.835
Accuracy Most Frequent Prediction: 0.625


It seems that there is not a difference in prediction power of the classifier when the weights are all positive and large enough. Negative weights decrease the accuracy greatly.

Explain:

I am not able to increase the accuracy of the classifier above 0.835. This value is greater than the most frequent prediction model and therefore I conclude that the classifier is successful at what it should do, classify spam comments.<br>
Since the accuracy decreased as the bias was set to be positive I conclude that my initial assumption of a negative bias was correct. To reiterate I assumed that, since the full data set had more non spam comments than spam comments. The hypothesis that a positive bias might work better for the test set because it has more spam comments was not supported by my analysis.<br>
It was also observed that fairly large positive weights are good choices for the chosen features. This is in accordance with my assumption. When comments contain the defined substrings it seems to classify them as spam comments fairly well.<br>
If I were to further improve this classifier by hand I would probably investigate whether the classifier is more likely to falsly classify a normal comment as spam or to miss a spam comment. This could give further insight on the effectiveness of the defined features. I would also investigate if I can spot any other features in unidentified spam comments and might define a new feature to better identify spam comments that could not be found yet.