# Task 1: Text Classification

#### 1. Download the BBC dataset provided on Moodle. The dataset, created by Greene and Cunningham, 2006 is a collection of 2225 documents from the BBC news website already categorized into 5 classes: business, entertainment, politics, sport, and tech.

In [1]:
import os

# The dataset can be found in the /data/BBC folder
print(os.listdir("../data/BBC"))

['business', 'entertainment', 'politics', 'README.TXT', 'sport', 'tech']


In [2]:
# Dictionary to hold the number of instances of each class
# business/entertainment/politics/sport/tech
category_dict = dict()

category_dict["business"] = len(os.listdir("../data/BBC/business"))
category_dict["entertainment"] = len(os.listdir("../data/BBC/entertainment"))
category_dict["politics"] = len(os.listdir("../data/BBC/politics"))
category_dict["sport"] = len(os.listdir("../data/BBC/sport"))
category_dict["tech"] = len(os.listdir("../data/BBC/tech"))

print("The number of instances in each class", category_dict)

The number of instances in each class {'business': 510, 'entertainment': 386, 'politics': 417, 'sport': 511, 'tech': 401}


In [3]:
num_texts = 0

for value in category_dict.values():
    num_texts += value
print('Total Number of Texts: {num_texts}'.format(num_texts=num_texts))

Total Number of Texts: 2225


#### 2. Plot the distribution of the instances in each class and save the graphic in a file called BBC-distribution.pdf. You may want to use matplotlib.pyplot and savefig to do this. This pre-analysis of the data set will allow you to determine if the classes are balanced, and which metric is more appropriate to use to evaluate the performance of your classifier.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
 
categories = list(category_dict.keys())
instances = list(category_dict.values())
  
fig = plt.figure(figsize = (10, 5))
 
# Creating the bar plot
plt.bar(categories, instances, color ='powderblue', width = 0.6)
 
plt.xlabel("Text Categories")
plt.ylabel("No. of Instances")
plt.title("Distribution of Instances in Each Class")
plt.savefig("../output/BBC-distribution.pdf")
plt.show()

#### 3. Load the corpus using load files and make sure you set the encoding to latin1. This will read the file structure and assign the category name to each file from their parent directory name.

In [None]:
from sklearn.datasets import load_files

# Reads the file structure and assign the category name to each file from their parent directory name
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html
res = load_files("../data/BBC", encoding="latin-1")

In [None]:
# The raw text data to learn (list of str)
X = res.data

# The target labels (e.g. business/entertainment/politics/sport/tech) but as an integer index! (e.g. 0/1/2/3/4)
y = res.target

target_names = res.target_names
print("The names of target classes: ", target_names)

print("\nSome examples below: ")
for i in range(0,10):
    text = res.data[i]
    target = res.target[i]
    category = target_names[target]
    print('{text}...{category}'.format(text=text[0:30], category=category))

#### 5. Split the dataset into 80% for training and 20% for testing. For this, you must use train test split with the parameter random state set to None

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=None)

#### 4. Pre-process the dataset to have the features ready to be used by a multinomial Naive Bayes classifier. This means that the frequency of each word in each class must be computed and stored in a term-document matrix. For this, you can use feature extraction.text.CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert a collection of text documents to a matrix of token counts. ("Tokenization")
count_vect = CountVectorizer()

# Xarray of shape (n_samples, n_features)
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform
X_train_counts = count_vect.fit_transform(X_train)

print("Shape of X_train_counts: ", X_train_counts.shape)
print("y_train length: ", len(y_train))

print("\nVocabulary (A Mapping of Terms to Feature Indices)")
vocabulary = count_vect.vocabulary_
names = list(vocabulary.keys())
index = list(vocabulary.values())
for i in range(0,10):
    print(names[i] + ":",index[i])

#### 6. Train a multinomial Naive Bayes Classifier (naive bayes.MultinomialNB) on the training set using the default parameters and evaluate it on the test set.

In [None]:
from sklearn.naive_bayes import MultinomialNB

multinomialNB = MultinomialNB()
clf = multinomialNB.fit(X_train_counts, y_train)

In [None]:
print("Let's test with a few examples to see if the model makes sense...\n")

docs_new = ['MSFT stock hit $300', 'Intel core processor with 16GB RAM', "GO HABS GO!"]
X_new_counts = count_vect.transform(docs_new)

predicted = clf.predict(X_new_counts)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, target_names[category]))

In [None]:
# Now let's actually test the model using X_text, y_test

X_test_counts = count_vect.transform(X_test)

y_predicted = clf.predict(X_test_counts)

#### 7. In a file called bbc-performance.txt, save the following information: (to make it easier for the TAs, make sure that your output for each sub-question below is clearly marked in your output file, using the headings (a), (b) . . .)

##### (a) a clear separator (a sequence of hyphens or stars) and string clearly describing the model (e.g. “MultinomialNB default values, try 1”)

In [None]:
# Create files that can be appended to
file_performance = open("../output/bbc-performance.txt", "a")
file_discussion = open("../output/bbc-discussion.txt", "a")

# Clear contents
file_performance.truncate(0)
file_discussion.truncate(0)

def write_model_name_to_file(model_name):
    file_performance.write("\n(a) **** {model_name} ****\n\n".format(model_name=model_name))

write_model_name_to_file("MultinomialNB default values, try 1")
print("Writing to file...")

##### (b) the confusion matrix (you can use confusion matrix)

In [None]:
from sklearn.metrics import confusion_matrix

cf_matrix = confusion_matrix(y_test, y_predicted)

def print_confusion_matrix(cf_matrix):
    print('Confusion Matrix:\n {cf_matrix}'.format(cf_matrix=cf_matrix))
    
def write_confusion_matrix_to_file(cf_matrix):
    file_performance.write("(b) Confusion Matrix" + "\n\n")
    np.savetxt(file_performance, X=cf_matrix.astype(int), fmt ='%i\t')
    file_performance.write("\n")
    
print_confusion_matrix(cf_matrix)
write_confusion_matrix_to_file(cf_matrix)

##### (c) the precision, recall, and F1-measure for each class (you can use classification report)

In [None]:
from sklearn import metrics

def print_classification_report(y_test, y_predicted, target_names):
    report = metrics.classification_report(y_test, y_predicted, target_names=target_names)
    print(report)
    
def write_classification_report_to_file(y_test, y_predicted, target_names):
    report = metrics.classification_report(y_test, y_predicted, target_names=target_names)
    file_performance.write("(c) Classification Report\n\n" + report)
    file_performance.write("\n")
    
print_classification_report(y_test, y_predicted, target_names)
write_classification_report_to_file(y_test, y_predicted, target_names)

##### (d) the accuracy, macro-average F1 and weighted-average F1 of the model (you can use accuracy score and f1 score)

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def print_scores(y_test,y_predicted):
    print("Accuracy Score: ", accuracy_score(y_test, y_predicted))
    print("Macro-Average F1: ", f1_score(y_test, y_predicted, average="macro"))
    print("Weighted-Average F1: ", f1_score(y_test, y_predicted, average="weighted"))
    
def write_scores_to_file(y_test,y_predicted):
    file_performance.write("(d) Accuracy, Macro-Average F1 and Weighted-Average F1\n\n")
    file_performance.write("Accuracy Score: " + str(accuracy_score(y_test, y_predicted)) + "\n")
    file_performance.write("Macro-Average F1: " + str(f1_score(y_test, y_predicted, average="macro")) + "\n")
    file_performance.write("Weighted-Average F1: " + str(f1_score(y_test, y_predicted, average="weighted")) + "\n")
    file_performance.write("\n")

print_scores(y_test, y_predicted)
write_scores_to_file(y_test,y_predicted)

##### (e) the prior probability of each class

In [None]:
# Returns a dictionary that holds the prior probabilities of each class
def compute_prior_probabilities(multinomialNB):
    class_count = multinomialNB.class_count_
    total = sum(class_count)
    prior_dict = dict()
    
    for i in range(len(class_count)):
        target = target_names[i]
        probability = class_count[i] / total
        prior_dict[target] = round(probability, 4)
        
    return prior_dict

def print_class_count(multionomialNB):
    class_count = multinomialNB.class_count_
    total = sum(class_count)
    print("The number of samples encountered for each class: {class_count}".format(class_count=class_count))
    print("The total number of samples: {total}".format(total=total))
    
def print_prior_probabilities(multinomialNB):
    prior_dict = compute_prior_probabilities(multinomialNB)
    print("The prior probabilities of each class: {prior_dict}".format(prior_dict=prior_dict))
    
def write_prior_probabilities_to_file(multinomialNB):
    file_performance.write("(e) Prior Probability of Each Class F1\n\n")
    prior_dict = compute_prior_probabilities(multinomialNB)
    for key in prior_dict:
        category = key
        probability = prior_dict[key]
        file_performance.write("{category}: {probability:.4f}".format(category=category, probability=probability))
        file_performance.write("\n")
    file_performance.write("\n")
    
print_class_count(multinomialNB)
print_prior_probabilities(multinomialNB)
write_prior_probabilities_to_file(multinomialNB)

##### (f) the size of the vocablary (i.e. the number of different words). For example, if the word potato appears 3 times, you only count it once.

In [None]:
vocabulary_size = len(vocabulary.keys())

def print_vocabulary_size(multinomialNB):
    print("The size of the vocabulary: {vocabulary_size}\n".format(vocabulary_size = vocabulary_size))
    print("Note that this can also be deduced by the number of columns (features) in the document-term matrix: ", X_train_counts.shape[1])
    print("Or this can be computed from the feature counts of the MultionomialNB: ", multinomialNB.feature_count_.shape[1])

def write_vocabulary_size_to_file(multinomialNB):
    file_performance.write("(f) The Size of the Vocablary\n")
    # The number of columns in the term document matrix
    vocabulary_size = multinomialNB.feature_count_.shape[1]
    file_performance.write("The size of the vocabulary: {vocabulary_size}\n".format(vocabulary_size = vocabulary_size))
    file_performance.write("\n")

print_vocabulary_size(multinomialNB)
write_vocabulary_size_to_file(multinomialNB)

##### (g) the number of word-tokens in each class (i.e. the number of words in total). For example, if the word potato appears 3 times, you count it 3 times.

In [None]:
# generates a dictionary containing the number of word-tokens in each class
def compute_word_token_dict(multinomialNB):
    word_token_dict = dict()
    class_count = multinomialNB.class_count_
    feature_count = multinomialNB.feature_count_
    
    for i in range(len(class_count)):
        target = target_names[i]
        # Take the sum of all the words for that class
        word_token_dict[target] = sum(feature_count[i, :])
        
    return word_token_dict
    
def print_word_tokens_by_class(multinomialNB):
    word_token_dict = compute_word_token_dict(multinomialNB)
    print("The number of word tokens by class: ", word_token_dict)
    
def write_word_tokens_by_class_to_file(multinomialNB):
    file_performance.write("(g) The number of word-tokens in each class (i.e. the number of words in total)\n")
    word_token_dict = compute_word_token_dict(multinomialNB)
    file_performance.write("The number of word tokens by class: " +  str(word_token_dict))
    file_performance.write("\n\n")

print_word_tokens_by_class(multinomialNB)
write_word_tokens_by_class_to_file(multinomialNB)

##### (h) the number of word-tokens in the entire corpus

In [None]:
def compute_total_word_tokens(multinomialNB):
    feature_count = multinomialNB.feature_count_
    return feature_count.sum()
    
def print_word_tokens_total(multinomialNB):
    total = compute_total_word_tokens(multinomialNB)
    print("Total word-tokens in the corpus: ", total)

def write_word_tokens_total_to_file(multinomialNB):
    file_performance.write("(h) The number of word-tokens in the entire corpus\n")
    total = compute_total_word_tokens(multinomialNB)
    file_performance.write("Total word-tokens in the corpus: " + str(total))
    file_performance.write("\n\n")
    
print_word_tokens_total(multinomialNB)
write_word_tokens_total_to_file(multinomialNB)

##### (i) the number and percentage of words with a frequency of zero in each class

In [None]:
def print_count_and_percentage(x, y, class_name):
    print("Count for {class_name}: {count}".format(class_name=class_name, count=x))
    print("Percentage for {class_name}: {percentage:.2%}".format(class_name=class_name, percentage=x/y))
    print("\n")

def write_count_and_percentage_to_file(x, y, class_name):
    file_performance.write("Count for {class_name}: {count}".format(class_name=class_name, count=x))
    file_performance.write("\n")
    file_performance.write("Percentage for {class_name}: {percentage:.2%}".format(class_name=class_name, percentage=x/y))
    file_performance.write("\n")
    
# returns a dictionary containing the number of words with a frequency zero in each class
def compute_frequency_zero_words(multinomialNB):
    frequency_zero_dict = dict()
    feature_count = multinomialNB.feature_count_
    num_features = feature_count.shape[1]
    
    for i in range(len(target_names)):
        target = target_names[i]
        count_zero = num_features - np.count_nonzero(feature_count[i,:])
        frequency_zero_dict[target] = count_zero
        
    return frequency_zero_dict
    
def print_words_with_frequency_zero(multinomialNB):
    feature_count = multinomialNB.feature_count_
    num_features = feature_count.shape[1]
    frequency_zero_dict = compute_frequency_zero_words(multinomialNB)
    print("Recall that the number of features is: {num_features}".format(num_features = num_features))
    print("The number and percentage of words with a frequency of ZERO in each class is outlined below\n")
    
    for key in frequency_zero_dict:
        print_count_and_percentage(frequency_zero_dict[key], num_features, key)
    
def write_words_with_frequency_zero_to_file(multinomialNB):
    file_performance.write("(i) the number and percentage of words with a frequency of zero in each class\n")
    feature_count = multinomialNB.feature_count_
    num_features = feature_count.shape[1]
    frequency_zero_dict = compute_frequency_zero_words(multinomialNB)
    
    for key in frequency_zero_dict:
        write_count_and_percentage_to_file(frequency_zero_dict[key], num_features, key)
        
    file_performance.write("\n")
    
print_words_with_frequency_zero(multinomialNB)
write_words_with_frequency_zero_to_file(multinomialNB)

##### (j) the number and percentage of words with a frequency of one in the entire corpus

In [None]:
def get_num_words_with_frequency_one(multinomialNB):
    feature_count = multinomialNB.feature_count_
    num_features = feature_count.shape[1]
    total = 0
    
    for j in range(num_features):
        col = feature_count[:, j]
        col_sum = int(col.sum())
        
        if (col_sum == 1):
            total += 1
        
    return total

def print_num_words_with_frequency_one(multinomialNB):
    total = get_num_words_with_frequency_one(multinomialNB)
    vocabulary_size = multinomialNB.feature_count_.shape[1]
    print("The number of words with a frequency of one in the entire corpus: ", total)
    print("The percentage of words with a frequency of one in the entire corpus: {percentage:.2%}".format(percentage=(total/vocabulary_size)))
    print("(Note that the number of features is: " + str(vocabulary_size) + ")")
    
def write_num_words_with_frequency_one_to_file(multinomialNB):
    total = get_num_words_with_frequency_one(multinomialNB)
    vocabulary_size = multinomialNB.feature_count_.shape[1]
    file_performance.write("(j) the number and percentage of words with a frequency of one in the entire corpus\n")
    file_performance.write("The number of words: " + str(total))
    file_performance.write("\n")
    file_performance.write("The percentage of words: {percentage:.2%}".format(percentage=(total/vocabulary_size)))
    file_performance.write("\n\n")
    
get_num_words_with_frequency_one(multinomialNB)
print_num_words_with_frequency_one(multinomialNB)
write_num_words_with_frequency_one_to_file(multinomialNB)

##### (k) your 2 favorite words (that are present in the vocabulary) and their log-prob

In [None]:
print("Index of word 'executive': {index}".format(index = vocabulary["executive"]))
print("Index of word 'the': {index}".format(index = vocabulary["the"]))

def compute_favorite_words_log_prob(multinomialNB):
    feature_index_1 = vocabulary["executive"]
    feature_index_2 = vocabulary["the"]
    p1 = multinomialNB.feature_log_prob_
    log_prob_1 = sum(p1[:, feature_index_1])
    log_prob_2 = sum(p1[:, feature_index_2])
    return log_prob_1, log_prob_2

def print_log_prob(multinomialNB):
    log_prob_1, log_prob_2 = compute_favorite_words_log_prob(multinomialNB)
    print("\n")
    print("log_prob of word 'executive': ", str(log_prob_1))
    print("log_prob of word 'the': ", str(log_prob_2))
    
def write_log_prob_to_file(multinomialNB):
    log_prob_1, log_prob_2 = compute_favorite_words_log_prob(multinomialNB)
    file_performance.write("(k) your 2 favorite words (that are present in the vocabulary) and their log-prob\n")
    file_performance.write("log_prob of word 'executive': " + str(log_prob_1))
    file_performance.write("\n")
    file_performance.write("log_prob of word 'the': " + str(log_prob_2))
    file_performance.write("\n")

print_log_prob(multinomialNB)
write_log_prob_to_file(multinomialNB)

# Makes sense that the log_prob of "the" is higher than the log_prob of "executive"!

#### 8. Redo steps 6 and 7 without changing anything (do not redo step 5, the dataset split). Change the model name to something like “MultinomialNB default values, try 2” and append the results to the file bbc-performance.txt.

In [None]:
def write_separator_to_file():
    file_performance.write("\n---------------------------------------------------------------------\n")

def write_model_to_file(model_name, cf_matrix, y_test, y_predicted, target_names, multinomialNB):
    write_separator_to_file()
    write_model_name_to_file(model_name)
    write_confusion_matrix_to_file(cf_matrix)
    write_classification_report_to_file(y_test, y_predicted, target_names)
    write_scores_to_file(y_test, y_predicted)
    write_prior_probabilities_to_file(multinomialNB)
    write_vocabulary_size_to_file(multinomialNB)
    write_word_tokens_by_class_to_file(multinomialNB)
    write_word_tokens_total_to_file(multinomialNB)
    write_words_with_frequency_zero_to_file(multinomialNB)
    write_num_words_with_frequency_one_to_file(multinomialNB)
    write_log_prob_to_file(multinomialNB)    

In [None]:
def create_multinomialNB(smoothing, X_train_counts, X_test, y_train, y_test, y_predicted, target_names, title):
    multinomialNB = MultinomialNB(alpha = smoothing)
    clf = multinomialNB.fit(X_train_counts, y_train)
    y_predicted = clf.predict(count_vect.transform(X_test))

    cf_matrix = confusion_matrix(y_test, y_predicted)

    print(str(cf_matrix) + "\n")
    print_classification_report(y_test, y_predicted, target_names)
    print_scores(y_test, y_predicted)
    write_model_to_file(title, cf_matrix, y_test, y_predicted, target_names, multinomialNB)

In [None]:
# Multinomial NB, default values, take #2

create_multinomialNB(title = "MultinomialNB default values, try 2",
                     smoothing = 1.0, # default smoothing  = 1.0 as per sklearn documentation 
                     X_train_counts = X_train_counts, 
                     X_test = X_test, 
                     y_train = y_train,
                     y_test = y_test,
                     y_predicted = y_predicted, 
                     target_names = target_names)

#### 9. Redo steps 6 and 7 again, but this time, change the smoothing value to 0.0001. Append the results at the end of bbc-performance.txt.

In [None]:
create_multinomialNB(title = "MultinomialNB (smoothing = 0.0001)",
                     smoothing = 0.0001,
                     X_train_counts = X_train_counts, 
                     X_test = X_test, 
                     y_train = y_train,
                     y_test = y_test,
                     y_predicted = y_predicted, 
                     target_names = target_names)

#### 10. Redo steps 6 and 7, but this time, change the smoothing value to 0.9. Append the results at the end of bbc-performance.txt.

In [None]:
create_multinomialNB(title = "MultinomialNB (smoothing = 0.9)",
                     smoothing = 0.9,
                     X_train_counts = X_train_counts, 
                     X_test = X_test, 
                     y_train = y_train,
                     y_test = y_test,
                     y_predicted = y_predicted, 
                     target_names = target_names)

#### 11. In a separate plain text file called bbc-discussion.txt, explain in 1 to 2 paragraphs:

##### (a) what metric is best suited to this dataset/task and why (see step (2))

In [None]:
question11_a = """ 
(a) 

First and foremost, it is necessary to discuss the class balance of business/entertainment/politics/sport/tech
As illustrated in the first section, the classes are MOSTLY balanced (with some minor discrepancies)

Thus, it can be said that accuracy along with F1 measure (macro and weighted) are suitable measures for this dataset.

This is highlighted by the fact that the accuracy score, macro-average F1, and weighted-average F1 all
have a similar result of 0.97-0.98!

If it was the case that there is a major class imbalance, weighted-average F1 would be much more suitable in
order to account for the classes that are represented less.

-----------------------------------------------------------------------------------------------------------------------
"""

def print_discussion(text):
    print(text)
    
def write_discussion_to_file(text):
    file_discussion.write(text)
    

print_discussion(question11_a)
write_discussion_to_file("Question 11\n")
write_discussion_to_file(question11_a)

##### (b) why the performance of steps (8-10) are the same or are different than those of step (7) above.

In [None]:
question11_b = """
(b)

There are many interesting observations that can be drawn from steps 8-10. First and foremost, it is important
to note that step 8 (MultinomialNB with default values) yielded the same results as step 6. 
--> This is expected behavior given that we kept the same values and did not redo the test split!

Moreover, the performance of the MultinomialNB models were very similar for steps (8-10). All the models achieved
a similar performance in terms of accuracy, macro-average F1, and weighted-average F1. 
(There is a very small differences in the confusion matrices).

Lastly, a small change that can be observed is in the feature log probabilities of the words (Question #7 (k)) 
when changing the smoothing values.

Note that this is expected, since smoothing was applied!!!

"""

print_discussion(question11_b)
write_discussion_to_file(question11_b)

In [None]:
file_performance.close()
file_discussion.close()