# Classification on Unseen Data (total 8 points)
In this final task, you should read the feather file 'TestQuestionsDF.feather.zstd' into a pandas dataframe. Hereafter this will be referred to as the test_set. <br>You can assume that the test_set is a random sample from the same dataset as 'TrainQuestionsDF.feather.zstd' (hereafter train_set).
Your goal is to classify the data in the test_set and achieve the best **average f1-score** using the train_set.
You are allowed to utilize any technique and model available in the scikit-learn library or the standard python libraries to do so.
Pay particular attention to the lessons learned from your experiments in the Classification notebook -- any of these approaches can be used to construct the model you use for prediction.
You can additionally choose to generate and/or construct any features from the available data. Remember that the test_set should be represented with the same feature space as the train_set. <br>For example, features based on text should be constructed with the same vocabulary on the test_set as the train_set.<br>
To achieve a high f1 score on unseen data, remember to utilize all the techniques you've learned in the lectures, lectorials and practicals.

For this task, you are expected to submit the following:
1. This notebook with your code, the code should be well documented and must run without errors.
    There is no time limit, but it is a good practice to save the parameters of the best model and add an option to generate a model with those parameters. Without running the full tuning of the hyper-parameters. <br>
2. Up to 4 prediction files, each predictions file will have exactly two columns: "Id" and "Label" with these headers and no other columns (e.g. index).<br>
 The file names should be SXXXXXXX-A2-predictions-\<n\>.csv - where n is a running integer {1,2,3,4}.

Your mark in this task will depend on the following:
1. The code is well documented, and the entire notebook runs without errors (1 points).
2. The submitted solutions are reproducible, i.e. the submitted code can generate the submitted prediction files (2 points).
3. The highest (out of the 4 prediction files) achieved average f1-score is in the following range:
 * (0.8, 1] (5 points)
 * (0.7, 0.8] (4 points)
 * (0.65, 0.7] (3 points)
 * (0, 0.65] (1 point)

To support the reproducibility of your solution, use the random seed anywhere where the solution involves a random process.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# TODO: Any additional (if needed) import statements should be in this cell
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
import scipy.sparse
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.preprocessing import normalize

In [3]:
# TODO: Set the random seed as your student id (only numbers)
RANDOM_SEED = 3891013
np.random.seed(RANDOM_SEED)

In [4]:
def read_feather_to_df(feather_file_name):
    """
    The function expects to receive a path to feather file,
    it will read the file from the disk into a pandas dataframe
    """
    return pd.read_feather(feather_file_name)

In [5]:
# Converted test and training files from feather to dataframes
train_df = read_feather_to_df('TrainQuestionsDF.feather.zstd')
test_df = read_feather_to_df('TestQuestionsDF.feather.zstd')

In [6]:
# TODO: Write your code below.
# You can split it into as many cells and functions as you see fit to make it readable and well documented.

def series_to_tfidf(sr, **tfidfvectorizer_kwargs):
    """
    The function receives an array or a pandas Series that contains text strings (a.k.a documents).
    It then converts the documents into a matrix of TF-IDF features
    The function should return two objects:
    TfidfVectorizer object after it learned (fitted) the vocabulary and idf from the training set,
    and a document-term matrix (the original documents array transformed into a TF-IDF features matrix).
    :param sr: pd.Series, contains text strings
    :param tfidfvectorizer_kwargs: key-word arguments that will be passed to TfidfVectorizer class
    :return: two objects, the fitted TfidfVectorizer object and the tf-idf document-term sparse matrix
    """
    # TODO: write your code here
    vectorizer = TfidfVectorizer(analyzer = 'word', token_pattern=r"(?u)\b\w[\w-]*\w\b|\b\w+\b", stop_words='english')
    TfidfVectorizer_object = vectorizer.fit(sr)
    sparse_matrix = vectorizer.fit_transform(sr)
    return TfidfVectorizer_object, sparse_matrix


def linear_svc(X, y, **linearsvc_kwargs):
#     :param training dataframe and corresponding array or series of labels
#     Called the LinearSVC model and named it as svc
    svc = LinearSVC()
#     Fitted the models with with training set and labels
    fit = svc.fit(X, y)
#     densified the model
    dense = fit.densify()
#     returning the fitted LinearSVC model
    return fit

def random_forest(X, y, **random_forest_kwargs):
#     :param training dataframe and corresponding array or series of labels
# Called the Random Forest model and named it as random_forest with estimators equal to 200
    random_forest = RandomForestClassifier(n_estimators=200)
#     Fitted the models with training set and labels
    fit = random_forest.fit(X, y)
#     returning the fitted Random Forest model
    return fit
    
def create_csv(predictions, filename):
     '''The function takes 2 parameters, 'predictions' is array or list of predictions generated by models and 'filename'
     is name of file to be saved on the disk. 
    '''
#         array of index would be produced based on the number of predictions in the input array
    index = range(len(predictions))
#     Creating a dictionary consisting of name of columns or headers(Id and Label) and index array and prediction array as their value
    d = {'Id': index, 'Label': predictions}
#     converting the dictionary to dataframe
    df = pd.DataFrame(data=d)
#     writitng dataframe to a CSV file of inteded name
    df.to_csv(filename, index = False)


In [7]:
# Spliited the data based on 2 columns to carry out text based prediction. The columns are 'Title' and 'Body'
# Trained the data with same vocabulary as Title and Body columns of train_df
X_train, X_test, y_train, y_test = train_test_split((series_to_tfidf(train_df["Title"])[1]), train_df['Label'], test_size=0.2, random_state=RANDOM_SEED)
X_train_body, X_test_body, y_train, y_test = train_test_split((series_to_tfidf(train_df["Body"])[1]), train_df['Label'], test_size=0.2, random_state=RANDOM_SEED)
test_set_body = series_to_tfidf(train_df["Body"])[0].transform(test_df["Body"])
test_set_title = series_to_tfidf(train_df["Title"])[0].transform(test_df["Title"])

In [8]:
# Normalised the data for better result. It is important to bring data to one scale. 
normalised_train = normalize(X_train)
normalised_test = normalize(X_test)
normalised_test_body = normalize(X_test_body)
normalised_test_set_body = normalize(test_set_body)
normalised_test_set_title = normalize(test_set_title)

In [9]:
# functions called for fitting and training the models with appropriate training data and named them according to the 
# data they have trained and fitted with
fitted_random_forest_with_title = random_forest(normalize(X_train), y_train)
fitted_random_forest_with_body = random_forest(normalize(X_train_body), y_train)
fitted_linear_svc = linear_svc((X_train_body), y_train)
fitted_linear_svc_with_title = linear_svc((X_train), y_train)

In [10]:
# Tested and printed the F1 score of predictions made on test data of train_df by Random Forest algorithm based on 'Body' column
pred22 = fitted_random_forest_with_body.predict(normalize(X_test_body))
print("Average F1 for Random Forest with Column Body: ", f1_score(y_test, pred22, average = "weighted"))

Average F1 for Random Forest with Column Body:  0.7290954694880798


In [11]:
# Tested and printed the F1 score of predictions made on test data of train_df by Random Forest algorithm based on 'Title' column
pred21 = fitted_random_forest_with_title.predict(normalize(X_test))
print("Average F1 for Random Forest with Column Title: ", f1_score(y_test, pred21, average = "weighted"))

Average F1 for Random Forest with Column Title:  0.6931656726287834


In [12]:
# Tested and printed the F1 score of predictions made on test data of train_df by LnearSVC algorithm based on 'Body' column
pred20 = fitted_linear_svc.predict(X_test_body)
print("Average F1 for Linear SVC with Column Body: ", f1_score(y_test, pred20, average = "weighted"))

Average F1 for Linear SVC with Column Body:  0.7395513207878044


In [13]:
# Tested and printed the F1 score of predictions made on test data of train_df by LnearSVC algorithm based on 'Title' column
pred19 = fitted_linear_svc_with_title.predict(X_test)
print("Average F1 for Linear SVC with Column Title: ", f1_score(y_test, pred19, average = "weighted"))

Average F1 for Linear SVC with Column Title:  0.6866223635754027


In [14]:
# Predicted labels of testing set(unseen data) with trained Random Forest model taking 'Body' as its features for text based prediction
pred_with_random_forest1 = fitted_random_forest_with_body.predict(normalize(test_set_body))
# Printing the predictions to a CSV file
create_csv(pred_with_random_forest1, "S3891013-A2-predictions-1.csv")

In [15]:
# Predicted labels of testing set(unseen data) with trained Random Forest model taking 'Title' as its features for text based prediction
pred_with_random_forest2 = fitted_random_forest_with_title.predict(normalize(test_set_title))
# Printing the predictions to a CSV file
create_csv(pred_with_random_forest2, "S3891013-A2-predictions-2.csv")

In [16]:
# Predicted labels of testing set(unseen data) with trained LinearSVC model taking 'Body' as its features for text based prediction
pred_with_linear_svc1 = fitted_linear_svc.predict(test_set_body)
# Printing the predictions to a CSV file
create_csv(pred_with_linear_svc1, "S3891013-A2-predictions-3.csv")

In [17]:
# Predicted labels of testing set(unseen data) with trained LinearSVC model taking 'Title' as its features for text based prediction
pred_with_linear_svc2 = fitted_linear_svc_with_title.predict(test_set_title)
# Printing the predictions to a CSV file
create_csv(pred_with_linear_svc2, "S3891013-A2-predictions-4.csv")