# Milestone 2

By December 15 you shall have implemented multiple baseline solutions to
your main text classification task. These should include both deep learning (DL) based methods
such as those introduced in Weeks 5-6 but also non-DL models such as those shown in Week 3.
Baselines can also include simple rule-based methods (e.g. keyword matching or regular expres-
sions). Each baseline should be evaluated both quantitatively and qualitatively, more details will
be provided in the lecture on text classification (Week 3)

## Load modules

In [56]:
import os
import pandas as pd
import conllu
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

## Load files

In [50]:
# Load the entire .conllu file into a Python variable
conllu_data = []
with open('Data/preprocessed_dataset.conllu', 'r', encoding='utf-8') as f:
    for line in f:
        conllu_data.append(line.strip())

# Display a sample of the loaded data (e.g., first 5 lines)
print("\n".join(conllu_data[:5]))

anns = pd.read_table('Data/annotations.txt', header=None)

# sent_id = 0_0
# text = And this Conservative strategy has produced the angry and desperate wing-nuts like the fellow who called reporters 'lying pieces of Sh*t' this week.
1	And	and	CCONJ	CC	_	None	None	_	_
2	this	this	DET	DT	Number=Sing|PronType=Dem	None	None	_	_
3	Conservative	Conservative	ADJ	JJ	Degree=Pos	None	None	_	_


Format of sent_id : X_Y

Where X is the id of the comment, while Y is the id of the sentence.

Example : 2_3 means it's comment with id 2 and its 3rd sentence.

## First experiment: use of Naive Bayes classification

1. Extraction of the comment only
2. Vectorization
3. Splitting into training and testing data
4. Training the model using different parameters
5. Evaluating the model

In [36]:
def load_and_group_conllu_to_list(file_path):
    """
    Load the .conllu file and group sentences by comment ID, returning a list of concatenated texts.
    """
    comments_list = []
    current_comment_id = None
    current_text = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            
            if line.startswith('# sent_id ='):
                # Extract the comment ID (X) from 'sent_id = X_Y'
                sent_id = line.split('=')[1].strip()
                comment_id = sent_id.split('_')[0]

                # Check if we've moved to a new comment
                if current_comment_id is not None and comment_id != current_comment_id:
                    # Store the completed text for the previous comment
                    if current_text:
                        comments_list.append(" ".join(current_text))
                    current_text = []

                # Update the current comment ID
                current_comment_id = comment_id

            elif line.startswith('# text ='):
                # Extract the text for the current sentence
                sentence_text = line.split('=')[1].strip()
                current_text.append(sentence_text)

        # Add the last comment if any
        if current_comment_id is not None and current_text:
            comments_list.append(" ".join(current_text))

    return comments_list


1. Extraction of the comments

In [39]:
file_path = 'Data/preprocessed_dataset.conllu'
comments_list = load_and_group_conllu_to_list(file_path)
comments_list[:3]

["And this Conservative strategy has produced the angry and desperate wing-nuts like the fellow who called reporters 'lying pieces of Sh*t' this week. The fortunate thing is that reporters were able to report it and broadcast it - which may shake up a few folks who recognize a bit of themselves somewhere in there and do some reflecting. I live in hope.",
 "I commend Harper for holding the debates outside of a left-wing forum as this will help prevent the left from manipulating the debates to try to make Harper look bad. Indeed, we’ll finally have some fair debates. Trudeau is a coward and the only one who’s opposing this as he’s terrified about losing left-wing protection during the debates if the debates are held elsewhere. If Trudeau doesn’t have Chretien or Martin speaking for him or isn't currently in training to learn how to handle himself in a debate, he has May attending the debates to hold his little hand. If Trudeau can’t speak for himself or handle debates, how does he expect

2. Vectorization (using TF-IDF)

In [52]:
comments_list
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(comments_list)
y = anns

3. Splitting into training and testing data

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Training the Naive Bayes classifier

In [57]:
param_grid = {'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]}  # Adjust the range as needed
model = MultinomialNB()
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Get the best model from GridSearchCV
best_model = grid_search.best_estimator_
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

Best alpha: 0.1
Best cross-validation accuracy: 0.6761


  y = column_or_1d(y, warn=True)


5. Evaluate the model

In [58]:
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.6875

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.48      0.58      1086
           1       0.67      0.86      0.75      1314

    accuracy                           0.69      2400
   macro avg       0.70      0.67      0.67      2400
weighted avg       0.70      0.69      0.67      2400



# Second experiment: Use Feature based models 

Idea: Use the features from the CONLL-U format to feed Machine Learning models (SVM, Random Forest, KNN and else) in order to classify the text