# Milestone 2

By December 15 you shall have implemented multiple baseline solutions to
your main text classification task. These should include both deep learning (DL) based methods
such as those introduced in Weeks 5-6 but also non-DL models such as those shown in Week 3.
Baselines can also include simple rule-based methods (e.g. keyword matching or regular expres-
sions). Each baseline should be evaluated both quantitatively and qualitatively, more details will
be provided in the lecture on text classification (Week 3)

## Load modules

In [1]:
import os
import pandas as pd
import conllu
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

## Load files

In [2]:
# Load the entire .conllu file into a Python variable
conllu_data = []
with open('Data/preprocessed_dataset.conllu', 'r', encoding='utf-8') as f:
    for line in f:
        conllu_data.append(line.strip())

# Display a sample of the loaded data (e.g., first 5 lines)
print("\n".join(conllu_data[:10]))

anns = pd.read_table('Data/annotations.txt', header=None)

# sent_id = 0_0
# text = And this Conservative strategy has produced the angry and desperate wing-nuts like the fellow who called reporters 'lying pieces of Sh*t' this week.
1	And	and	CCONJ	CC	_	None	None	_	_
2	this	this	DET	DT	Number=Sing|PronType=Dem	None	None	_	_
3	Conservative	Conservative	ADJ	JJ	Degree=Pos	None	None	_	_
4	strategy	strategy	NOUN	NN	Number=Sing	None	None	_	_
5	has	have	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	None	None	_	_
6	produced	produce	VERB	VBN	Tense=Past|VerbForm=Part	None	None	_	_
7	the	the	DET	DT	Definite=Def|PronType=Art	None	None	_	_
8	angry	angry	ADJ	JJ	Degree=Pos	None	None	_	_


Format of sent_id : X_Y

Where X is the id of the comment, while Y is the id of the sentence.

Example : 2_3 means it's comment with id 2 and its 3rd sentence.

## First experiment: use of Naive Bayes classification (with original and pre-processed text)

1. Extraction of the comment only
2. Vectorization
3. Splitting into training and testing data
4. Training the model using different parameters
5. Evaluating the model

In [3]:
def extract_original_text(file_path):
    """
    Load the .conllu file and group sentences by comment ID, returning a list of concatenated texts.
    """
    comments_list = []
    current_comment_id = None
    current_text = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            
            if line.startswith('# sent_id ='):
                # Extract the comment ID (X) from 'sent_id = X_Y'
                sent_id = line.split('=')[1].strip()
                comment_id = sent_id.split('_')[0]

                # Check if we've moved to a new comment
                if current_comment_id is not None and comment_id != current_comment_id:
                    # Store the completed text for the previous comment
                    if current_text:
                        comments_list.append(" ".join(current_text))
                    current_text = []

                # Update the current comment ID
                current_comment_id = comment_id

            elif line.startswith('# text ='):
                # Extract the text for the current sentence
                sentence_text = line.split('=')[1].strip()
                current_text.append(sentence_text)

        # Add the last comment if any
        if current_comment_id is not None and current_text:
            comments_list.append(" ".join(current_text))

    return comments_list


In [4]:
def extract_preprocessed_text(file_path):
    """
    Load the .conllu file and group sentences by comment ID, returning a list of concatenated cleaned texts using lemmas.
    """
    comments_list = []
    current_comment_id = None
    current_text = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            
            if line.startswith('# sent_id ='):
                # Extract the comment ID (X) from 'sent_id = X_Y'
                sent_id = line.split('=')[1].strip()
                comment_id = sent_id.split('_')[0]

                # Check if we've moved to a new comment
                if current_comment_id is not None and comment_id != current_comment_id:
                    # Store the completed text for the previous comment
                    if current_text:
                        comments_list.append(" ".join(current_text))
                    current_text = []

                current_comment_id = comment_id

            elif not line.startswith('#') and line:
                # Extract the lemma (3rd column)
                columns = line.split('\t')
                if len(columns) > 2:
                    lemma = columns[2].lower()  # Use the lemma column in lowercase
                    current_text.append(lemma)

        # Add the last comment if any
        if current_comment_id is not None and current_text:
            comments_list.append(" ".join(current_text))

    return comments_list


1. Extraction of the comments

In [5]:
file_path = 'Data/preprocessed_dataset.conllu'
original_list = extract_original_text(file_path)
original_list[:3]

["And this Conservative strategy has produced the angry and desperate wing-nuts like the fellow who called reporters 'lying pieces of Sh*t' this week. The fortunate thing is that reporters were able to report it and broadcast it - which may shake up a few folks who recognize a bit of themselves somewhere in there and do some reflecting. I live in hope.",
 "I commend Harper for holding the debates outside of a left-wing forum as this will help prevent the left from manipulating the debates to try to make Harper look bad. Indeed, we’ll finally have some fair debates. Trudeau is a coward and the only one who’s opposing this as he’s terrified about losing left-wing protection during the debates if the debates are held elsewhere. If Trudeau doesn’t have Chretien or Martin speaking for him or isn't currently in training to learn how to handle himself in a debate, he has May attending the debates to hold his little hand. If Trudeau can’t speak for himself or handle debates, how does he expect

In [6]:
preprocessed_list = extract_preprocessed_text(file_path)
preprocessed_list[:3]

["and this conservative strategy have produce the angry and desperate wing - nut like the fellow who call reporter 's lying piece of sh*tember ' this week . the fortunate thing be that reporter be able to report it and broadcast it - which may shake up a few folk who recognize a bit of themselves somewhere in there and do some reflect . i live in hope .",
 'i commend harper for hold the debate outside of a left - wing forum as this will help prevent the left from manipulate the debate to try to make harper look bad . indeed , we will finally have some fair debate . trudeau be a coward and the only one who be oppose this as he be terrified about lose left - wing protection during the debate if the debate be hold elsewhere . if trudeau do not have chretien or martin speak for he or be not currently in training to learn how to handle himself in a debate , he have may attend the debate to hold his little hand . if trudeau can not speak for himself or handle debate , how do he expect to run

### For original text

2. Vectorization (using TF-IDF)
3. Splitting into training and testing data
4. Training the Naive Bayes classifier
5. Evaluate the model

In [7]:
#2 Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(original_list)
y = anns

#3 Splitting into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#4 Training the Naive Bayes classifier
param_grid = {'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]}  # Adjust the range as needed
model = MultinomialNB()
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Get the best model from GridSearchCV
best_model = grid_search.best_estimator_
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

#5 Evaluate the model
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Best alpha: 0.1
Best cross-validation accuracy: 0.6761
Accuracy: 0.6875

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.48      0.58      1086
           1       0.67      0.86      0.75      1314

    accuracy                           0.69      2400
   macro avg       0.70      0.67      0.67      2400
weighted avg       0.70      0.69      0.67      2400



  y = column_or_1d(y, warn=True)


### For preprocessed text

In [8]:
#2 Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_list)
y = anns

#3 Splitting into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#4 Training the Naive Bayes classifier
param_grid = {'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]}  # Adjust the range as needed
model = MultinomialNB()
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Get the best model from GridSearchCV
best_model = grid_search.best_estimator_
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

#5 Evaluate the model
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Best alpha: 0.1
Best cross-validation accuracy: 0.6760
Accuracy: 0.6808333333333333

Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.47      0.57      1086
           1       0.66      0.85      0.75      1314

    accuracy                           0.68      2400
   macro avg       0.69      0.66      0.66      2400
weighted avg       0.69      0.68      0.67      2400



  y = column_or_1d(y, warn=True)


# Second experiment: Use Feature based models 

Idea: Use the features from the CONLL-U format to feed Machine Learning models (SVM, Random Forest, KNN and else) in order to classify the text

1. Extraction of the features
2. Splitting into training and testing data
3. Training the models using different parameters
4. Evaluating the models

In [20]:
def load_conllu_data(file_path):
    comments = {}
    
    with open(file_path, 'r', encoding='utf-8') as f:
        current_comment_id = None
        pos_counts = Counter()
        num_tokens = 0
        total_word_length = 0
        num_sentences = 0
        
        for line in f:
            line = line.strip()
            
            # Check if the line is a sentence ID line
            if line.startswith("# sent_id"):
                sent_id = line.split("= ")[1]
                current_comment_id = sent_id.split('_')[0]
                
                # Initialize a new comment entry if not already present
                if current_comment_id not in comments:
                    comments[current_comment_id] = {
                        'num_tokens': 0,
                        'total_word_length': 0,
                        'pos_counts': Counter(),
                        'num_sentences': 0
                    }
                    pos_counts = Counter()
                    num_tokens = 0
                    total_word_length = 0
                    num_sentences = 0

            # Check if the line is a text line
            elif line.startswith("# text"):
                text = line.split("= ")[1]

            # Process token lines
            elif line and not line.startswith("#"):
                columns = line.split("\t")
                if len(columns) >= 4:
                    token = columns[1]
                    pos_tag = columns[3]
                    
                    # Extract token-level features
                    word_length = len(token)
                    total_word_length += word_length
                    num_tokens += 1
                    pos_counts[pos_tag] += 1
            
            # End of a sentence block
            if line == "" and current_comment_id is not None:
                comments[current_comment_id]['num_tokens'] += num_tokens
                comments[current_comment_id]['total_word_length'] += total_word_length
                comments[current_comment_id]['pos_counts'].update(pos_counts)
                comments[current_comment_id]['num_sentences'] += 1

    # Convert the aggregated features to a DataFrame
    features = []
    for comment_id, data in comments.items():
        pos_counts = data['pos_counts']
        avg_word_length = data['total_word_length'] / data['num_tokens'] if data['num_tokens'] > 0 else 0
        
        feature_dict = {
            'comment_id': comment_id,
            'num_tokens': data['num_tokens'],
            'avg_word_length': avg_word_length,
            'num_sentences': data['num_sentences'],
            'num_nouns': pos_counts.get('NOUN', 0),
            'num_verbs': pos_counts.get('VERB', 0),
            'num_adjectives': pos_counts.get('ADJ', 0),
            'num_adverbs': pos_counts.get('ADV', 0),
            'num_pronouns': pos_counts.get('PRON', 0),
            'num_conjunctions': pos_counts.get('CCONJ', 0),
            'num_determiners': pos_counts.get('DET', 0),
        }
        features.append(feature_dict)
    
    df = pd.DataFrame(features)
    return df

1. Extraction of the features

In [24]:
file_path = 'Data/preprocessed_dataset.conllu'
features_conllu = load_conllu_data(file_path)

2. Splitting into training and testing data

In [31]:
X = features_conllu.drop(['comment_id'], axis=1)
y = anns 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Training the models using different parameters
4. Evaluating the best models

In [40]:
# Define hyperparameter grids
param_grids = {
    'knn': {
        'n_neighbors': [3, 5, 10, 15,20,50,100],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan']
    },
    'rf': {
        'n_estimators': [10,25, 50, 100, 150],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'logisticregression': {
        'penalty': ['l2'],
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['lbfgs', 'liblinear']
    }
}

# Train and tune models
models = {
    'knn': KNeighborsClassifier(),
    'rf': RandomForestClassifier(random_state=42),
    'logisticregression': LogisticRegression()
}

best_models = {}

for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    grid_search = GridSearchCV(model, param_grids[model_name.lower()], cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    print(f"Best Parameters for {model_name}: {grid_search.best_params_}")
    best_model = grid_search.best_estimator_
    best_models[model_name] = best_model

    # Evaluate on the test set
    y_pred = best_model.predict(X_test)
    print(f"\n{model_name} Accuracy:", accuracy_score(y_test, y_pred))
    print(f"\n{model_name} Classification Report:\n", classification_report(y_test, y_pred))


Training knn...


  return self._fit(X, y)


Best Parameters for knn: {'metric': 'manhattan', 'n_neighbors': 50, 'weights': 'uniform'}

knn Accuracy: 0.9075

knn Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.89      0.90      1086
           1       0.91      0.92      0.92      1314

    accuracy                           0.91      2400
   macro avg       0.91      0.91      0.91      2400
weighted avg       0.91      0.91      0.91      2400


Training rf...


  return fit_method(estimator, *args, **kwargs)


Best Parameters for rf: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 50}

rf Accuracy: 0.9233333333333333

rf Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.92      0.92      1086
           1       0.93      0.93      0.93      1314

    accuracy                           0.92      2400
   macro avg       0.92      0.92      0.92      2400
weighted avg       0.92      0.92      0.92      2400


Training logisticregression...
Best Parameters for logisticregression: {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}

logisticregression Accuracy: 0.91

logisticregression Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.94      0.90      1086
           1       0.95      0.89      0.92      1314

    accuracy                           0.91      2400
   macro avg       0.91      0.91      0.91      2400
weighted avg       0.91      0.91      0.91      2400



  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Model Performance Comparison

| Classifier              | Accuracy | Precision (0) | Recall (0) | F1-score (0) | Precision (1) | Recall (1) | F1-score (1) | Macro Avg Precision | Macro Avg Recall | Macro Avg F1-score |
|-------------------------|----------|----------------|-------------|---------------|----------------|-------------|---------------|---------------------|-------------------|---------------------|
| **Naive Bayes on original data**         | 0.6875   | 0.74           | 0.48        | 0.58          | 0.67           | 0.86        | 0.75          | 0.70                | 0.67              | 0.67                |
| **Naive Bayes on preprocessed data**         | 0.6808   | 0.73           | 0.47        | 0.57         | 0.66           | 0.85        | 0.75          | 0.69                | 0.66              | 0.66                |
| **K-Nearest Neighbors** | 0.9075   | 0.90           | 0.89        | 0.90          | 0.91           | 0.92        | 0.92          | 0.91                | 0.91              | 0.91                |
| **Random Forest**       | 0.9233   | 0.91           | 0.92        | 0.92          | 0.93           | 0.93        | 0.93          | 0.92                | 0.92              | 0.92                |
| **Logistic Regression** | 0.91     | 0.87           | 0.94        | 0.90          | 0.95           | 0.89        | 0.92          | 0.91                | 0.91              | 0.91                |

---

### Explanation of the Columns:
- **Accuracy**: Overall accuracy of the model.
- **Precision (0)**, **Recall (0)**, **F1-score (0)**: Metrics for class **0**.
- **Precision (1)**, **Recall (1)**, **F1-score (1)**: Metrics for class **1**.
- **Macro Avg Precision**, **Macro Avg Recall**, **Macro Avg F1-score**: Averages of precision, recall, and F1-score for both classes

# Now Onto Deep Learning solutions!