This notebook contextualizes accuracy against a majority class baseline, and analyzes the most important features for classification.

### 1. Importing Necessary Libraries

First, we import the essential libraries for our task.
* `collections.Counter` is a handy tool for counting hashable objects, which we'll use to find the most common class label.
* `sklearn` (Scikit-learn) provides powerful and easy-to-use tools for machine learning. We import modules for feature extraction (`CountVectorizer`), data preprocessing (`preprocessing`), and our classification model (`linear_model`).
* `numpy` is a fundamental package for numerical computation in Python, which we'll use for array manipulation.

In [None]:
# Import the Counter class from the collections module for counting items in a list.
from collections import Counter
# Import CountVectorizer to convert text data into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
# Import the preprocessing module to encode categorical labels into numerical format.
from sklearn import preprocessing
# Import the linear_model module which contains the Logistic Regression classifier.
from sklearn import linear_model
# Import the numpy library for efficient numerical operations, especially for handling arrays.
import numpy as np

### 2. Reading the Data

This function, `read_data`, is designed to open a tab-separated values (`.tsv`) file and parse its contents. It iterates through each line, splitting it into a label and a text body. It then appends these to two separate lists, `X` (for the text features) and `Y` (for the labels), which are returned by the function.

In [None]:
# Define a function to read data from a file.
def read_data(filename):
    # Initialize an empty list to store the text samples (features).
    X=[]
    # Initialize an empty list to store the corresponding labels.
    Y=[]
    # Open the specified file with UTF-8 encoding to handle a wide range of characters.
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file.
        for line in file:
            # Remove any trailing whitespace (like newline characters) and split the line by the tab character.
            cols=line.rstrip().split("\t")
            # The first column is the label.
            label=cols[0]
            # The second column is the text.
            # The sample text is already tokenized; if yours is not, do so here.
            text=cols[1]
            # Add the text to our feature list.
            X.append(text)
            # Add the label to our label list.
            Y.append(label)
    # Return the lists of features (X) and labels (Y).
    return X, Y

### 3. Setting the Data Directory

This cell sets a variable that points to the location of your dataset. Make sure this path is correct and that the directory contains the `train.tsv` and `dev.tsv` files.

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
# This variable holds the path to the folder containing the data files.
directory="../data/text_classification_sample_data"

### 4. Loading Training and Development Sets

Here, we use our `read_data` function to load the training and development (validation) datasets from their respective `.tsv` files. The training data (`trainX`, `trainY`) will be used to train our model, and the development data (`devX`, `devY`) will be used to evaluate its performance and tune hyperparameters.

In [None]:
# Call the read_data function to load the training data.
trainX, trainY=read_data("%s/train.tsv" % directory)
# Call the read_data function to load the development (validation) data.
devX, devY=read_data("%s/dev.tsv" % directory)

Q1: Implement the majority class baseline for your data that we went over in `Hyperparameters.ipynb`

### 5. Implementing the Majority Class Baseline

A **majority class baseline** is a simple model that always predicts the most frequent label from the training set, regardless of the input. It's a fundamental benchmark: any useful machine learning model must perform better than this baseline. This function calculates the majority label and then computes the accuracy of this naive prediction on the development set.

In [None]:
# Define a function to calculate the majority class baseline accuracy.
def majority_class(trainY, devY):
    # Use Counter to count the occurrences of each unique label in the training set.
    label_counts = Counter(trainY)
    # Find the most common label and its count, and select just the label.
    majority_label = label_counts.most_common(1)[0][0]
    
    # Create a list of predictions where every prediction is the majority label.
    # The length of this list is the same as the number of items in the development set.
    predictions = [majority_label] * len(devY)
    
    # Calculate accuracy by comparing the predictions to the true development set labels.
    # sum(p == y for p, y in zip(predictions, devY)) counts how many predictions were correct.
    correct = sum(p == y for p, y in zip(predictions, devY))
    # Accuracy is the number of correct predictions divided by the total number of predictions.
    accuracy = correct / len(devY)
    
    # Return the list of predictions and the calculated accuracy.
    return predictions, accuracy

### 6. Evaluating the Baseline Model

This cell executes the `majority_class` function to get the predictions and accuracy for our baseline model. The resulting accuracy `a` shows the score our more sophisticated model needs to beat.

In [None]:
# Call the majority_class function with the training and development labels.
# 'p' will store the predictions, and 'a' will store the accuracy.
p, a = majority_class(trainY,devY)

Q2: After experimenting with hyperparameter choices in class, what is the best accuracy that you uncovered on your development data?  Which hyperparameter choices led to that accuracy?  Plug in the values here and execute the cell to yield the accuracy. 

### 7. Training and Evaluating a Logistic Regression Model

This cell contains the complete pipeline for training a Logistic Regression classifier.

1.  **Label Encoding**: Machine learning models require numerical input. `LabelEncoder` converts our text-based labels (e.g., 'positive', 'negative') into integers (e.g., 1, 0).
2.  **Vectorization**: `CountVectorizer` converts the text documents into a numerical matrix. Each row represents a document, and each column represents a unique word from the vocabulary. A `1` in the matrix indicates the presence of a word in a document (since `binary=True`). We limit the vocabulary to the 10,000 most frequent words to keep the model manageable.
3.  **Model Training**: We initialize a `LogisticRegression` model with specific hyperparameters (`C=0.1`, `penalty='l2'`) and train it on the vectorized training data (`X_train`) and encoded labels (`Y_train`).
4.  **Evaluation**: Finally, we evaluate the trained model's accuracy on the held-out development set.

In [None]:
# Create an instance of the LabelEncoder.
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the mapping from text labels to numbers.
le.fit(trainY)
# Transform the training labels into their numerical representation.
Y_train=le.transform(trainY)
# Transform the development labels using the same learned mapping.
Y_dev=le.transform(devY)

# Initialize CountVectorizer with chosen hyperparameters.
# max_features=10000: Use the 10,000 most frequent words as features.
# analyzer=str.split: Split text on whitespace (since it's pre-tokenized).
# lowercase=False: Do not convert text to lowercase.
# strip_accents=None: Do not remove accents.
# binary=True: Use 1 for presence and 0 for absence of a word, rather than its frequency.
vectorizer = CountVectorizer(max_features=10000, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)

# Fit the vectorizer on the training text to learn the vocabulary and then transform the text into a feature matrix.
X_train = vectorizer.fit_transform(trainX)
# Transform the development text into a feature matrix using the vocabulary learned from the training data.
X_dev = vectorizer.transform(devX)

# Initialize the Logistic Regression model.
# C=0.1: Sets the inverse of regularization strength. Smaller C means stronger regularization.
# solver='lbfgs': An algorithm for optimization.
# penalty='l2': Use L2 regularization to prevent overfitting.
logreg = linear_model.LogisticRegression(C=0.1, solver='lbfgs', penalty='l2')

# Train the model using the training data features (X_train) and labels (Y_train).
model=logreg.fit(X_train, Y_train)

# Calculate and print the accuracy of the trained model on the development set.
print("Accuracy: %.3f" % logreg.score(X_dev, Y_dev))

Q3: For binary classification using logistic regression, the parameters of the learned model are given in `model.coef_[0]`.  Print out the 25 features that are most associated with each class (i.e., the 25 parameters that have the largest positive values and the 25 parameters with largest negative values).  For reference, consider the `inverse_transform` function in [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.transform) to get the class labels that correspond to positive(=1) and negative(=0), and the `vocabulary_` function in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to yield the index for each vocabulary term.


### 8. Analyzing Model Feature Weights

To understand *why* our model makes its predictions, we can inspect its learned weights (coefficients). For a binary logistic regression model, each feature (word) has a single weight.

* A large **positive** weight means the presence of that word strongly suggests the positive class (label `1`).
* A large **negative** weight means the presence of that word strongly suggests the negative class (label `0`).

This function extracts these weights, identifies the top 25 words for each class, and prints them out. This provides valuable insight into the model's decision-making process.

In [None]:
# Define a function to analyze and display the most important feature weights.
def analyze_weights(learned_model, label_encoder, count_vectorizer):
    # Get the array of learned coefficients (weights) from the model for the positive class (class 1).
    coefs = learned_model.coef_[0]
    
    # Get the vocabulary from the vectorizer. The .items() gives (term, index) pairs.
    # We sort by index to create an array of terms where the index matches the column in the feature matrix.
    vocab = np.array([term for term, idx in sorted(count_vectorizer.vocabulary_.items(), key=lambda x: x[1])])
    
    # Use the label encoder to find the original string labels for the numerical classes 0 and 1.
    class_labels = label_encoder.inverse_transform([0, 1])
    
    # Get the indices of the coefficients sorted from smallest to largest.
    # We take the last 25 and reverse them to get the indices of the 25 largest positive weights.
    top_pos_idx = np.argsort(coefs)[-25:][::-1]
    # Use these indices to get the corresponding terms from the vocabulary array.
    top_pos_terms = vocab[top_pos_idx]
    
    # Get the indices of the 25 smallest (most negative) coefficients.
    top_neg_idx = np.argsort(coefs)[:25]
    # Use these indices to get the corresponding terms from the vocabulary array.
    top_neg_terms = vocab[top_neg_idx]
    
    # Print the header for the positive class features.
    print(f"\nTop 25 features for class '{class_labels[1]}' (positive weights):")
    # Loop through the top positive terms and their corresponding weights.
    for term, weight in zip(top_pos_terms, coefs[top_pos_idx]):
        # Print the term and its weight, formatted for alignment and readability.
        print(f"{term:20s} {weight:.4f}")
    
    # Print the header for the negative class features.
    print(f"\nTop 25 features for class '{class_labels[0]}' (negative weights):")
    # Loop through the top negative terms and their corresponding weights.
    for term, weight in zip(top_neg_terms, coefs[top_neg_idx]):
        # Print the term and its weight, formatted for alignment and readability.
        print(f"{term:20s} {weight:.4f}")

### 9. Running the Weight Analysis

This final cell calls the `analyze_weights` function, passing it the trained model, the label encoder, and the count vectorizer. This will print the lists of the most influential words for each class, completing our analysis.

In [None]:
# Call the analysis function with the trained model and the fitted encoder and vectorizer.
analyze_weights(model, le, vectorizer)