This notebook explores the use of the permutation test to assess the significance of coefficents learned in logistic regression (testing against the null that each $\beta$ = 0).

#### Cell 1: Imports

This first code block imports all the necessary libraries for the script.

In [None]:
# Import the sys module to interact with the Python interpreter.
import sys
# Import preprocessing from scikit-learn, specifically for LabelEncoder to convert text labels to integers.
from sklearn import preprocessing
# Import linear_model from scikit-learn, which contains the LogisticRegression classifier.
from sklearn import linear_model
# Import choices from the random module, although it's not used in the final script.
from random import choices
# Import CountVectorizer to convert a collection of text documents to a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
# Import the shuffle function to randomly permute the labels for the permutation test.
from random import shuffle
# Import the NumPy library, which is essential for numerical operations, especially for handling arrays.
import numpy as np
# Import the copy module to create a deep copy of the labels list, ensuring the original list is not modified.
import copy

--- 

#### Cell 2: Data Reading Function

This cell defines a function `read_data` to load the dataset from a tab-separated values (`.tsv`) file.

In [None]:
# Define a function named 'read_data' that takes a filename as input.
def read_data(filename):
    # Initialize an empty list 'X' to store the text data (features).
    X=[]
    # Initialize an empty list 'Y' to store the corresponding labels.
    Y=[]
    # Open the specified file with UTF-8 encoding to handle a wide range of characters.
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file.
        for line in file:
            # Remove trailing whitespace and split the line into columns by the tab character.
            cols=line.rstrip().split("\t")
            # The first column is the label.
            label=cols[0]
            # The second column is the text.
            text=cols[1]
            # Assumes the text is already tokenized (words are separated by spaces).
            # Append the text to the feature list X.
            X.append(text)
            # Append the label to the label list Y.
            Y.append(label)
    # Return the lists of features (X) and labels (Y).
    return X, Y

--- 

#### Cell 3: Set Data Directory

Here, we define the path to the folder containing our dataset. You should change this string to match the location on your own machine.

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
# Set a variable 'directory' to the path of the data folder.
directory="../data/text_classification_sample_data"

--- 

#### Cell 4: Load Training and Development Data

This code calls the `read_data` function to load the training and development (validation) sets into memory.

In [None]:
# Call the read_data function to load the training data and store it in trainX and trainY.
trainX, trainY=read_data("%s/train.tsv" % directory)
# Call the read_data function to load the development data and store it in devX and devY.
devX, devY=read_data("%s/dev.tsv" % directory)

--- 

#### Cell 5: Featurization Function

The `featurize` function converts the raw text data into a numerical format that the machine learning model can understand. It uses a "bag-of-words" approach with binary values (presence or absence of a word).

In [None]:
# Define a function 'featurize' that takes the raw text of the training and development sets as input.
def featurize(trainX, devX):
    # Initialize CountVectorizer.
    # max_features=10000: Keep only the 10,000 most frequent words.
    # analyzer=str.split: Split text into words using whitespace.
    # lowercase=False: Do not convert words to lowercase.
    # strip_accents=None: Do not remove accents.
    # binary=True: Use 1 if a word is present in a document, 0 otherwise (instead of word counts).
    vectorizer = CountVectorizer(max_features=10000, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)

    # Fit the vectorizer to the training data to learn the vocabulary and then transform trainX into a feature matrix.
    X_train = vectorizer.fit_transform(trainX)
    # Transform the development data using the same vocabulary learned from the training data.
    X_dev = vectorizer.transform(devX)

    # Return the transformed training and development data, and the fitted vectorizer object.
    return X_train, X_dev, vectorizer

--- 

#### Cell 6: Model Training Function

This function, `train`, takes the featurized training data and labels, and trains a logistic regression classifier.

In [None]:
# Define a function 'train' that takes the training features, training labels, and a LabelEncoder object.
def train(X_train, trainY, le):
    # Use the LabelEncoder to transform the string labels (e.g., "positive", "negative") into integers (e.g., 1, 0).
    Y_train=le.transform(trainY)
    # Initialize the Logistic Regression model.
    # C=100: A smaller C specifies stronger regularization. Here, C=100 means relatively weak regularization.
    # solver='lbfgs': An efficient optimization algorithm.
    # penalty='l2': Use L2 regularization to prevent overfitting.
    # max_iter=10000: Allow up to 10,000 iterations for the solver to converge.
    logreg = linear_model.LogisticRegression(C=100, solver='lbfgs', penalty='l2', max_iter=10000)
    # Train the model using the feature matrix (X_train) and the integer labels (Y_train).
    logreg.fit(X_train, Y_train)
    # Return the trained logistic regression model object.
    return logreg
    # This line is unreachable because the function already returned on the line above.
    return logreg.coef_[0]

--- 

#### Cell 7: Model Testing Function

The `test` function evaluates the trained model on the development set and prints its accuracy.

In [None]:
# Define a function 'test' to evaluate the model's performance.
def test(logreg, devX_feat, devY, le):
    # Transform the string labels of the development set into integers.
    Y_dev=le.transform(devY)
    # Use the model's 'score' method to calculate accuracy on the development set and print it, formatted to 3 decimal places.
    print("Accuracy: %.3f" % logreg.score(devX_feat, Y_dev))

--- 

#### Cell 8: Weight Analysis Function

This function, `analyze_weights`, is for interpreting the model. It prints the 25 most influential words for each class, along with their learned coefficients and calculated p-values.

In [None]:
# Define a function to analyze and display the model's coefficients (weights).
def analyze_weights(coefs, label_encoder, vocab, p_values):
    # Create a 'reverse_vocab' dictionary to map from feature index back to the word.
    reverse_vocab = {v: k for k, v in vocab.items()}

    # Get the indices that would sort the coefficients array in ascending order.
    sort_index = np.argsort(coefs)

    # Print the name of the first class (corresponding to negative coefficients).
    print(label_encoder.inverse_transform([0])[0])
    # Loop through the 25 smallest (most negative) coefficients.
    for k in sort_index[:25]:
        # Print the coefficient, the corresponding word, and its p-value.
        print ("%.5f\t%s\t%.4f" % (coefs[k], reverse_vocab[k], p_values[k] ))

    # Print a newline for spacing.
    print()
    # Print the name of the second class (corresponding to positive coefficients).
    print(label_encoder.inverse_transform([1])[0])

    # Loop through the 25 largest (most positive) coefficients in descending order.
    for k in reversed(sort_index[-25:]):
        # Print the coefficient, the corresponding word, and its p-value.
        print ("%.5f\t%s\t%.4f" % (coefs[k], reverse_vocab[k], p_values[k] ))

--- 

#### Cell 9: Main Execution and Permutation Test

This is the main block where the script executes the entire pipeline: featurization, training, and the permutation test itself.

The **permutation test** works by shuffling the labels (`trainY`) and re-training the model many times. For each feature, we count how often the coefficient from a shuffled-label model is more extreme (larger in absolute value) than the coefficient from the original, unshuffled model. This count, divided by the number of permutations, gives us the p-value. A small p-value (e.g., < 0.05) suggests the original coefficient is statistically significant and not just due to random chance.

In [None]:
# Featurize the training and development text data.
X_train, X_dev, vectorizer=featurize(trainX, devX)
# Initialize the LabelEncoder.
le = preprocessing.LabelEncoder()
# Fit the encoder on the training labels to learn the mapping from strings to integers.
le.fit(trainY)

# Train the logistic regression model on the original (unshuffled) data.
logreg=train(X_train, trainY, le)
# Test the model on the development set and print its accuracy.
test(logreg, X_dev, devY, le)

# Extract the learned coefficients from the trained model. This is our "true" set of weights.
true_coefficients=logreg.coef_[0]

# Set the number of permutations (P). For a real analysis, this should be much higher (e.g., 1000 or 10000).
P=100

# Initialize a NumPy array of zeros to store the p-value for each coefficient.
p_values=np.zeros(len(true_coefficients))
# Create a deep copy of the training labels to be used for shuffling.
permutedY=copy.deepcopy(trainY)

# Start the permutation loop, which will run P times.
for i in range(P):
    # Print the progress every 10 iterations.
    if i % 10 == 0:
        print(i)
    
    # The core of the permutation test: shuffle the labels randomly.
    # This breaks the true relationship between the features (X) and the labels (Y).
    shuffle(permutedY)
    
    # Train a new logistic regression model on the original features but with the newly shuffled labels.
    permuted_logreg=train(X_train, permutedY, le)
    # Extract the coefficients from this new model.
    coefficients=permuted_logreg.coef_[0]
    
    # Compare the coefficients from the permuted model to the true coefficients.
    # Iterate through each coefficient index.
    for idx, coef in enumerate(coefficients):
        # Check if the absolute value of the coefficient from the permuted model is greater than
        # the absolute value of the original, true coefficient.
        if abs(true_coefficients[idx]) < abs(coef):
            # If it is, we count this as an instance where a random permutation produced a more "extreme" result.
            # Increment the p-value counter for this feature by 1/P.
            p_values[idx]+=1./P

--- 

#### Cell 10: Save Results to File

This block saves the final results—the true coefficient, the corresponding word, and its calculated p-value—to a text file named `weights.txt`.

In [None]:
# Create a reverse vocabulary to map feature indices back to words.
inverse_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}            
# Open a file named "weights.txt" in write mode.
out=open("weights.txt", "w")
# Iterate through each of the true coefficients with its index.
for idx, coef in enumerate(true_coefficients):
    # Write the coefficient, the corresponding word, and its p-value to the file, separated by tabs.
    out.write("%.3f\t%s\t%.5f\n" % (coef, inverse_vocab[idx], p_values[idx]))
# Close the file to ensure all data is written to disk.
out.close()

--- 

#### Cell 11: Display Final Analysis

Finally, this cell calls the `analyze_weights` function to print the most significant positive and negative features directly to the notebook's output for immediate review.

In [None]:
# Call the analysis function to display the top 25 words for each class, along with their weights and p-values.
analyze_weights(true_coefficients, le, vectorizer.vocabulary_, p_values)