This notebook explores the use of the bootstrap to create confidence intervals for any statistic of interest that is estimated from data.

This cell imports all the necessary Python libraries.
* `sys` for system-specific parameters and functions.
* `Counter` for counting hashable objects.
* `sklearn.preprocessing` for encoding categorical labels into numbers.
* `sklearn.linear_model` for the Logistic Regression model.
* `pandas` for data manipulation (though not heavily used here).
* `scipy.sparse` to handle sparse matrices, which are memory-efficient for feature sets with many zero values.
* `numpy` for numerical operations, especially with arrays.
* `math.sqrt` for calculating the square root.
* `scipy.stats.norm` to access properties of the normal distribution, like the Percent Point Function (`ppf`).
* `random.choices` for random sampling with replacement, which is the core of the bootstrap method.

In [None]:
# Import necessary libraries
import sys
from collections import Counter
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
from scipy import sparse
import numpy as np
from math import sqrt 
from scipy.stats import norm
from random import choices

This cell defines a function `read_data` to load text data from a tab-separated file (`.tsv`). It reads the file line by line, splitting each line into a label and a text body. It assumes the text has already been tokenized (split into words).

In [None]:
# Defines a function to read a tab-separated file.
def read_data(filename):
    # Initialize empty lists to store text data (X) and labels (Y).
    X=[]
    Y=[]
    # Open the specified file with UTF-8 encoding.
    with open(filename, encoding="utf-8") as file:
        # Iterate over each line in the file.
        for line in file:
            # Remove trailing whitespace and split the line by tabs.
            cols=line.rstrip().split("\t")
            # The first column is the label.
            label=cols[0]
            # The second column is the text.
            text=cols[1]
            # Assumes text is already tokenized.
            # Append the text to the X list.
            X.append(text)
            # Append the label to the Y list.
            Y.append(label)
    # Return the lists of texts and labels.
    return X, Y

This cell sets the path to the directory containing the dataset. You should change this string to match the location of your data files (`train.tsv`, `dev.tsv`, `test.tsv`).

In [None]:
# Change this to the directory with your data.
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/text_classification_sample_data"

Here, the `read_data` function is called to load the training and development (validation) datasets from the specified directory.

In [None]:
# Load the training data using the read_data function.
trainX, trainY=read_data("%s/train.tsv" % directory)
# Load the development (validation) data.
devX, devY=read_data("%s/dev.tsv" % directory)

This cell defines a simple feature engineering function. It uses two small, predefined dictionaries of words associated with Democrats and Republicans. The function `political_dictionary_feature` checks if any tokens from an input text are present in these dictionaries and creates binary features (`word_in_dem_dictionary`, `word_in_repub_dictionary`) if they are.

In [None]:
# Create a set of words associated with the Democratic party.
dem_dictionary=set(["republican","cut", "opposition"])
# Create a set of words associated with the Republican party.
repub_dictionary=set(["growth","economy"])

# Defines a function to extract features based on the political dictionaries.
def political_dictionary_feature(tokens):
    # Initialize an empty dictionary to store features for a document.
    feats={}
    # Iterate through each word (token) in the document.
    for word in tokens:
        # If the word is in the Democratic dictionary...
        if word in dem_dictionary:
            # ...set the corresponding feature to 1.
            feats["word_in_dem_dictionary"]=1
        # If the word is in the Republican dictionary...
        if word in repub_dictionary:
            # ...set the corresponding feature to 1.
            feats["word_in_repub_dictionary"]=1
    # Return the dictionary of features.
    return feats

The `build_features` function processes a list of documents (`trainX`). For each document, it splits the text into tokens and then applies a list of provided feature functions (like the `political_dictionary_feature` function defined above) to generate a set of features for that document.

In [None]:
# Defines a function to apply feature extraction to an entire dataset.
def build_features(trainX, feature_functions):
    # Initialize an empty list to hold the feature dictionaries for all documents.
    data=[]
    # Iterate through each document in the input data.
    for doc in trainX:
        # Initialize an empty dictionary for the current document's features.
        feats={}

        # The sample text is already tokenized; if not, you would tokenize here.
        tokens=doc.split(" ")
        
        # Iterate through the list of feature functions to apply.
        for function in feature_functions:
            # Update the features dictionary with the results from the current function.
            feats.update(function(tokens))

        # Append the completed feature dictionary to the data list.
        data.append(feats)
    # Return the list of feature dictionaries.
    return data

This helper function, `create_vocab`, builds a vocabulary from the training data. It iterates through all the features generated for each document and assigns a unique integer ID to each unique feature name. This is a crucial step to convert text features into a numerical format that machine learning models can understand.

In [None]:
# This helper function converts a dictionary of feature names to unique numerical ids.
def create_vocab(data):
    # Initialize an empty dictionary to store the feature-to-ID mapping.
    feature_vocab={}
    # Initialize a counter for the unique feature IDs.
    idx=0
    # Iterate through each document's feature dictionary.
    for doc in data:
        # Iterate through each feature name in the dictionary.
        for feat in doc:
            # If the feature has not been seen before...
            if feat not in feature_vocab:
                # ...add it to the vocabulary with a new unique ID.
                feature_vocab[feat]=idx
                # Increment the ID for the next new feature.
                idx+=1
                
    # Return the completed feature vocabulary.
    return feature_vocab

The `features_to_ids` function converts the list of feature dictionaries into a sparse matrix. A sparse matrix is used because most documents will not contain most features, resulting in many zero values. Storing only the non-zero values is highly memory-efficient. This function uses the vocabulary created by `create_vocab` to map feature names to the correct column index in the matrix.

In [None]:
# This helper function converts a dictionary of feature names to a sparse representation.
def features_to_ids(data, feature_vocab):
    # Create a sparse matrix in "List of Lists" format with dimensions (num_docs, num_features).
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    # Iterate through each document and its index.
    for idx,doc in enumerate(data):
        # Iterate through each feature in the document's feature dictionary.
        for f in doc:
            # If the feature exists in our training vocabulary...
            if f in feature_vocab:
                # ...set the value in the sparse matrix at (row=doc_idx, col=feature_id).
                new_data[idx,feature_vocab[f]]=doc[f]
    # Return the final sparse matrix.
    return new_data

The `evaluate` function orchestrates the entire model training and evaluation pipeline. It takes training and development data, along with feature functions, and performs the following steps:
1.  Builds features for both training and development sets.
2.  Creates a vocabulary based *only* on the training data to prevent data leakage.
3.  Converts feature dictionaries into sparse matrices.
4.  Encodes string labels (e.g., 'Democrat') into numerical labels (e.g., 0, 1).
5.  Trains a logistic regression model.
6.  Evaluates the model's accuracy on the development set.
7.  Returns the model's predictions and the true labels for further analysis.

In [None]:
# This function trains a model and returns the predicted and true labels for the dev data.
def evaluate(trainX, devX, trainY, devY, feature_functions):
    # Generate feature dictionaries for the training data.
    trainX_feat=build_features(trainX, feature_functions)
    # Generate feature dictionaries for the development data.
    devX_feat=build_features(devX, feature_functions)

    # Create a vocabulary from features found *only* in the training data.
    feature_vocab=create_vocab(trainX_feat)

    # Convert the training features into a sparse matrix of IDs.
    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    # Convert the development features into a sparse matrix of IDs using the same vocabulary.
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    # Initialize a LabelEncoder to convert string labels to integers.
    le=preprocessing.LabelEncoder()
    # Fit the encoder on the training labels to learn the mapping.
    le.fit(trainY)

    # Transform both training and development labels to integers.
    trainY=le.transform(trainY)
    devY=le.transform(devY)
    
    # Print which class corresponds to the label '1'.
    print ("Class 1 is %s" % le.inverse_transform([1]))
    
    # Initialize a Logistic Regression model with specified parameters.
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    # Train the model on the training data.
    logreg.fit(trainX_ids, trainY)
    # Print the accuracy of the model on the development data.
    print ("Accuracy: %.3f"  % logreg.score(devX_ids, devY))
    # Get the model's predictions for the development data.
    predictions=logreg.predict(devX_ids)
    
    # Return the predictions and the true labels for the dev set.
    return (predictions, devY)

The `binomial_confidence_intervals` function calculates a parametric confidence interval for accuracy, assuming the errors follow a binomial distribution. It uses the normal approximation to the binomial distribution.
1. It calculates the mean accuracy (`success_rate`).
2. It finds the Z-score (`z_alpha`) corresponding to the desired confidence level.
3. It calculates the standard error of the mean.
4. It computes the lower and upper bounds of the confidence interval.

In [None]:
# Defines a function to calculate a parametric confidence interval for a binomial outcome (e.g., accuracy).
def binomial_confidence_intervals(predictions, truth, confidence_level=0.95):
    # Initialize a list to store correctness (1 if correct, 0 if incorrect).
    correct=[]
    # Iterate through predictions and true labels simultaneously.
    for pred, gold in zip(predictions, truth):
        # Append 1 if the prediction matches the true label, otherwise 0.
        correct.append(int(pred==gold))
        
    # Calculate the success rate (accuracy) as the mean of the 'correct' list.
    success_rate=np.mean(correct)

    # For a two-tailed test, find the area for each tail.
    critical_value=(1-confidence_level)/2
    # The ppf (Percent Point Function) finds the z-score for the given cumulative probability.
    z_alpha=-1*norm.ppf(critical_value)
    
    # The standard error for a binomial distribution is sqrt(p*(1-p)/n).
    standard_error=sqrt((success_rate*(1-success_rate))/len(correct))

    # Calculate the lower bound of the confidence interval.
    lower=success_rate-z_alpha*standard_error
    # Calculate the upper bound of the confidence interval.
    upper=success_rate+z_alpha*standard_error
    # Print the formatted result.
    print("%.3f, %s%% Confidence interval: [%.3f,%.3f]" % (success_rate, confidence_level*100, lower, upper))

This is a simple helper function to calculate accuracy by comparing predicted labels to true labels.

In [None]:
# Defines a function to calculate accuracy.
def accuracy(truth, predictions):
    # Initialize a counter for correct predictions.
    correct=0.
    # Iterate through the indices of the labels.
    for idx in range(len(truth)):
        # Get the true label.
        g=truth[idx]
        # Get the predicted label.
        p=predictions[idx]
        # If they match...
        if g == p:
            # ...increment the counter.
            correct+=1
    # Return the total correct divided by the total number of items.
    return correct/len(truth)

### **Specify features for model and train logistic regression**

This cell specifies which feature function(s) to use and then calls the `evaluate` function to train the model and get the predictions and true labels for the development set.

In [None]:
# Create a list containing the feature function(s) to be used.
features=[political_dictionary_feature]
# Call the evaluate function to train the model and get results.
predictions, truth=evaluate(trainX, devX, trainY, devY, features)

This cell calls the `binomial_confidence_intervals` function to calculate and print the 95% parametric confidence interval for the accuracy of the model's predictions.

In [None]:
# Calculate and print the binomial confidence interval for the model's accuracy.
binomial_confidence_intervals(predictions, truth, confidence_level=0.95)

### **Q1: Implement the bootstrap**
Implement the bootstrap to create confidence intervals at a specified confidence level for any function `metric(truth, predictions)` where *truth* is an array of true labels for a set of data points, and *predictions* is an array of predicted labels for those same points. See `accuracy(truth, predictions)` above for an example of a metric that should be supported. `bootstrap` should return a tuple of (lower, median, upper), where *lower* is the lower confidence bound, *upper* is the upper confidence bound, and *median* is the median value of the metric among the bootstrap resamples. Hint: see `np.percentile`.

This is the implementation of the bootstrap function. It works by:
1.  Repeatedly resampling (with replacement) from the set of (true label, prediction) pairs.
2.  Calculating the desired metric (e.g., accuracy) for each resampled set.
3.  This process creates a distribution of the metric's values.
4.  The confidence interval is then determined by taking the percentiles of this distribution. For a 95% confidence interval, we take the 2.5th and 97.5th percentiles.

In [None]:
# Defines the bootstrap function to create non-parametric confidence intervals.
def bootstrap(gold, predictions, metric, B=10000, confidence_level=0.95):
    # Calculate the critical value for the lower tail.
    critical_value=(1-confidence_level)/2
    # Convert the critical value to a percentile for the lower bound.
    lower_sig=100*critical_value
    # Convert the critical value to a percentile for the upper bound.
    upper_sig=100*(1-critical_value)
    # Create a list of [true_label, predicted_label] pairs.
    data=[]
    for g, p in zip(gold, predictions):
        data.append([g,p])

    # Initialize a list to store the metric scores from each bootstrap sample.
    metric_scores=[]
    
    # Loop B times (B is the number of bootstrap resamples).
    for b in range(B):
        # Create a new sample by choosing with replacement from the original data.
        choice=choices(data, k=len(data))
        # Convert the chosen sample to a NumPy array for easier slicing.
        choice=np.array(choice)
        # Calculate the desired metric on the resampled data.
        score=metric(choice[:,0], choice[:,1])
        
        # Append the calculated score to our list of scores.
        metric_scores.append(score)
    
    # Calculate the percentiles corresponding to the lower bound, median, and upper bound.
    percentiles=np.percentile(metric_scores, [lower_sig, 50, upper_sig])
    
    # Extract the lower bound from the percentiles array.
    lower=percentiles[0]
    # Extract the median from the percentiles array.
    median=percentiles[1]
    # Extract the upper bound from the percentiles array.
    upper=percentiles[2]
    
    # Return the lower bound, median, and upper bound.
    return lower, median, upper

### **Q2: Use your bootstrap implementation for accuracy**
Use your bootstrap implementation to generate confidence intervals for accuracy. How do these compare to the parametric intervals above?

This cell calls the newly implemented `bootstrap` function, passing the `accuracy` function as the metric. It then prints the resulting confidence interval. You can compare this interval to the one generated by the parametric `binomial_confidence_intervals` function; they should be very similar, which gives confidence in the bootstrap implementation.

In [None]:
# Set the desired confidence level.
confidence_level=0.95
# Call the bootstrap function with the accuracy metric.
lower, median,upper=bootstrap(truth, predictions, accuracy, B=10000,confidence_level=confidence_level)
# Print the results in a formatted string.
print("%.3f, %s%% Bootstrap confidence interval: [%.3f, %.3f]" % (median, confidence_level*100, lower, upper))

### **Q3: Implement the F1 score**
Implement the F1 score for binary data. Calculate F1 as the harmonic mean of precision and recall for the positive class (i.e., y=1).

This cell defines a function to calculate the F1 score for the positive class (label `1`).
- **Precision** is the ratio of true positives to all predicted positives (`correct / trials`).
- **Recall** is the ratio of true positives to all actual positives (`correct / trues`).
- **F1 Score** is the harmonic mean of precision and recall: `2 * (precision * recall) / (precision + recall)`.
The code includes checks to prevent division by zero if there are no predicted positives or no actual positives.

In [None]:
# Defines a function to calculate the F1 score for the positive class (label=1).
def F1(truth, predictions):
    # Initialize counter for true positives (predicted=1, truth=1).
    correct=0.
    # Initialize counter for predicted positives (predicted=1).
    trials=0.
    # Initialize counter for actual positives (truth=1).
    trues=0.
    # Iterate through the indices of the labels.
    for idx in range(len(truth)):
        # Get the true label.
        g=truth[idx]
        # Get the predicted label.
        p=predictions[idx]
        # If it's a true positive...
        if g == p and g == 1:
            # ...increment the true positive counter.
            correct+=1
        # If it's an actual positive...
        if g == 1:
            # ...increment the actual positive counter.
            trues+=1
        # If it's a predicted positive...
        if p == 1:
            # ...increment the predicted positive counter.
            trials+=1
            
    # Calculate precision, handling division by zero.
    precision=correct/trials if trials > 0 else 0
    # Calculate recall, handling division by zero.
    recall=correct/trues if trues > 0 else 0
    # Calculate F1 score, handling division by zero.
    f=(2*precision*recall)/(precision+recall) if (precision+recall) > 0 else 0
    # Return the F1 score.
    return f

### **Q4: Use your bootstrap implementation for the F1 score**
Use your bootstrap implementation to generate confidence intervals for the F1 score.

This cell demonstrates the power of the bootstrap method. The same `bootstrap` function can be used to find a confidence interval for the F1 score simply by passing the `F1` function as the metric. This is a major advantage over parametric methods, which often require complex, metric-specific formulas.

In [None]:
# Set the desired confidence level.
confidence_level=0.95
# Call the bootstrap function with the F1 metric.
lower, median,upper=bootstrap(truth, predictions, F1, B=10000,confidence_level=confidence_level)
# Print the results in a formatted string.
print("%.3f, %s%% Bootstrap confidence interval: [%.3f, %.3f]" % (median, confidence_level*100, lower, upper))

This is an empty code cell, left for potential future use.