### **Commented and Explained IPython Notebook**

This notebook explores linear regression with L2 (Ridge) and L1 (Lasso) regularization, using the movie box office prediction data. Be sure to install beautifulsoup (a great python library for reading XML).

```sh 
conda install beautifulsoup4=4.7.1
```

--- 
## **1. Importing Necessary Libraries**
This first code block imports all the essential Python libraries for our task. We'll need `nltk` for text processing, `numpy` for numerical operations, `sklearn` for machine learning models and metrics, and `BeautifulSoup` to parse the XML data file.

In [None]:
# Import the Natural Language Toolkit for text processing tasks like tokenization.
import nltk
# Import NumPy for efficient numerical operations, especially on arrays.
import numpy as np
# Import scikit-learn's linear_model module, which contains Ridge and Lasso regression.
from sklearn import linear_model
# Import scikit-learn's metrics module to evaluate model performance (e.g., Mean Absolute Error).
import sklearn.metrics
# Import the BeautifulSoup library to easily parse XML and HTML files.
from bs4 import BeautifulSoup
# Import the CountVectorizer to convert text documents into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer

--- 
## **2. Defining the Data Reading Function**
The following function, `read_movie_data`, is designed to read and parse the movie data from the specified XML file. It extracts movie reviews, their corresponding box office revenue (the target variable), and splits them into training and testing sets as defined within the file itself.

In [None]:
# Define a function to read and process the movie data from a given XML file.
def read_movie_data(filename):
    
    # Initialize an empty list to store the review texts for the training set.
    trainX=[]
    # Initialize an empty list to store the box office revenues for the training set.
    Y_train=[]
    
    # Initialize an empty list to store the review texts for the test set.
    testX=[]
    # Initialize an empty list to store the box office revenues for the test set.
    Y_test=[]
    
    # Open and read the specified file.
    with open(filename) as file:
        # Create a BeautifulSoup object to parse the XML content of the file.
        soup=BeautifulSoup(file)
        # Find all XML tags named 'instance', where each tag represents a single movie.
        movies=soup.findAll('instance')
        # Iterate through each movie found in the file.
        for movie in movies:
            # Get the value of the 'subpop' attribute to determine if it's for training or testing.
            split=movie["subpop"]
            # Find the 'regy' tag and get the box office value from its 'yvalue' attribute, converting it to a float.
            y=float(movie.find('regy')["yvalue"])

            # we'll just take the first review in the data (each movie has multiple reviews)
            review=movie.find('text')
            
            # Use NLTK's word_tokenize to split the review text into a list of words (tokens).
            tokens=nltk.word_tokenize(review.text)
            # Join the tokens back into a single string, separated by spaces.
            words=' '.join(tokens)
            # Check if this movie belongs to the training set.
            if split == "train":
                # If so, add the review text to the training feature list.
                trainX.append(words)
                # And add the box office value to the training target list.
                Y_train.append(y)
            # Check if this movie belongs to the test set.
            elif split == "test":
                # If so, add the review text to the test feature list.
                testX.append(words)
                # And add the box office value to the test target list.
                Y_test.append(y)
   
    # Return the four populated lists.
    return trainX, Y_train, testX, Y_test

--- 
## **3. Defining the Weight Analysis Function**
This helper function, `analyze_weights`, is created to inspect the coefficients (or "weights") that our trained model learns for each feature (word or n-gram). It identifies and prints the features that most strongly predict high box office revenue (large positive weights) and those that most strongly predict low revenue (large negative weights).

In [None]:
# Define a function to analyze and display the most influential feature weights from a trained model.
def analyze_weights(learned_model, vocab, num_to_print, printZero=True):
    # Create a reverse vocabulary to map feature indices back to their corresponding words/n-grams.
    reverse_vocab = {v: k for k, v in vocab.items()}

    # Get the indices that would sort the model's coefficients in ascending order.
    sort_index = np.argsort(learned_model.coef_)
    
    # Print the features with the highest positive coefficients.
    # We iterate backwards through the last 'num_to_print' elements of the sorted indices.
    for k in reversed(sort_index[-num_to_print:]):
        # Check if the coefficient is non-zero or if we are instructed to print zero-value coefficients.
        if learned_model.coef_[k] != 0 or printZero:
            # Print the coefficient value and the corresponding feature name.
            print ("%.5f\t%s" % (learned_model.coef_[k], reverse_vocab[k] ))
        
    # Print a blank line for better readability.
    print()

    # Print the features with the most negative coefficients.
    # We iterate through the first 'num_to_print' elements of the sorted indices.
    for k in sort_index[:num_to_print]:
        # Check if the coefficient is non-zero or if we are instructed to print zero-value coefficients.
        if learned_model.coef_[k] != 0 or printZero:
            # Print the coefficient value and the corresponding feature name.
            print ("%.5f\t%s" % (learned_model.coef_[k], reverse_vocab[k] ))

--- 
## **4. Loading and Splitting the Data**
Here, we call the `read_movie_data` function to load the dataset. The function returns the texts and box office values, neatly separated into training and testing sets.

In [None]:
# Call the data reading function with the path to the dataset file.
# Unpack the returned values into separate variables for training and testing data.
trainX, Y_train, testX, Y_test=read_movie_data("../data/7domains-train-dev.tl.xml")

--- 
## **5. Vectorizing the Text Data**
Machine learning models can't work with raw text. We need to convert the movie reviews into a numerical format. We use `CountVectorizer` to transform our text data into a "bag-of-words" representation. Each review becomes a vector where each element represents the presence or absence of a specific word or n-gram from our vocabulary.

In [None]:
# Initialize the CountVectorizer with specific parameters.
# max_features=10000: Limit the vocabulary to the 10,000 most frequent words/n-grams.
# ngram_range=(1,2): Include both single words (unigrams) and pairs of adjacent words (bigrams) as features.
# lowercase=True: Convert all text to lowercase before tokenizing.
# strip_accents=None: Do not remove accents from characters.
# binary=True: Use 1 for presence and 0 for absence of a feature, rather than its frequency count.
vectorizer = CountVectorizer(max_features=10000, ngram_range=(1,2), lowercase=True, strip_accents=None, binary=True)

# Learn the vocabulary from the training data and transform the training text into a numerical matrix.
X_train = vectorizer.fit_transform(trainX)
# Transform the test data using the same vocabulary learned from the training data.
X_test = vectorizer.transform(testX)

--- 
## **6. Ridge Regression (L2 Regularization)**
Ridge regression is linear regression with L2 regularization. How does varying the regularization strength affect the accuracy (Mean Absolute Error)? How does it affect the rank order of the most informative coefficients? Play around with the parameters of the `CountVectorizer` above (varying the number of `max_features`, increasing the `ngram_range` to include bigrams, trigrams, etc.).

In [None]:
# Higher values of alpha mean stronger regularization.
# Initialize a Ridge Regression model with an alpha (regularization strength) of 100.
# fit_intercept=True tells the model to calculate an intercept term.
ridge_regression = linear_model.Ridge(alpha=100, fit_intercept=True)
# Train (fit) the Ridge model using the training feature matrix (X_train) and training target values (Y_train).
ridge_regression.fit(X_train, (Y_train))
# Use the trained model to make predictions on the test set.
preds=ridge_regression.predict(X_test)
# Calculate the Mean Absolute Error (MAE) between the model's predictions and the actual test values.
mae=sklearn.metrics.mean_absolute_error(preds, (Y_test))
# Print the calculated MAE, formatted to three decimal places.
print("MAE: %.3f" % mae)
# Call the analyze_weights function to see the 5 most positive and 5 most negative features.
analyze_weights(ridge_regression, vectorizer.vocabulary_, 5)

--- 
## **7. Lasso Regression (L1 Regularization)**
Lasso is linear regression with L1 regularization, which pressures coefficients to not only be close to zero, but **exactly zero**. Lasso provides features selection as a result of this, since parameters with a 0 value are effectively removed from the model. How does varying the regularization strength here affect the number of non-zero coefficients? How does it affect the rank order of the most informative coefficients?

In [None]:
# Initialize a Lasso Regression model with an alpha of 100.
# max_iter=10000 is set to ensure the optimization algorithm has enough iterations to converge.
lasso = linear_model.Lasso(alpha=100, fit_intercept=True, max_iter=10000)
# Train the Lasso model on the training data.
lasso.fit(X_train, (Y_train))
# Make predictions on the test data.
preds=lasso.predict(X_test)
# Calculate the Mean Absolute Error.
mae=sklearn.metrics.mean_absolute_error(preds, (Y_test))
# Print the MAE.
print("MAE: %.3f" % mae)

# Initialize a counter for non-zero features.
count=0
# Loop through all the coefficients learned by the Lasso model.
for val in lasso.coef_:
    # If a coefficient is not zero, increment the counter.
    count+=1 if val != 0 else 0

# Print the total number of features that were not eliminated by the L1 regularization.
print("Nonzero features: %s\n" % count)
# Analyze the weights, but this time set printZero=False since we are only interested in the features Lasso kept.
analyze_weights(lasso, vectorizer.vocabulary_, 5, printZero=False)