## Tutorial: Preparing Movie Review Text for Binary Classification

This week's focus is on learning how to work with real-world text data — specifically, full movie reviews pulled from websites like RogerEbert.com. Before we can build machine learning models next week, we need to understand how to clean, organize, and extract meaning from raw text.

Last week, we used Covid testing data, where we used test results as the target variable (what we're predicting) and symptoms or other predictive values (such as high risk occupation or contacts) as features (what we use to train a model to predict the target variable). 

We can take a similar approach to document (text) classification, by converting documents into numerical  representations of words and phrases in each document, providing a vectorized version of a text document through frequency of particular tokens (words or phrases). 

This workshop will provide background for working with text data. 


### Learning Objectives

1. Understand and extract review content (feature data)  
   We'll locate and extract the main body of the review from an HTML page. This is the core descriptive text we'll use later as input for a classification model.

2. Assign a simple binary label (target data)  
   For each review, we’ll decide if it is positive (1) or negative (0), using clues like star ratings or the overall sentiment of the text. These labels won’t be used for training this week, but it’s important to understand where they come from.


### Why Text Data Presents Additional Challenges

- Text doesn't naturally present itself in a tabular, feature-based format like the Covid testing data. We need to find a way to represent text as a vector that may not be as immediatley intuitive.
- Text from sournces on the web often includs noise like navigation bars, ads, and unrelated metadata  
- Opinion and sentiment are expressed through nuance and word choice  
- You can't use it as-is with models until it’s cleaned and transformed

In this workship we will learn techniqyes to/:

- Extracting useful content from raw HTML pages  
- Cleaning and normalizing unstructured text (removing tags, lowering case, stripping whitespace)  
- Thinking critically about what is signal versus what is noise in a document  
- Preparing the data so it’s ready for modeling next week

By the end of this exercise, you will have a cleaned dataset of movie reviews and labels that can be used for training a binary classifier — just not yet. First, let’s get the text ready.


## Fetching the Raw HTML of a Web Page

In this code, we use the `requests` library to fetch the raw HTML content of a web page. Specifically, we are retrieving the content from a review of the movie *Marmaduke* (2010) on RogerEbert.com. The code does the following:

1. **Sends an HTTP GET request** to the specified URL using `requests.get()`.
2. **Checks the response status** to ensure the page was successfully retrieved (status code `200` means success).
3. **Extracts the raw HTML** of the page and stores it as a string in `page_source`.
4. **Prints the first 100 characters** of the HTML content for inspection, which can help us see the structure or start of the page.
5. **Handles errors** by printing a failure message if the page cannot be retrieved.

This is useful for fetching web pages to process their content further, such as parsing the text of a review or extracting specific elements from the HTML.


In [1]:
import requests

url = 'https://www.rogerebert.com/reviews/marmaduke-2010'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    page_source = response.text
    print(page_source[:100])
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

<!doctype html>
<html lang="en-US" prefix="og: https://ogp.me/ns#">
<head>
	<meta charset="UTF-8">
	


## Extracting Clean Text from a Web Page

The code above provides the raw page source, but it contains a lot of extraneous tags, symbols, and characters. How can we get a cleaner version?

This code uses the `requests` and `BeautifulSoup` libraries to fetch and parse the HTML content of a web page. Specifically, it:

1. Sends an HTTP GET request to the specified URL using `requests`.
2. Parses the HTML content using `BeautifulSoup` with the `'html.parser'` parser.
3. Extracts all visible text from the page (removing all HTML tags) using `.get_text()`.
   - The `separator='\n'` option adds line breaks between blocks of text for readability.
   - The `strip=True` option removes leading and trailing whitespace from each block.
4. Prints the resulting plain-text content to the console.

> This is useful for extracting readable text from web pages, such as news articles or reviews, while ignoring HTML, JavaScript, and styling content.


In [2]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.rogerebert.com/reviews/marmaduke-2010'

# Send GET request
response = requests.get(url)

# Check for success
if response.status_code == 200:
    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract visible text
    movie_review_text = soup.get_text(separator='\n', strip=True)
    print(movie_review_text[:1000])  # Or write to file if needed
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Great Danes drool when they speak movie review (2010) | Roger Ebert
Signup
Search in https://www.rogerebert.com
Movie Reviews
Great Movies
TV/Streaming
Interviews
Collections
Roger Ebert
Contributors
Reviews
Great Danes drool when they speak
Comedy
87 minutes
‧
PG
‧
2010
Roger Ebert
June 2, 2010
3 min read
20th Century-Fox presents  a film directed by
Tom Dey
. Written by Tim Rasmussen and Vince DiMeglio, based on the comic strip created by
Brad Anderson
and Phil Leeming. Running time: 87 minutes.
Rated PG
(for rude humor and language).
Dogs cannot talk. This we know. Dogs can talk in the movies. This we also know. But when we see them lip-synching with their dialogue, it’s just plain grotesque. The best approach is the one used by “Garfield” in which we saw the cat and heard
Bill Murray
, but there was no nonsense about Garfield’s mouth moving.
The moment I saw Marmaduke’s big drooling lips moving, I knew I was in trouble. There is nothing discreet about a Great Dane with a lot on his

## Text Preprocessing Steps

Text data is inherently messy and unstructured, often containing noise, inconsistencies, and irrelevant information. Raw text can be challenging for machine learning algorithms to interpret without some cleaning and preparation. The goal of preprocessing is to:
- **Ensure consistency**: For example, converting text to lowercase so that words like "Dog" and "dog" are treated as the same.
- **Reduce noise**: Removing unwanted characters like punctuation and stop words helps the model focus on the more meaningful parts of the text.
- **Extract useful features**: By tokenizing the text and removing common, unimportant words, we help the model focus on the content that carries more significance.
- **Normalize text**: Steps like stemming or lemmatization reduce different forms of words to their root form, improving the model's ability to compare words and make accurate predictions.

### What We’ll Do:
1. Lowercase the Text: To ensure consistency and avoid treating the same word differently because of capitalization.
2. Remove Punctuation and Special Characters: To clean up unnecessary characters that don’t contribute to meaningful analysis.
3. Tokenize the Text: Split the text into individual words or tokens, which are the basic units for analysis.
4. Remove Stop Words: Filter out common words like "the," "is," "and," which don’t carry significant meaning in most contexts.
5. Stemming or Lemmatization: Reduce words to their root form to group similar words together and improve model performance.
6. Optional Noise Removal: Clean up numbers, non-alphabetic characters, and extra whitespace to make the data even cleaner.

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
import re

In [4]:
# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/boushey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/boushey/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
# Original sentence
text = "Dogs are running in the park, and they ran fast! The dog was excited; it's their favorite spot."

In [9]:
# Step 1: Normalize Capitalization
normalized_text = text.lower()
print(normalized_text)

dogs are running in the park, and they ran fast! the dog was excited; it's their favorite spot.


In [10]:
# Step 2: Remove Stop Words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in normalized_text.split() if word not in stop_words]
print(filtered_tokens)

['dogs', 'running', 'park,', 'ran', 'fast!', 'dog', 'excited;', 'favorite', 'spot.']


In [17]:
# Step 3: Remove Punctuation (after stop word removal)
text_without_punctuation = [re.sub(r'[^\w\s]', '', word) for word in filtered_tokens]
print(text_without_punctuation)

['dogs', 'running', 'park', 'ran', 'fast', 'dog', 'excited', 'favorite', 'spot']


In [19]:
# Step 4: Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word, pos='v') for word in text_without_punctuation]
print(lemmatized_tokens)

['dog', 'run', 'park', 'run', 'fast', 'dog', 'excite', 'favorite', 'spot']


In [21]:
# Step 5: Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in lemmatized_tokens]
print(stemmed_tokens)

['dog', 'run', 'park', 'run', 'fast', 'dog', 'excit', 'favorit', 'spot']


In [23]:
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

def preprocess_text(text):
    # Step 1: Normalize Capitalization
    normalized_text = text.lower()

    # Step 2: Remove Stop Words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in normalized_text.split() if word not in stop_words]

    # Step 3: Remove Punctuation
    text_without_punctuation = [re.sub(r'[^\w\s]', '', word) for word in filtered_tokens]

    # Step 4: Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word, pos='v') for word in text_without_punctuation]

    # Step 5: Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in lemmatized_tokens]
    
    # Return final processed text
    #return ' '.join(stemmed_tokens)
    return ' '.join(lemmatized_tokens)


In [24]:
preprocess_text(movie_review_text)

'great danes drool speak movie review 2010  roger ebert signup search httpswwwrogerebertcom movie review great movies tvstreaming interview collections roger ebert contributors review great danes drool speak comedy 87 minutes  pg  2010 roger ebert june 2 2010 3 min read 20th centuryfox present film direct tom dey  write tim rasmussen vince dimeglio base comic strip create brad anderson phil leeming run time 87 minutes rat pg for rude humor language dog cannot talk know dog talk movies also know see lipsynching dialogue its plain grotesque best approach one use garfield saw cat hear bill murray  nonsense garfields mouth move moment saw marmadukes big drool lips move know trouble nothing discreet great dane lot mind especially hes narrator film never shut up master phil move winslow family kansas orange county join crowd dog park vegetarian pet food company well say movie speak part dog and cat humans congenial pgrated animal comedy like comic strip 56th year maybe youll like it maybe no