# NLP Fundamentals Lab

* * * 

<div class="alert alert-success">  
    
### Lab Objectives 
    
Apply the NLP fundamentals you learned in the lesson to your own text dataset:
* Preprocess your text data using appropriate cleaning techniques
* Compare different tokenization approaches on your data
* Create numerical representations using Bag-of-Words and TF-IDF
* Build a text classifier if you have labeled data
* Analyze and interpret your results
</div>

### Icons Used in This Lab
🔔 **Question**: A quick question to help you understand what's going on.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
📝 **Your Task**: Something for you to implement or analyze.<br>

### Sections
1. [Data Import and Exploration](#section1)
2. [Text Preprocessing](#section2)
3. [Tokenization Comparison](#section3)
4. [Numerical Representation](#section4)
5. [Text Classification (Optional)](#section5)
6. [Analysis and Reflection](#section6)

## Setup

First, let's install and import the necessary packages for text analysis.

In [None]:
# Uncomment the following lines to install packages if needed
# %pip install nltk
# %pip install transformers
# %pip install spacy
# %pip install scikit-learn
# !python -m spacy download en_core_web_sm

In [None]:
# Import necessary packages
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from string import punctuation
%matplotlib inline

<a id='section1'></a>

# 1. Data Import and Exploration

📝 **Your Task**: Import your text dataset and explore its structure.

**Questions to consider:**
- What format is your data in? (CSV, JSON, plain text files, etc.)
- Which column(s) contain the main text you want to analyze?
- What other information do you have (labels, categories, metadata)?
- How many documents/texts do you have?
- What does a typical text look like in your dataset?

In [None]:
# Load your dataset
# Example: df = pd.read_csv('your_data.csv')


In [None]:
# Explore the structure of your data
# Show first few rows, column info, basic statistics


In [None]:
# Look at a few example texts
# Display 2-3 representative examples from your dataset


🔔 **Question**: What patterns do you notice in your text data? What preprocessing challenges might you encounter?

<a id='section2'></a>

# 2. Text Preprocessing

📝 **Your Task**: Clean and preprocess your text data based on its specific characteristics.

**Common preprocessing steps to consider:**
- Handle missing values (NaN, empty strings)
- Remove or replace unwanted content (URLs, email addresses, special characters)
- Lowercase text
- Remove extra whitespace
- Handle punctuation
- Create placeholders for specific patterns in your data

In [None]:
# Check for and handle missing values
# Count NaN values, empty strings, or other missing data indicators


In [None]:
# Remove or filter out unusable rows
# Remove rows with insufficient content or placeholder text


In [None]:
# Create a basic preprocessing function
def basic_preprocess(text):
    """
    Apply basic preprocessing steps to text.
    Adapt this function based on your data's characteristics.
    """
    # Convert to string and lowercase
    
    # Remove extra whitespace using regex
    
    # Handle other data-specific patterns
    
    return text

# Test your function on an example


In [None]:
# Create an advanced preprocessing function with regex patterns
def advanced_preprocess(text):
    """
    Apply domain-specific preprocessing.
    Replace patterns specific to your dataset with placeholders.
    """
    
    # Example patterns you might want to replace:
    # URLs, email addresses, phone numbers, dates, etc.
    
    # Replace URLs with placeholder
    # pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    # text = re.sub(pattern, ' URL ', text)
    
    # Replace numbers with placeholder
    # pattern = r'\\d+'
    # text = re.sub(pattern, ' NUMBER ', text)
    
    # Add your own domain-specific patterns here
    
    return text

# Test your advanced function


In [None]:
# Apply preprocessing to your dataset
# Create a new column with cleaned text


<a id='section3'></a>

# 3. Tokenization Comparison

📝 **Your Task**: Compare different tokenization approaches on your data and understand their trade-offs.

**Tokenizers to compare:**
- NLTK (rule-based)
- spaCy (linguistic model-based)
- Modern transformer tokenizer (subword-based)

In [None]:
# Download NLTK resources if needed
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

In [None]:
# Set up tokenizers
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
from transformers import AutoTokenizer

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Load transformer tokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")

# Get stop words
stop_words = set(stopwords.words('english'))

In [None]:
# Create traditional preprocessing function (NLTK-based)
def traditional_tokenize(text):
    """
    Traditional NLP preprocessing: lowercase, remove punctuation, 
    tokenize, remove stopwords.
    """
    
    # Lowercase
    
    # Remove punctuation
    
    # Tokenize with NLTK
    
    # Remove stop words
    
    return ' '.join(tokens)

In [None]:
# Create spaCy preprocessing function
def spacy_tokenize(text):
    """
    SpaCy-based preprocessing with linguistic features.
    """
    
    # Process with spaCy
    
    # Extract tokens, filter stop words, punctuation, spaces
    
    # Optional: use lemmatization instead of original text
    
    return ' '.join(tokens)

In [None]:
# Create modern tokenizer function
def modern_tokenize(text):
    """
    Modern subword tokenization.
    """
    
    # Tokenize with transformer tokenizer
    
    return ' '.join(tokens)

In [None]:
# Compare tokenization approaches on example text
# Pick a representative example from your dataset
example_text = ""  # Replace with your example

print("Original text:")
print(example_text)
print("\n" + "="*50 + "\n")

print("Traditional (NLTK):")
# Apply traditional_tokenize

print("\nSpaCy:")
# Apply spacy_tokenize

print("\nModern (Transformer):")
# Apply modern_tokenize

🔔 **Question**: What differences do you notice between the tokenization approaches? Which seems most appropriate for your data and task?

In [None]:
# Apply your chosen preprocessing approach to the full dataset
# Create a column with the processed text


<a id='section4'></a>

# 4. Numerical Representation

📝 **Your Task**: Convert your preprocessed text into numerical representations using Bag-of-Words and TF-IDF.

**Key considerations:**
- What vocabulary size makes sense for your dataset?
- Should you filter out very rare or very common words?
- How do the representations differ between Bag-of-Words and TF-IDF?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Create Bag-of-Words representation
# Initialize CountVectorizer with appropriate parameters
count_vectorizer = None  # Your implementation

# Fit and transform your text data
bow_matrix = None  # Your implementation

# Create DataFrame for easier inspection
bow_df = None  # Your implementation

In [None]:
# Explore the Bag-of-Words representation
# Print matrix shape, vocabulary size, most frequent words


In [None]:
# Create TF-IDF representation
# Initialize TfidfVectorizer with appropriate parameters
tfidf_vectorizer = None  # Your implementation

# Fit and transform your text data
tfidf_matrix = None  # Your implementation

# Create DataFrame for easier inspection
tfidf_df = None  # Your implementation

In [None]:
# Compare a single document in both representations
# Pick a document index to examine
doc_index = 0

# Show the original text

# Show top words by Bag-of-Words count

# Show top words by TF-IDF score


In [None]:
# Visualize the most important words across your corpus
# Create bar plots for top words in both representations


🔔 **Question**: How do the Bag-of-Words and TF-IDF representations differ for your data? Which seems more informative?

<a id='section5'></a>

# 5. Text Classification (Optional)

📝 **Your Task**: If your dataset has labels or categories, build a text classifier.

**Requirements:**
- Your data must have a target variable (labels, categories, ratings, etc.)
- You need at least 2 classes with reasonable representation
- If you don't have labeled data, skip this section or create labels based on some text characteristics

💡 **Tip**: If you don't have natural labels, you could create binary labels based on text length, presence of certain words, or other measurable characteristics.

In [None]:
# Check if you have suitable labels for classification
# Explore your target variable


In [None]:
# Prepare data for classification
# Filter to classes with sufficient samples, handle class imbalance if needed


In [None]:
# Split data into features and target
from sklearn.model_selection import train_test_split

# Define X (text) and y (labels)
X = None  # Your text data
y = None  # Your labels

# Train-test split
X_train, X_test, y_train, y_test = None  # Your implementation

In [None]:
# Build classification pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create pipeline with TF-IDF and classifier
classifier = None  # Your implementation

# Train the model

# Make predictions

# Evaluate performance


In [None]:
# Analyze feature importance
# Extract and visualize the most important features for each class


In [None]:
# Test on new examples
# Create some test examples or use holdout data
new_examples = []  # Add your test cases

# Make predictions and show probabilities


<a id='section5b'></a>

# 5b. Simple Sentiment Analysis with VADER

📝 **Your Task**: Analyze the emotional tone of your text data using VADER (Valence Aware Dictionary and sEntiment Reasoner).

**When to use VADER:**
- ✅ Social media text (Twitter, Reddit, Facebook)
- ✅ Text with emojis, slang, or informal language
- ✅ Need fast results on large datasets
- ✅ Want interpretable sentiment scores

**When to skip this section:**
- ❌ Your text doesn't express emotions or opinions
- ❌ You have highly domain-specific sentiment (medical, legal)

**What VADER gives you:**
- **compound**: Overall sentiment score (-1 to +1) ← Use this for most analyses
- **positive**: Proportion of positive words (0 to 1)
- **negative**: Proportion of negative words (0 to 1)
- **neutral**: Proportion of neutral words (0 to 1)

In [None]:
# Install VADER if needed
# !pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Initialize VADER
analyzer = SentimentIntensityAnalyzer()

### Test VADER on Examples

First, test VADER on a few example texts to see how it works:

In [None]:
# Test on example texts from your domain
# Replace these with representative examples from your dataset
test_examples = [
    "Your positive example text here",
    "Your negative example text here",
    "Your neutral example text here"
]

print("Testing VADER on examples:\n" + "="*60)
for text in test_examples:
    scores = analyzer.polarity_scores(text)
    print(f"\nText: {text}")
    print(f"Scores: {scores}")
    print(f"Overall sentiment (compound): {scores['compound']:.3f}")

### Apply VADER to Your Dataset

Now apply sentiment analysis to all your texts:

In [None]:
# Apply VADER to your text column
# Replace 'text_column' with your actual column name
# df['sentiment_scores'] = df['text_column'].apply(
#     lambda x: analyzer.polarity_scores(str(x))
# )

# Extract compound score (overall sentiment)
# df['sentiment_compound'] = df['sentiment_scores'].apply(lambda x: x['compound'])

# Optional: Extract individual components
# df['sentiment_pos'] = df['sentiment_scores'].apply(lambda x: x['pos'])
# df['sentiment_neg'] = df['sentiment_scores'].apply(lambda x: x['neg'])
# df['sentiment_neu'] = df['sentiment_scores'].apply(lambda x: x['neu'])

# Show first few results


### Categorize Sentiment (Optional)

Convert compound scores into simple categories for easier analysis:

In [None]:
def categorize_sentiment(compound_score):
    """
    Categorize sentiment based on compound score.
    Adjust thresholds based on your data if needed.
    """
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply categorization
# df['sentiment_category'] = df['sentiment_compound'].apply(categorize_sentiment)

# Show distribution
# print("Sentiment distribution:")
# print(df['sentiment_category'].value_counts())
# print("\nPercentages:")
# print(df['sentiment_category'].value_counts(normalize=True).round(3) * 100)

### Visualize Sentiment Distribution

In [None]:
# Visualize sentiment distribution
# fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# # Histogram of compound scores
# axes[0].hist(df['sentiment_compound'], bins=30, edgecolor='black', alpha=0.7)
# axes[0].axvline(x=0, color='red', linestyle='--', alpha=0.5, label='Neutral')
# axes[0].set_xlabel('Sentiment Compound Score')
# axes[0].set_ylabel('Number of Texts')
# axes[0].set_title('Distribution of Sentiment Scores')
# axes[0].legend()
# axes[0].grid(True, alpha=0.3)

# # Bar chart of categories
# sentiment_counts = df['sentiment_category'].value_counts()
# colors = {'positive': 'green', 'neutral': 'gray', 'negative': 'red'}
# bar_colors = [colors[cat] for cat in sentiment_counts.index]

# axes[1].bar(sentiment_counts.index, sentiment_counts.values, 
#             color=bar_colors, alpha=0.7, edgecolor='black')
# axes[1].set_xlabel('Sentiment Category')
# axes[1].set_ylabel('Count')
# axes[1].set_title('Sentiment Categories')
# axes[1].grid(True, alpha=0.3, axis='y')

# plt.tight_layout()
# plt.show()

### Analyze Sentiment Patterns

Explore how sentiment varies across different categories or groups in your data:

In [None]:
# If you have categories or labels, compare sentiment across them
# Example: Compare sentiment by category
# sentiment_by_category = df.groupby('your_category_column')['sentiment_compound'].agg(['mean', 'std', 'count'])
# print(sentiment_by_category)

# Visualize sentiment by category
# df.boxplot(column='sentiment_compound', by='your_category_column', figsize=(10, 6))
# plt.ylabel('Sentiment Compound Score')
# plt.title('Sentiment Distribution by Category')
# plt.suptitle('')  # Remove default title
# plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)
# plt.tight_layout()
# plt.show()

### Examine Extreme Examples

Look at the most positive and negative examples to understand what VADER is detecting:

In [None]:
# Find most positive text
# most_positive_idx = df['sentiment_compound'].idxmax()
# print("Most POSITIVE text:")
# print(f"Score: {df.loc[most_positive_idx, 'sentiment_compound']:.3f}")
# print(f"Text: {df.loc[most_positive_idx, 'your_text_column'][:300]}...")
# print("\n" + "="*80 + "\n")

# Find most negative text
# most_negative_idx = df['sentiment_compound'].idxmin()
# print("Most NEGATIVE text:")
# print(f"Score: {df.loc[most_negative_idx, 'sentiment_compound']:.3f}")
# print(f"Text: {df.loc[most_negative_idx, 'your_text_column'][:300]}...")

### Sentiment Insights

🔔 **Questions to consider:**
- What is the overall sentiment distribution in your dataset?
- Are there differences in sentiment across categories/groups?
- Do the most extreme examples make sense?
- What patterns or insights does sentiment reveal about your data?

💡 **Tip**: VADER measures the *emotional tone* of the writing, not necessarily the "truth" or "correctness" of the content. Consider what sentiment means in the context of your specific dataset.

⚠️ **Warning**: If your results seem off, check:
- Is your text type appropriate for VADER? (social media works best)
- Did you handle NaN values properly?
- Are the extreme examples what you'd expect?
- Might you need domain-specific sentiment analysis instead?

<a id='section6'></a>

# 6. Analysis and Reflection

📝 **Your Task**: Reflect on your analysis and findings.

**Questions to address:**
- What insights did you gain about your text data?
- Which preprocessing steps were most important for your dataset?
- How well did the different approaches work?
- What challenges did you encounter?
- What would you do differently or explore further?

## Summary of Findings

**Dataset characteristics:**
- [Describe your dataset: size, source, type of text]
- [Key patterns or challenges you observed]

**Preprocessing insights:**
- [Which preprocessing steps were most important]
- [Domain-specific challenges you addressed]

**Tokenization comparison:**
- [How different tokenizers performed on your data]
- [Which approach you chose and why]

**Numerical representation:**
- [Differences between Bag-of-Words and TF-IDF for your data]
- [Most informative features/words]

**Classification results (if applicable):**
- [Model performance and key predictive features]
- [Insights about what the model learned]

## Next Steps

**Potential extensions:**
- Experiment with different preprocessing approaches
- Try other classification algorithms
- Explore n-grams (bigrams, trigrams) in your vectorizers
- Investigate misclassified examples
- Compare with word embeddings or modern language models

**Questions for further exploration:**
- How would your results change with different preprocessing choices?
- What other features could improve classification performance?
- How do traditional methods compare to modern approaches on your data?

<div class="alert alert-success">

## ❗ Key Takeaways

* **Preprocessing is crucial**: The quality of your analysis depends heavily on appropriate text cleaning and preprocessing.
* **Different tokenizers have trade-offs**: Rule-based, linguistic, and subword approaches each have strengths for different types of text.
* **TF-IDF often outperforms simple counts**: By considering both term frequency and document frequency, TF-IDF typically provides more informative representations.
* **Domain knowledge matters**: Understanding your specific text domain helps guide preprocessing decisions and interpret results.
* **Feature analysis is valuable**: Examining which words are most predictive can provide insights about your data and task.
* **Traditional methods are still useful**: While modern language models are powerful, classical NLP techniques remain valuable for many tasks.

</div>

## 🚀 Stretch Goals

For students who complete the lab early and want to explore further:

### 1. Experiment with N-grams

Try using bigrams (two-word phrases) instead of just single words to capture more context:

```python
# Use ngram_range parameter in your vectorizer
tfidf_bigram = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
X_bigram = tfidf_bigram.fit_transform(your_texts)

# Find the most informative bigrams in your dataset
```

**Question**: Do bigrams improve your classification accuracy? Which bigrams are most meaningful for your data?

### 2. Named Entity Recognition (NER)

**Challenge**: Extract structured information from unstructured text.

**Tasks:**
- Use spaCy's pre-trained NER model to identify entities (people, organizations, locations)
- Visualize entity distributions in your corpus
- Create a knowledge graph from extracted entities
- Build custom entity patterns for domain-specific entities
- Compare entity extraction across different text sources

**Analysis ideas:**
- Which entities appear most frequently?
- How are different entity types distributed across your documents?
- Can you find interesting co-occurrence patterns between entities?

In [None]:
# Example: Extract and analyze named entities
# import spacy
# from collections import Counter

# nlp = spacy.load("en_core_web_sm")

# # Extract entities from your texts
# all_entities = []
# for text in processed_texts[:100]:  # Sample for speed
#     doc = nlp(text)
#     for ent in doc.ents:
#         all_entities.append((ent.text, ent.label_))

# # Analyze entity distribution
# entity_counts = Counter(all_entities)
# person_entities = [(ent, count) for (ent, label), count in entity_counts.items() if label == 'PERSON']
# org_entities = [(ent, count) for (ent, label), count in entity_counts.items() if label == 'ORG']

# print(f"Top people mentioned: {person_entities[:5]}")
# print(f"Top organizations: {org_entities[:5]}")

### 3. Sentiment and Emotion Analysis

**Challenge**: Go beyond classification to analyze sentiment and emotional content.

**Tasks:**
- Implement rule-based sentiment analysis using VADER or TextBlob
- Create custom sentiment lexicons for your domain
- Analyze sentiment distribution across different categories/labels
- Build a sentiment classifier using your labeled data
- Extract emotion-bearing phrases and patterns

**Visualization ideas:**
- Sentiment over time (if temporal data available)
- Sentiment distribution by category
- Word clouds colored by sentiment
- Correlation between sentiment and other variables

In [None]:
# Example: Basic sentiment analysis
# from textblob import TextBlob
# # or: from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# # Analyze sentiment for each text
# sentiments = []
# for text in your_texts[:100]:  # Sample for speed
#     blob = TextBlob(text)
#     sentiments.append(blob.sentiment.polarity)

# # Visualize sentiment distribution
# plt.hist(sentiments, bins=20, edgecolor='black')
# plt.xlabel('Sentiment Polarity')
# plt.ylabel('Frequency')
# plt.title('Sentiment Distribution in Your Dataset')