# 📝 Notebook 2: Python Essentials for Text Analysis
## Processing Political & Legal Documents with Python

**Time to complete:** 90-120 minutes  
**Prerequisites:** Completed Notebook 1 (R-to-Python Translation)  
**What you'll build:** A complete text processing pipeline for your dissertation data!

### 🎯 Learning Objectives
1. Load and explore text data from various sources
2. Clean and preprocess political/legal documents
3. Extract basic features and patterns
4. Save processed data for analysis

---

## 🚀 Part 1: Setup and Installation (One Click!)

Run this cell once at the beginning of each session:

In [None]:
# Install necessary packages (only needed once per session in Colab)
import sys
if 'google.colab' in sys.modules:
    print("🔧 Installing packages for Google Colab...")
    !pip install -q wordcloud
    print("✅ All packages installed!")
else:
    print("📍 Not in Colab - assuming packages are installed")

# Import everything we need
import pandas as pd
import numpy as np
import re
import json
from collections import Counter
from datetime import datetime
import textwrap

print("✅ Setup complete! You're ready to analyze text.")

## 📂 Part 2: Loading Your Text Data

We'll practice with three common data formats you'll encounter:

In [None]:
# Example 1: Create sample political speech data
speeches_data = {
    'speaker': ['Biden', 'Trump', 'Sanders', 'Harris', 'DeSantis'],
    'date': ['2024-01-15', '2024-02-20', '2024-01-28', '2024-03-10', '2024-02-05'],
    'text': [
        """My fellow Americans, we stand at an inflection point in history. 
        Together, we can build a better future for our children. The work ahead 
        will not be easy, but I know the American people are up to the task.""",
        
        """This administration is a disaster, folks. They're destroying our 
        country. We had the greatest economy in history, and they ruined it. 
        We need to make America great again, and we will!""",
        
        """The billionaire class has rigged our economy. Working families 
        deserve a living wage, healthcare as a human right, and free public 
        college. We need a political revolution in this country.""",
        
        """We must protect our democracy and our fundamental rights. Women's 
        rights are human rights. We will not go back. Together, we will 
        move forward and build a more just society.""",
        
        """The woke ideology is poisoning our schools and destroying our values. 
        We must fight back against this radical agenda. Florida is where woke 
        goes to die, and we're just getting started."""
    ]
}

speeches_df = pd.DataFrame(speeches_data)
print("📊 Loaded political speeches dataset:")
print(speeches_df[['speaker', 'date']].head())
print(f"\n📝 Total speeches: {len(speeches_df)}")

In [None]:
# Example 2: Create sample legal document data
legal_data = {
    'document_id': ['RON_001', 'RON_002', 'RON_003'],
    'jurisdiction': ['Boulder, CO', 'Toledo, OH', 'Orange County, FL'],
    'year': [2021, 2022, 2023],
    'text': [
        """Whereas natural ecosystems have inherent rights to exist and flourish, 
        the Boulder Creek watershed shall possess legal standing. Any citizen may 
        bring action to enforce these rights in court.""",
        
        """Lake Erie Bill of Rights: Lake Erie and its watershed possess the right 
        to exist, flourish, and naturally evolve. The people of Toledo have the 
        right to a clean and healthy environment.""",
        
        """The Wekiva River and Econlockhatchee River possess fundamental rights 
        to flow, to be free from pollution, and to maintain their natural water levels. 
        These rights shall be enforced by county residents."""
    ]
}

legal_df = pd.DataFrame(legal_data)
print("\n📜 Loaded legal documents dataset:")
print(legal_df[['document_id', 'jurisdiction']].head())

### 💡 Loading Your Own Data

**For Fairooz (Political Speeches):**
Upload your CSV with speeches to Colab, then uncomment and run:

In [None]:
# === UNCOMMENT AND MODIFY FOR YOUR DATA ===
# from google.colab import files
# uploaded = files.upload()  # Click to upload your file

# # Load your political speeches CSV
# your_speeches = pd.read_csv('your_speeches.csv')  # Change filename
# print(your_speeches.head())

**For Brisa (Legal Documents):**
Upload your CSV with RoN documents to Colab, then uncomment and run:

In [None]:
# === UNCOMMENT AND MODIFY FOR YOUR DATA ===
# from google.colab import files
# uploaded = files.upload()  # Click to upload your file

# # Load your legal documents CSV
# your_legal_docs = pd.read_csv('your_legal_docs.csv')  # Change filename
# print(your_legal_docs.head())

## 🧹 Part 3: Text Cleaning Pipeline

Let's build a reusable cleaning function for political/legal text:

In [None]:
def clean_text_basic(text):
    """
    Basic text cleaning for political/legal documents
    Keep this simple - we'll add complexity later!
    """
    if pd.isna(text):
        return ""
    
    # Convert to string and lowercase
    text = str(text).lower()
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Remove special characters but keep sentence structure
    text = re.sub(r'[^\w\s.!?]', '', text)
    
    return text

# Test the function
sample_text = speeches_df['text'][0]
cleaned = clean_text_basic(sample_text)

print("🔤 Original text (first 100 chars):")
print(textwrap.fill(sample_text[:100], width=60))
print("\n🧹 Cleaned text (first 100 chars):")
print(textwrap.fill(cleaned[:100], width=60))

In [None]:
# Apply cleaning to all speeches
speeches_df['cleaned_text'] = speeches_df['text'].apply(clean_text_basic)
legal_df['cleaned_text'] = legal_df['text'].apply(clean_text_basic)

print("✅ Text cleaning complete!")
print(f"Speeches cleaned: {len(speeches_df)}")
print(f"Legal docs cleaned: {len(legal_df)}")

## 📊 Part 4: Basic Text Analysis

Let's extract useful features from our text:

In [None]:
def analyze_text(text):
    """
    Extract basic statistics from text
    Similar to what you might do with stringr in R
    """
    # Word count
    words = text.split()
    word_count = len(words)
    
    # Sentence count (approximate)
    sentences = re.findall(r'[.!?]+', text)
    sentence_count = len(sentences) if sentences else 1
    
    # Average words per sentence
    avg_words_per_sentence = word_count / sentence_count if sentence_count > 0 else 0
    
    # Unique words (vocabulary richness)
    unique_words = len(set(words))
    
    # Lexical diversity
    lexical_diversity = unique_words / word_count if word_count > 0 else 0
    
    return {
        'word_count': word_count,
        'sentence_count': sentence_count,
        'avg_words_per_sentence': round(avg_words_per_sentence, 1),
        'unique_words': unique_words,
        'lexical_diversity': round(lexical_diversity, 3)
    }

# Test on one speech
sample_analysis = analyze_text(speeches_df['cleaned_text'][0])
print("📈 Text analysis for Biden's speech:")
for key, value in sample_analysis.items():
    print(f"  {key}: {value}")

In [None]:
# Apply analysis to all documents
# This is like using mutate() in R to add multiple columns

# For speeches
for index, row in speeches_df.iterrows():
    stats = analyze_text(row['cleaned_text'])
    for key, value in stats.items():
        speeches_df.at[index, key] = value

print("📊 Speech Statistics:")
print(speeches_df[['speaker', 'word_count', 'avg_words_per_sentence', 'lexical_diversity']])

In [None]:
# Quick visualization of results
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Word count by speaker
axes[0].bar(speeches_df['speaker'], speeches_df['word_count'], color='steelblue')
axes[0].set_title('Word Count by Speaker')
axes[0].set_xlabel('Speaker')
axes[0].set_ylabel('Words')
axes[0].tick_params(axis='x', rotation=45)

# Lexical diversity
axes[1].bar(speeches_df['speaker'], speeches_df['lexical_diversity'], color='coral')
axes[1].set_title('Lexical Diversity by Speaker')
axes[1].set_xlabel('Speaker')
axes[1].set_ylabel('Diversity Score')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 🔍 Part 5: Pattern Detection for Political/Legal Text

In [None]:
# Define patterns relevant to political science research

# For Fairooz: Detecting rhetorical patterns
political_patterns = {
    'unity_language': ['together', 'united', 'unity', 'we', 'us', 'our'],
    'crisis_language': ['crisis', 'disaster', 'emergency', 'threat', 'danger'],
    'populist_language': ['elite', 'rigged', 'establishment', 'people', 'corrupt'],
    'rights_language': ['rights', 'freedom', 'liberty', 'justice', 'democracy']
}

# For Brisa: Legal document patterns  
legal_patterns = {
    'rights_granted': ['right to', 'shall possess', 'entitled to', 'authority to'],
    'enforcement': ['enforce', 'action', 'court', 'standing', 'bring suit'],
    'environmental': ['ecosystem', 'watershed', 'natural', 'pollution', 'environment'],
    'procedural': ['whereas', 'shall', 'pursuant', 'herein', 'thereof']
}

def count_patterns(text, patterns_dict):
    """
    Count occurrences of pattern categories in text
    """
    text_lower = text.lower()
    results = {}
    
    for category, patterns in patterns_dict.items():
        count = sum(1 for pattern in patterns if pattern in text_lower)
        results[category] = count
    
    return results

# Test on a speech
biden_patterns = count_patterns(speeches_df['text'][0], political_patterns)
print("🎯 Patterns in Biden's speech:")
for pattern, count in biden_patterns.items():
    print(f"  {pattern}: {count}")

In [None]:
# Apply pattern detection to all speeches
for pattern_type in political_patterns.keys():
    speeches_df[pattern_type] = speeches_df['text'].apply(
        lambda x: count_patterns(x, political_patterns)[pattern_type]
    )

print("\n📊 Pattern Analysis Results:")
print(speeches_df[['speaker'] + list(political_patterns.keys())])

## 🏷️ Part 6: Keyword Extraction

Extract the most important words from each document:

In [None]:
def extract_keywords(text, num_keywords=10, stop_words=None):
    """
    Extract most common meaningful words from text
    Similar to what you might do with tm package in R
    """
    # Default stop words (common words to ignore)
    if stop_words is None:
        stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
                     'of', 'with', 'by', 'from', 'is', 'was', 'are', 'were', 'be', 'been',
                     'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
                     'could', 'should', 'may', 'might', 'must', 'can', 'this', 'that',
                     'these', 'those', 'i', 'you', 'we', 'they', 'it', 'he', 'she'}
    
    # Tokenize and clean
    words = text.lower().split()
    words = [w for w in words if w.isalpha() and w not in stop_words and len(w) > 3]
    
    # Count frequencies
    word_freq = Counter(words)
    
    # Get top keywords
    top_keywords = word_freq.most_common(num_keywords)
    
    return top_keywords

# Extract keywords for each speaker
print("🔑 Top Keywords by Speaker:\n")
for idx, row in speeches_df.iterrows():
    keywords = extract_keywords(row['text'], num_keywords=5)
    print(f"{row['speaker']}:")
    for word, freq in keywords:
        print(f"  • {word} ({freq})")
    print()

## 💾 Part 7: Saving Your Processed Data

In [None]:
# Prepare final dataset with all our extracted features
final_speeches = speeches_df[['speaker', 'date', 'cleaned_text', 
                              'word_count', 'sentence_count',
                              'avg_words_per_sentence', 'lexical_diversity',
                              'unity_language', 'crisis_language', 
                              'populist_language', 'rights_language']]

# Save to CSV (can open in R!)
output_filename = 'processed_speeches.csv'
final_speeches.to_csv(output_filename, index=False)
print(f"✅ Saved processed data to '{output_filename}'")
print("\n📊 Data preview:")
print(final_speeches.head())

In [None]:
# Also save summary statistics
summary_stats = speeches_df.groupby('speaker').agg({
    'word_count': 'mean',
    'lexical_diversity': 'mean',
    'unity_language': 'sum',
    'crisis_language': 'sum',
    'populist_language': 'sum',
    'rights_language': 'sum'
}).round(2)

summary_stats.to_csv('speech_summary_stats.csv')
print("\n📈 Summary statistics by speaker:")
print(summary_stats)

## 🎯 Part 8: Your Turn - Process Your Own Data!

Now apply everything to your dissertation data:

In [None]:
# === TEMPLATE FOR YOUR DATA ===
# Modify this code block for your specific needs

def process_my_documents(df, text_column='text'):
    """
    Complete pipeline for processing political/legal documents
    
    Parameters:
    df: Your dataframe
    text_column: Name of the column containing text
    """
    # 1. Clean text
    df['cleaned_text'] = df[text_column].apply(clean_text_basic)
    
    # 2. Basic statistics
    for index, row in df.iterrows():
        stats = analyze_text(row['cleaned_text'])
        for key, value in stats.items():
            df.at[index, key] = value
    
    # 3. Pattern detection (customize patterns for your research!)
    # === MODIFY THIS SECTION ===
    my_patterns = {
        'pattern1': ['word1', 'word2'],  # Add your keywords
        'pattern2': ['word3', 'word4'],  # Add more categories
    }
    # === END MODIFICATION ===
    
    for pattern_type in my_patterns.keys():
        df[pattern_type] = df['cleaned_text'].apply(
            lambda x: count_patterns(x, my_patterns)[pattern_type]
        )
    
    return df

# Example usage (uncomment when you have your data):
# my_processed_data = process_my_documents(your_data, text_column='your_text_column')
# my_processed_data.to_csv('my_processed_data.csv', index=False)

## 📚 Part 9: Quick Reference Functions

Here are all the functions we created, ready to copy and use:

In [None]:
# === SAVE THIS CELL FOR FUTURE USE ===

def text_processing_toolkit():
    """
    All your text processing functions in one place
    Copy this to any new notebook!
    """
    
    functions = {
        'clean_text': clean_text_basic,
        'analyze_text': analyze_text,
        'count_patterns': count_patterns,
        'extract_keywords': extract_keywords,
        'process_documents': process_my_documents
    }
    
    return functions

# Get your toolkit
toolkit = text_processing_toolkit()
print("🛠️ Your text processing toolkit contains:")
for name in toolkit.keys():
    print(f"  • {name}()")

## 🏁 Part 10: Next Steps and Resources

In [None]:
# Create a word cloud visualization (fun bonus!)
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all speeches
all_text = ' '.join(speeches_df['cleaned_text'])

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, 
                      background_color='white',
                      colormap='viridis').generate(all_text)

# Display
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Political Speeches')
plt.show()

print("🎨 Word cloud generated from your corpus!")

## ✅ Congratulations! You've Completed Text Analysis Essentials!

### 📊 What You've Accomplished:
- ✅ Loaded and explored text data
- ✅ Built a complete text cleaning pipeline
- ✅ Extracted meaningful features and patterns
- ✅ Analyzed political/legal language patterns
- ✅ Created reusable functions for your research
- ✅ Saved processed data for further analysis

### 🎯 Your Homework:
1. **Upload your dissertation data** to Colab
2. **Process at least 20 documents** using the pipeline
3. **Identify 3 interesting patterns** in your data
4. **Save your processed dataset** for next week

### 📚 Resources for This Week:

**Documentation:**
- [Pandas Text Methods](https://pandas.pydata.org/docs/user_guide/text.html)
- [Python Regular Expressions](https://docs.python.org/3/howto/regex.html)
- [Google Colab Tips](https://colab.research.google.com/notebooks/snippets/importing_libraries.ipynb)

**For R Users:**
- Your processed CSV can be loaded directly into R!
- Use `write.csv()` in Python → `read.csv()` in R
- All numeric features we created work perfectly in R's statistical functions

### 💬 Getting Help:
```python
# If you get stuck, try these:

# 1. Check data types
df.dtypes

# 2. Check for missing values  
df.isna().sum()

# 3. Preview your data
df.head()

# 4. Check shape
df.shape
```

### 🚀 Ready for Week 1?
Next week we'll use spaCy for advanced NLP - but you now have all the Python basics you need!

**Remember:**
- Your R knowledge is valuable - you're just learning new syntax
- Focus on your research questions, not perfect code
- These notebooks are yours to modify and reuse
- Help is always available in Slack!

---

**📝 Note:** Save this notebook! You'll reference these functions throughout October.