<a href="https://colab.research.google.com/github/emiliawisnios/Social-and-Public-Policy-python/blob/main/Notebooks/Social_and_Public_Policy_Coding_Python_07_14_11_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In today's class we will focus on text documents processing.

Next time we will work on data scraping.

# Environment setup

In [1]:
import os
import re
import string
from collections import Counter

# Required libraries for text processing
# !pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# File Handling

## Reading Text Files

In political science research, we often work with various text documents like:
- Policy documents
- Speech transcripts
- Legislative texts
- Social media data


Let's learn how to handle these files in Python.

In [3]:
# Example 1: Basic file reading
def read_simple_file(filename):
    """
    Basic function to read a text file
    """
    with open(filename, 'r', encoding='utf-8') as file:
        return file.read()

EXERCISE 1:

Create a text file named `speech.txt` with any political speech. Try reading it using the function above.


In [6]:
# YOUR CODE HERE
read_simple_file('/content/speech.txt')

'Księga pierwsza\n\n\n\nGospodarstwo\n\nPowrót panicza — Spotkanie się pierwsze w pokoiku, drugie u stołu — Ważna Sędziego nauka o grzeczności — Podkomorzego uwagi polityczne nad modami — Początek sporu o Kusego i Sokoła — Żale Wojskiego — Ostatni Woźny Trybunału — Rzut oka na ówczesny stan polityczny Litwy i Europy\n\n    Litwo! Ojczyzno moja! ty jesteś jak zdrowie:\nIle cię trzeba cenić, ten tylko się dowie,\nKto cię stracił. Dziś piękność twą w całej ozdobie\nWidzę i opisuję, bo tęsknię po tobie.\n\n    Panno święta, co Jasnej bronisz Częstochowy\nI w Ostrej świecisz Bramie! Ty, co gród zamkowy\nNowogródzki ochraniasz z jego wiernym ludem!\nJak mnie dziecko do zdrowia powróciłaś cudem\n(Gdy od płaczącej matki, pod Twoją opiekę\nOfiarowany, martwą podniosłem powiekę;\nI zaraz mogłem pieszo, do Twych świątyń progu\nIść za wrócone życie podziękować Bogu),\nTak nas powrócisz cudem na Ojczyzny łono.\nTymczasem przenoś moją duszę utęsknioną\nDo tych pagórków leśnych, do tych łąk zielonych

Different ways to read files:
1. `read()` - entire file as a single string
2. `readlines()` - list of lines
3. `readline()` - one line at a time

In [8]:
# Example 2: Reading line by line (memory efficient for large files)
def read_large_file(filename):
    """
    Memory-efficient function to process large files line by line
    """
    with open(filename, 'r', encoding='utf-8') as file:
        for line in file:
            # Process each line
            yield line.strip()

In [12]:
large_file_lines = read_large_file('/content/speech.txt')
for line in large_file_lines:
    print(line)
    break

Księga pierwsza


EXERCISE 2:

Create a function that:
1. Reads a file
2. Counts the number of lines
3. Counts the total number of words
4. Returns both counts

In [None]:
# word_tokenize for calculating number of words

In [16]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [18]:
# YOUR CODE HERE

def file_statistics(filename):
  number_of_lines = 0
  number_of_words = 0

  ########### HERE GOES YOUR CODE ################
  large_file_lines = read_large_file(filename)
  for line in large_file_lines:
      number_of_lines += 1 # number_of_lines = number_of_lines + 1
      number_of_words += len(word_tokenize(line))

  return number_of_lines, number_of_words

file_statistics('/content/speech.txt')

(114, 888)

# Text Cleaning and Normalization

In [19]:
def clean_text(text):
    """
    Basic text cleaning function

    Parameters:
    text (str): Input text

    Returns:
    str: Cleaned text
    """
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text

In [20]:
# Example 3: Working with real political text
sample_text = """
The United Nations (UN) was established in 1945, after World War II.
Its primary purpose is to maintain international peace & security.
The UN has 193 Member States as of 2024!
"""

print("Original text:")
print(sample_text)
print("\nCleaned text:")
print(clean_text(sample_text))

Original text:

The United Nations (UN) was established in 1945, after World War II. 
Its primary purpose is to maintain international peace & security.
The UN has 193 Member States as of 2024!


Cleaned text:
the united nations un was established in after world war ii its primary purpose is to maintain international peace security the un has member states as of


EXERCISE 3:

Enhance the clean_text function to:
1. Remove specific words (like 'the', 'and', 'or')
2. Handle special characters
3. Remove specific patterns (like dates, URLs)


Write test cases for your enhanced function

In [26]:
# YOUR CODE HERE

def clean_text_extended(text):
    """
    Basic text cleaning function

    Parameters:
    text (str): Input text

    Returns:
    str: Cleaned text
    """
    # Convert to lowercase
    text = text.lower()

    # Removal of urls
    url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    text = url_pattern.sub('', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove specific words
    words_to_remove = ['the', 'and', 'or']
    for word in words_to_remove:
        text = text.replace(word, '')

    # Handle special characters
    special_characters = ['&']
    for char in special_characters:
        text = text.replace(char, '')

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text

In [27]:
test1 = 'THIS is a class of the the the or and & http://www.google.com python.'
clean_text_extended(test1)

'this is a class of python'

# Parsing and Information Extraction

In [None]:
def extract_entities(text):
    """
    Extract basic entities from text
    """
    # Tokenize into sentences
    sentences = sent_tokenize(text)

    # Tokenize into words
    words = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]

    return {
        'sentence_count': len(sentences),
        'word_count': len(words),
        'filtered_word_count': len(filtered_words),
        'unique_words': len(set(filtered_words))
    }


In [None]:
# Example 4: Analyzing political text
policy_text = """
The Green New Deal is a proposed package of United States legislation
that aims to address climate change and economic inequality.
The proposal calls for meeting 100% of the power demand in the United States
through clean, renewable, and zero-emission energy sources.
"""

analysis = extract_entities(policy_text)
print("\nText Analysis:")
for key, value in analysis.items():
    print(f"{key}: {value}")

EXERCISE 4:

Create a function that:
1. Takes a political text as input
2. Identifies and counts key policy-related terms
3. Finds the most frequent phrases (2-3 words)
4. Returns a summary of the findings

In [None]:
# YOUR CODE HERE

FINAL PROJECT IDEAS:

1. Policy Document Analyzer
   - Read a policy document
   - Clean and normalize the text
   - Extract key points
   - Generate a summary
   - Count specific policy-related terms

2. Speech Comparison Tool
   - Read two political speeches
   - Compare vocabulary usage
   - Analyze sentiment
   - Find common themes
   - Visualize differences

3. Legislative Text Processor
   - Extract sections and subsections
   - Find definitions
   - Track amendments
   - Create a searchable index

In [None]:
# Additional Helper Functions

def count_word_frequencies(text):
    """
    Count word frequencies in cleaned text
    """
    words = word_tokenize(clean_text(text))
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    return Counter(filtered_words)

def find_keywords(text, keywords):
    """
    Find instances of specific keywords in text
    """
    cleaned_text = clean_text(text)
    found_keywords = {}
    for keyword in keywords:
        count = len(re.findall(r'\b' + re.escape(keyword) + r'\b', cleaned_text))
        found_keywords[keyword] = count
    return found_keywords


EXERCISE 5:

Final Integration Exercise:

Create a complete text analysis pipeline that:
1. Reads a political document
2. Cleans and normalizes the text
3. Extracts key information
4. Generates a structured report

Use all the concepts we've covered in this module!

# 💡 TIPS & BEST PRACTICES 💡

1. File Handling Tips:
   - ALWAYS use 'with' statements when working with files (auto-closes files)
   - ALWAYS specify encoding (usually 'utf-8') to handle special characters
   - For large files, use generators and process line by line
   - Keep backup copies of original files before modifying
   - Use meaningful file names and organize by project/date

2. Error Handling Tips:
   - Wrap file operations in try-except blocks
   - Check if file exists before operations
   - Validate file formats and encodings
   - Log errors for debugging

3. Text Cleaning Tips:
   - Clean text in stages, saving intermediate results
   - Create custom cleaning functions for specific needs
   - Document all cleaning steps for reproducibility
   - Keep original text separate from cleaned versions
   - Consider domain-specific cleaning needs

4. Performance Tips:
   - Use sets for fast lookups
   - Compile regex patterns if used multiple times
   - Use list comprehensions instead of loops where possible
   - Process large files in chunks
   - Use appropriate data structures (e.g., Counter for frequencies)

5. Political Science-Specific Tips:
   - Preserve proper nouns and acronyms
   - Handle special cases like bill numbers
   - Consider geographical references
   - Maintain chronological markers
   - Pay attention to political terminology

6. Common Pitfalls to Avoid:
   - Don't modify original files directly
   - Don't assume all files are in English
   - Don't remove all numbers (might be important)
   - Don't forget to handle edge cases
   - Don't clean text more than necessary