<a href="https://colab.research.google.com/github/emiliawisnios/Social-and-Public-Policy-python/blob/main/Notebooks/Social_and_Public_Policy_Coding_Python_31_10_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In today's class we will focus on fundamentals of text processing and NLP.

Next time we will work on data scraping.

In [None]:
import nltk
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import matplotlib.pyplot as plt

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
speeches = {
    'kennedy': """Ask not what your country can do for you – ask what you can do for your country.
                 Let both sides seek to invoke the wonders of science instead of its terrors.""",
    'mlk': """I have a dream that one day this nation will rise up and live out the true meaning of its creed:
              We hold these truths to be self-evident, that all men are created equal."""
}

# Text preprocessing

In [None]:
def display_steps(text, step_name):
    """Helper function to display results of each preprocessing step"""
    print(f"\n{step_name}:")
    print(text[:150], "..." if len(text) > 150 else "")
    print("-" * 30)

## Tokenization

Tokenization is the process of breaking text into individual words or sentences.

In [None]:
# Word Tokenization
def tokenize_text(text):
    return word_tokenize(text)

# Sentence Tokenization
def sentence_tokenize(text):
    return sent_tokenize(text)

# Example with Kennedy's speech
kennedy_speech = speeches['kennedy']
print("\nOriginal text:")
print(kennedy_speech)

word_tokens = tokenize_text(kennedy_speech)
print("\nWord tokens:")
print(word_tokens)

sent_tokens = sentence_tokenize(kennedy_speech)
print("\nSentence tokens:")
print(sent_tokens)

## Stemming

Stemming reduces words to their root/base form, sometimes producing non-real words.

In [None]:
stemmer = PorterStemmer()
def stem_text(tokens):
    return [stemmer.stem(word) for word in tokens]

stemmed_words = stem_text(word_tokens)
print("\nOriginal vs Stemmed words:")
for orig, stemmed in zip(word_tokens[:10], stemmed_words[:10]):
    print(f"{orig:15} -> {stemmed}")

## Lemmatization

Lemmatization reduces words to their dictionary base form, always producing real words.

In [None]:
lemmatizer = WordNetLemmatizer()
def lemmatize_text(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

lemmatized_words = lemmatize_text(word_tokens)
print("\nOriginal vs Lemmatized words:")
for orig, lemma in zip(word_tokens[:10], lemmatized_words[:10]):
    print(f"{orig:15} -> {lemma}")

## Stop word removal

Removing common words that don't carry significant meaning.

In [None]:
stop_words = set(stopwords.words('english'))
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

filtered_words = remove_stopwords(word_tokens)
print("\nOriginal tokens:", word_tokens)
print("\nAfter stopword removal:", filtered_words)

# Feature extraction

## BAG OF WORDS

Converting text into numerical features based on word frequency.

In [None]:
# Initialize CountVectorizer
count_vectorizer = CountVectorizer()

# Create corpus from both speeches
corpus = list(speeches.values())

# Generate BoW representation
bow_matrix = count_vectorizer.fit_transform(corpus)

# Create DataFrame for better visualization
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=count_vectorizer.get_feature_names_out(),
    index=['Kennedy Speech', 'MLK Speech']
)

print("\nBag of Words representation:")
print(bow_df)

## TF-IDF (Term Frequency-Inverse Document Frequency)

Weighting words based on their frequency and importance across documents.

In [None]:
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate TF-IDF representation
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Create DataFrame for better visualization
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vectorizer.get_feature_names_out(),
    index=['Kennedy Speech', 'MLK Speech']
)

print("\nTF-IDF representation:")
print(tfidf_df)

# Word embeddings

Creating dense vector representations of words that capture semantic relationships.

In [None]:
# Prepare data for Word2Vec
tokenized_corpus = [word_tokenize(speech.lower()) for speech in corpus]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1)

# Function to find similar words
def find_similar_words(word, model):
    try:
        similar_words = model.wv.most_similar(word)
        print(f"\nWords most similar to '{word}':")
        for w, score in similar_words:
            print(f"{w}: {score:.4f}")
    except KeyError:
        print(f"'{word}' not in vocabulary")

# Example similar words
print("\nExploring word similarities:")
find_similar_words("country", word2vec_model)

# Homework

1. Text Analysis Exercise:
   - Choose a political speech or document of your interest
   - Apply the preprocessing steps learned (tokenization, stemming, lemmatization)
   - Compare the results of different preprocessing techniques
   
2. Comparative Analysis Exercise:
   - Select two political texts from different periods or contexts
   - Create and compare their BoW and TF-IDF representations
   - What insights can you draw about their language and themes?
   
3. Word Embedding Exercise:
   - Using the Word2Vec model, explore relationships between political concepts
   - Find similar words for terms like 'democracy', 'freedom', 'justice'
   - What patterns do you observe in the semantic relationships?