<a href="https://colab.research.google.com/github/dustoff06/EcoMod/blob/main/NLP_Example_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

This lecture will cover advanced text processing techniques in NLP, including text preparation, classification, summarization, topic modeling, similarity models, clustering, semantic analysis, sentiment analysis, and using BERT.

# Common Text Preparation Steps

Strip .html tags

Convert Emoticons / Emojis to text

Lowercasing: Converting all characters to lowercase.

Stop word removal

Strip punctuation

Convert numbers to text

Stemming/Lemmatization: Reducing words to their root form.

Tokenization: Splitting text into words or tokens (1=cat, 2=dog, ..)

Padding:  make all lengths the same

# Libraries (Subset)


In [65]:
# Import necessary libraries
import gutenbergpy.textget
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import spacy
from textblob import TextBlob
from transformers import pipeline

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Example book
book = gutenbergpy.textget.get_text_by_id(1001)
text = gutenbergpy.textget.strip_headers(book).decode('utf-8')
text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'\n\n\n\nThe Divine Comedy\n\nof Dante Alighieri\n\nTranslated by\nHENRY WADSWORTH LONGFELLOW\nINFERNO\n\n\nContents\n\nCanto I. The Dark Forest. The Hill of Difficulty. The Panther, the Lion, and the Wolf. Virgil.\nCanto II. The Descent. Dante’s Protest and Virgil’s Appeal. The Intercession of the Three Ladies Benedight.\nCanto III. The Gate of Hell. The Inefficient or Indifferent. Pope Celestine V. The Shores of Acheron. Charon. The Earthquake and the Swoon.\nCanto IV. The First Circle, Limbo: Virtuous Pagans and the Unbaptized. The Four Poets, Homer, Horace, Ovid, and Lucan. The Noble Castle of Philosophy.\nCanto V. The Second Circle: The Wanton. Minos. The Infernal Hurricane. Francesca da Rimini.\nCanto VI. The Third Circle: The Gluttonous. Cerberus. The Eternal Rain. Ciacco. Florence.\nCanto VII. The Fourth Circle: The Avaricious and the Prodigal. Plutus. Fortune and her Wheel. The Fifth Circle: The Irascible and the Sullen. Styx.\nCanto VIII. Phlegyas. Philippo Argenti. The Gate 

# Text Preparation

In [66]:
# 1. Text Preparation
# Text preparation steps
from nltk.tokenize import RegexpTokenizer
tokens = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

print("Text Preparation:")
print(lemmatized_tokens)

Text Preparation:
['divine', 'comedy', 'dante', 'alighieri', 'translated', 'henry', 'wadsworth', 'longfellow', 'inferno', 'content', 'canto', 'i.', 'dark', 'forest', '.', 'hill', 'difficulty', '.', 'panther', ',', 'lion', ',', 'wolf', '.', 'virgil', '.', 'canto', 'ii', '.', 'descent', '.', 'dante', '’', 'protest', 'virgil', '’', 'appeal', '.', 'intercession', 'three', 'lady', 'benedight', '.', 'canto', 'iii', '.', 'gate', 'hell', '.', 'inefficient', 'indifferent', '.', 'pope', 'celestine', 'v.', 'shore', 'acheron', '.', 'charon', '.', 'earthquake', 'swoon', '.', 'canto', 'iv', '.', 'first', 'circle', ',', 'limbo', ':', 'virtuous', 'pagan', 'unbaptized', '.', 'four', 'poet', ',', 'homer', ',', 'horace', ',', 'ovid', ',', 'lucan', '.', 'noble', 'castle', 'philosophy', '.', 'canto', 'v.', 'second', 'circle', ':', 'wanton', '.', 'minos', '.', 'infernal', 'hurricane', '.', 'francesca', 'da', 'rimini', '.', 'canto', 'vi', '.', 'third', 'circle', ':', 'gluttonous', '.', 'cerberus', '.', 'eter

# Classification

In [67]:
# 2. Text Classification
# Sample data from the first canto of Dante's Inferno
texts = [
    "Midway upon the journey of our life",
    "I found myself within a forest dark",
    "For the straightforward pathway had been lost",
    "Ah me! how hard a thing it is to say",
    "What was this forest savage, rough, and stern",
    "The thought of it renews my fear"
]
labels = ['narrative', 'narrative', 'narrative', 'reflection', 'description', 'emotion']

# Create a pipeline that vectorizes the text and then applies a classifier
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model
model.fit(texts, labels)

# Predict the category of a new text
new_text = "It is so bitter, death is hardly more."
predicted_label = model.predict([new_text])
print("\nText Classification:")
print(predicted_label)


Text Classification:
['narrative']


# Summarization

In [68]:
# 3. Text Summarization
# Load the summarization pipeline
summarizer = pipeline("summarization")

summary = summarizer(text[:500], max_length=130, min_length=30, do_sample=False)
print("\nText Summarization:")
print(summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.



Text Summarization:
 The Divine Comedy is the work of Dante Alighieri . The work was published in Dante's "Divine Comedy" and "The Divine Comedy" The work has been translated by Richard Longfeathers and translated by Henry Wadsworth .


# Topic Modeling

Discussion on Latent Dirichlet Allocation (LDA)
Introduction to Topic Modeling
Topic modeling is a type of statistical model used to discover the abstract topics that occur in a collection of documents. One of the most popular methods for topic modeling is Latent Dirichlet Allocation (LDA).

What is LDA?
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. It posits that each document is a mixture of a small number of topics and that each word in the document is attributable to one of the document's topics. LDA is an example of a topic model and is used to classify text in a document to a particular topic.

How LDA Works
Assumptions:

Documents are represented as random mixtures over latent topics.
Each topic is characterized by a distribution over words.
Generative Process:

For each document in the corpus:
Randomly choose a distribution over topics.
For each word in the document:
Randomly choose a topic from the distribution over topics.
Randomly choose a word from the corresponding topic.
The goal of LDA is to find the set of topics that are distributed in a particular way across the documents and the distribution of words that represent each topic.

Steps in LDA
Initialize:

Assign each word in each document to one of the K topics randomly.
Update:

For each document, update the topic distribution over the document.
For each word in the document, update the word distribution over the topic.
Reassign each word in the document to a new topic based on the current topic distributions.
Convergence:

Repeat the update step until the topic assignments stabilize.
Applications of LDA
Document Classification: Classify documents based on the dominant topics.
Recommendation Systems: Recommend documents with similar topics.
Information Retrieval: Retrieve documents with similar topics.
Sentiment Analysis: Analyze sentiments by examining the topics discussed in text data.
Topic Discovery: Discover the main themes in large text corpora.
Advantages of LDA
Scalability: Efficiently handles large corpora.
Interpretability: Provides interpretable topics, each represented by a distribution of words.
Unsupervised Learning: Does not require labeled data.
Limitations of LDA
Bag-of-Words Assumption: Ignores word order, which can be critical for understanding the context.
Number of Topics: Requires the number of topics to be specified in advance.
Parameter Sensitivity: Results can be sensitive to the choice of hyperparameters

In [69]:
# 4. Topic Modeling
# Split the first canto into larger segments for topic modeling
documents = sent_tokenize(text)

# Ensure documents are not empty
documents = [doc for doc in documents if len(doc.split()) > 2]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)

# Display the topics
print("\nTopic Modeling:")
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:")
    print(" ".join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-6:-1]]))


Topic Modeling:
Topic 0:
thou unto said did thee
Topic 1:
said thou great did unto
Topic 2:
thou thee thy said art
Topic 3:
said saw er thou unto
Topic 4:
unto doth way like head


# Similarity Models

In [70]:
# 5. Similarity Models
# Split the first canto into larger segments for similarity analysis
documents = sent_tokenize(text)

# Ensure documents are not empty
documents = [doc for doc in documents if len(doc.split()) > 2]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

# Compute similarity matrix
similarity_matrix = cosine_similarity(tfidf_matrix)
print("\nSimilarity Models:")
print(similarity_matrix)



Similarity Models:
[[1.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.04017027]
 [0.         0.         0.         ... 0.         0.04017027 1.        ]]


# Text Clustering

In [71]:
# 6. Text Clustering
# Split the first canto into larger segments for clustering
documents = sent_tokenize(text)

# Ensure documents are not empty
documents = [doc for doc in documents if len(doc.split()) > 2]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# Print cluster assignments
print("\nText Clustering:")
print(kmeans.labels_)


Text Clustering:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 1 0 1 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1
 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0
 0 1 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1
 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
 0 1 1 



# Semantic Analysis

In [72]:
# 7. Semantic Analysis
# Load the spacy model for English
nlp = spacy.load("en_core_web_sm")

# Process the first canto of Dante's Inferno
doc = nlp(text)

# Extract entities
print("\nSemantic Analysis:")
for ent in doc.ents:
    print(ent.text, ent.label_)



Semantic Analysis:
Dante Alighieri

 PERSON
The Hill of Difficulty ORG
Panther PERSON
Lion ORG
Wolf PERSON
Canto II PRODUCT
Dante’s Protest PERSON
Virgil’s Appeal PERSON
The Gate of Hell FAC
The Inefficient or Indifferent ORG
Celestine V. PERSON
Acheron ORG
Charon PERSON
Swoon FAC
Virtuous Pagans PERSON
Four CARDINAL
Horace PERSON
Ovid PERSON
Lucan NORP
The Noble Castle of Philosophy FAC
Canto V. PERSON
The Infernal Hurricane ORG
Francesca da Rimini PERSON
Canto VI PERSON
Cerberus PERSON
Canto VII PRODUCT
The Fourth Circle: The Avaricious LOC
Wheel PRODUCT
Fifth ORDINAL
Sullen GPE
Styx PERSON
Philippo Argenti PERSON
The Gate of the City of Dis ORG
Canto IX PERSON
Medusa PERSON
Angel PERSON
Dis GPE
Sixth ORDINAL
Canto X. Farinata PERSON
Cavalcante de’ Cavalcanti PERSON
Damned ORG
XI ORG
The Broken Rocks ORG
Pope Anastasius PERSON
Description of PERSON
Inferno LOC
XII ORG
The River Phlegethon LOC
Neighbours FAC
The Wood of Thorns ORG
Harpies LOC
Lano PERSON
Jacopo da Sant’ Andrea PERSON

# Sentiment Analysis

In [73]:
# 8. Sentiment Analysis
# Text from the first canto of Dante's Inferno
text = """
The thought of it renews my fear! It is so bitter, death is hardly more.
"""

blob = TextBlob(text)

# Get sentiment polarity and subjectivity
print("\nSentiment Analysis:")
print(f"Polarity: {blob.sentiment.polarity}, Subjectivity: {blob.sentiment.subjectivity}")


Sentiment Analysis:
Polarity: 0.2, Subjectivity: 0.5


# Transformers

In [74]:
# 10. BERT (Bidirectional Encoder Representations from Transformers)
# Load a pre-trained BERT model for sentiment analysis
classifier = pipeline('sentiment-analysis')



result = classifier(text)

print("\nBERT:")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



BERT:
[{'label': 'NEGATIVE', 'score': 0.911802351474762}]
