# Combined NLP Course Notebook (Student Version)

This notebook integrates concepts from general NLP introduction and a specific case study on Sentiment Analysis. It includes explanations, code demonstrations, and reflective questions. Some sections require you to complete the code as part of the learning exercises.

**Sources:**
* Based on `natural_language_processing.ipynb` and `sentiment_analysis_demo.ipynb`.

## 1. Introduction to Natural Language Processing (NLP)

**Definition:** 
NLP is the field that makes human language accessible to computers. It enables machines to read, interpret, and generate text, which is essential for applications such as intelligent search engines, machine translation, and dialogue systems.

**Business Application:** 
Consider how a customer service chatbot uses NLP to understand and respond to client inquiries in real time.

**Before the Demo Question:** 
- What are some business applications where understanding and generating human language could provide a competitive advantage?

In [None]:
# A simple demonstration of text processing: tokenizing a sentence into words.

sample_text_intro = "Welcome to the world of Natural Language Processing for business applications!"

# Tokenization: splitting the text into words
tokens_intro = sample_text_intro.split()

print("Original Text:")
print(sample_text_intro)
print("\nTokenized Words:")
print(tokens_intro)

**Reflection:** 
This demo shows how we can break a sentence into its individual words (tokens), a fundamental step in NLP. In real-world applications, tokenization is the first step in tasks like search, sentiment analysis, and automated customer support.

**Discussion Questions:** 
- Why is tokenization important in processing natural language data? 
- Can you think of a scenario in your business where extracting key words from text might be useful?

## 2. Tokenization and Text Preprocessing

**Definition:** 
Text preprocessing involves cleaning and preparing text data for analysis. This includes steps like tokenization (breaking text into words or sentences), lowercasing, removing punctuation and stop words (common words like 'the', 'is', 'in'), and stemming/lemmatization (reducing words to their root form).

**Business Application:** 
Preprocessing ensures consistency in analyzing customer feedback forms, removing noise to focus on meaningful content.

**Before the Demo Question:** 
- Why is it important to clean text data before analyzing it? What kind of 'noise' might exist in raw text?

In [None]:
# Demo using NLTK for more advanced preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# Download necessary NLTK data (if not already downloaded)
# You might need to run these downloads once
try:
    nltk.data.find('corpora/wordnet')
except nltk.downloader.DownloadError:
    print("Downloading NLTK 'wordnet' data...")
    nltk.download('wordnet')
    print("Download complete.")
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    print("Downloading NLTK 'stopwords' data...")
    nltk.download('stopwords')
    print("Download complete.")
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    print("Downloading NLTK 'punkt' data...")
    nltk.download('punkt')
    print("Download complete.")

sample_text_proc = "This is an example sentence, showing off the stop words filtration and stemming! It's quite amazing."

# 1. Tokenization
tokens = word_tokenize(sample_text_proc)

# 2. Lowercasing
tokens = [word.lower() for word in tokens]

# 3. Removing Punctuation
tokens = [word for word in tokens if word.isalnum()] # Keep only alphanumeric tokens

# 4. Removing Stop Words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

# 5. Stemming (reducing words to root form)
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]

# 6. Lemmatization (reducing words to base/dictionary form)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Text:", sample_text_proc)
print("\nTokenized & Lowercased:", word_tokenize(sample_text_proc.lower()))
print("\nAfter Removing Punctuation & Stopwords:", tokens)
print("\nStemmed Tokens:", stemmed_tokens)
print("\nLemmatized Tokens:", lemmatized_tokens)

**Reflection:** 
This demo illustrates several common preprocessing steps. Stemming is faster but can produce non-dictionary words, while lemmatization is slower but yields actual words. The choice depends on the specific NLP task and desired outcome.

**Discussion Questions:** 
- When might you prefer stemming over lemmatization, or vice versa?
- How could removing stop words impact the analysis of certain types of text (e.g., legal documents vs. social media posts)?

## 3. Text Representation: Vectorization

**Definition:** 
Computers understand numbers, not text. Vectorization converts text data into numerical vectors. Common techniques include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings (like Word2Vec, GloVe).

**Business Application:** 
Vectorization allows machine learning models to process text for tasks like document classification (e.g., sorting emails into folders) or sentiment analysis.

**Before the Demo Question:** 
- How can we represent the meaning or content of text using only numbers?

In [None]:
# Demo using TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the corpus and transform the data
tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (words in the vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Display the TF-IDF matrix as a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names, index=['Doc1', 'Doc2', 'Doc3', 'Doc4'])

print("Corpus:")
for i, doc in enumerate(corpus):
    print(f"Doc{i+1}: {doc}")

print("\nTF-IDF Matrix:")
print(tfidf_df)

**Reflection:** 
The TF-IDF matrix represents each document as a vector where each dimension corresponds to a word in the vocabulary. The values indicate the importance of each word in the document relative to the entire corpus. Words that are frequent in a specific document but rare across all documents get higher scores.

**Discussion Questions:** 
- What does a high TF-IDF score for a word in a document signify?
- How does TF-IDF differ from a simple Bag-of-Words count? What are the advantages of TF-IDF?

## 4. Case Study: Sentiment Analysis for Product Reviews

This section provides a practical, in-depth implementation of sentiment analysis, drawing heavily from the `sentiment_analysis_demo.ipynb` structure and content.

### 4.1 Business Context

Companies like Amazon process millions of customer reviews daily. Automatically analyzing the sentiment of these reviews allows them to:
- Identify products with quality issues
- Highlight highly-rated products
- Track customer satisfaction trends
- Feed data into recommendation systems
- Respond proactively to negative feedback

Understanding customer sentiment at scale is crucial for product development, marketing, and customer relationship management. By the end of this section, you'll understand how sentiment analysis works and how to implement a basic version.

### 4.2 Setup and Data Loading

First, let's import the necessary libraries and load a sample dataset of product reviews. For this demo, we'll create a small, illustrative dataset.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re # Regular expressions for cleaning

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC # Support Vector Classifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# NLTK for preprocessing (ensure downloads from section 2)
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Settings for plots
sns.set_style('darkgrid')
plt.style.use('seaborn-v0_8-talk') # Using a visually appealing style

# --- Sample Data Creation ---
# In a real scenario, you would load data from a CSV or database
data = {
    'review_text': [
        'This product is amazing! Highly recommend.',
        'Absolutely terrible quality. Broke after one use.',
        'It works okay, nothing special.',
        'Love it! Best purchase ever.',
        'Waste of money. Very disappointed.',
        'Decent product for the price.',
        'I am satisfied with this item.',
        'Poor customer service and faulty product.',
        'Excellent value, works perfectly.',
        'Not bad, but could be better.',
        'The product arrived damaged and unusable.',
        'Fantastic! Exceeded my expectations.',
        'Mediocre performance, would not buy again.',
        'Just what I needed, great quality.',
        'The instructions were unclear, making it hard to use.'
    ],
    'sentiment': [
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'positive',
        'neutral',
        'negative',
        'positive',
        'negative',
        'positive',
        'negative' # Subjective, but leaning negative due to usability issue
    ]
}
df_reviews = pd.DataFrame(data)

print("Sample Review Data:")
print(df_reviews.head())
print("\nData Shape:", df_reviews.shape)
print("\nSentiment Distribution:")
print(df_reviews['sentiment'].value_counts())

### 4.3 Data Exploration and Preprocessing

Before building the model, we need to clean the text data. This involves steps similar to those in Section 2, but tailored for sentiment analysis.

In [None]:
# --- Text Preprocessing Function ---

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Remove punctuation and numbers (keep spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    # 3. Tokenize
    tokens = word_tokenize(text)
    # 4. Remove Stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # 5. Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # 6. Join back into string
    return ' '.join(tokens)

# Apply preprocessing to the review text
df_reviews['processed_text'] = df_reviews['review_text'].apply(preprocess_text)

print("\nReviews after Preprocessing:")
print(df_reviews[['review_text', 'processed_text']].head())

### 4.4 Feature Extraction (Vectorization)

We'll use TF-IDF to convert the processed text into numerical features suitable for machine learning models.

In [None]:
# Define features (X) and target (y)
X = df_reviews['processed_text']
y = df_reviews['sentiment']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

# Initialize TF-IDF Vectorizer (within a pipeline later)
tfidf_vectorizer = TfidfVectorizer(max_features=1000) # Limit features for efficiency

# Fit and transform the training data (Example - will be done in pipeline)
# X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
# X_test_tfidf = tfidf_vectorizer.transform(X_test)
# print(f"\nTF-IDF Matrix Shape (Train): {X_train_tfidf.shape}")

### 4.5 Model Building and Training

We will train two common text classification models: Multinomial Naive Bayes and Logistic Regression. We use Scikit-learn's `Pipeline` to chain the vectorization and classification steps.

In [None]:
# --- Model 1: Multinomial Naive Bayes --- 
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))), # Use uni- and bi-grams
    ('clf', MultinomialNB())
])

print("\nTraining Naive Bayes Model...")
nb_pipeline.fit(X_train, y_train)
print("Naive Bayes Training Complete.")

# --- Model 2: Logistic Regression --- 
lr_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(solver='liblinear', multi_class='auto', random_state=42)) # Good solver for smaller datasets
])

print("\nTraining Logistic Regression Model...")
lr_pipeline.fit(X_train, y_train)
print("Logistic Regression Training Complete.")

### 4.6 Model Evaluation

Let's evaluate the performance of our trained models on the unseen test data.

In [None]:
# --- Evaluate Naive Bayes --- 
y_pred_nb = nb_pipeline.predict(X_test)
print("\n--- Naive Bayes Evaluation ---")
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_nb))
print("\nConfusion Matrix:")
cm_nb = confusion_matrix(y_test, y_pred_nb, labels=nb_pipeline.classes_)
sns.heatmap(cm_nb, annot=True, fmt='d', cmap='Blues', xticklabels=nb_pipeline.classes_, yticklabels=nb_pipeline.classes_)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Naive Bayes Confusion Matrix')
plt.show()

# --- Evaluate Logistic Regression --- 
y_pred_lr = lr_pipeline.predict(X_test)
print("\n--- Logistic Regression Evaluation ---")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lr))
print("\nConfusion Matrix:")
cm_lr = confusion_matrix(y_test, y_pred_lr, labels=lr_pipeline.classes_)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Greens', xticklabels=lr_pipeline.classes_, yticklabels=lr_pipeline.classes_)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Logistic Regression Confusion Matrix')
plt.show()

**Reflection:**
The classification report provides precision, recall, and F1-score for each class (positive, negative, neutral). The confusion matrix shows how many instances of each true class were predicted as each possible class. Accuracy gives the overall percentage of correct predictions.

*Note: With a very small dataset like this example, performance metrics might be volatile and not fully representative. A larger dataset is needed for robust evaluation.*

### 4.7 Applying the Model to New Reviews

Let's use the better-performing model (based on evaluation, though results may vary with this small dataset) to predict the sentiment of new, unseen reviews.

In [None]:
# Assuming Logistic Regression performed slightly better or is preferred
chosen_model = lr_pipeline 

new_reviews = [
    "This is the best product I have ever bought! So happy!",
    "Completely useless, broke within a week.",
    "It's alright, does the job but nothing fancy.",
    "Terrible experience, would not recommend to anyone."
]

# Preprocess the new reviews
processed_new_reviews = [preprocess_text(review) for review in new_reviews]

# Predict sentiment
predicted_sentiments = chosen_model.predict(new_reviews) # Pipeline handles preprocessing implicitly if raw text is passed
predicted_probabilities = chosen_model.predict_proba(new_reviews)

print("\n--- Predictions on New Reviews ---")
for review, sentiment, probs in zip(new_reviews, predicted_sentiments, predicted_probabilities):
    print(f"Review: {review}")
    print(f"Predicted Sentiment: {sentiment}")
    # Show probabilities for each class
    prob_dict = {label: f"{prob:.2%}" for label, prob in zip(chosen_model.classes_, probs)}
    print(f"Probabilities: {prob_dict}\n")

### 4.8 Business Application: Extracting Insights

How can a business use these predictions?

1.  **Track Sentiment Trends:** Aggregate sentiment scores over time (e.g., daily, weekly) to monitor overall customer satisfaction for a product or service.
2.  **Identify Urgent Issues:** Filter for highly negative reviews. Analyze the text of these reviews (using techniques like topic modeling or keyword extraction) to pinpoint specific problems (e.g., 'shipping damage', 'poor battery life', 'confusing instructions').
3.  **Highlight Positive Feedback:** Identify highly positive reviews. Use excerpts in marketing materials or testimonials (with permission). Analyze positive themes to understand what customers value most.
4.  **Prioritize Product Improvements:** Correlate sentiment with specific product features mentioned in reviews (requires more advanced Aspect-Based Sentiment Analysis) to guide development efforts.
5.  **Improve Customer Support:** Route negative reviews to support teams for follow-up.

In [None]:
# Example: Simulate analyzing the original dataframe with predictions
# (We'll predict on the full dataset for demonstration)
df_reviews['predicted_sentiment'] = chosen_model.predict(df_reviews['review_text'])

print("\n--- Analysis Example: Identifying Negative Reviews ---")
negative_reviews = df_reviews[df_reviews['predicted_sentiment'] == 'negative']
print(f"Found {len(negative_reviews)} potentially negative reviews:")
print(negative_reviews[['review_text', 'predicted_sentiment']])

# Further analysis could involve finding common words in negative reviews
from collections import Counter

negative_texts = ' '.join(negative_reviews['processed_text'])
word_counts = Counter(word_tokenize(negative_texts))

print("\nMost common words in predicted negative reviews:")
print(word_counts.most_common(10))

### 4.9 Learning Challenge

Your task is to try and improve the sentiment analysis model's performance. **Complete the code in the next cell.**

**Tasks:**

1.  **Experiment with Vectorization:** 
    * Try using `CountVectorizer` instead of `TfidfVectorizer` in one of the pipelines.
    * Adjust parameters within `TfidfVectorizer` (e.g., `min_df`, `max_df`, `ngram_range=(1, 3)`). Does changing the n-gram range help capture more context?
2.  **Try a Different Model:** 
    * Implement and evaluate a `LinearSVC` (Support Vector Classifier) model. Use a `Pipeline` similar to the other models.

**Goal:** Achieve a higher accuracy score or better F1-scores on the test set compared to the initial Naive Bayes and Logistic Regression models.

**Report:** Briefly document which changes you made and what impact they had on the evaluation metrics (Accuracy, Classification Report) in a markdown cell or code comments.

In [None]:
# --- Challenge: Improve Sentiment Analysis Model --- 

print("--- Original Model Scores (for comparison) ---")
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_nb):.4f}")
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")

# --- YOUR CODE HERE --- 

# Task 1: Experiment with Vectorization (Example: Modify LR pipeline)
# Define a new pipeline, perhaps lr_pipeline_v2
# Hint: Change TfidfVectorizer to CountVectorizer OR adjust TfidfVectorizer parameters
# lr_pipeline_v2 = Pipeline([
#    ('vectorizer', CountVectorizer(...)), # Or TfidfVectorizer with different params
#    ('clf', LogisticRegression(...))
# ])
# print("\nTraining Model with different vectorizer...")
# lr_pipeline_v2.fit(X_train, y_train)
# y_pred_lr_v2 = lr_pipeline_v2.predict(X_test)
# print("\n--- Evaluation (Vectorizer Change) ---")
# print("Accuracy:", accuracy_score(y_test, y_pred_lr_v2))
# print("Classification Report:\n", classification_report(y_test, y_pred_lr_v2))


# Task 2: Implement and Evaluate LinearSVC
# Define a new pipeline, perhaps svc_pipeline
# svc_pipeline = Pipeline([
#    ('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))), # Use consistent vectorizer for comparison
#    ('clf', LinearSVC(random_state=42, dual=False, max_iter=1000)) 
# ])
# print("\nTraining LinearSVC Model...")
# svc_pipeline.fit(X_train, y_train)
# y_pred_svc = svc_pipeline.predict(X_test)
# print("\n--- LinearSVC Evaluation ---")
# print("Accuracy:", accuracy_score(y_test, y_pred_svc))
# print("Classification Report:\n", classification_report(y_test, y_pred_svc))


# --- Report Your Findings --- 
# Add comments here or in a new markdown cell explaining what you tried and the results.
# For example:
# "Tried CountVectorizer with Logistic Regression. Accuracy was X.XXX. 
# LinearSVC with TF-IDF (ngram 1,2) achieved accuracy Y.YYY, which was the best." 

print("\nChallenge section complete. Remember to fill in the code above and report your findings.")

### 4.10 Business Implications

This case study demonstrates that even simple NLP techniques can provide valuable business insights from unstructured text data like customer reviews. 

Key Takeaways:
- **Scalability:** Sentiment analysis automates the processing of vast amounts of text that would be impossible to handle manually.
- **Actionable Insights:** It transforms raw feedback into quantifiable metrics and identifiable themes, enabling data-driven decisions.
- **Competitive Advantage:** Businesses that effectively leverage customer feedback gain an edge in product quality, customer satisfaction, and market responsiveness.

While our model is basic, production systems use more sophisticated techniques (like deep learning models - BERT, RoBERTa) for higher accuracy and nuance (e.g., aspect-based sentiment), but the core principles of preprocessing, vectorization, modeling, and evaluation remain fundamental.

## 5. Named Entity Recognition (NER)

**Definition:** 
NER identifies and categorizes key entities in text, such as names of people, organizations, locations, dates, and monetary values.

**Business Application:** 
Extracting company names and locations from news articles for market intelligence, or identifying product names in customer support tickets.

**Before the Demo Question:** 
- If you could automatically extract all mentions of competitors or product names from online articles, how could your business use that information?

In [None]:
# Demo using spaCy for NER
import spacy

# Load a pre-trained spaCy model (download if needed)
# You might need to run this command in your terminal or a code cell:
# python -m spacy download en_core_web_sm 
try:
    nlp_ner = spacy.load('en_core_web_sm')
except OSError:
    print('Downloading language model for spaCy NER...')
    print('Please run `python -m spacy download en_core_web_sm` in your terminal or environment.')
    # Attempting download via subprocess (might require permissions)
    import subprocess
    try:
       subprocess.run(['python', '-m', 'spacy', 'download', 'en_core_web_sm'], check=True)
       nlp_ner = spacy.load('en_core_web_sm')
       print("Model downloaded successfully.")
    except Exception as e:
        print(f"Could not automatically download model. Error: {e}")
        nlp_ner = None

text_ner = "Apple Inc. is planning to open a new store in Stockholm, Sweden next month, costing over $5 million."

if nlp_ner:
    doc_ner = nlp_ner(text_ner)
    print(f"Text: {text_ner}\n")
    print("Named Entities Found:")
    if not doc_ner.ents:
        print("No entities found by this model.")
    else:
        for ent in doc_ner.ents:
            print(f"- {ent.text} ({ent.label_})")
else:
    print("Skipping NER demo as spaCy model couldn't be loaded.")

**Reflection:** 
The spaCy library quickly identifies common entities like organizations (ORG), locations (GPE - Geopolitical Entity), dates (DATE), and money (MONEY). This structured information is much easier for computer systems to work with than raw text.

**Discussion Questions:** 
- What are the limitations of pre-trained NER models? Might they miss domain-specific entities (e.g., specific financial instrument names)?
- How could NER be combined with sentiment analysis for more nuanced insights (e.g., finding sentiment towards specific organizations mentioned in news)?

## 6. Topic Modeling

**Definition:** 
Topic modeling is an unsupervised technique used to discover abstract "topics" that occur in a collection of documents. Algorithms like Latent Dirichlet Allocation (LDA) identify groups of words that frequently appear together, representing underlying themes.

**Business Application:** 
Analyzing large volumes of customer survey responses or support emails to automatically identify the main themes or issues being discussed.

**Before the Demo Question:** 
- Imagine you have thousands of open-ended survey responses. How could you quickly get a sense of the main topics customers are talking about without reading every single response?

In [None]:
# Demo using LDA with Scikit-learn
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents (replace with real data)
topic_corpus = [
    "The new software update has improved performance significantly.",
    "Customer support was very helpful resolving the issue.",
    "I found a bug in the latest software version.",
    "The price is reasonable for the features offered.",
    "Login issues need to be fixed in the next update.",
    "Excellent support team, quick response time.",
    "Feature request: add integration with other tools.",
    "Billing problem took too long to resolve.",
    "Performance is slow after the recent software patch."
]

# Vectorize the text data using CountVectorizer (suitable for LDA)
# Apply similar preprocessing as before (lowercase, remove stops, etc.)
count_vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=0.9, min_df=2) # Ignore terms too frequent or too rare
X_counts = count_vectorizer.fit_transform(topic_corpus)

# Define the number of topics to find
num_topics = 3 

# Initialize and fit the LDA model
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X_counts)

# Display the topics
feature_names_topic = count_vectorizer.get_feature_names_out()
print(f"\n--- Top Words per Topic (Found by LDA with {num_topics} topics) ---")
for topic_idx, topic in enumerate(lda.components_):
    # Get the indices of the top words for this topic
    top_word_indices = topic.argsort()[:-6:-1] # Get top 5 words
    top_words = [feature_names_topic[i] for i in top_word_indices]
    print(f"Topic #{topic_idx+1}: {', '.join(top_words)}")

# Assign topics to documents (optional)
# topic_distribution = lda.transform(X_counts)
# print("\nTopic distribution for first document:", topic_distribution[0])

**Reflection:** 
LDA identifies clusters of co-occurring words, representing latent topics. The interpretation of these topics often requires human judgment (e.g., Topic #1 seems related to 'software/updates/performance', Topic #2 to 'support/issues'). The number of topics is a hyperparameter that often needs tuning.

**Discussion Questions:** 
- How would you determine the optimal number of topics for a given dataset?
- What are the challenges in interpreting the topics generated by LDA? How can the results be made more actionable for business decisions?

## 7. Text Summarization

**Definition:** 
Text summarization automatically creates a shorter version of a text document while retaining the most important information. There are two main types: extractive (selecting key sentences from the original) and abstractive (generating new sentences that capture the essence).

**Business Application:** 
Generating concise summaries of long reports, news articles, or meeting transcripts to save time and quickly grasp key points.

**Before the Demo Question:** 
- How much time could be saved in your organization if long documents could be reliably summarized automatically?

In [None]:
# Demo using Hugging Face Transformers for Abstractive Summarization
# Note: Requires internet connection and installing transformers & pytorch/tensorflow
# You might need to run: pip install transformers torch # or tensorflow
try:
    from transformers import pipeline
    # Initialize the summarization pipeline (using a smaller model for faster demo)
    # Using distilbart-cnn-6-6 which is smaller than the default bart-large-cnn
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6")
except ImportError:
    print("Could not import transformers. Please install it: pip install transformers torch")
    summarizer = None
except Exception as e:
    print(f"Could not load summarization model. Error: {e}")
    print("Skipping summarization demo. Ensure 'transformers' and a backend (torch/tensorflow) are installed.")
    summarizer = None

long_text = """
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) focused on enabling computers 
to understand, interpret, and generate human language. It combines computational linguistics—rule-based modeling 
of human language—with statistical, machine learning, and deep learning models. Together, these technologies 
enable computers to process human language in the form of text or voice data and to ‘understand’ its full 
meaning, complete with the speaker’s or writer’s intent and sentiment. NLP drives computer programs that 
translate text from one language to another, respond to spoken commands, and summarize large volumes of text 
rapidly—even in real time. There's a high likelihood you’ve interacted with NLP in the form of voice-operated 
GPS systems, digital assistants, speech-to-text dictation software, customer service chatbots, and other 
consumer conveniences. But NLP also plays a growing role in enterprise solutions that help streamline business 
operations, increase employee productivity, and simplify mission-critical business processes.
"""

if summarizer:
    print("Original Text Length:", len(long_text))
    # Generate summary (adjust max/min length as needed)
    try:
        summary = summarizer(long_text, max_length=60, min_length=20, do_sample=False)
        print("\nGenerated Summary:")
        print(summary[0]['summary_text'])
        print("Summary Length:", len(summary[0]['summary_text']))
    except Exception as e:
        print(f"\nError during summarization: {e}")
        print("This might be due to model download issues or resource constraints.")
else:
    print("Skipping summarization execution as the pipeline could not be initialized.")

**Reflection:** 
Abstractive summarization models, often based on Transformers, can generate fluent, concise summaries that are not just copied sentences. The quality depends heavily on the model and the input text complexity.

**Discussion Questions:** 
- What are the potential risks of relying on automated summaries (e.g., loss of nuance, factual inaccuracies)?
- In which business scenarios would extractive summarization be preferred over abstractive, and vice versa?

## 8. Machine Translation

**Definition:** 
Machine translation automatically converts text from one language to another.

**Business Application:** 
Translating product documentation for global markets, providing multilingual customer support, analyzing foreign language market reports.

**Before the Demo Question:** 
- How does language act as a barrier in your business operations or expansion plans?

In [None]:
# Demo using Hugging Face Transformers for Translation
# Note: Requires internet connection and installing transformers & pytorch/tensorflow
try:
    from transformers import pipeline
    # Initialize translation pipeline (e.g., English to French)
    # Using a smaller T5 model for efficiency
    translator = pipeline("translation_en_to_fr", model="t5-small")
except ImportError:
    print("Could not import transformers. Please install it: pip install transformers torch")
    translator = None
except Exception as e:
    print(f"Could not load translation model. Error: {e}")
    print("Skipping translation demo. Ensure 'transformers' and a backend (torch/tensorflow) are installed.")
    translator = None

text_to_translate = "Natural Language Processing is a fascinating field with many business applications."

if translator:
    print("Original English Text:")
    print(text_to_translate)

    # Perform translation
    try:
        translation = translator(text_to_translate, max_length=100)
        print("\nTranslated French Text:")
        print(translation[0]['translation_text'])
    except Exception as e:
        print(f"\nError during translation: {e}")
        print("This might be due to model download issues or resource constraints.")
else:
    print("Skipping translation execution as the pipeline could not be initialized.")

**Reflection:** 
Modern machine translation models (often sequence-to-sequence Transformers) can produce remarkably fluent translations for common language pairs. However, quality can vary for less common languages or highly technical/idiomatic text.

**Discussion Questions:** 
- When is machine translation 'good enough' for business use, and when is human translation still essential?
- How can businesses evaluate the quality of machine translation output?

## 9. Introduction to Transformers and Advanced Models

**Definition:** 
Transformers are a type of deep learning architecture that has revolutionized NLP. Models like BERT, GPT, RoBERTa, and T5 use attention mechanisms to weigh the importance of different words in a sequence, leading to state-of-the-art performance on many tasks.

**Business Application:** 
Powering sophisticated chatbots, advanced search engines that understand query intent, highly accurate sentiment analysis, and realistic text generation.

**Before the Demo Question:** 
- Have you interacted with AI systems recently (like ChatGPT or advanced search engines) that seem to understand language much better than older systems? What makes them feel different?

In [None]:
# Demo: Using a pre-trained BERT model for Fill-Mask task (predicting masked words)
# Note: Requires internet connection and installing transformers & pytorch/tensorflow
try:
    from transformers import pipeline
    unmasker = pipeline('fill-mask', model='bert-base-uncased')
except ImportError:
    print("Could not import transformers. Please install it: pip install transformers torch")
    unmasker = None
except Exception as e:
    print(f"Could not load fill-mask model. Error: {e}")
    print("Skipping fill-mask demo. Ensure 'transformers' and a backend (torch/tensorflow) are installed.")
    unmasker = None

masked_sentence = "Stockholm is the capital of [MASK]."

if unmasker:
    print(f"Sentence: {masked_sentence}\n")
    print("BERT's Top Predictions for [MASK]:")
    try:
        predictions = unmasker(masked_sentence)
        # The output format can sometimes be a list of lists, handle accordingly
        if predictions and isinstance(predictions[0], list):
            predictions = predictions[0]
            
        for i, pred in enumerate(predictions):
             if isinstance(pred, dict):
                 print(f" {i+1}. {pred['token_str']} (Score: {pred['score']:.4f})")
             else:
                 print(f"Unexpected prediction format item: {pred}")
    except Exception as e:
        print(f"\nError during fill-mask prediction: {e}")
        print("This might be due to model download issues or resource constraints.")
else:
    print("Skipping fill-mask execution as the pipeline could not be initialized.")


**Reflection:** 
This demo shows BERT's ability to understand context. By looking at the surrounding words ('Stockholm', 'capital'), it correctly predicts 'sweden' with high confidence. This contextual understanding is a key strength of Transformer models.

**Discussion Questions:** 
- How does the 'attention mechanism' in Transformers help them understand context better than older models like RNNs or LSTMs?
- What are the computational costs and data requirements associated with training large Transformer models?

## 10. Fine-tuning Pre-trained Models

**Definition:** 
Fine-tuning involves taking a large, pre-trained language model (like BERT or GPT) that has learned general language patterns from massive datasets, and further training it on a smaller, task-specific dataset. This adapts the model to a particular domain or task (like classifying legal documents or analyzing medical notes).

**Business Application:** 
Adapting a general sentiment analysis model to understand industry-specific jargon in financial news, or training a chatbot on company-specific FAQs.

**Before the Demo Question:** 
- Why might a general-purpose language model struggle with highly specialized text (e.g., scientific papers, legal contracts)? How could fine-tuning help?

In [None]:
# Conceptual Demo: Fine-tuning for Sentiment Classification (Illustrative)
# Note: Actual fine-tuning requires significant setup, data, and compute resources.
# This cell provides a conceptual overview using Hugging Face trainer API structure.

# --- This is PSEUDOCODE / CONCEPTUAL --- 
'''
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset # Example library to handle datasets

# 1. Load a pre-trained model and tokenizer
model_name = "bert-base-uncased" 
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3) # e.g., positive, negative, neutral
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Load your task-specific dataset (e.g., our product reviews)
# Assume 'dataset' has 'train' and 'test' splits with 'text' and 'label' columns
# dataset = load_dataset('your_sentiment_dataset_script.py') 
# Example using our previous DataFrame (needs conversion to Hugging Face Dataset object)
# from datasets import Dataset
# hf_dataset = Dataset.from_pandas(df_reviews[['processed_text', 'sentiment']]) 
# # Map labels to integers
# label_map = {'negative': 0, 'neutral': 1, 'positive': 2}
# def map_labels(example):
#     example['label'] = label_map[example['sentiment']]
#     return example
# hf_dataset = hf_dataset.map(map_labels)
# # Tokenize
# def tokenize_function(examples):
#     return tokenizer(examples["processed_text"], padding="max_length", truncation=True)
# tokenized_datasets = hf_dataset.map(tokenize_function, batched=True)
# # Split (if not already done)
# tokenized_datasets = tokenized_datasets.train_test_split(test_size=0.2)

# 3. Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    num_train_epochs=1,              # Number of training epochs (usually small for fine-tuning)
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    warmup_steps=10,                 # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=5,
    evaluation_strategy="epoch"      # Evaluate each epoch
)

# 4. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"], # Placeholder
    eval_dataset=tokenized_datasets["test"]   # Placeholder
    # compute_metrics=compute_accuracy_metric # Function to compute metrics
)

# 5. Start Fine-tuning
# trainer.train()

# 6. Evaluate the fine-tuned model
# trainer.evaluate()

# 7. Save the fine-tuned model
# trainer.save_model("./fine_tuned_sentiment_model")
'''

print("This cell contains conceptual code for fine-tuning.")
print("Actual execution requires a specific dataset, environment setup, and significant compute time.")


**Reflection:** 
Fine-tuning allows leveraging the power of large pre-trained models without the need to train them from scratch, making state-of-the-art NLP accessible for specific business problems with moderate amounts of labeled data.

**Discussion Questions:** 
- What are the key steps involved in preparing a dataset for fine-tuning a Transformer model?
- How does fine-tuning compare to traditional machine learning approaches (like those used in the sentiment analysis case study) in terms of performance, data needs, and complexity?

## 11. Prompting Language Models - Assignment

**Definition:** 
Prompting involves crafting specific text inputs (prompts) to guide large language models (LLMs like GPT-3/4) to perform desired tasks without explicit fine-tuning. The way a prompt is phrased significantly influences the quality and relevance of the model's output.

**Business Application:** 
Using LLMs for rapid prototyping of NLP tasks like text generation (writing marketing copy), summarization, question answering, and simple classification by designing effective prompts.

**Assignment Task:**
Your final assignment is to practice prompt engineering for common business tasks. In the code cell below, you will:
1.  **Define Prompts:** Write clear and effective prompts for the following three scenarios:
    * **Scenario A (Email Drafting):** Generate a short follow-up email to a potential client after a meeting. Mention that you enjoyed the conversation about [mention a specific topic, e.g., 'their supply chain needs'] and suggest scheduling a brief call next week to discuss next steps.
    * **Scenario B (Summarization):** Summarize the key decisions made in a hypothetical meeting described in the `meeting_notes` variable (provided in the code cell). The summary should be 2-3 bullet points.
    * **Scenario C (Classification):** Classify customer feedback messages (provided in the `feedback_list` variable) into one of three categories: 'Bug Report', 'Feature Request', or 'General Inquiry'. Use a few-shot prompting approach (provide 1-2 examples within the prompt).
2.  **Use the LLM Function:** Use the provided conceptual function `llm_generate(prompt, max_tokens)` to generate text based on your prompts.
3.  **Print Results:** Print the generated text for each scenario clearly.

**Goal:** Successfully generate relevant and coherent text for each business scenario by crafting effective prompts. This exercise focuses on the prompt design, not the underlying LLM technology (which is simulated here).

In [None]:
# Conceptual Demo & Assignment: Prompting Language Models

# --- Provided Conceptual LLM Function (Simulated) --- 
# This function simulates calling a large language model.
# In a real scenario, this would involve API calls to services like OpenAI, Anthropic, etc.
# For this assignment, it returns predefined placeholder text based on keywords in the prompt.
def llm_generate(prompt, max_tokens=100):
    #Simulates generating text from a prompt.
    prompt_lower = prompt.lower()
    print(f"\n--- Simulating LLM Call with max_tokens={max_tokens} ---")
    # Simple keyword-based simulation
    if "follow-up email" in prompt_lower and "client" in prompt_lower:
        return "Subject: Following Up\n\nHi [Client Name],\nIt was great speaking with you about [Specific Topic Mentioned]. Would you be available for a quick call next week to discuss next steps?\n\nBest regards,\n[Your Name]"
    elif "summarize" in prompt_lower and "meeting notes" in prompt_lower:
        return "- Decision 1: Approved the Q3 budget.\n- Decision 2: Agreed to postpone the product launch to October.\n- Action Item: Marketing team to revise launch plan."
    elif "classify" in prompt_lower and "customer feedback" in prompt_lower:
        # Simulate classification based on the last review in a few-shot prompt
        if "doesn't load" in prompt_lower:
            return "Bug Report"
        elif "integrate with calendar" in prompt_lower:
            return "Feature Request"
        else:
            return "General Inquiry" # Default fallback
    else:
        return f"[Simulated LLM Response for prompt starting with: '{prompt[:50]}...']"

# --- Provided Data for Scenarios --- 
meeting_notes = """
Meeting Minutes - Project Phoenix - April 1, 2025
Attendees: Alice, Bob, Charlie
Discussion: Reviewed Q3 budget proposal. Alice presented revised figures. Bob raised concerns about marketing spend. After discussion, the budget was approved with minor adjustments.
Product Launch: Discussed timeline for the new 'Phoenix' software. Charlie indicated development is slightly behind schedule. Agreed to postpone the official launch from September to October 15th.
Action Items: Marketing team needs to update the launch communications plan based on the new date.
"""

feedback_list = [
    "The login page doesn't load on Firefox.",
    "Could you please add an option to integrate with Google Calendar?",
    "How do I reset my password?"
]

# --- Assignment: Write Your Prompts and Generate Text --- 

# Scenario A: Email Drafting
print("\n--- Scenario A: Email Drafting ---")
prompt_email = """ 
# --- YOUR PROMPT A HERE --- 
# Write a prompt to generate a follow-up email to a potential client 
# after a meeting about their 'supply chain needs'. Suggest a call next week.
"""
# generated_email = llm_generate(prompt_email, max_tokens=150)
# print("Generated Email:")
# print(generated_email)

# Scenario B: Summarization
print("\n--- Scenario B: Summarization ---")
prompt_summary = f""" 
# --- YOUR PROMPT B HERE --- 
# Write a prompt to summarize the key decisions from the meeting_notes below 
# into 2-3 bullet points.
# Meeting Notes:
# {meeting_notes}
"""
# generated_summary = llm_generate(prompt_summary, max_tokens=75)
# print("Generated Summary:")
# print(generated_summary)

# Scenario C: Classification (Few-Shot)
print("\n--- Scenario C: Classification ---")
# You need to classify the items in 'feedback_list'. 
# Craft ONE prompt that includes examples (few-shot) to classify the LAST item.
prompt_classify = """
# --- YOUR PROMPT C HERE --- 
# Write a prompt to classify customer feedback into 'Bug Report', 'Feature Request', 
# or 'General Inquiry'. Include examples for the first two items in feedback_list 
# and ask the model to classify the third item.
# Example Format:
# Classify the following customer feedback:
# Feedback: '[Feedback 1 Text]' -> Category: [Correct Category 1]
# Feedback: '[Feedback 2 Text]' -> Category: [Correct Category 2]
# Feedback: '[Feedback 3 Text]' -> Category: 
"""
# generated_classification = llm_generate(prompt_classify, max_tokens=10)
# print(f"Feedback to Classify: '{feedback_list[2]}' ")
# print(f"Generated Classification:")
# print(generated_classification)

print("\nAssignment section complete. Make sure you have written prompts and uncommented the function calls.")


**Reflection:** 
Prompt engineering is becoming a crucial skill. By providing clear instructions, context, and sometimes examples (few-shot prompting), users can elicit complex behaviors from LLMs without traditional coding or model training. The effectiveness depends heavily on the LLM's capabilities and the quality of the prompt.

**Discussion Questions:** 
- What makes a 'good' prompt? What are some common pitfalls in prompt design?
- How does prompting compare to fine-tuning in terms of cost, effort, performance, and control over the output?
- What are the ethical considerations when using LLMs for tasks like text generation or automated decision-making?

## 12. Conclusion

This notebook provided an overview of key Natural Language Processing concepts and techniques, from fundamental text preprocessing to advanced Transformer models and prompting.

We explored:
* **Core NLP Tasks:** Tokenization, preprocessing, vectorization (TF-IDF), Named Entity Recognition, Topic Modeling, Summarization, and Translation.
* **Sentiment Analysis Case Study:** A practical walkthrough of building, evaluating, and applying a sentiment classifier for business insights, including a hands-on challenge.
* **Modern NLP:** Introduction to Transformer architectures (like BERT) and their capabilities, the concept of fine-tuning pre-trained models, and the power of prompting Large Language Models.

Throughout the notebook, we emphasized the **business applications** of these techniques, demonstrating how NLP can drive value by extracting insights from text, automating tasks, and improving customer experiences.

The field of NLP is rapidly evolving, particularly with the advent of large language models. Understanding these fundamental concepts and practical implementations provides a strong foundation for leveraging NLP in various business contexts.