# Day-56: NLP Project

Today, on Day 56, we'll put these pieces together in a practical NLP Project focusing on common business use cases like review mining.

A major commercial application of NLP is Review Mining—analyzing customer feedback (reviews, tweets, surveys) to extract actionable insights. This project ties together your preprocessing, feature engineering, and classification skills to build a functional system for understanding what customers like and dislike.

## Topics Covered:

Classification, Keyword Extraction, Review Mining

## Review Mining: The Business Problem 

Review Mining is the process of extracting meaningful information, patterns, and insights from text-based customer feedback. It enables businesses to quickly:

1. Quantify Sentiment: Determine the percentage of positive vs. negative feedback.

2. Identify Pain Points: Automatically flag recurring negative themes (e.g., "slow delivery," "poor quality").

3. Track Trends: Monitor changes in sentiment over time or after a product update.

- `Analogy`: The Customer Service Supervisor. Instead of reading 10,000 emails, the system provides a dashboard that shows "52% of reviews mention Topic X, and 90% of those are negative."

## Sentiment Classification: The End Goal

We use the labeled Sentiment column to train a supervised model that can predict the polarity of future, unseen comments.

- `Algorithm`: We stick with Logistic Regression (Day 54) for its simplicity and interpretability, paired with TF-IDF features (Day 51).

- **Why not VADER?**
    - While VADER is fast (Day 54), a trained Logistic Regression model learns the specific nuances and slang of the YouTube comment environment (like using emojis or non-standard language), which makes it a more accurate domain-specific classifier.

## Keyword Extraction: Diagnosing the Problem

Once we identify a group of negative comments, we need to know why they are negative. We use TF-IDF scores to identify words that are highly specific to that group of comments.

- `Process`: Filter all comments labeled 'Negative' (or those predicted as negative by the model) and then re-run the TF-IDF calculation on just that subset. Words with the highest weight represent the core keywords of the complaint (e.g., "broken," "slow," "spam").

In [2]:
! pip install tabulate kagglehub kagglehub[pandas-datasets]



In [22]:
import nltk
# Ensure necessary NLTK downloads are complete (run these once if needed)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [52]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import kagglehub
from sklearn.preprocessing import LabelEncoder
from tabulate import tabulate

# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter


# Download latest version
file_path = kagglehub.dataset_download("atifaliak/youtube-comments-dataset")

print("Path to dataset files:", file_path)
# Load the latest version
df = pd.read_csv(file_path + "\YoutubeCommentsDataSet.csv")
print("First 5 records:", df.head())

# Basic info
print("Dataset Info:")
print(df.info())
print("Missing values in each column:")
print(df.isnull().sum()) # Check for missing values
# Drop rows with missing values in 'Comment' or 'Sentiment'
df.dropna(subset=['Comment', 'Sentiment'], inplace=True)
print("After dropping missing values, dataset shape:", df.shape)


# Convert Sentiment to numerical labels
le = LabelEncoder()
df['label'] = le.fit_transform(df['Sentiment'])
# Determine the mapping for readability
label_map = dict(zip(df['label'].unique(), df['Sentiment'].unique()))
print(f"Label Mapping: {le.classes_}") 
print(f"Label Distribution:\n{df['Sentiment'].value_counts().to_markdown()}")


# --- 1. Preprocessing Pipeline (Day 50) ---

# Define aggressive custom stop words based on initial output (like, people, dont, etc.)
CUSTOM_NOISE_WORDS = {
    'like', 'people', 'dont', 'im', 'get', 'one', 'time', 'video', 'would', 'really', 'u', 'could', 'say', 'me', 
    'know', 'even', 'make', 'year', 'need', 'never', 'much', 'also', 'go', 'see', 'think', 'want', 'way', 'good', 
    'great', 'still', 'got', 'cant', 'us', 'look', 'back', 'thing', 'things', 'lot', 'lots', 'another', 'new', 
    'first', 'last', 'two', 'three', 'years', 'day', 'days', 'los', 'always', 'game', 'games', 'tan', 'anh', 
    'feel', 'man', 'says', 'said', 'sayin', 'guy', 'hermosa', 'mucho', 'mas', 'muy', 'bien', 'buenas', 'didnt', 
    'take', 'going', 'mejores', 'actually', 'better', 'best', 'work', 'works', 'working', 'watched', 'watching', 
    'watch', 'seen', 'seein', 'thank', 'thanks', 'thankyou', 'thankx', 'plz', 'plzz', 'plzzz', 'someone', 'he', 
    'right', 'many', 'she', 'they', 'we', 'ou'}
def preprocess_text(text):
    """Cleans and tokenizes text using lemmatization and aggressive stopword removal."""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove punctuation/special chars
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    # Combined stop words for English, French, and custom noise words
    try:
        combined_stop_words = set(stopwords.words('english') + stopwords.words('french'))
    except LookupError:
        # Fallback if French stop words are not downloaded
        combined_stop_words = set(stopwords.words('english'))
        print("Warning: Could not load French stopwords.")

    final_stop_words = combined_stop_words.union(CUSTOM_NOISE_WORDS)

    # Filter stopwords, lemmatize, AND filter tokens by length (length > 2)
    cleaned_tokens = [
        lemmatizer.lemmatize(w) 
        for w in tokens 
        if w not in final_stop_words and len(w) > 2 # Filter out short noise tokens
    ]
    return " ".join(cleaned_tokens)

# Apply the preprocessing pipeline
df['Clean_Comment'] = df['Comment'].apply(preprocess_text)
print("\n--- 1. Preprocessing Complete ---")
print(f"Original (Example): {df['Comment'].iloc[0]}")
print(f"Cleaned (Example): {df['Clean_Comment'].iloc[0]}\n")


# --- 2. Sentiment Classification (Day 51 & Day 54) ---
# Feature Extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['Clean_Comment'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Training Logistic Regression Model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

# Evaluation
y_pred = lr_model.predict(X_test)
print("--- 2. SENTIMENT CLASSIFICATION REPORT ---")
print(classification_report(y_test, y_pred, target_names=le.classes_,zero_division=0))


# --- 3. Keyword Extraction by Sentiment (Review Mining) ---
print("\n--- 3. KEYWORD EXTRACTION BY SENTIMENT (Top 5) ---")
print("Using TF-IDF with Bigrams (1,2) to capture meaningful phrases.")

# List to store results for a clean table
keyword_results = []

for sentiment_class in le.classes_:
    # Filter the original DataFrame for this specific sentiment class
    filtered_df = df[df['Sentiment'] == sentiment_class]

    if not filtered_df.empty:
        # Key Improvement: Use ngram_range=(1, 2) to capture bigrams (e.g., 'apple pay', 'dont like')
        sub_tfidf = TfidfVectorizer(ngram_range=(1, 2))
        sub_X = sub_tfidf.fit_transform(filtered_df['Clean_Comment'])
        
        # Sum the TF-IDF weights across all documents in this class
        feature_names = sub_tfidf.get_feature_names_out()
        total_weights = sub_X.sum(axis=0).A1
        keyword_scores = pd.Series(total_weights, index=feature_names)
        
        # Display the top 5 keywords
        top_keywords = keyword_scores.nlargest(5)
        
        # Store results
        keyword_results.append([sentiment_class] + list(top_keywords.index))

# Print final keyword table
headers = ["Sentiment"] + [f"Keyword {i+1}" for i in range(5)]
print(tabulate(keyword_results, headers=headers, tablefmt="github"))

Path to dataset files: C:\Users\amey9\.cache\kagglehub\datasets\atifaliak\youtube-comments-dataset\versions\1
First 5 records:                                              Comment Sentiment
0  lets not forget that apple pay in 2014 require...   neutral
1  here in nz 50 of retailers don’t even have con...  negative
2  i will forever acknowledge this channel with t...  positive
3  whenever i go to a place that doesn’t take app...  negative
4  apple pay is so convenient secure and easy to ...  positive
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18408 entries, 0 to 18407
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Comment    18364 non-null  object
 1   Sentiment  18408 non-null  object
dtypes: object(2)
memory usage: 287.8+ KB
None
Missing values in each column:
Comment      44
Sentiment     0
dtype: int64
After dropping missing values, dataset shape: (18364, 2)
Label Mapping: ['negative' 'neutral' 