# Day 2: Advanced Feature Engineering and Model Prototyping

**Goal:** To build a comprehensive feature set and develop a robust classification pipeline, adapting to technical challenges by implementing both a primary (LLM) and a fallback (Scikit-learn) strategy.

This notebook documents the full workflow for Day 2, including:
1.  **Base Feature Engineering:** Creating foundational features from text and metadata.
2.  **Advanced NLP Features:** Enriching the data with TF-IDF keywords and LDA topics.
3.  **Time-based Analysis:** Calculating time deltas to detect anomalous user behavior.
4.  **Policy Modules:** Implementing both rule-based and ML-based classifiers.
5.  **Final Output Generation:** Saving the fully augmented dataset for Day 3's evaluation.

In [1]:
# ===================================================================
# Data Loading and Sentiment Analysis
# ===================================================================

import pandas as pd
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# Download necessary resources for VADER
nltk.download('vader_lexicon')

# Load processed and combined data from UCSD source
df = pd.read_csv('ucsd_delaware_reviews_combined.csv')

# Initialize VADER
analyzer = SentimentIntensityAnalyzer()

# Calculate sentiment scores (ensure 'text' column has no NaN values)
df.dropna(subset=['text'], inplace=True)
df['sentiment_score'] = df['text'].apply(lambda text: analyzer.polarity_scores(str(text))['compound'])

print("Successfully loaded upgraded data and calculated sentiment scores!")
print(df[['text', 'sentiment_score', 'category']].head())

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Successfully loaded upgraded data and calculated sentiment scores!
                                                text  sentiment_score  \
0  Lived here for 3 years and enjoyed it. Locatio...           0.8176   
1  Lived here for 3 years and enjoyed it. Locatio...           0.8176   
2  Nice complex and awesome staff.  Maintenance i...           0.9685   
3  Nice complex and awesome staff.  Maintenance i...           0.9685   
4  Good places for people to live my friend lives...           0.7269   

                category  
0  ['Apartment complex']  
1  ['Apartment complex']  
2  ['Apartment complex']  
3  ['Apartment complex']  
4  ['Apartment complex']  


### Task 1.1: Metadata and Heuristic Features

We begin by extracting simple but powerful features from the existing data, such as review length, user activity, and signals indicating a real visit.

In [2]:
# ===================================================================
## Feature Engineering
# ===================================================================

## 1. Metadata Features
# Review length (character count)
df['review_length'] = df['text'].str.len()

# Count number of reviews per user
# .transform('count') assigns the count to each row of that user
df['user_review_count'] = df.groupby('user_id')['user_id'].transform('count')

# Deviation between review rating and average location rating
# fillna(0) to handle cases without avg_rating
df['rating_deviation'] = (df['rating'] - df['avg_rating']).fillna(0)


## 2. "Has Visited" Signal Inference
# Keywords indicating the user actually visited the location
visit_keywords = [
    'visited', 'went to', 'ate here', 'dined here', 'was there',
    'stayed at', 'my visit', 'our visit', 'ordered', 'tried the'
]
# Create flag if text contains any keyword from the list above
df['has_visit_keyword'] = df['text'].str.contains('|'.join(visit_keywords), case=False, na=False)

print("Successfully created new features!")
# Display new columns for verification
print(df[['user_name', 'review_length', 'user_review_count', 'rating_deviation', 'has_visit_keyword']].head())

Successfully created new features!
        user_name  review_length  user_review_count  rating_deviation  \
0  Heather Carper            112                  4               0.5   
1  Heather Carper            112                  4               0.5   
2  STACY CLAVETTE            283                  5               0.5   
3  STACY CLAVETTE            283                  5               0.5   
4       Zion Hood             74                  5               0.5   

   has_visit_keyword  
0              False  
1              False  
2              False  
3              False  
4              False  


### Task 2.1: Rule-Based Policy Enforcement Module

Next, we build a simple rule-based module. This serves as a strong baseline and provides initial labels for our machine learning models.

In [3]:
# ===================================================================
## Rule-based Violation Detection
# ===================================================================

def detect_violations_rules(row):
    flags = {}

    # Policy 1: Rant without visit
    # Condition: very negative sentiment AND no visit keywords
    is_rant_no_visit = row['sentiment_score'] < -0.5 and not row['has_visit_keyword']
    flags['is_rant_without_visit'] = is_rant_no_visit

    # Policy 2: Irrelevant content (based on simple rules)
    # Condition: Review is too short (could be spam or valueless)
    is_irrelevant = row['review_length'] < 20
    flags['is_irrelevant'] = is_irrelevant

    # Add 'is_clean' flag if no violations detected
    flags['is_clean_by_rules'] = not any([is_rant_no_visit, is_irrelevant])

    return pd.Series(flags)

# Apply function to the entire DataFrame
rule_based_flags = df.apply(detect_violations_rules, axis=1)
df = pd.concat([df, rule_based_flags], axis=1)

print("Successfully flagged violations based on rules!")
# Check reviews flagged as "rant without visit"
print("\nReviews that are potentially 'Rant without visit':")
print(df[df['is_rant_without_visit'] == True][['text', 'sentiment_score', 'has_visit_keyword']].head())

Successfully flagged violations based on rules!

Reviews that are potentially 'Rant without visit':
                                                 text  sentiment_score  \
30  I love this place I have severe fibromyalgia m...          -0.5568   
31  I love this place I have severe fibromyalgia m...          -0.5568   
42  I run a law firm. They deposit client trust mo...          -0.8070   
43  I run a law firm. They deposit client trust mo...          -0.8070   
84  Took my Miter saw in to replace the handle bec...          -0.5990   

    has_visit_keyword  
30              False  
31              False  
42              False  
43              False  
84              False  


## Task 3: Model Development - Prototyping a Classifier

The primary goal was to use a Large Language Model (LLM). However, due to API limitations, we pivoted to our robust fallback plan: a Scikit-learn based machine learning model.

In [4]:
# ===================================================================
## Hugging Face Authentication
# ===================================================================

from google.colab import userdata
import huggingface_hub

try:
    hf_token = userdata.get('HF_TOKEN')
    huggingface_hub.login(token=hf_token)
    print("Successfully logged in to Hugging Face!")
except Exception as e:
    print("Error! Make sure you've saved 'HF_TOKEN' in Colab Secrets.")

Successfully logged in to Hugging Face!


In [5]:
# ===================================================================
## Machine Learning Classification with Scikit-learn
# ===================================================================

# --- PLAN B: FALLBACK USING SCIKIT-LEARN (FIXED) ---
print("LLM API is not available, switching to Plan B: Using Scikit-learn.")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split # Additional import to split data

# 1. Prepare training data
def create_label(row):
    if row['is_rant_without_visit']:
        return 'rant_no_visit'
    if row['is_irrelevant']:
        return 'irrelevant'
    return 'clean'

df['rule_based_label'] = df.apply(create_label, axis=1)

# Handle potential NaN values in text_clean column from the beginning
df['text_clean'] = df['text_clean'].fillna('')

# Split data into training and test sets (for evaluation in Day 3)
# We'll train the model on 80% of the data
X_train, X_test, y_train, y_test = train_test_split(
    df['text_clean'],
    df['rule_based_label'],
    test_size=0.2,
    random_state=42,
    stratify=df['rule_based_label'] # Maintain label proportions
)

# 2. Build Pipeline
text_clf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced')),
])

# 3. Train the model
print("\nTraining Logistic Regression model...")
text_clf_pipeline.fit(X_train, y_train)
print("Training complete!")

# 4. Make predictions on the entire dataset (now it won't error)
df['sklearn_classification'] = text_clf_pipeline.predict(df['text_clean'])

# 5. Display results
print("\nClassification results using Scikit-learn (first 10 rows):")
print(df[['text', 'rule_based_label', 'sklearn_classification']].head(10))

# Check if it correctly detects "rant" cases
print("\nChecking 'rant' cases predicted by the model:")
print(df[df['sklearn_classification'] == 'rant_no_visit'][['text', 'sentiment_score']].head())

LLM API is not available, switching to Plan B: Using Scikit-learn.

Training Logistic Regression model...
Training complete!

Classification results using Scikit-learn (first 10 rows):
                                                text rule_based_label  \
0  Lived here for 3 years and enjoyed it. Locatio...            clean   
1  Lived here for 3 years and enjoyed it. Locatio...            clean   
2  Nice complex and awesome staff.  Maintenance i...            clean   
3  Nice complex and awesome staff.  Maintenance i...            clean   
4  Good places for people to live my friend lives...            clean   
5  Good places for people to live my friend lives...            clean   
6                                      great layout.       irrelevant   
7                                      great layout.       irrelevant   
8  Positives: Elevator, dog park and maintenance....            clean   
9  Positives: Elevator, dog park and maintenance....            clean   

  sklearn_c

## End of Day 2: Next Steps

We have successfully created a feature-rich dataset and a functional baseline classifier. The next steps for will involve:
-   **Advanced NLP:** Generating more sophisticated features like topics and keywords.
-   **Model Evaluation:** Rigorously evaluating the performance of our Scikit-learn model.
-   **Final Output:** Saving the fully augmented dataset for the final evaluation phase.

In [6]:
# ===================================================================
## Install Required Packages
# ===================================================================

%pip install scikit-learn gensim



### Task 1.2: Advanced NLP - Keyword Extraction and Topic Modeling

To gain a deeper understanding of the review content, we apply two advanced NLP techniques:
-   **TF-IDF:** To extract the most important keywords from each review.
-   **LDA (Latent Dirichlet Allocation):** To automatically discover the main underlying topics discussed across all reviews. This is crucial for identifying irrelevant content.

In [7]:
# ===================================================================
## Keyword Extraction with TF-IDF
# ===================================================================
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Initialize TF-IDF Vectorizer
# We'll only consider the 2000 most common words to speed up processing
tfidf_vectorizer = TfidfVectorizer(max_features=2000, stop_words='english')

# Fit on the entire text_clean column to learn vocabulary and IDF weights
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text_clean'].fillna(''))
# Get the list of words (features)
feature_names = np.array(tfidf_vectorizer.get_feature_names_out())

def extract_top_keywords(doc, top_n=5):
    """Extract top N keywords from a text based on fitted TF-IDF."""
    # Transform only this document
    tfidf_vector = tfidf_vectorizer.transform([doc])
    # Sort indices of words by TF-IDF score
    sorted_indices = np.argsort(tfidf_vector.toarray()).flatten()[::-1]
    # Get top N keywords
    top_keywords = feature_names[sorted_indices[:top_n]]
    return ', '.join(top_keywords)

# Apply this function to create a new column
# Note: This step might be slow if the dataset is large
df['keywords'] = df['text_clean'].fillna('').apply(extract_top_keywords)

print("Keywords extracted using TF-IDF:")
print(df[['text', 'keywords']].head())

Keywords extracted using TF-IDF:
                                                text  \
0  Lived here for 3 years and enjoyed it. Locatio...   
1  Lived here for 3 years and enjoyed it. Locatio...   
2  Nice complex and awesome staff.  Maintenance i...   
3  Nice complex and awesome staff.  Maintenance i...   
4  Good places for people to live my friend lives...   

                                           keywords  
0    convenience, views, apartments, lived, enjoyed  
1    convenience, views, apartments, lived, enjoyed  
2  close, complex, apartments, maintenance, located  
3  close, complex, apartments, maintenance, located  
4                 awsome, lives, says, friend, live  


In [8]:
# ===================================================================
## Topic Modeling with LDA - Data Preparation
# ===================================================================

from gensim import corpora
from gensim.models import LdaModel

# Tokenize text (split into words)
tokenized_data = [text.split() for text in df['text_clean'].fillna('')]

# Create dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
# Filter out extremely rare or common words
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

print("Data prepared for LDA topic modeling.")

Data prepared for LDA topic modeling.


In [9]:
# ===================================================================
## Topic Modeling with LDA - Model Training and Topic Assignment
# ===================================================================

# Train LDA model to identify 5 topics
# This step may take several minutes
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10, random_state=42)

print("LDA model training complete. Below are the 5 main topics:")
for idx, topic in lda_model.print_topics(-1):
    print(f'Topic: {idx} \nWords: {topic}\n')

# Assign the main topic to each review
def get_dominant_topic(doc):
    bow = dictionary.doc2bow(doc.split())
    topics = lda_model.get_document_topics(bow)
    # Choose the topic with highest probability
    dominant_topic = sorted(topics, key=lambda x: x[1], reverse=True)[0][0]
    return dominant_topic

df['topic'] = df['text_clean'].fillna('').apply(get_dominant_topic)

print("\nTopics assigned to each review:")
print(df[['text', 'topic']].head())

LDA model training complete. Below are the 5 main topics:
Topic: 0 
Words: 0.048*"the" + 0.035*"to" + 0.024*"you" + 0.023*"i" + 0.023*"they" + 0.023*"and" + 0.023*"a" + 0.019*"is" + 0.018*"have" + 0.014*"of"

Topic: 1 
Words: 0.051*"a" + 0.034*"to" + 0.031*"of" + 0.031*"place" + 0.030*"great" + 0.029*"for" + 0.025*"and" + 0.025*"nice" + 0.016*"good" + 0.013*"very"

Topic: 2 
Words: 0.093*"the" + 0.039*"and" + 0.032*"was" + 0.029*"food" + 0.024*"is" + 0.021*"it" + 0.019*"best" + 0.017*"good" + 0.015*"love" + 0.014*"a"

Topic: 3 
Words: 0.097*"and" + 0.090*"great" + 0.061*"very" + 0.055*"service" + 0.044*"friendly" + 0.043*"staff" + 0.042*"good" + 0.026*"food" + 0.021*"helpful" + 0.021*"always"

Topic: 4 
Words: 0.046*"i" + 0.041*"and" + 0.038*"the" + 0.037*"was" + 0.033*"to" + 0.029*"my" + 0.025*"a" + 0.016*"me" + 0.016*"for" + 0.014*"it"


Topics assigned to each review:
                                                text  topic
0  Lived here for 3 years and enjoyed it. Locatio...    

### Task 1.3: Time-based Features and Ad Detection

To identify potential bot-like behavior, we analyze the time between consecutive reviews from the same user. We also implement a rule to detect advertisements by searching for URLs.

In [10]:
# ===================================================================
## Time Delta Analysis
# ===================================================================

# Convert 'time' column (milliseconds) to datetime format
df['datetime'] = pd.to_datetime(df['time'], unit='ms')

# Sort DataFrame by user and time
df = df.sort_values(by=['user_id', 'datetime'])

# Calculate time difference (in seconds) compared to the previous review by the same user
df['time_delta_seconds'] = df.groupby('user_id')['datetime'].diff().dt.total_seconds().fillna(0)

print("Calculated time differences between reviews from the same user:")
print(df[['user_name', 'datetime', 'time_delta_seconds']].head())

Calculated time differences between reviews from the same user:
          user_name                datetime  time_delta_seconds
648     Ronald Keys 2018-03-01 15:56:48.891        0.000000e+00
44104  Don Kelleher 2020-05-21 21:43:37.803        0.000000e+00
14627   Tom Carroll 2020-02-08 01:04:21.369        0.000000e+00
23922   Tom Carroll 2020-10-11 14:24:50.518        2.130243e+07
35287  Jeff Peacock 2017-07-23 02:04:50.769        0.000000e+00


In [11]:
# ===================================================================
## URL Detection for Advertisement Identification
# ===================================================================

# Create a few fake reviews containing URLs for testing
promo_text_1 = "Great place! visit www.mypromo.com for a 10% discount!"
promo_text_2 = "I loved it, check out my blog at http://myblog.net"
df.loc[len(df)] = df.iloc[0] # Copy a row as a template
df.loc[len(df)-1, 'text'] = promo_text_1
df.loc[len(df)] = df.iloc[1]
df.loc[len(df)-1, 'text'] = promo_text_2

# Define pattern to catch URLs
url_pattern = r'(https|http|www)[^\s]+'
df['has_url'] = df['text'].str.contains(url_pattern, case=False, na=False)

print(f"Number of reviews with URLs after adding samples: {df['has_url'].sum()}")
print("Reviews containing URLs:")
print(df[df['has_url']][['text', 'has_url']])

  df['has_url'] = df['text'].str.contains(url_pattern, case=False, na=False)


Number of reviews with URLs after adding samples: 7
Reviews containing URLs:
                                                    text  has_url
9783   Incredibly delicious!!\nAs always.\nLuv the su...     True
31808  Very disorganized staff unfriendly and unhelpf...     True
58287  I loved it, check out my blog at http://myblog...     True
27067  I live by this place. I don't eat Chinese but ...     True
23844  (Translated by Google) Awwweeeessssooooomeeee ...     True
52771  (Translated by Google) Slowwwww at the pharmac...     True
37895  Time well spent and well worth it.\nEvery wher...     True


## Task 2 & 3: Policy Modules - Rules and Scikit-learn Fallback

Now we combine all our engineered features into policy modules. We create a multi-label rule-based system and then train a Scikit-learn `LogisticRegression` model to learn from these rules, providing a robust fallback classifier.

In [12]:
# ===================================================================
## Multi-label Classification
# ===================================================================

# Modify function to create multiple label columns (multi-label)
def create_multilabels(row):
    labels = []
    # Policy 1: Rant without visit
    if row['sentiment_score'] < -0.5 and not row['has_visit_keyword']:
        labels.append('rant_no_visit')
    # Policy 2: Advertisement
    if row['has_url']:
        labels.append('ad')
    # Policy 3: Irrelevant (e.g., topic 4 is considered irrelevant)
    # Assuming after reviewing the topics above, you find topic 4 to be irrelevant
    if row['topic'] == 4: # Replace 4 with the topic index you consider irrelevant
        labels.append('irrelevant')

    # If no labels, it's 'clean'
    if not labels:
        labels.append('clean')

    return labels

df['multilabels'] = df.apply(create_multilabels, axis=1)

print("\nCreated multi-label classifications:")
print(df[['text', 'multilabels']].tail()) # View the recently added advertisement reviews


Created multi-label classifications:
                                                    text multilabels
42827  A nice beach!  Fairly busy, arrive early to ge...     [clean]
22297  Good food, better if you eat seafood.  Friendl...     [clean]
22299  Good food, better if you eat seafood.  Friendl...     [clean]
19157                 Food was good, staff was friendly.     [clean]
36516  Always great meats, great prices,  great servi...     [clean]


In [13]:
# ===================================================================
## LLM-based Multi-label Classification
# ===================================================================

from huggingface_hub import InferenceClient
import json
import time

# Ensure you're logged in
# huggingface_hub.login(token=userdata.get('HF_TOKEN'))
client = InferenceClient()

# Prompt improved to handle multi-label classification and request JSON output
def classify_review_llm_multilabel(review_text, category):
    prompt = f"""
    As an AI assistant for Google Maps, analyze the following review for a place in the category "{category}".
    A review can have one or more of the following violation labels. If no violations are found, classify it as "clean".

    Possible Labels:
    - "ad": Contains advertisements, promotions, or external links.
    - "irrelevant": The content is not related to the given category.
    - "rant_no_visit": A strong complaint that shows no evidence of a real visit.

    Provide your answer ONLY in a valid JSON format with a single key "labels" which is a list of strings.
    For example: {{"labels": ["clean"]}} or {{"labels": ["ad", "irrelevant"]}}.

    Review Text:
    "{review_text}"

    JSON Output:
    """

    try:
        response = client.text_generation(prompt, model="mistralai/Mistral-7B-Instruct-v0.2", max_new_tokens=100, temperature=0.1)

        json_part = response[response.find('{'):response.rfind('}')+1]
        if json_part:
            return json.loads(json_part)
        else:
            return {"labels": ["error_parsing"]}
    except Exception as e:
        if "is currently loading" in str(e):
            print("Model is loading, retrying...")
            time.sleep(15)
            return classify_review_llm_multilabel(review_text, category)
        return {"labels": [f"error_{str(e)}"]}

# Test on a small sample, including the advertisement reviews you just created
sample_df_llm = df.tail(10).copy() # Take the last 10 reviews, including fake ones

llm_results = sample_df_llm.apply(
    lambda row: classify_review_llm_multilabel(row['text'], row['category']),
    axis=1
)

sample_df_llm['llm_labels'] = [res.get('labels', ['error']) for res in llm_results]

print("\nMulti-label classification results from LLM (Mistral):")
print(sample_df_llm[['text', 'multilabels', 'llm_labels']])


Multi-label classification results from LLM (Mistral):
                                                    text multilabels  \
25777  First time I tried their food will say it wasn...     [clean]   
39737  I have been going here for years. All the staf...     [clean]   
35036                                            Awesome     [clean]   
1534                              Very helpful and nice.     [clean]   
42774  A nice beach!  Fairly busy, arrive early to ge...     [clean]   
42827  A nice beach!  Fairly busy, arrive early to ge...     [clean]   
22297  Good food, better if you eat seafood.  Friendl...     [clean]   
22299  Good food, better if you eat seafood.  Friendl...     [clean]   
19157                 Food was good, staff was friendly.     [clean]   
36516  Always great meats, great prices,  great servi...     [clean]   

                                              llm_labels  
25777  [error_Model mistralai/Mistral-7B-Instruct-v0....  
39737  [error_Model mistralai/Mis

## Day 2 Complete: Saving the Final Augmented Dataset

All feature engineering and modeling steps for Day 2 are complete. We now save the final, fully-enriched DataFrame. This file contains all the necessary data and labels for the evaluation and optimization tasks in Day 3.

In [14]:
# ===================================================================
## Save Enriched Dataset for Day 3
# ===================================================================

# Save DataFrame enriched with all features and labels
final_output_filename = 'final_augmented_reviews_for_day3.csv'
df.to_csv(final_output_filename, index=False)

print(f"Successfully saved final DataFrame to file '{final_output_filename}'.")
print("This is the required input for Day 3.")

Successfully saved final DataFrame to file 'final_augmented_reviews_for_day3.csv'.
This is the required input for Day 3.
