## Detection Tasks

Detect the following in user reviews:
- Spam
- Advertisements
- Irrelevant content
- Rants from users who have likely never visited the location

In [8]:
%pip install pandas numpy matplotlib seaborn huggingface_hub

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# Setup

**Task:** Build a Python script using Hugging Face InferenceClient to classify reviews with Gemma 3 12B and Qwen3 8B. Implement multitask prompt.

- Input: CSV 
- Output: CSV with predictions (gemma_pred, qwen_pred)
- Models: Gemma 3 12B, Qwen3 8B
- Time: 3 hours
- Dependencies: huggingface_hub, P1’s data, P2’s prompts

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from huggingface_hub import login

login(token='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')

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Set up Hugging Face InferenceClient for Gemma 3 12B and Qwen3 8B
from huggingface_hub import InferenceClient


gemma_client = InferenceClient(model='google/gemma-3-12b')
qwen_client = InferenceClient(model='qwen/Qwen3-8B')

In [4]:
# Load dataset (replace 'data.csv' with your actual file)
df = pd.read_csv('cleaned_reviews_noempty.csv')
df.head()

Unnamed: 0,store_name,rating,review,reviewer_name
0,49 SEATS,5,wowowow great vibes and food!! super eccentric...,Hannah Eva
1,49 SEATS,4,We had the classic pasta and fish n chips with...,S dssp
2,49 SEATS,5,Its an amazing restaurant with good vibes,Sanjith
3,49 SEATS,5,great atmosphere,Vivian L
4,49 SEATS,5,Great atmosphere and vibes!,Jayden


In [5]:
# Extract reviews from dataframe
reviews = df['review'].tolist()

In [6]:
# Basic exploration of the dataset
print('Shape:', df.shape)
print('Columns:', df.columns.tolist())
print('Info:')
df.info()
print('Missing values:')
print(df.isnull().sum())

Shape: (1782, 4)
Columns: ['store_name', 'rating', 'review', 'reviewer_name']
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1782 entries, 0 to 1781
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   store_name     1782 non-null   object
 1   rating         1782 non-null   int64 
 2   review         1782 non-null   object
 3   reviewer_name  1782 non-null   object
dtypes: int64(1), object(3)
memory usage: 55.8+ KB
Missing values:
store_name       0
rating           0
review           0
reviewer_name    0
dtype: int64


## Multitask Inference Pipeline

All tasks are handled by a single model and prompt.

In [8]:
# Read single-task prompts from .txt files using a helper function
def read_prompt(filename):
    # Input: filename (str) - path to the .txt file containing the prompt
    # Output: prompt (str) - the prompt text read from the file
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

In [9]:
return_type_prompt = "After clarifying, return a dictionary with key are Ad, Irr, Rant, or Val, value : 1 for detected, 0 for not detected."

In [10]:
# Read multitask prompt from .txt file using the helper function
# Input: filename (str) - path to the .txt file containing the prompt
# Output: prompt (str) - the prompt text read from the file
multitask_prompt = read_prompt('few_shot_prompt.txt')  # Should cover spam, advertisements, irrelevant content, and rants

In [25]:
chosen_multitask_client = gemma_client  # choose your multitask model

In [30]:
# Multitask pipeline: one model and prompt for all four aspects
multitask_model = chosen_multitask_client

spam_preds = []
ad_preds = []
irrelevant_preds = []
rant_preds = []
for review in reviews:
    # Input: review (str) - a single review text
    # Output: pred (str or dict) - prediction output from model
    # Append return_type_prompt to multitask_prompt for each review
    prompt = multitask_prompt + "\n" + return_type_prompt
    pred = multitask_model.text_generation(
        prompt.format(
            clean_text=review,
            rating=df.loc[reviews.index(review), 'rating'],
            store_name=df.loc[reviews.index(review), 'store_name'],
            reviewer_name=df.loc[reviews.index(review), 'reviewer_name']
        )
    )
    # pred is expected to be a dictionary with keys: 'Ad', 'Irr', 'Rant', 'Val'
    spam_preds.append(pred.get('Ad', 0))
    ad_preds.append(pred.get('Ad', 0))
    irrelevant_preds.append(pred.get('Irr', 0))
    rant_preds.append(pred.get('Rant', 0))

# Save multitask predictions in separate binary columns
# Input: reviews (list of str), spam_preds/ad_preds/irrelevant_preds/rant_preds (list of int)
# Output: multitask_df (pd.DataFrame), CSV file
multitask_df = pd.DataFrame({
    'review': reviews,
    'spam_pred': spam_preds,
    'ad_pred': ad_preds,
    'irrelevant_pred': irrelevant_preds,
    'rant_pred': rant_preds
})
multitask_df.to_csv('multitask_predictions.csv', index=False)

RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-68af0f55-035ffe49181eb12d5d38b7fa;b06ecbb8-d809-4ca0-8489-1f2979b79cfd)

Repository Not Found for url: https://huggingface.co/api/models/google/gemma-3-12b?expand=inferenceProviderMapping.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication

# Data structure

- Input: csv file
- Output: prediction dataframe

In [18]:
%pip install pandas numpy matplotlib seaborn huggingface_hub

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [20]:
from huggingface_hub import InferenceClient

gemma_client = InferenceClient(model='google/gemma-3-12b')
qwen_client = InferenceClient(model='qwen/Qwen3-8B')

def read_prompt(filename):
    # Input: filename (str) - path to the .txt file containing the prompt
    # Output: prompt (str) - the prompt text read from the file
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()
    
multi_prompt = read_prompt('few_shot_prompt.txt')
return_type_prompt = "After clarifying, return a dictionary with keys:Ad, Irr, Rant, Val ; and value : 1 for detected, 0 for not detected."
chosen_multitask_client = gemma_client

In [21]:
import pandas as pd

def classify_reviews(csv_file):
    # Read the CSV file
    df = pd.read_csv(csv_file)

    # Ensure the necessary columns are present
    if 'review' not in df.columns:
        raise ValueError("CSV must contain a 'review' column.")

    # Get the reviews
    reviews = df['review'].tolist()

    # Call the multitask model for predictions
    spam_preds = []
    ad_preds = []
    irrelevant_preds = []
    rant_preds = []

    for review in reviews:
        # Use multitask_prompt for each review
        prompt = multi_prompt + "\n" + return_type_prompt
        response = chosen_multitask_client.text_generation(
            prompt.format(
                clean_text=review,
                rating=df.loc[reviews.index(review), 'rating'] if 'rating' in df.columns else '',
                store_name=df.loc[reviews.index(review), 'store_name'] if 'store_name' in df.columns else '',
                reviewer_name=df.loc[reviews.index(review), 'reviewer_name'] if 'reviewer_name' in df.columns else ''
            )
        )

        # Parse the response (pseudo-code, replace with actual parsing logic)
        parsed = parse_model_response(response)
        spam_preds.append(parsed.get('Ad', 0))
        ad_preds.append(parsed.get('Ad', 0))
        irrelevant_preds.append(parsed.get('Irr', 0))
        rant_preds.append(parsed.get('Rant', 0))

    # Create a DataFrame with the predictions
    predictions_df = pd.DataFrame({
        'review': reviews,
        'spam_pred': spam_preds,
        'ad_pred': ad_preds,
        'irrelevant_pred': irrelevant_preds,
        'rant_pred': rant_preds
    })

    return predictions_df


In [23]:
pred_df = classify_reviews('cleaned_reviews_noempty.csv')

RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-68af1a97-2208f5117907aafb7013c2eb;6f5062a8-f696-4fed-afbf-0da0313b85fe)

Repository Not Found for url: https://huggingface.co/api/models/google/gemma-3-12b?expand=inferenceProviderMapping.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication

In [None]:
# Download pred_df as a CSV file
pred_df.to_csv('file_name.csv', index=False)

NameError: name 'pred_df' is not defined