## Policy F: Classify the intent of the reviews

Idea - we want a quantitative value that means “how likely is this a genuine review?”

**So How?**

__Zero-shot classifier__
- Model: facebook/bart-large-mnli (or MoritzLaurer/deberta-v3-large-zeroshot-v2)
- Then provide the labels as plain English strings:
- “genuine”, “spam”, “advertising”, “competitor attack”, “irrelevant”
__Model returns a probability for each label__
- The score: S(intent) = P("genuine") which is already in [0,1] scale.

__“Irrelevancy” - How do we judge that?__
Idea - does the text actually talk about this place?
Ex: lets say the review is about “Baskin-Robbins ice cream” but the location is “Dominos Pizza”
Cosine sim
- sim_name = cos01(emb_text, emb_name)
- sim_desc = cos01(emb_text, emb_desc)
- sim_cat = cos01(emb_text, emb_cat)
- S(relevancy) = max(sim_name, sim_desc, sim_cat)

In [1]:
from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
from transformers import pipeline
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import transformers
transformers.__version__

'4.55.4'

In [3]:
# configuration
INTENT_LABELS = ['genuine', 'spam', 'advertising', 'competitor attack', "incentivize", "mistaken identity"]

# zero-shot classifier (ZSC) model
ZSC_MODEL_NAME = "facebook/bart-large-mnli"

# Model holders for initial state
_ZSC_PIPELINE = None



In [4]:
# utility functions

def get_zero_shot_pipeline(model_name: str = ZSC_MODEL_NAME):
    global _ZSC_PIPELINE
    # check if is the initial state == None
    if _ZSC_PIPELINE is None:
        _ZSC_PIPELINE = pipeline(
            task = "zero-shot-classification",
            model = model_name
        )
    return _ZSC_PIPELINE


In [5]:
# intent classification
def score_intent(
    text:str,
    labels: Optional[List[str]] = None,
    model_name: str = ZSC_MODEL_NAME,
) -> Dict[str, float]:
    # returns a dict: {label: probability}, including S(intent) = P("genuine").
    
    if not text or not text.strip():
        # edge case
        # if empty text, return 0 probability for all labels, including intent
        base = {lbl: 0.0 for lbl in labels or INTENT_LABELS}
        base["S_intent"] = 0.0
        return base
    
    zsc = get_zero_shot_pipeline(model_name)
    use_labels = labels or INTENT_LABELS
    
    res = zsc(
        sequences=text,
        candidate_labels = use_labels,
        multi_label = False # pick one distribution that sums ~1
    )
    
    # mapping scores to their labels
    scores = dict(zip(res["labels"], res['scores']))
    
    # ensuring all labels exist, incase the model drops something
    for lbl in use_labels:
        scores.setdefault(lbl, 0.0)
    
    # S(intent) = P("genuine")
    scores["S_intent"] = float(scores.get("genuine", 0.0))
    return scores


def batch_score_intent(
    texts: List[str],
    labels: Optional[List[str]] = None,
    model_name: str = ZSC_MODEL_NAME,
    batch_size: int = 16
) -> List[Dict[str, float]]:
    """
    Batched intent scoring for throughput. Returns list of per-text dicts.
    """
    zsc = get_zero_shot_pipeline(model_name)
    use_labels = labels or INTENT_LABELS
    outputs: List[Dict[str, float]] = []

    for i in range(0, len(texts), batch_size):
        chunk = texts[i:i+batch_size]
        res_list = zsc(sequences=chunk, candidate_labels=use_labels, multi_label=False)
        if isinstance(res_list, dict):
            res_list = [res_list]
        for res in res_list:
            scores = dict(zip(res["labels"], res["scores"]))
            for lbl in use_labels:
                scores.setdefault(lbl, 0.0)
            scores["S_intent"] = float(scores.get("genuine", 0.0))
            outputs.append(scores)
    return outputs

In [6]:
df = pd.read_csv('/Users/evan/Documents/Projects/TikTok-TechJam-2025/data_gpt_labeler/final_data_2.csv')

In [None]:
df.head()


Unnamed: 0.1,Unnamed: 0,rating,text,business_name,business_category,business_description,_id
0,848694,5,Excellent beach for family activities great su...,'Ehukai Beach Park,"['Park', 'Public beach', 'Tourist attraction']",Popular surfing beach offering massive wintert...,1.1730942640485394e+20_1605375558437
1,848706,5,My favorite Beach for surfing on Oahu North Sh...,'Ehukai Beach Park,"['Park', 'Public beach', 'Tourist attraction']",Popular surfing beach offering massive wintert...,1.1249899958787118e+20_1570685676722
2,848685,5,Usually a parking spot available and a nice sp...,'Ehukai Beach Park,"['Park', 'Public beach', 'Tourist attraction']",Popular surfing beach offering massive wintert...,1.1677373083828122e+20_1618554513347
3,848711,5,Nice small beach. Great place to watch surfers,'Ehukai Beach Park,"['Park', 'Public beach', 'Tourist attraction']",Popular surfing beach offering massive wintert...,1.0664503467931671e+20_1541146996259
4,848700,5,Awesome spot for surfing!,'Ehukai Beach Park,"['Park', 'Public beach', 'Tourist attraction']",Popular surfing beach offering massive wintert...,1.1425088661032362e+20_1612418675718


In [7]:
review = df["text"].iloc[0]
print(f"Review: {review}")


# Intent
intent_scores = score_intent(review)
print("Intent scores:", intent_scores)
print("S(intent) =", intent_scores["S_intent"])


Review: Very clean place and great customer service. Poke is pretty good but a little pricey.  Had a 2 choice bowl with sushi rice.  I would return because of the service and location but i would go to other spots that are cheaper if I have more time.


Device set to use mps:0


Intent scores: {'genuine': 0.4300074875354767, 'incentivize': 0.1709793210029602, 'advertising': 0.13390693068504333, 'mistaken identity': 0.12189310789108276, 'competitor attack': 0.10178375244140625, 'spam': 0.04142928123474121, 'S_intent': 0.4300074875354767}
S(intent) = 0.4300074875354767


In [10]:
sample_df = df.sample(n=10, random_state=42).reset_index(drop=True).copy()
reviews = sample_df["text"].fillna("").tolist()

intent_scores = batch_score_intent(reviews)
intent_scores_df = pd.DataFrame(intent_scores)
result = pd.concat([sample_df, intent_scores_df], axis=1)
result


Unnamed: 0.1,Unnamed: 0,rating,text,business_name,business_category,business_description,_id,genuine,incentivize,advertising,mistaken identity,competitor attack,spam,S_intent
0,16252,5,"Great place, one of the highlights of The Isla...",Hali'imaile General Store,"['Hawaiian restaurant', 'Restaurant']",Acclaimed destination for Hawaiian regional cu...,1.0236590088831161e+20_1610222520503,0.849657,0.094062,0.022382,0.016693,0.012094,0.005111,0.849657
1,14684,5,Best miso soup in a long time,Sushi Bushido,"['Sushi restaurant', 'Japanese restaurant']",Specialty sushi rolls join hot dishes on the m...,1.179536790737905e+20_1597777204209,0.683318,0.175009,0.062317,0.029739,0.03732,0.012298,0.683318
2,11731,5,Great location! Staff is always friendly and q...,Starbucks,"['Coffee shop', 'Breakfast restaurant', 'Cafe'...",Seattle-based coffeehouse chain known for its ...,1.1798840919845287e+20_1613631789054,0.819729,0.08673,0.053284,0.017627,0.015356,0.007274,0.819729
3,14742,5,Always great food and service!,Moe's,"['Mexican restaurant', 'Burrito restaurant', '...",Counter-serve chain dishing up Southwestern st...,1.0742434675805764e+20_1594022832344,0.888305,0.057469,0.018242,0.016087,0.013208,0.00669,0.888305
4,14521,5,Coconut milk latte sooo good! As everyone know...,Starbucks,"['Coffee shop', 'Breakfast restaurant', 'Cafe'...",Seattle-based coffeehouse chain known for its ...,1.1706393437520925e+20_1554143116159,0.148371,0.347174,0.329135,0.046267,0.10094,0.028113,0.148371
5,16340,4,Good food but really slow service,Hali'imaile General Store,"['Hawaiian restaurant', 'Restaurant']",Acclaimed destination for Hawaiian regional cu...,1.0484460410969786e+20_1551420331386,0.230962,0.221878,0.107548,0.142736,0.236232,0.060644,0.230962
6,10576,5,The service in here is great and the waiting t...,Starbucks,"['Coffee shop', 'Breakfast restaurant', 'Cafe'...",Seattle-based coffeehouse chain known for its ...,1.043580730626666e+20_1607959928323,0.917829,0.04782,0.011148,0.009085,0.009527,0.004591,0.917829
7,15202,5,Wonderful atmosphere picture windows of the oc...,Bull Shed Restaurant,"['Steak house', 'American restaurant', 'Bar', ...",Steakhouse serving upscale surf 'n' turf in a ...,1.0848833206673172e+20_1523836567279,0.087585,0.822015,0.045865,0.019967,0.015035,0.009534,0.087585
8,16363,5,Great food but a little pricey!,Hali'imaile General Store,"['Hawaiian restaurant', 'Restaurant']",Acclaimed destination for Hawaiian regional cu...,1.010571907774781e+20_1528509334392,0.114429,0.241146,0.165781,0.161524,0.281029,0.036091,0.114429
9,10439,1,Seriously avoid this place like the plague. Ba...,Starbucks,"['Coffee shop', 'Breakfast restaurant', 'Cafe'...",Seattle-based coffeehouse chain known for its ...,1.020784104639533e+20_1575927435525,0.775335,0.028989,0.021258,0.014963,0.048451,0.111003,0.775335


#### Findings:

Will need to relate business name, category and description to the model as well, it lacks context and may determine review as "genuine" if not.

### Irrelevancy/Relevancy using Cosine Similarity

“Irrelevancy” - How do we judge that?

Idea - does the text actually talk about this place?

Ex: lets say the review is about “Baskin-Robbins ice cream” but the location is “Dominos Pizza”

Cosine sim
- sim_name = cos01(emb_text, emb_name)
- sim_desc = cos01(emb_text, emb_desc)
- sim_cat = cos01(emb_text, emb_cat)
- S(relevancy) = max(sim_name, sim_desc, sim_cat)

In [None]:
# lightweight general-purpose embedder
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# Model holders for initial state
_EMBED_MODEL = None

In [None]:
# utility functions

def _cos_sim(a:np.ndarray, b: np.ndarray) -> float:
    # range of [-1, 1]
    denom = (np.linalg.norm(a) * np.linalg.norm(b))
    if denom == 0:
        return 0.0
    return float(np.dot(a,b)/denom)


def _cos01(x:float) -> float:
    # mapping cos sim from range [-1, 1] to [0, 1]
    return (x+1.0)/2.0

def get_embedder(model_name: str = EMBED_MODEL_NAME) -> SentenceTransformer:
    global _EMBED_MODEL
    if _EMBED_MODEL is None:
        _EMBED_MODEL = SentenceTransformer(model_name)
    return _EMBED_MODEL