# App Review Trend Analysis AI Agent

## Objective
Build an AI agent that processes daily app reviews, identifies recurring
issues and requests as topics, deduplicates semantically similar topics,
and generates a rolling trend analysis table.


In [None]:
!pip install pandas numpy scikit-learn sentence-transformers




In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity




In [None]:
data = {
    "date": [
        "2024-06-01","2024-06-01",
        "2024-06-02","2024-06-02",
        "2024-06-03","2024-06-03","2024-06-03"
    ],
    "review": [
        "Delivery partner was rude",
        "Food arrived cold and stale",
        "Delivery guy behaved badly",
        "App map is not working properly",
        "Food quality was poor",
        "Bring back 10 minute delivery",
        "Delivery person was impolite"
    ]
}

df = pd.DataFrame(data)
df["date"] = pd.to_datetime(df["date"])
df


Unnamed: 0,date,review
0,2024-06-01,Delivery partner was rude
1,2024-06-01,Food arrived cold and stale
2,2024-06-02,Delivery guy behaved badly
3,2024-06-02,App map is not working properly
4,2024-06-03,Food quality was poor
5,2024-06-03,Bring back 10 minute delivery
6,2024-06-03,Delivery person was impolite


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def deduplicate_topics(reviews, threshold=0.75):
    embeddings = model.encode(reviews)
    similarity = cosine_similarity(embeddings)

    topics = []
    used = set()

    for i, review in enumerate(reviews):
        if i in used:
            continue

        cluster = [review]
        used.add(i)

        for j in range(i+1, len(reviews)):
            if similarity[i][j] >= threshold and j not in used:
                cluster.append(reviews[j])
                used.add(j)

        topics.append(cluster)

    return topics


In [None]:
def get_topic_name(cluster):
    return cluster[0]


In [None]:
records = []

for date, group in df.groupby("date"):
    clusters = deduplicate_topics(group["review"].tolist())
    for cluster in clusters:
        records.append({
            "date": date,
            "topic": get_topic_name(cluster)
        })

topic_df = pd.DataFrame(records)
topic_df


Unnamed: 0,date,topic
0,2024-06-01,Delivery partner was rude
1,2024-06-01,Food arrived cold and stale
2,2024-06-02,Delivery guy behaved badly
3,2024-06-02,App map is not working properly
4,2024-06-03,Food quality was poor
5,2024-06-03,Bring back 10 minute delivery
6,2024-06-03,Delivery person was impolite


In [None]:
trend_table = topic_df.pivot_table(
    index="topic",
    columns="date",
    aggfunc="size",
    fill_value=0
)

trend_table


date,2024-06-01,2024-06-02,2024-06-03
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
App map is not working properly,0,1,0
Bring back 10 minute delivery,0,0,1
Delivery guy behaved badly,0,1,0
Delivery partner was rude,1,0,0
Delivery person was impolite,0,0,1
Food arrived cold and stale,1,0,0
Food quality was poor,0,0,1


In [None]:
trend_table.to_csv("trend_analysis_output.csv")
print("Trend analysis output saved.")


Trend analysis output saved.


## Approach

Daily app reviews are treated as batch inputs. A semantic embedding model
is used to understand review meaning and group similar feedback into
canonical topics. A similarity-based deduplication agent ensures high
recall while avoiding duplicate topic categories.

## Assumptions
- Reviews arrive as daily batches.
- Semantically similar complaints should map to a single topic.

## Limitations
- Review data is simulated.
- Live ingestion and large-scale deployment are out of scope.


In [None]:
!ls


output	sample_data  trend_analysis_output.csv


In [None]:
!ls sample_data


anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


In [None]:
# save the trend table directly into output folder
trend_table.to_csv("output/trend_analysis_output.csv")

print("CSV saved directly inside output folder.")


CSV saved directly inside output folder.
