# Ebuss: Sentiment-based Product Recommendation System

This notebook covers:
- EDA + Cleaning
- Sentiment model training (LogReg / Naive Bayes / Random Forest code)
- User-based vs Item-based recommender (we deploy Item-based)
- Sentiment-filtered top-5 recommendations
- Deployment steps (Flask + Heroku)

**Deployment link (replace after you deploy):** `https://<your-heroku-app>.herokuapp.com/`


In [None]:
import pandas as pd
import numpy as np

DATA_PATH = 'sample30.csv'
ATTR_PATH = 'Data+Attribute+Description.csv'
df = pd.read_csv(DATA_PATH)
attr = pd.read_csv(ATTR_PATH, encoding='latin1')
print(df.shape)
df.head()

In [None]:
# Attribute descriptions
attr

## 1) EDA and Cleaning

In [None]:
df.isna().mean().sort_values(ascending=False).head(10)

In [None]:
df['reviews_rating'] = pd.to_numeric(df['reviews_rating'], errors='coerce')
print(df['reviews_rating'].describe())

## 2) Text preprocessing and feature extraction
We use TF-IDF over a hashing-based vectorizer for speed and easy deployment.

In [None]:
df['review_full'] = (df['reviews_title'].fillna('').astype(str) + ' ' + df['reviews_text'].fillna('').astype(str)).str.strip()

df['user_sentiment'] = df['user_sentiment'].astype(str).str.lower()
sent_df = df[df['user_sentiment'].isin(['positive','negative'])].copy().reset_index(drop=True)
y = (sent_df['user_sentiment'] == 'positive').astype(int)
X = sent_df['review_full']
print('Class balance:', y.value_counts(normalize=True))

## 3) Train at least 3 models
Below we include code for:
- Logistic Regression
- Naive Bayes
- Random Forest (heavier; run in Colab for faster compute)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

hv = HashingVectorizer(stop_words='english', n_features=2**14, alternate_sign=False, ngram_range=(1,2))
tfidf = TfidfTransformer()

def evaluate(pipe):
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    prob = pipe.predict_proba(X_test)[:,1]
    return {
        'accuracy': accuracy_score(y_test, pred),
        'f1': f1_score(y_test, pred),
        'roc_auc': roc_auc_score(y_test, prob)
    }

pipelines = {
    'log_reg': Pipeline([('hash', hv), ('tfidf', tfidf), ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', solver='liblinear'))]),
    'naive_bayes': Pipeline([('hash', hv), ('tfidf', tfidf), ('clf', MultinomialNB())]),
    'random_forest': Pipeline([('hash', hv), ('tfidf', tfidf), ('clf', RandomForestClassifier(n_estimators=300, random_state=42, class_weight='balanced_subsample', n_jobs=-1))])
}

for name, pipe in pipelines.items():
    print(name, evaluate(pipe))


Pick the best performing classifier (often Logistic Regression). Save it as a pickle for deployment.

## 4) Recommendation system
We build and compare:
- User-based CF
- Item-based CF

We deploy **Item-based CF** because the number of products is much smaller than the number of users (more scalable).

In [None]:
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

ratings_df = df.dropna(subset=['reviews_rating','reviews_username','name']).copy()
ratings_df['reviews_username'] = ratings_df['reviews_username'].astype(str)
ratings_df['name'] = ratings_df['name'].astype(str)

user_item = ratings_df.pivot_table(index='reviews_username', columns='name', values='reviews_rating', aggfunc='mean')
user_item_filled = user_item.fillna(0.0)
R = sparse.csr_matrix(user_item_filled.values)

users = user_item_filled.index.tolist()
items = user_item_filled.columns.tolist()

# Item-item similarity (items x users)
S = cosine_similarity(R.T, dense_output=True)
np.fill_diagonal(S, 0.0)
print('Users:', len(users), 'Items:', len(items), 'S shape:', S.shape)

In [None]:
import numpy as np

def recommend_20(username, n=20):
    if username not in users:
        return []
    uidx = users.index(username)
    user_r = R.getrow(uidx).toarray().ravel()
    rated = user_r > 0
    sims = S[:, rated]
    ratings = user_r[rated]
    if ratings.size == 0:
        return []
    scores = sims.dot(ratings) / (np.abs(sims).sum(axis=1) + 1e-9)
    scores[rated] = -np.inf
    top_idx = np.argsort(-scores)[:n]
    return [items[i] for i in top_idx if np.isfinite(scores[i])]

# Example
example_user = users[0]
recommend_20(example_user, 10)

## 5) Sentiment-filtered Top-5
Take the top-20 from CF and pick the top-5 with the highest average positive sentiment over the product's reviews.

In [None]:
# Assume `best_sentiment_model` is your chosen pipeline from section 3.
# For each product, compute average positive probability across reviews.

def avg_positive_sentiment(product_name, model):
    texts = df.loc[df['name'] == product_name, 'review_full'].dropna().astype(str).tolist()
    if not texts:
        return 0.0
    probs = model.predict_proba(texts)[:,1]
    return float(np.mean(probs))

def recommend_5(username, model):
    top20 = recommend_20(username, 20)
    scored = [(p, avg_positive_sentiment(p, model)) for p in top20]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [p for p, s in scored[:5]]


## 6) Deployment
You will submit:
- Notebook
- `model.py`, `app.py`, `templates/index.html`
- Pickle files

**Heroku link:** replace this after deployment: `https://<your-heroku-app>.herokuapp.com/`