***Problem Statement***

The e-commerce business is quite popular today. Here, you do not need to take orders by going to each customer. A company launches its website to sell the items to the end consumer, and customers can order the products that they require from the same website. Famous examples of such e-commerce companies are Amazon, Flipkart, Myntra, Paytm and Snapdeal.

Suppose you are working as a Machine Learning Engineer in an e-commerce company named 'Ebuss'. Ebuss has captured a huge market share in many fields, and it sells the products in various categories such as household essentials, books, personal care products, medicines, cosmetic items, beauty products, electrical appliances, kitchen and dining products and health care products.

With the advancement in technology, it is imperative for Ebuss to grow quickly in the e-commerce market to become a major leader in the market because it has to compete with the likes of Amazon, Flipkart, etc., which are already market leaders.

As a senior ML Engineer, you are asked to build a model that will improve the recommendations given to the users given their past reviews and ratings. 

 

In order to do this, you planned to build a sentiment-based product recommendation system, which includes the following tasks.

Data sourcing and sentiment analysis
Building a recommendation system
Improving the recommendations using the sentiment analysis model
Deploying the end-to-end project with a user interface

In [53]:
import pandas as pd
from dateutil import parser
from tqdm import tqdm

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer ,PorterStemmer
import nltk
import re

# Ensure all required resources are present
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amit.kumar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/amit.kumar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/amit.kumar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/amit.kumar/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [54]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('sample30.csv')

# Display the first few rows of the DataFrame to get an overview of the data
print("Original DataFrame head:")
display(df.head())

# Display information about the DataFrame, including data types and non-null values
print("\nDataFrame info:")
display(df.info())

# Display the number of missing values in each column
print("\nMissing values per column:")
display(df.isnull().sum())

Original DataFrame head:


Unnamed: 0,id,brand,categories,manufacturer,name,reviews_date,reviews_didPurchase,reviews_doRecommend,reviews_rating,reviews_text,reviews_title,reviews_userCity,reviews_userProvince,reviews_username,user_sentiment
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,,,5,i love this album. it's very good. more to the...,Just Awesome,Los Angeles,,joshua,Positive
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,True,,5,Good flavor. This review was collected as part...,Good,,,dorothy w,Positive
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,True,,5,Good flavor.,Good,,,dorothy w,Positive
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,False,False,1,I read through the reviews on here before look...,Disappointed,,,rebecca,Negative
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,False,False,1,My husband bought this gel for us. The gel cau...,Irritation,,,walker557,Negative



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        29999 non-null  object
dtypes: int64(1), object

None


Missing values per column:


id                          0
brand                       0
categories                  0
manufacturer              141
name                        0
reviews_date               46
reviews_didPurchase     14068
reviews_doRecommend      2570
reviews_rating              0
reviews_text                0
reviews_title             190
reviews_userCity        28071
reviews_userProvince    29830
reviews_username           63
user_sentiment              1
dtype: int64

### Data Cleaning and Pre-processing

Based on the missing value analysis, we will decide on the appropriate strategy for handling missing values. This might involve imputation (replacing missing values with a calculated value like the mean, median, or mode) or removal (dropping rows or columns with missing values), depending on the extent and nature of the missing data.

We will also drop columns that are not relevant for our analysis to simplify the dataset and improve performance. Finally, we will ensure all columns have the correct data types for subsequent analysis.

In [55]:
def clean_reviews_date(df, col='reviews_date'):
    """
    Cleans and standardizes a reviews_date column.

    Steps:
    1. Replace junk values (N/A, null, etc.)
    2. Parse valid dates with pandas (fast)
    3. For still-missing values, try dateutil parser (slower but flexible)
    4. Return cleaned dataframe and log summary
    """
    # Step 1: Replace common junk values with NA
    junk_values = ['N/A', 'NA', 'na', 'null', 'None', 'NONE', 'Unknown', '', ' ']
    df[col] = df[col].replace(junk_values, pd.NA)

    # Step 2: First attempt with pandas (fast, flexible)
    parsed = pd.to_datetime(df[col], errors='coerce')

    # Step 3: For rows still missing, try dateutil parser (slower but more powerful)
    mask_missing = parsed.isna() & df[col].notna()
    if mask_missing.sum() > 0:
        def try_parse_date(x):
            try:
                return parser.parse(x, dayfirst=False, fuzzy=True)
            except:
                return pd.NaT
        parsed.loc[mask_missing] = df.loc[mask_missing, col].apply(try_parse_date)

    # Step 4: Assign cleaned column back
    df[col] = parsed

    # Step 5: Log summary
    total = len(df)
    valid = df[col].notna().sum()
    missing = df[col].isna().sum()
    print(f"[INFO] Cleaned '{col}': {valid}/{total} valid dates, {missing} missing ({missing/total:.2%})")

    return df


In [56]:
# Example of handling missing values (replace with appropriate strategy based on analysis)

print("\nDropping irrelevant columns...")
df = df.drop(columns=['manufacturer'])   # redundant

columns_to_drop = ['reviews_userCity', 'reviews_userProvince']
df.drop(columns=columns_to_drop, inplace=True)
print(f"Dropped columns: {columns_to_drop}")

#Use user_sentiment or reviews_rating to impute:
#If sentiment = Positive (or rating ≥4) → fill "Yes".
#If sentiment = Negative (or rating ≤2) → fill "No".
#Neutral cases → "Unknown".
#This keeps imputation consistent with actual review content.

df['reviews_doRecommend'] = df.apply(
    lambda x: 'Yes' if pd.isna(x['reviews_doRecommend']) and x['reviews_rating'] >= 4
    else ('No' if pd.isna(x['reviews_doRecommend']) and x['reviews_rating'] <= 2
    else (x['reviews_doRecommend'] if pd.notna(x['reviews_doRecommend']) else 'Unknown')),
    axis=1
)

# fill missing with empty string
df['reviews_title'] = df['reviews_title'].fillna("")

# fill missing with Anonymous string
df['reviews_username'] = df['reviews_username'].fillna("Anonymous")


#convert data type
df = clean_reviews_date(df, 'reviews_date')  # convert to datetime
df['reviews_rating'] = df['reviews_rating'].astype('int')  # ensure integer
df['reviews_didPurchase'] = df['reviews_didPurchase'].fillna('Unknown').astype('category')
df['reviews_doRecommend'] = df['reviews_doRecommend'].astype('category')
df['user_sentiment'] = df['user_sentiment'].astype('category')

# Display the first few rows of the cleaned and pre-processed DataFrame
print("\nCleaned and Pre-processed DataFrame head:")
display(df.head())


Dropping irrelevant columns...
Dropped columns: ['reviews_userCity', 'reviews_userProvince']
[INFO] Cleaned 'reviews_date': 29946/30000 valid dates, 54 missing (0.18%)

Cleaned and Pre-processed DataFrame head:


Unnamed: 0,id,brand,categories,name,reviews_date,reviews_didPurchase,reviews_doRecommend,reviews_rating,reviews_text,reviews_title,reviews_username,user_sentiment
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30 06:21:45+00:00,Unknown,Yes,5,i love this album. it's very good. more to the...,Just Awesome,joshua,Positive
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09 00:00:00+00:00,True,Yes,5,Good flavor. This review was collected as part...,Good,dorothy w,Positive
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09 00:00:00+00:00,True,Yes,5,Good flavor.,Good,dorothy w,Positive
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y Love Sensuality Pleasure Gel,2016-01-06 00:00:00+00:00,False,False,1,I read through the reviews on here before look...,Disappointed,rebecca,Negative
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y Love Sensuality Pleasure Gel,2016-12-21 00:00:00+00:00,False,False,1,My husband bought this gel for us. The gel cau...,Irritation,walker557,Negative


In [57]:
# reviews_date (54 missing, <0.2%)
# Very few → you don’t lose much if you drop them.
# Since date is important for time-based analysis but not for core sentiment/recommendation we can drop those rows

df = df[df['reviews_date'].notna()]

df = df[df['user_sentiment'].notna()]

In [58]:
# Display the DataFrame info after dropping columns
print("\nDataFrame info after dropping columns:")
display(df.info())

# After handling missing values, display the updated missing value count
print("Missing values per column after handling:")
display(df.isnull().sum())


DataFrame info after dropping columns:
<class 'pandas.core.frame.DataFrame'>
Index: 29945 entries, 0 to 29999
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   id                   29945 non-null  object             
 1   brand                29945 non-null  object             
 2   categories           29945 non-null  object             
 3   name                 29945 non-null  object             
 4   reviews_date         29945 non-null  datetime64[ns, UTC]
 5   reviews_didPurchase  29945 non-null  category           
 6   reviews_doRecommend  29945 non-null  category           
 7   reviews_rating       29945 non-null  int64              
 8   reviews_text         29945 non-null  object             
 9   reviews_title        29945 non-null  object             
 10  reviews_username     29945 non-null  object             
 11  user_sentiment       29945 non-null  category

None

Missing values per column after handling:


id                     0
brand                  0
categories             0
name                   0
reviews_date           0
reviews_didPurchase    0
reviews_doRecommend    0
reviews_rating         0
reviews_text           0
reviews_title          0
reviews_username       0
user_sentiment         0
dtype: int64

In [59]:


# Download necessary NLTK data (if not already downloaded)
try:
    stopwords = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    stopwords = stopwords.words('english')

try:
    WordNetLemmatizer()
except LookupError:
    nltk.download('wordnet')
    nltk.download('omw-1.4')


# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """
    Cleans and preprocesses text data.

    Steps:
    1. Convert text to lowercase.
    2. Remove punctuation.
    3. Remove stop words.
    4. Apply stemming or lemmatization (choose one).
    """
    # 1. Convert text to lowercase
    text = text.lower()

    # 2. Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # 3. Remove stop words
    text = ' '.join([word for word in text.split() if word not in stopwords])

    # 4. Apply lemmatization (you can switch to stemming if preferred)
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    # text = ' '.join([stemmer.stem(word) for word in text.split()]) # Uncomment for stemming

    return text

# Apply preprocessing to the reviews_text and reviews_title columns
print("Applying text preprocessing to 'reviews_text' and 'reviews_title'...")
df['reviews_text_preprocessed'] = df['reviews_text'].apply(preprocess_text)
df['reviews_title_preprocessed'] = df['reviews_title'].apply(preprocess_text)

# Display the first few rows with the new preprocessed columns
print("\nDataFrame head with preprocessed text:")
display(df[['reviews_text', 'reviews_text_preprocessed', 'reviews_title', 'reviews_title_preprocessed']].head())

Applying text preprocessing to 'reviews_text' and 'reviews_title'...

DataFrame head with preprocessed text:


Unnamed: 0,reviews_text,reviews_text_preprocessed,reviews_title,reviews_title_preprocessed
0,i love this album. it's very good. more to the...,love album good hip hop side current pop sound...,Just Awesome,awesome
1,Good flavor. This review was collected as part...,good flavor review collected part promotion,Good,good
2,Good flavor.,good flavor,Good,good
3,I read through the reviews on here before look...,read review looking buying one couple lubrican...,Disappointed,disappointed
4,My husband bought this gel for us. The gel cau...,husband bought gel u gel caused irritation fel...,Irritation,irritation


In [60]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE

# Step 1: Split data into training and testing parts
# We will use 'reviews_text_preprocessed' and 'user_sentiment' for our model
X = df['reviews_text_preprocessed']
y = df['user_sentiment']

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Data split into training and testing sets.")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Step 2: Convert the text to features using TF-IDF vectorizer
# Initialize TF-IDF vectorizer
# max_features can be adjusted to limit the number of features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit the vectorizer on the training data and transform both training and testing data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("\nText data vectorized using TF-IDF.")
print(f"X_train_tfidf shape: {X_train_tfidf.shape}")
print(f"X_test_tfidf shape: {X_test_tfidf.shape}")

# Step 3: Handle class imbalance using SMOTE
print("\nHandling class imbalance using SMOTE...")
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

print("Class imbalance handled using SMOTE.")
print(f"Resampled X_train shape: {X_train_resampled.shape}")
print(f"Resampled y_train shape: {y_train_resampled.shape}")
print("Resampled y_train distribution:")
display(y_train_resampled.value_counts())

Data split into training and testing sets.
X_train shape: (23956,)
X_test shape: (5989,)
y_train shape: (23956,)
y_test shape: (5989,)

Text data vectorized using TF-IDF.
X_train_tfidf shape: (23956, 5000)
X_test_tfidf shape: (5989, 5000)

Handling class imbalance using SMOTE...
Class imbalance handled using SMOTE.
Resampled X_train shape: (42540, 5000)
Resampled y_train shape: (42540,)
Resampled y_train distribution:


user_sentiment
Negative    21270
Positive    21270
Name: count, dtype: int64

In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from tqdm import tqdm
# Step 3: Build and train models

# Initialize the models
log_reg = LogisticRegression(random_state=42, solver='liblinear') # Using liblinear solver for smaller datasets
rf_clf = RandomForestClassifier(random_state=42, verbose=1)
nb_clf = MultinomialNB() # Suitable for text data with TF-IDF or Count Vectorizer

models = {
    "Logistic Regression": log_reg,
    "Random Forest": rf_clf,
    "Naive Bayes": nb_clf
}


# Train each model
print("Training models...")
for name, model in tqdm(models.items(), desc="Models"):
    model.fit(X_train_resampled, y_train_resampled)

print("\nAll models trained.")

Training models...


Models:  33%|████████████████████████████████████████████████████████████████████████████▎                                                                                                                                                        | 1/3 [00:00<00:00,  8.07it/s][Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.8s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    9.8s finished
Models: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.32s/it]


All models trained.





In [63]:
# Step 4: Evaluate models

print("Evaluating models...")
results = {}

for name, model in models.items():
    print(f"Evaluating {name}...")
    y_pred = model.predict(X_test_tfidf)
    y_pred_proba = model.predict_proba(X_test_tfidf)[:, 1] if hasattr(model, 'predict_proba') else None

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, pos_label='Positive')
    recall = recall_score(y_test, y_pred, pos_label='Positive')
    f1 = f1_score(y_test, y_pred, pos_label='Positive')
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else 'N/A'

    results[name] = {
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1,
        "ROC-AUC": roc_auc
    }
    print(f"{name} evaluated.")

# Display the results
print("\nModel Performance Comparison:")
results_df = pd.DataFrame(results).T
display(results_df)

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished


Evaluating models...
Evaluating Logistic Regression...
Logistic Regression evaluated.
Evaluating Random Forest...


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished


Random Forest evaluated.
Evaluating Naive Bayes...
Naive Bayes evaluated.

Model Performance Comparison:


Unnamed: 0,Accuracy,Precision,Recall,F1-Score,ROC-AUC
Logistic Regression,0.920521,0.980353,0.929095,0.954036,0.960981
Random Forest,0.927701,0.950387,0.969156,0.95968,0.944603
Naive Bayes,0.845717,0.955611,0.866466,0.908858,0.878445


**Naive Bayes**

Lowest across all metrics → clearly the weakest.

**Logistic Regression vs Random Forest**

**Accuracy:** Random Forest slightly better (0.928 vs 0.920).

**Precision:** Logistic Regression wins (0.980 vs 0.951).

**Recall:** Random Forest wins big (0.969 vs 0.929).

**F1-Score:** Random Forest wins (0.960 vs 0.954).

**ROC-AUC:** Logistic Regression wins (0.961 vs 0.946).

**So it’s a trade-off:**

Logistic Regression → More conservative, higher precision (fewer false positives).

Random Forest → More aggressive, higher recall & F1 (fewer false negatives).

**Best Model Choice**

If your system’s priority is catching as many positive reviews as possible (high recall, balanced F1), then → Random Forest.

If you care more about precision (only recommend when very confident), then → Logistic Regression.

For a sentiment-based product recommendation system, usually recall & F1 matter more — because you don’t want to miss positive sentiments that drive recommendations.

**Best Overall Model = Random Forest**

# Build System

Build and evaluate user-based and item-based recommendation systems on the dataset from "/content/sample_data/sample30.csv", select the best performing model, and provide reasons for the selection.

## Prepare data for recommendation system

Select the necessary columns and potentially preprocess them for building the recommendation system.


**Reasoning**:
Create a new DataFrame with only the necessary columns for the recommendation system and display its head and info.



In [64]:
reviews_df = df[['reviews_username', 'name', 'reviews_rating']]
print("Reviews DataFrame head:")
display(reviews_df.head())
print("\nReviews DataFrame info:")
display(reviews_df.info())

Reviews DataFrame head:


Unnamed: 0,reviews_username,name,reviews_rating
0,joshua,Pink Friday: Roman Reloaded Re-Up (w/dvd),5
1,dorothy w,Lundberg Organic Cinnamon Toast Rice Cakes,5
2,dorothy w,Lundberg Organic Cinnamon Toast Rice Cakes,5
3,rebecca,K-Y Love Sensuality Pleasure Gel,1
4,walker557,K-Y Love Sensuality Pleasure Gel,1



Reviews DataFrame info:
<class 'pandas.core.frame.DataFrame'>
Index: 29945 entries, 0 to 29999
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   reviews_username  29945 non-null  object
 1   name              29945 non-null  object
 2   reviews_rating    29945 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 935.8+ KB


None

## Split data

Divide the data into training and testing sets for evaluating the recommendation system.


**Reasoning**:
The goal is to split the data into training and testing sets for the recommendation system. The previous subtask created the `reviews_df` DataFrame with relevant columns. This step will perform the split using `train_test_split`.



In [65]:
from sklearn.model_selection import train_test_split

# Split the reviews_df DataFrame into training and testing sets
train_df, test_df = train_test_split(reviews_df, test_size=0.2, random_state=42)

# Print the shapes of the resulting DataFrames
print("Training set shape:", train_df.shape)
print("Testing set shape:", test_df.shape)

Training set shape: (23956, 3)
Testing set shape: (5989, 3)


## Build user-based collaborative filtering model

Implement a user-based collaborative filtering recommendation system.


**Reasoning**:
Create a pivot table from the training data, calculate user similarity, and define a function for user-based recommendations.



In [66]:
from sklearn.metrics.pairwise import cosine_similarity

# 1. Create a pivot table from the training data
user_item_matrix = train_df.pivot_table(index='reviews_username', columns='name', values='reviews_rating').fillna(0)

print("User-Item Matrix head:")
display(user_item_matrix.head())
print("\nUser-Item Matrix shape:", user_item_matrix.shape)

# 2. Calculate the pairwise cosine similarity between users
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

print("\nUser Similarity Matrix head:")
display(user_similarity_df.head())

# 3. Define a function for user-based recommendations
def user_based_recommendations(user_id, user_item_matrix, user_similarity_df, n_recommendations=5):
    """
    Generates user-based recommendations for a given user.

    Args:
        user_id (str): The ID of the target user.
        user_item_matrix (pd.DataFrame): The user-item matrix.
        user_similarity_df (pd.DataFrame): The user similarity matrix.
        n_recommendations (int): The number of recommendations to generate.

    Returns:
        list: A list of recommended item IDs.
    """
    if user_id not in user_similarity_df.index:
        print(f"User '{user_id}' not found in the similarity matrix.")
        return []

    # Get the similarity scores for the target user
    user_similarities = user_similarity_df.loc[user_id]

    # Remove the user's own similarity score
    user_similarities = user_similarities.drop(user_id)

    # Sort similar users by similarity in descending order
    similar_users = user_similarities.sort_values(ascending=False)

    # Get items rated by the target user
    items_rated_by_user = user_item_matrix.loc[user_id][user_item_matrix.loc[user_id] > 0].index

    # Initialize a dictionary to store recommended item scores
    item_scores = {}

    # Iterate through similar users
    for similar_user, similarity_score in similar_users.items():
        if similarity_score <= 0: # Consider only users with positive similarity
            continue

        # Get items rated by the similar user
        items_rated_by_similar_user = user_item_matrix.loc[similar_user][user_item_matrix.loc[similar_user] > 0].index

        # Identify items rated by the similar user but not by the target user
        items_to_consider = items_rated_by_similar_user.difference(items_rated_by_user)

        # For each item, add the similar user's rating weighted by similarity
        for item in items_to_consider:
            if item not in item_scores:
                item_scores[item] = 0
            item_scores[item] += user_item_matrix.loc[similar_user, item] * similarity_score

    # Sort items by their recommendation score in descending order
    recommended_items = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)

    # Return the top N recommended item IDs
    return [item for item, score in recommended_items[:n_recommendations]]

print("\nUser-based recommendation function defined.")

User-Item Matrix head:


name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz","42 Dual Drop Leaf Table with 2 Madrid Chairs""",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",...,Walkers Stem Ginger Shortbread,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Wilton Black Dots Standard Baking Cups,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00sab00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01impala,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02dakota,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02deuce,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0325home,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



User-Item Matrix shape: (20527, 263)

User Similarity Matrix head:


reviews_username,00sab00,01impala,02dakota,02deuce,0325home,1.11E+24,1085,10ten,1234,1234567,...,zsazsa,zt313,zubb,zulaa118,zwithanx,zxcsdfd,zxjki,zyiah4,zzdiane,zzz1127
reviews_username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00sab00,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01impala,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02dakota,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
02deuce,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0325home,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0



User-based recommendation function defined.


## Build item-based collaborative filtering model

Implement an item-based collaborative filtering recommendation system.


**Reasoning**:
Calculate item similarity and define the item-based recommendation function.



In [67]:
from sklearn.metrics.pairwise import cosine_similarity

# 1. Calculate the pairwise cosine similarity between items (using the transposed user-item matrix)
item_similarity = cosine_similarity(user_item_matrix.T)

# 2. Convert the resulting similarity matrix into a pandas DataFrame
item_similarity_df = pd.DataFrame(item_similarity, index=user_item_matrix.columns, columns=user_item_matrix.columns)

print("Item Similarity Matrix head:")
display(item_similarity_df.head())
print("\nItem Similarity Matrix shape:", item_similarity_df.shape)

# 3. Define a function item_based_recommendations
def item_based_recommendations(item_name, user_item_matrix, item_similarity_df, n_recommendations=5):
    """
    Generates item-based recommendations for a given item.

    Args:
        item_name (str): The name of the target item.
        user_item_matrix (pd.DataFrame): The user-item matrix.
        item_similarity_df (pd.DataFrame): The item similarity matrix.
        n_recommendations (int): The number of recommendations to generate.

    Returns:
        list: A list of recommended item names.
    """
    if item_name not in item_similarity_df.index:
        print(f"Item '{item_name}' not found in the similarity matrix.")
        return []

    # 4. Get the similarity scores for the target item
    item_similarities = item_similarity_df.loc[item_name]

    # 5. Remove the item's own similarity score
    item_similarities = item_similarities.drop(item_name, errors='ignore')

    # 6. Sort similar items by similarity in descending order
    similar_items = item_similarities.sort_values(ascending=False)

    # 7. Get users who rated the target item
    users_who_rated_item = user_item_matrix.index[user_item_matrix[item_name] > 0]

    # 8. Initialize a dictionary to store recommended item scores
    item_scores = {}

    # 9. Iterate through similar items and users
    for similar_item, similarity_score in similar_items.items():
        if similarity_score <= 0: # Consider only items with positive similarity
            continue

        # 10. For each user, if they rated the similar item, add their rating weighted by the item similarity
        for user in users_who_rated_item:
            if user_item_matrix.loc[user, similar_item] > 0:
                if similar_item not in item_scores:
                    item_scores[similar_item] = 0
                item_scores[similar_item] += user_item_matrix.loc[user, similar_item] * similarity_score

    # 11. Sort items in the item_scores dictionary by their recommendation score
    recommended_items = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)

    # 12. Return the top N recommended item names
    return [item for item, score in recommended_items[:n_recommendations]]

print("\nItem-based recommendation function defined.")

Item Similarity Matrix head:


name,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz","42 Dual Drop Leaf Table with 2 Madrid Chairs""",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,Alex Cross (dvdvideo),"All,bran Complete Wheat Flakes, 18 Oz.",...,Walkers Stem Ginger Shortbread,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee","WeatherTech 40647 14-15 Outlander Cargo Liners Behind 2nd Row, Black",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Wilton Black Dots Standard Baking Cups,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100:Complete First Season (blu-Ray),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002118,0.0,0.0
2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"2x Ultra Era with Oxi Booster, 50fl oz",0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"42 Dual Drop Leaf Table with 2 Madrid Chairs""",0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.146114,0.0



Item Similarity Matrix shape: (263, 263)

Item-based recommendation function defined.


## Evaluate recommendation systems

Evaluate the performance of both the user-based and item-based recommendation systems using appropriate metrics (e.g., RMSE, precision, recall).


**Reasoning**:
Define the evaluation function for the recommendation systems and then call it for both user-based and item-based models.



In [68]:
def evaluate_recommendation_system(test_df, recommendation_function, user_item_matrix, similarity_matrix, n_recommendations=5):
    """
    Evaluates the performance of a recommendation system.

    Args:
        test_df (pd.DataFrame): DataFrame containing the test set (user, item, rating).
        recommendation_function (function): The function to generate recommendations.
        user_item_matrix (pd.DataFrame): The user-item matrix from the training data.
        similarity_matrix (pd.DataFrame): The user or item similarity matrix.
        n_recommendations (int): The number of recommendations generated by the system.

    Returns:
        dict: A dictionary containing evaluation metrics (e.g., hit rate).
    """
    hits = 0
    total_test_interactions = len(test_df)
    users_in_train = user_item_matrix.index.tolist()
    items_in_train = user_item_matrix.columns.tolist()

    for index, row in tqdm(test_df.iterrows(), total=total_test_interactions, desc=f"Evaluating {recommendation_function.__name__}"):
        user = row['reviews_username']
        actual_item = row['name']

        # Only evaluate if the user and item are in the training data's matrix
        # This is a limitation of collaborative filtering - it cannot recommend for new users/items
        if user in users_in_train and actual_item in items_in_train:
            # Generate recommendations for the user
            if recommendation_function.__name__ == 'user_based_recommendations':
                 recommended_items = recommendation_function(user, user_item_matrix, similarity_matrix, n_recommendations)
            elif recommendation_function.__name__ == 'item_based_recommendations':
                 # For item-based, we need to provide an item from the user's history in the training set
                 # This is a simplification; a real system would use all items rated by the user
                 user_rated_items_in_train = user_item_matrix.loc[user][user_item_matrix.loc[user] > 0].index.tolist()
                 if not user_rated_items_in_train: # Skip if user has no rated items in training (should be handled by user_in_train check, but double-checking)
                     continue
                 # Use the first item the user rated in the training set as a basis for item-based recommendation
                 # A more sophisticated approach would aggregate recommendations from all rated items
                 seed_item = user_rated_items_in_train[0]
                 recommended_items = recommendation_function(seed_item, user_item_matrix, similarity_matrix, n_recommendations)
            else:
                print(f"Unknown recommendation function: {recommendation_function.__name__}")
                continue


            # Check if the actual item is in the recommendations
            if actual_item in recommended_items:
                hits += 1

    # Calculate metrics
    hit_rate = hits / total_test_interactions if total_test_interactions > 0 else 0

    return {"Hit Rate": hit_rate}



In [70]:
# Evaluate User-Based Recommendation System
print("Evaluating User-Based Recommendation System...")
user_based_metrics = evaluate_recommendation_system(test_df, user_based_recommendations, user_item_matrix, user_similarity_df)
print("User-Based Metrics:", user_based_metrics)

# Evaluate Item-Based Recommendation System
print("\nEvaluating Item-Based Recommendation System...")
item_based_metrics = evaluate_recommendation_system(test_df, item_based_recommendations, user_item_matrix, item_similarity_df)
print("Item-Based Metrics:", item_based_metrics)

Evaluating User-Based Recommendation System...


Evaluating user_based_recommendations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5989/5989 [04:35<00:00, 21.75it/s]


User-Based Metrics: {'Hit Rate': 0.07697445316413425}

Evaluating Item-Based Recommendation System...


Evaluating item_based_recommendations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5989/5989 [12:14<00:00,  8.16it/s]

Item-Based Metrics: {'Hit Rate': 0.08281850058440474}





**Reasoning**:
The evaluation of the item-based recommendation system is complete, print the metrics.



In [71]:
print("Item-Based Metrics:", item_based_metrics)

Item-Based Metrics: {'Hit Rate': 0.08281850058440474}


## Compare and select the best model

Compare the performance of the two recommendation systems and select the best one based on the evaluation results, providing reasons for the selection.


**Reasoning**:
Compare the evaluation metrics and select the best performing model.



In [72]:
print("User-Based Metrics:", user_based_metrics)
print("Item-Based Metrics:", item_based_metrics)

# Compare the Hit Rates and select the best model
if item_based_metrics['Hit Rate'] > user_based_metrics['Hit Rate']:
    best_model = "Item-Based Collaborative Filtering"
    best_metric_value = item_based_metrics['Hit Rate']
    reason = f"The Item-Based model achieved a higher Hit Rate ({best_metric_value:.4f}) compared to the User-Based model ({user_based_metrics['Hit Rate']:.4f})."
elif user_based_metrics['Hit Rate'] > item_based_metrics['Hit Rate']:
    best_model = "User-Based Collaborative Filtering"
    best_metric_value = user_based_metrics['Hit Rate']
    reason = f"The User-Based model achieved a higher Hit Rate ({best_metric_value:.4f}) compared to the Item-Based model ({item_based_metrics['Hit Rate']:.4f})."
else:
    best_model = "Both models performed equally"
    best_metric_value = user_based_metrics['Hit Rate'] # or item_based_metrics['Hit Rate']
    reason = f"Both models achieved the same Hit Rate ({best_metric_value:.4f})."

print(f"\nBest Performing Recommendation Model: {best_model}")
print(f"Reason for Selection: {reason}")

User-Based Metrics: {'Hit Rate': 0.07697445316413425}
Item-Based Metrics: {'Hit Rate': 0.08281850058440474}

Best Performing Recommendation Model: Item-Based Collaborative Filtering
Reason for Selection: The Item-Based model achieved a higher Hit Rate (0.0828) compared to the User-Based model (0.0770).


## Summary:

### Data Analysis Key Findings

*   The user-item matrix created from the training data contained 20527 unique users and 263 unique items.
*   The user-based recommendation system achieved a Hit Rate of approximately 0.0770 on the test set.
*   The item-based recommendation system achieved a Hit Rate of approximately 0.0828 on the test set, slightly outperforming the user-based model.

### Insights or Next Steps

*   The item-based collaborative filtering model is selected as the best performing model due to its higher Hit Rate on the test data.
*   Further improvements could involve exploring different similarity metrics, incorporating regularization techniques, or utilizing hybrid approaches combining content-based or model-based methods to address the limitations of collaborative filtering (e.g., cold start problem).


In [73]:
# Step 8: Generate recommendations for a user using the best model

# Specify the username for whom you want recommendations
# Replace 'Enter_Username_Here' with the actual username from your dataset
target_username = 'rebecca' # Example username, replace with a user from your dataset

# Check if the target user exists in the user-item matrix
if target_username not in user_item_matrix.index:
    print(f"User '{target_username}' not found in the training data.")
else:
    # To use the item-based recommendation function, we need an item the user has rated
    # We can pick one of the items the user rated in the training set as a seed
    user_rated_items_in_train = user_item_matrix.loc[target_username][user_item_matrix.loc[target_username] > 0].index.tolist()

    if not user_rated_items_in_train:
        print(f"User '{target_username}' has not rated any items in the training data. Cannot generate item-based recommendations.")
    else:
        # Use the first item the user rated in the training set as the seed item
        seed_item_for_recommendation = user_rated_items_in_train[0]
        print(f"Using '{seed_item_for_recommendation}' as the seed item for recommendations for user '{target_username}'.")

        # Generate top N recommendations using the item-based model
        n_recommendations = 20
        recommended_items = item_based_recommendations(seed_item_for_recommendation, user_item_matrix, item_similarity_df, n_recommendations)

        if recommended_items:
            print(f"\nTop {n_recommendations} Recommended Items for user '{target_username}':")
            for i, item in enumerate(recommended_items):
                print(f"{i+1}. {item}")
        else:
            print(f"Could not generate recommendations for user '{target_username}'.")

Using 'Clorox Disinfecting Wipes Value Pack Scented 150 Ct Total' as the seed item for recommendations for user 'rebecca'.

Top 20 Recommended Items for user 'rebecca':
1. Clorox Disinfecting Bathroom Cleaner
2. Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd
3. Tostitos Bite Size Tortilla Chips
4. Burt's Bees Lip Shimmer, Raisin
5. My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital)
6. Mike Dave Need Wedding Dates (dvd + Digital)
7. Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)
8. Coty Airspun Face Powder, Translucent Extra Coverage
9. Chex Muddy Buddies Brownie Supreme Snack Mix
10. Pendaflex174 Divide It Up File Folder, Multi Section, Letter, Assorted, 12/pack
11. Chips Deluxe Soft 'n Chewy Cookies
12. The Resident Evil Collection 5 Discs (blu-Ray)
13. The Script - No Sound Without Silence (cd)
14. Vicks Vaporub, Regular, 3.53oz
15. Hormel Chili, No Beans
16. Chester's Cheese Flavored Puffcorn Snacks
17. Vaseline Intensive Care Lip Therapy Coco


Analyze the reviews of the top 20 recommended products for a user, predict the sentiment of these reviews using the best performing sentiment analysis model, calculate the percentage of positive sentiments for each of the 20 products, and identify the top 5 products with the highest percentage of positive reviews.

## Filter reviews for recommended products

Create a DataFrame containing only the reviews for the top 20 recommended products.


**Reasoning**:
Create a DataFrame containing only the reviews for the top 20 recommended products.



In [74]:
# 1. Create a list named top_20_recommended_items
top_20_recommended_items = recommended_items

# 2. Filter the original DataFrame df
df_recommended_reviews = df[df['name'].isin(top_20_recommended_items)]

# 3. Display the head and the shape of the df_recommended_reviews DataFrame
print("DataFrame head with reviews for top 20 recommended products:")
display(df_recommended_reviews.head())
print("\nShape of the filtered DataFrame:", df_recommended_reviews.shape)

DataFrame head with reviews for top 20 recommended products:


Unnamed: 0,id,brand,categories,name,reviews_date,reviews_didPurchase,reviews_doRecommend,reviews_rating,reviews_text,reviews_title,reviews_username,user_sentiment,reviews_text_preprocessed,reviews_title_preprocessed
1796,AVpe41TqilAPnD_xQH3d,FOX,"Movies & TV Shows,Movies,Romance,Romantic Come...",Mike Dave Need Wedding Dates (dvd + Digital),2016-10-02 00:00:00+00:00,Unknown,False,1,I expected more from this movie and more from ...,Not as funny as I thought,elite,Positive,expected movie zac enron would give movie try ...,funny thought
1797,AVpe41TqilAPnD_xQH3d,FOX,"Movies & TV Shows,Movies,Romance,Romantic Come...",Mike Dave Need Wedding Dates (dvd + Digital),2016-11-13 00:00:00+00:00,Unknown,False,1,Would ABSOLUTELY NOT recommend. We could only ...,Mike & Dave Need Wedding Dates,tampa,Negative,would absolutely recommend could take 15 minut...,mike dave need wedding date
1798,AVpe41TqilAPnD_xQH3d,FOX,"Movies & TV Shows,Movies,Romance,Romantic Come...",Mike Dave Need Wedding Dates (dvd + Digital),2016-12-03 00:00:00+00:00,Unknown,False,1,Terrible movie with good actors. Can't believe...,Horrible movie,johnny,Negative,terrible movie good actor cant believe spent 2...,horrible movie
1799,AVpe41TqilAPnD_xQH3d,FOX,"Movies & TV Shows,Movies,Romance,Romantic Come...",Mike Dave Need Wedding Dates (dvd + Digital),2016-12-23 00:00:00+00:00,Unknown,False,1,This movie is terrible with only a few funny s...,Terrible,raiderfan1,Negative,movie terrible funny scene,terrible
1800,AVpe41TqilAPnD_xQH3d,FOX,"Movies & TV Shows,Movies,Romance,Romantic Come...",Mike Dave Need Wedding Dates (dvd + Digital),2017-01-06 00:00:00+00:00,Unknown,False,1,This was a boring movie i couldnt watch 20 min...,Mike,viktoorhdz,Negative,boring movie couldnt watch 20 minute,mike



Shape of the filtered DataFrame: (11924, 14)


## Predict sentiment for recommended products' reviews

Use the best sentiment analysis model (Random Forest, based on previous evaluation) to predict the sentiment (positive or negative) for each review of the recommended products.


**Reasoning**:
Use the best sentiment analysis model to predict the sentiment of the reviews for the recommended products.



In [75]:
# 1. Select the 'reviews_text_preprocessed' column from the df_recommended_reviews DataFrame
X_recommended = df_recommended_reviews['reviews_text_preprocessed']

# 2. Use the fitted TF-IDF vectorizer to transform the preprocessed text data
X_recommended_tfidf = tfidf_vectorizer.transform(X_recommended)

# 3. Use the best performing sentiment analysis model (Random Forest) to predict the sentiment labels
# The best model was identified as Random Forest (rf_clf) in the previous sentiment analysis task.
predicted_sentiment = rf_clf.predict(X_recommended_tfidf)

# 4. Add the predicted sentiment labels as a new column named 'predicted_sentiment' to the df_recommended_reviews DataFrame
df_recommended_reviews['predicted_sentiment'] = predicted_sentiment

# 5. Display the head of the df_recommended_reviews DataFrame
print("\nDataFrame head with predicted sentiment:")
display(df_recommended_reviews[['reviews_text', 'reviews_text_preprocessed', 'predicted_sentiment']].head())


DataFrame head with predicted sentiment:


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_recommended_reviews['predicted_sentiment'] = predicted_sentiment


Unnamed: 0,reviews_text,reviews_text_preprocessed,predicted_sentiment
1796,I expected more from this movie and more from ...,expected movie zac enron would give movie try ...,Positive
1797,Would ABSOLUTELY NOT recommend. We could only ...,would absolutely recommend could take 15 minut...,Negative
1798,Terrible movie with good actors. Can't believe...,terrible movie good actor cant believe spent 2...,Negative
1799,This movie is terrible with only a few funny s...,movie terrible funny scene,Negative
1800,This was a boring movie i couldnt watch 20 min...,boring movie couldnt watch 20 minute,Negative


**Reasoning**:
Calculate the percentage of positive sentiments for each recommended product and identify the top 5 products with the highest percentage.



In [76]:
# Calculate the percentage of positive sentiments for each product
positive_sentiment_percentage = df_recommended_reviews.groupby('name')['predicted_sentiment'].apply(lambda x: (x == 'Positive').sum() / len(x) * 100)

# Sort the products by the percentage of positive sentiment in descending order
sorted_positive_sentiment = positive_sentiment_percentage.sort_values(ascending=False)

# Identify the top 5 products with the highest percentage of positive reviews
top_5_positive_products = sorted_positive_sentiment.head(5)

# Display the percentage of positive sentiment for all recommended products (sorted)
print("\nPercentage of Positive Sentiment for Recommended Products (Sorted):")
display(sorted_positive_sentiment)

# Display the top 5 products with the highest percentage of positive reviews
print("\nTop 5 Products with Highest Percentage of Positive Reviews:")
display(top_5_positive_products)


Percentage of Positive Sentiment for Recommended Products (Sorted):


name
Chips Deluxe Soft 'n Chewy Cookies                                                 100.000000
My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital)                                95.808383
100:Complete First Season (blu-Ray)                                                 94.244604
Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)                 93.432574
Clorox Disinfecting Bathroom Cleaner                                                92.447278
Godzilla 3d Includes Digital Copy Ultraviolet 3d/2d Blu-Ray/dvd                     92.090226
Coty Airspun Face Powder, Translucent Extra Coverage                                89.240506
Vaseline Intensive Care Lip Therapy Cocoa Butter                                    88.607595
Burt's Bees Lip Shimmer, Raisin                                                     88.087056
The Script - No Sound Without Silence (cd)                                          87.179487
Chex Muddy Buddies Brownie Supreme Snack Mix           


Top 5 Products with Highest Percentage of Positive Reviews:


name
Chips Deluxe Soft 'n Chewy Cookies                                     100.000000
My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital)                    95.808383
100:Complete First Season (blu-Ray)                                     94.244604
Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)     93.432574
Clorox Disinfecting Bathroom Cleaner                                    92.447278
Name: predicted_sentiment, dtype: float64

## Present the top 5 products

Present the names of the top 5 products with the highest percentage of positive reviews.


**Reasoning**:
Print the heading and iterate through the top 5 positive products to display their rank, name, and positive sentiment percentage.



In [77]:
# Print a clear heading
print("Top 5 Products with Highest Percentage of Positive Reviews:")

# Iterate through the top_5_positive_products Series and print the results
for rank, (product_name, percentage) in enumerate(top_5_positive_products.items()):
    print(f"{rank + 1}. {product_name}: {percentage:.2f}% Positive")

Top 5 Products with Highest Percentage of Positive Reviews:
1. Chips Deluxe Soft 'n Chewy Cookies: 100.00% Positive
2. My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital): 95.81% Positive
3. 100:Complete First Season (blu-Ray): 94.24% Positive
4. Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd): 93.43% Positive
5. Clorox Disinfecting Bathroom Cleaner: 92.45% Positive


## Summary:

### Data Analysis Key Findings

*   There are 11924 reviews available for the top 20 recommended products.
*   The Random Forest model was used to predict the sentiment of these reviews, categorizing them as either 'Positive' or 'Negative'.
*   The percentage of positive sentiment varies among the recommended products.
*   "Chips Deluxe Soft 'n Chewy Cookies" has the highest percentage of positive reviews at 100%.
*   The top 5 products with the highest percentage of positive reviews are:
    1.  Chips Deluxe Soft 'n Chewy Cookies: 100.00% Positive
    2.  My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital): 95.66% Positive
    3.  100:Complete First Season (blu-Ray): 94.24% Positive
    4.  Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd): 93.26% Positive
    5.  Clorox Disinfecting Bathroom Cleaner: 92.45% Positive

### Insights or Next Steps

*   The high percentage of positive reviews for the top 5 products suggests these items are likely to be well-received by users. Consider highlighting these products in marketing or promotional efforts.
*   Further analysis could involve examining the content of the negative reviews for the other recommended products to identify areas for potential improvement or address common complaints.



Lets generate pickel files

In [78]:
import pickle
import os

# Create a directory to save the pickle files
pickle_dir = 'recommendation_app/pickles'
os.makedirs(pickle_dir, exist_ok=True)

# Define the file paths
user_item_matrix_path = os.path.join(pickle_dir, 'user_item_matrix.pkl')
item_similarity_df_path = os.path.join(pickle_dir, 'item_similarity_df.pkl')
rf_clf_path = os.path.join(pickle_dir, 'rf_clf.pkl')
tfidf_vectorizer_path = os.path.join(pickle_dir, 'tfidf_vectorizer.pkl')
df_path = os.path.join(pickle_dir, 'df.pkl')

# Save the variables using pickle
try:
    with open(user_item_matrix_path, 'wb') as f:
        pickle.dump(user_item_matrix, f)
    print(f"Saved user_item_matrix to {user_item_matrix_path}")

    with open(item_similarity_df_path, 'wb') as f:
        pickle.dump(item_similarity_df, f)
    print(f"Saved item_similarity_df to {item_similarity_df_path}")

    with open(rf_clf_path, 'wb') as f:
        pickle.dump(rf_clf, f)
    print(f"Saved rf_clf to {rf_clf_path}")

    with open(tfidf_vectorizer_path, 'wb') as f:
        pickle.dump(tfidf_vectorizer, f)
    print(f"Saved tfidf_vectorizer to {tfidf_vectorizer_path}")

    with open(df_path, 'wb') as f:
        pickle.dump(df, f)
    print(f"Saved df to {df_path}")

except Exception as e:
    print(f"Error saving files: {e}")

Saved user_item_matrix to recommendation_app/pickles/user_item_matrix.pkl
Saved item_similarity_df to recommendation_app/pickles/item_similarity_df.pkl
Saved rf_clf to recommendation_app/pickles/rf_clf.pkl
Saved tfidf_vectorizer to recommendation_app/pickles/tfidf_vectorizer.pkl
Saved df to recommendation_app/pickles/df.pkl


**Github Link For Repo :-** https://github.com/RohitKini/CapstoneProject

**App Deployed on Heroku Link :-**

https://product-recommendation-project-8a76b78d452d.herokuapp.com/