# **BlushBot: Personalized Skincare Recommendation System**

---

### **Introduction**

In a world where skincare is deeply personal, finding the right product tailored to your specific needs can be a daunting task. BlushBot is here to revolutionize the way we shop for skincare by providing personalized product recommendations designed specifically for Indian customers.

BlushBot leverages cutting-edge machine learning techniques such as **content-based filtering**, **TF-IDF vectorization**, and **cosine similarity** to match users' unique preferences with the most suitable skincare products. Whether you have oily skin, dry skin, or a specific concern like hydration or anti-aging, BlushBot ensures that you always get the best recommendations.

---

### **Why BlushBot?**

- **Personalized Recommendations**: Designed for Indian customers, BlushBot understands your unique skincare needs.
- **Smart Filtering**: Enter your **skin type**, **concern**, and **price range**, and let the system do the magic!
- **Trusted by Science**: Uses state-of-the-art algorithms to ensure precise and accurate matches.
- **Simplifying Choices**: With BlushBot, say goodbye to overwhelming product lists and hello to tailored recommendations.
---

### **Project Highlights**

- **Categories Covered**:
  - Moisturizers, Serums, Sunscreens, Toners, and Cleansers.
- **Customer-Centric Features**:
  - Recommendations based on **skin type**, **price**, and **concerns** like hydration or anti-aging.
- **Local Relevance**:
  - Focuses on Indian brands to meet the preferences of Indian customers.
- **Technology Stack**:
  - Built using Python, pandas, scikit-learn, and Streamlit for deployment.

---

BlushBot doesn’t just recommend, it empowers users to make informed skincare choices tailored to their unique needs. Let’s dive into how it works!


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data = pd.read_excel('/content/drive/MyDrive/BlushBot/datasets/combined_skincare_dataset.xlsx')

In [4]:
data = pd.read_excel('/content/drive/MyDrive/BlushBot/datasets/combined_skincare_dataset.xlsx')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 663 entries, 0 to 662
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_brand     663 non-null    object 
 1   product_name      663 non-null    object 
 2   product_category  663 non-null    object 
 3   quantity          663 non-null    object 
 4   price             663 non-null    int64  
 5   overall_rating    663 non-null    float64
 6   no_of_ratings     663 non-null    int64  
 7   concern           663 non-null    object 
 8   ingredient        663 non-null    object 
 9   skin_type         663 non-null    object 
 10  star_5            663 non-null    int64  
 11  star_4            660 non-null    float64
 12  star_3            628 non-null    float64
 13  star_2            591 non-null    float64
 14  star_1            629 non-null    float64
dtypes: float64(5), int64(3), object(7)
memory usage: 77.8+ KB


In [6]:
df = pd.read_excel('/content/drive/MyDrive/BlushBot/datasets/combined_skincare_dataset.xlsx')

The dataset used in BlushBot contains information about various skincare products, categorized into the following columns:

### **Columns in the Dataset**
1. **product_brand**: The name of the skincare brand.
2. **product_name**: The name of the skincare product.
2. **product_category**: Type of product (e.g., Moisturizer, Serum, Cleanser).
4. **skin_type**: Targeted skin type (e.g., Oily, Dry, Sensitive).
5. **price**: The product's price in Indian Rupees (INR).
6. **overall_rating**: Average customer rating for the product.
7. **Star Ratings**:
   - **star_5**: Number of 5-star ratings.
   - **star_4**, **star_3**, etc.
8. **concern**: The primary skin concern addressed by the product (e.g., Hydration, Anti-aging).
9. **ingredient**: Key active ingredients in the product.
10. **quantity**: Quantity of the skinare product in ML.
11. **no_of_ratings**: Total star ratings including star_5, star_4 and so on.

In [7]:
data.shape

(663, 15)

In [8]:
# Create a mapping of product names to indices
product_indices = pd.Series(data.index, index=data['product_name']).to_dict()

# Create a reverse mapping of indices to product names
index_to_product = pd.Series(data['product_name'].values, index=data.index).to_dict()
index_to_product


{0: 'Gentle Skin Cleanser',
 1: 'Set of 6 Gulabari Rose Glow Face Cleanser Spray',
 2: 'Gulabari Set Of 3 Rose Glow Face Cleanser For Balanced & Hydrated Skin',
 3: '2% Niacinamide Non-Irritant & Soap Free Oily Skin Cleanser',
 4: '2% Niacinamide Gentle Skin Cleanser for Sensitive, Dry, Normal Skin',
 5: 'Watermelon SuperGlow Facial Gel Cleanser with Vitamin C & Cucumber',
 6: 'The Daily Duet Gentle Hydrating Cleanser',
 7: 'Salicylic Acid Jamun Face Wash Cleanser Gel For Active Acne',
 8: 'Purify & Glow Cleanser + Mask For Deep Cleanses Pores',
 9: 'Deep Pore Clean Milky Foam Face Wash Cleanser with Seaweed',
 10: 'Coffee Face Wash for Fresh & Glowing Skin,Hydrate Cleanser for Oil Removal',
 11: 'Luminous Brightening Foaming Cleanser & Face Wash For Clear & Even Skin',
 12: 'Regenerist Cleanser & Face Wash for Plump & Bouncy Skin With Salicylic Acid',
 13: 'Creamy Cleanser',
 14: 'Red Vine Face Wash Cleanser for Anti Ageing & De-Pigmentation',
 15: 'Acne Control Face Cleanser for Oily

In [9]:
# Checking missing values
print(data.isnull().sum())

product_brand        0
product_name         0
product_category     0
quantity             0
price                0
overall_rating       0
no_of_ratings        0
concern              0
ingredient           0
skin_type            0
star_5               0
star_4               3
star_3              35
star_2              72
star_1              34
dtype: int64


In [10]:
# Fill missing star ratings with 0
star_columns = ['star_5', 'star_4', 'star_3', 'star_2', 'star_1']
data[star_columns] = data[star_columns].fillna(0)

# Verify changes
print("Missing values after filling:")
print(data[star_columns].isnull().sum())

Missing values after filling:
star_5    0
star_4    0
star_3    0
star_2    0
star_1    0
dtype: int64


In [11]:
# Checking for duplicate rows
print("\nDuplicate Rows Count:")
print(data.duplicated().sum())


Duplicate Rows Count:
0


In [12]:
# Calculate percentage of ratings for each star category
for col in star_columns:
    data[f"{col}_percentage"] = (data[col] / (data['no_of_ratings'])).fillna(0)

# Verify
print(data[[f"{col}_percentage" for col in star_columns]].head())

   star_5_percentage  star_4_percentage  star_3_percentage  star_2_percentage  \
0           0.720541           0.186809           0.048944           0.014918   
1           0.770073           0.116788           0.041363           0.018248   
2           0.693084           0.177204           0.059068           0.021075   
3           0.647934           0.216529           0.080992           0.014876   
4           0.665768           0.210243           0.063342           0.016173   

   star_1_percentage  
0           0.028789  
1           0.053528  
2           0.049570  
3           0.039669  
4           0.044474  


* Calculating percentages can help in analysis, as it allows comparison across products with different numbers of ratings.


* Addition of 5 more columns namely star_5_percentage, star_4_percentage and so on increases shape of dataframe.

In [13]:
data.shape

(663, 20)

In [14]:
# Summary statistics
print("\nSummary Statistics:")
print(data.describe())


Summary Statistics:
             price  overall_rating  no_of_ratings        star_5        star_4  \
count   663.000000      663.000000     663.000000    663.000000    663.000000   
mean    471.286576        4.438763    4643.446456   3057.067873    969.389140   
std     352.134616        0.173091    8952.397853   5901.334426   1887.197458   
min      36.000000        2.900000       5.000000      3.000000      0.000000   
25%     269.000000        4.400000     141.000000     98.000000     25.000000   
50%     376.000000        4.400000     695.000000    472.000000    141.000000   
75%     539.000000        4.500000    4040.000000   2724.500000    824.500000   
max    3712.000000        5.000000   55261.000000  37018.000000  11520.000000   

            star_3       star_2       star_1  star_5_percentage  \
count   663.000000   663.000000   663.000000         663.000000   
mean    333.013575   105.571644   178.404223           0.675871   
std     654.793622   210.645685   337.739884    

In [15]:
categorical_columns = ['ingredient', 'concern', 'skin_type','product_category']

In [16]:
for col in categorical_columns:
    print(f"Unique values in {col}:")
    print(data[col].unique())
    print("\n")

Unique values in ingredient:
['Niacinamide' 'Rose' 'Gel' 'Amino Acids' 'Salicylic Acid' 'Seaweed'
 'Coffee' 'Vitamin C' 'Vitamin c' 'Aloe Vera' 'Retinol' 'Tea Tree'
 'Ceramides' 'Shea Butter' 'Hyaluronic Acid' 'Turmeric' 'Sandalwood'
 'Glycerin' 'Centella Asiatica' 'Cruelty-Free' 'Vitamin E' 'Hydration'
 'Vitamin B3' 'Green Tea' 'Witch Hazel' 'AHA' 'Papaya' 'Ceramide'
 'Daisy Flower Extract' 'Saffron' 'Avocado' 'Kiwi Gel' 'Honey'
 'Lemon Grass' 'Mint and Eucalyptus' 'Lemon' 'Amla' 'Dull Skin'
 'kumkumadi' 'Glycolic Acid' 'Green Apple' 'Grapefruit' 'Orange'
 'Liquorice Extract' 'Carrot' 'Ceramides ' 'Squalene' 'CICA Niacinamide'
 'Almond' 'Vitamin C and ECream' 'Beetroot' 'Alpha Arbutin' 'Watermelon'
 'Pomegranate' 'Cream' 'Milk' 'Peach' 'Snail Mucin' 'Cucumber' '24k Gold'
 'Caffeine' 'Kojic Acid' 'Goat Milk' 'Vitamin C and E' 'Collagen'
 'Ginseng' 'Peptides' 'Argan Oil' 'Azelaic Acid' 'Zinc Oxide'
 'Cherry tomato' 'Titanium Dioxide' 'Liquid' 'Anti-Oxidants' 'Blueberry'
 'Cica calming' 

In [17]:
#remove extra spaces
for col in categorical_columns:
    data[col] = data[col].str.strip()

#make case uniform
for col in categorical_columns:
    data[col] = data[col].str.lower()  # Convert to lowercase
# Alternatively, for title case:
# data[col] = data[col].str.title()

In [18]:
#mapping
skin_type_mapping = {
    'liquid': 'combination',  # Correct typo
    'lotion': 'combination',
    'normal skin': 'normal',
}
data['skin_type'] = data['skin_type'].replace(skin_type_mapping)

ingredient_mapping = {
    # Standard Ingredients
    'niacinamide': 'niacinamide',
    'rose water': 'rosewater',
    'amino acids': 'aminoacids',
    'salicylic acid': 'salicylicacid',
    'seaweed': 'seaweed',
    'coffee': 'coffee',
    'vitamin c': 'vitaminc',
    'vitamin c and e': 'vitamincande',
    'vitamin c and ecream': 'vitamincande',
    'aloe vera': 'aloevera',
    'tea tree': 'teatree',
    'ceramides': 'ceramides',
    'ceramides and hyaluronic acid': 'ceramides',
    'shea butter': 'sheabutter',
    'hyaluronic acid': 'hyaluronicacid',
    'turmeric': 'turmeric',
    'sandalwood': 'sandalwood',
    'glycerin': 'glycerin',
    'centella asiatica': 'centellaasiatica',
    'centella asiatic': 'centellaasiatica',  # Fix typo
    'centella astiatica': 'centellaasiatica',  # Fix typo
    'centella water': 'centellaasiatica',
    'vitamin e': 'vitamine',
    'vitamin b3': 'vitaminb3',
    'green tea': 'greentea',
    'green tea face': 'greentea',
    'witch hazel': 'witchhazel',
    'witch hasel': 'witchhazel',  # Fix typo
    'aha': 'aha',
    'bha': 'bha',
    'aha-bha': 'ahabha',
    'pha': 'pha',
    'glycolic acid': 'glycolicacid',
    'alpha arbutin': 'alphaarbutin',
    'kojic acid': 'kojicacid',
    'azelaic acid': 'azelaicacid',
    'zinc oxide': 'zincoxide',
    'titanium dioxide': 'titaniumdioxide',
    'papaya': 'papaya',
    'daisy flower extract': 'daisyflowerextract',
    'saffron': 'saffron',
    'avocado': 'avocado',
    'kiwi gel': 'kiwi',
    'honey': 'honey',
    'lemon grass': 'lemongrass',
    'mint and eucalyptus': 'eucalyptus',
    'lemon': 'lemon',
    'amla': 'amla',
    'kumkumadi': 'kumkumadi',
    'green apple': 'greenapple',
    'grapefruit': 'grapefruit',
    'orange': 'orange',
    'liquorice extract': 'licoriceextract',
    'licorice extract': 'licoriceextract',
    'carrot': 'carrot',
    'carrot juice': 'carrot',
    'squalene': 'squalane',
    'squalane': 'squalane',
    'retinol': 'retinol',
    'almond': 'almond',
    'beetroot': 'beetroot',
    'watermelon': 'watermelon',
    'pomegranate': 'pomegranate',
    'pomegranate extract': 'pomegranate',
    'pomegranate and collagen': 'pomegranate',
    'milk': 'milk',
    'goat milk': 'milk',
    'milk peptide': 'peptides',
    'peach': 'peach',
    'snail mucin': 'snailmucin',
    'cucumber': 'cucumber',
    'cucumber and vitamin c': 'cucumber',
    '24k gold': 'gold',
    'caffeine': 'coffee',
    'collagen': 'collagen',
    'ginseng': 'ginseng',
    'ginseng and niacinamide': 'ginseng',
    'peptides': 'peptides',
    'argan oil': 'arganoil',
    'cherry tomato': 'cherrytomato',
    'blueberry': 'blueberry',
    'coconut': 'coconut',
    'neem': 'neem',
    'strawberry': 'strawberry',
    'olives': 'olives',
    'jojoba': 'jojoba',
    'chamomile white tea': 'chamomile',
    'melanostain': 'melanostain',
    'heartleaf': 'heartleaf',
    'rice water': 'ricewater',
    'rice water and niacinamide': 'niacinamide',
    'cerapeptide': 'peptides',
    'mandarin-vitamin': 'mandarin',
    'black tea': 'blacktea',
    'green plum': 'greenplum',
    'rose and mint': 'rosewater',
    'rose oil and tea': 'rosewater',
    'apricot': 'apricot',
    'white seed': 'whiteseed',
    'pineapple': 'pineapple',
    'jamun': 'jamun',
    'white tea': 'whitetea',
    'camelia': 'camellia',
    'houttuynia cordata': 'houttuyniacordata',
    'hydrium watery': 'hydriumwatery',
    'dewy rose': 'rosewater',
    'chamomile and white tea':'chamomile',
    'classic rose' :'rosewater',

    # Non-Specific Entries to Remove or Replace
    'dull skin': None,
    'gel': None,
    'hydration': None,
    'cruelty-free': None,
    'organic': None,
    'cream': None,
    'liquid': None,
    'cica calming': None,
    'non-comedogenic': None,
    'sand 03': None,
    'anti-oxidants': None
}
data['ingredient'] = data['ingredient'].replace(ingredient_mapping)
# Drop rows with None values
data = data[data['ingredient'].notnull()]

concern_mapping = {
    'deep nourishment': 'nourishment',
    'dull skin': 'dullness',
    'acne or blemishes': 'acne',
    'excess oil': 'oilcontrol',
    'pore care': 'porecare',
    'deep pore clean': 'deepcleansing',
    'anti-ageing': 'antiaging',
    'deep cleansing': 'deepcleansing',
    'deep cleansing': 'deepcleansing',
    'anti-pollution': 'antipollution',
    'retinol': None,
    'pigmentation': 'pigmentation',
    'tan removal': 'tanremoval',
    'hydrating': 'hydration',
    'brightening': 'brightening',
    'oil control': 'oilcontrol',
    'dark spots': 'darkspots',
    'skin inflammation': 'inflammation',
    'dryness': 'dryness',
    'hydration': 'hydration',
    'uneven skin tone': 'unevenskintone',
    'dullness': 'dullness',
    'softening': 'hydration',
    'vitamin c and e': None,
    'sun protection': 'sunprotection',
    'softening and smoothening': 'softening',
    'darks spots & pigmentation': 'darkspots',
    'skin brightening': 'brightening',
    'ageing skin': 'antiaging',
    'uneven texture': 'uneventexture',
    'acne prone skin': 'acne',
    'darks spots pigmentation': 'darkspots',
    'glow': 'glow',
    'detan': 'tanremoval',
    'pore minimizing and blurring': 'porecare',
    'firming': 'antiaging',
    'blackheads and whiteheads': 'blackheads',
    'redness': 'redness'
}
data['concern'] = data['concern'].replace(concern_mapping)

# Drop rows with None values
data = data[data['concern'].notnull()]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['concern'] = data['concern'].replace(concern_mapping)


In [19]:
data.shape

(646, 20)

In [20]:
for col in categorical_columns:
    print(f"Unique values in {col}:")
    print(data[col].unique())
    print("\n")

Unique values in ingredient:
['niacinamide' 'rose' 'aminoacids' 'salicylicacid' 'seaweed' 'coffee'
 'vitaminc' 'aloevera' 'teatree' 'ceramides' 'sheabutter' 'hyaluronicacid'
 'turmeric' 'sandalwood' 'glycerin' 'centellaasiatica' 'vitamine'
 'vitaminb3' 'greentea' 'witchhazel' 'aha' 'papaya' 'ceramide'
 'daisyflowerextract' 'saffron' 'avocado' 'kiwi' 'honey' 'lemongrass'
 'eucalyptus' 'lemon' 'amla' 'kumkumadi' 'glycolicacid' 'greenapple'
 'grapefruit' 'orange' 'licoriceextract' 'carrot' 'squalane' 'retinol'
 'cica niacinamide' 'almond' 'vitamincande' 'beetroot' 'alphaarbutin'
 'watermelon' 'pomegranate' 'milk' 'peach' 'snailmucin' 'cucumber' 'gold'
 'kojicacid' 'collagen' 'ginseng' 'peptides' 'arganoil' 'azelaicacid'
 'zincoxide' 'cherrytomato' 'titaniumdioxide' 'blueberry' 'coconut' 'neem'
 'strawberry' 'olives' 'jojoba' 'chamomile' 'melanostain' 'rosewater'
 'heartleaf' 'ahabha' 'pha' 'ricewater' 'nectar' 'blacktea' 'greenplum'
 'apricot' 'whiteseed' 'pineapple' 'jamun' 'whitetea' 'b

In [21]:
data.columns

Index(['product_brand', 'product_name', 'product_category', 'quantity',
       'price', 'overall_rating', 'no_of_ratings', 'concern', 'ingredient',
       'skin_type', 'star_5', 'star_4', 'star_3', 'star_2', 'star_1',
       'star_5_percentage', 'star_4_percentage', 'star_3_percentage',
       'star_2_percentage', 'star_1_percentage'],
      dtype='object')

Label Encoding for 'product_category' and 'skin_type'

In [22]:
import pickle
with open('/content/drive/MyDrive/BlushBot/datasets/data.pkl', 'wb') as file:
    pickle.dump(data, file)

print("TF-IDF Vectorizer saved to tfidf_vectorizer.pkl")

TF-IDF Vectorizer saved to tfidf_vectorizer.pkl


In [23]:
from sklearn.preprocessing import LabelEncoder

# Initialize label encoders
label_encoder_category = LabelEncoder()
label_encoder_skinType = LabelEncoder()

# Apply encoding to Product Category
data['Product Category Encoded'] = label_encoder_category.fit_transform(data['product_category'])

# Apply encoding to Skin Type
data['Skin Type Encoded'] = label_encoder_skinType.fit_transform(data['skin_type'])

# Check the mappings
print("Product Category Mapping:")
print(dict(zip(label_encoder_category.classes_, label_encoder_category.transform(label_encoder_category.classes_))))

print("\nSkin Type Mapping:")
print(dict(zip(label_encoder_skinType.classes_, label_encoder_skinType.transform(label_encoder_skinType.classes_))))

Product Category Mapping:
{'cleanser': 0, 'moisturizer': 1, 'serum': 2, 'sunscreen': 3, 'toner': 4}

Skin Type Mapping:
{'combination': 0, 'dry': 1, 'normal': 2, 'oily': 3, 'sensitive': 4}


TF-IDF Vectorization for 'concern' and 'ingredient'

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_concern = TfidfVectorizer(max_features=50, stop_words='english')
tfidf_ingredient = TfidfVectorizer(max_features=50, stop_words='english')

concern_tfidf = tfidf_concern.fit_transform(data['concern']).toarray()
ingredient_tfidf = tfidf_ingredient.fit_transform(data['ingredient']).toarray()

# Combine TF-IDF features into DataFrames
concern_df = pd.DataFrame(concern_tfidf, columns=tfidf_concern.get_feature_names_out())
ingredient_df = pd.DataFrame(ingredient_tfidf, columns=tfidf_ingredient.get_feature_names_out())


In [31]:
import pickle

# Save both the TF-IDF matrix and the fitted vectorizer
with open('/content/drive/MyDrive/BlushBot/datasets/concern_tfidf.pkl', 'wb') as file:
    pickle.dump({
        'concern_tfidf': concern_tfidf,
        'tfidf_concern': tfidf_concern
    }, file)


In [27]:
'''import pickle
with open('/content/drive/MyDrive/BlushBot/datasets/concern_tfidf.pkl.pkl', 'wb') as file:
    pickle.dump(concern_tfidf, file)

print("TF-IDF Vectorizer saved to tfidf_vectorizer.pkl")

TF-IDF Vectorizer saved to tfidf_vectorizer.pkl


In [None]:
print(concern_df.shape)
print(ingredient_df.shape)

In [None]:
# Normalize price
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data['Price Normalized'] = scaler.fit_transform(data[['price']])

In [None]:
print(data['Price Normalized'])

In [None]:
data.head()


In [None]:
for col in ['product_brand', 'product_name']:
    freq = data[col].value_counts() / len(data)
    data[f"{col}_encoded"] = data[col].map(freq)

print(data)

In [None]:
# prompt: make a dataframe of data to save

import pandas as pd

# Assuming 'data' DataFrame is already created from the previous code

# Select specific columns for the new DataFrame
columns_to_save = ['product_brand_encoded', 'product_name_encoded', 'product_category', 'skin_type', 'price', 'overall_rating', 'concern', 'ingredient', 'Product Category Encoded', 'Skin Type Encoded', 'Price Normalized']
data_to_save = data[columns_to_save]

# Display the new DataFrame (optional)
print(data_to_save.head())

In [None]:
columns_to_combine = [
    'star_5_percentage', 'star_4_percentage', 'star_3_percentage', 'star_2_percentage', 'star_1_percentage','Price Normalized',
     'Product Category Encoded', 'Skin Type Encoded',	'product_brand_encoded','product_name_encoded']

In [None]:
final_combined = data[columns_to_combine]
final_combined.head()

In [None]:
model_df = pd.concat([final_combined, concern_df, ingredient_df], axis=1)
model_df.head()

In [None]:
model_df.shape

In [None]:
# Filling all the Nan values in model_df with 0
model_df.fillna(0, inplace=True)

In [None]:
# Save the combined DataFrame to a CSV file
model_df.to_csv('/content/drive/MyDrive/BlushBot/datasets/combined_encoded_data.csv', index=False)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(model_df)

In [None]:
# Function to encode user input based on the mappings
def encode_user_inputs(user_skin_type, user_category, user_price, data):
    # Encoding for Skin Type and Product Category
    skin_type_mapping = {'combination': 0, 'dry': 1, 'normal': 2, 'oily': 3, 'sensitive': 4}
    category_mapping = {'cleanser': 0, 'moisturizer': 1, 'serum': 2, 'sunscreen': 3, 'toner': 4}
    skin_type_encoded = skin_type_mapping.get(user_skin_type.lower(), None)
    category_encoded = category_mapping.get(user_category.lower(), None)

    if skin_type_encoded is None or category_encoded is None:
        return None  # Invalid input

    # Normalize user price based on the data's normalized price range
    price_normalized = (user_price - data['price'].min()) / (data['price'].max() - data['price'].min())

    return skin_type_encoded, category_encoded, price_normalized


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Predefined concerns in the data (example, adjust based on your dataset)
all_concerns = data['concern'].unique()  # All unique concerns in the dataset

# Fit TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
concern_tfidf = tfidf_vectorizer.fit_transform(all_concerns)

# Function to vectorize user's concern and compare with existing concerns
def vectorize_user_concern(user_concern, tfidf_vectorizer):
    user_concern.strip().lower()
    user_concern_vec = tfidf_vectorizer.transform([user_concern])  # Vectorize user input
    return user_concern_vec

In [None]:
import pickle
with open('/content/drive/MyDrive/BlushBot/datasets/tfidf_vectorizer.pkl', 'wb') as file:
    pickle.dump(concern_tfidf, file)

print("TF-IDF Vectorizer saved to tfidf_vectorizer.pkl")

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def recommend_filtered_products(user_skin_type, user_category, user_price, user_concern, data, tfidf_vectorizer, top_n=5, weight1=0.5, weight2=0.3, weight3=0.2, sigma=100):
    """
    Recommend products based on user inputs, star percentages, and price conditions, with price balance.

    Parameters:
    - user_skin_type (str): Skin type (non-encoded).
    - user_category (str): Product category (non-encoded).
    - user_price (float): User's price range.
    - user_concern (str): Skin concern (already preprocessed).
    - data (DataFrame): Original dataset.
    - tfidf_vectorizer: Pre-trained TF-IDF vectorizer for 'concern'.
    - top_n (int): Number of top recommendations to return.
    - weight1 (float): Weight for cosine similarity in final ranking.
    - weight2 (float): Weight for 5-star percentage in final ranking.
    - weight3 (float): Weight for price score in final ranking.
    - sigma (float): Standard deviation for price proximity score.

    Returns:
    - DataFrame with top recommended products.
    """
    # Step 1: Encode user inputs
    encoded_inputs = encode_user_inputs(user_skin_type, user_category, user_price, data)

    if encoded_inputs is None:
        return "Invalid input provided."

    skin_type_encoded, category_encoded, price_normalized = encoded_inputs

    # Step 2: Filter products based on user inputs (skin type, category, and price conditions)
    filtered_data = data[
        (data['Skin Type Encoded'] == skin_type_encoded) &
        (data['Product Category Encoded'] == category_encoded) &
        ((data['price'] <= user_price + 150) | (data['price'] >= user_price - 150))  # Price condition
    ].copy()  # Explicitly create a copy to avoid chained assignment warnings

    if filtered_data.empty:
        return "No products match your filters"

    # Step 3: Vectorize the user's concern and compute cosine similarity with filtered products
    user_concern_vec = vectorize_user_concern(user_concern, tfidf_vectorizer)
    product_concerns = tfidf_vectorizer.transform(filtered_data['concern'])  # Vectorize product concerns
    similarity_scores = cosine_similarity(user_concern_vec, product_concerns).flatten()
    filtered_data['similarity_score'] = similarity_scores

    # Step 4: Compute price proximity score
    filtered_data['price_score'] = np.exp(-((filtered_data['price'] - user_price) ** 2) / (2 * sigma ** 2))

    # Step 5: Compute final ranking score
    filtered_data['similarity_score'] = similarity_scores
    filtered_data['final_score'] = (
        (filtered_data['similarity_score'] * weight1) +
        (filtered_data['star_5_percentage'] * weight2) +
        (filtered_data['price_score'] * weight3)
    )

    # Step 6: Sort by final score and select top N
    filtered_data = filtered_data.sort_values(by='final_score', ascending=False).head(top_n)

    # Return top N recommended products with details
    return filtered_data[['product_name', 'product_brand','product_category', 'price', 'skin_type', 'concern', 'star_5_percentage', 'final_score']]


In [None]:
# Example user inputs
user_skin_type = "combination"
user_category = "cleanser"
user_price = 250
user_concern = "hydration"

# Get recommendations
recommended_products = recommend_filtered_products(
    user_skin_type=user_skin_type,
    user_category=user_category,
    user_price=user_price,
    user_concern=user_concern,
    data=data,
    tfidf_vectorizer=tfidf_vectorizer,
    top_n=5
)

# Display recommendations
print("Top Recommended Products:")
print(recommended_products)

In [None]:
test_case_1 = {
    'user_skin_type': 'oily',
    'user_category': 'moisturizer',
    'user_price': 400,
    'user_concern': 'oilcontrol',
    'relevant_products': [
        'Sustainable Naked & Raw Coffee Face Moisturizer',
        'Oil-Free Gel Face Moisturizer',
        '100% Natural Salicylic Acid Liquid Moisturiser with Aloevera & Zinc PCA',
        '5% Nia-Ceramide Mattifying Moisturizer for Oil Control',
        'Strawberry Dew Do-It-All Moisturizer'
    ]
}

test_case_2 = {
    'user_skin_type': 'dry',
    'user_category': 'serum',
    'user_price': 800,
    'user_concern': 'antiaging',
    'relevant_products': [
        'Regenerist Micro Sculpting Serum with Hyaluronic Acid',
        'Retinol24 Max Night Serum To Visibly Reduce Fine Lines',
        '24K Gold Serum with Niacinamide & Hyaluronic Acid',
        'Dandelion Youth Anti-Ageing Serum',
        '2% Salicylic Acid Serum With Witch Hazel'
    ]
}

test_case_3 = {
    'user_skin_type': 'sensitive',
    'user_category': 'sunscreen',
    'user_price': 600,
    'user_concern': 'sunprotection',
    'relevant_products': [
        'C-Cinamide Sunscreen SPF 50 PA+ Aqua Gel With Vitamin C & Niacinamide',
        'SPF 50++ Relief Sun Rice Probiotics',
        'Relief Sun',
        'Sun Sensitive UV Face Sunscreen',
        'Sun SPF30 Light Gel with Vitamin E'
    ]
}

test_case_4 = {
    'user_skin_type': 'combination',
    'user_category': 'cleanser',
    'user_price': 250,
    'user_concern': 'hydration',
    'relevant_products': [
        'Restore Hydrating Cream Cleanser',
        '2% Vitamin C Face Wash Foaming Cleanser with Brush',
        'The Daily Duet Gentle Hydrating Cleanser',
        'Clarifying Acne Cleanser with Zinc PCA & Salicylic Acid',
        'Total Effects Foaming Cleanser & Face Wash To Fight 7 Signs of Ageing'
    ]
}

test_case_5 = {
    'user_skin_type': 'normal',
    'user_category': 'toner',
    'user_price': 1500,
    'user_concern': 'darkspots',
    'relevant_products': [
        'Houttuynia Cordata Extract',
        '1.5% Vitamin C Alcohol-Free Spray Toner With Moisture',
        'Heartleaf 77% Soothing Toner',
        'Wonder Ceramide Mochi Toner',
        'White Seed Range Brightening Toner'
    ]
}


test_cases = [test_case_1, test_case_2,test_case_3,test_case_4,test_case_5]

In [None]:
# Precision at K
def calculate_precision_at_k(recommended_products, relevant_products, k):
    # Get the top k recommendations (assuming recommended_products is a DataFrame)
    top_k_recommendations = recommended_products['product_name'][:k].tolist()
    relevant_count = len(set(top_k_recommendations) & set(relevant_products))  # Find common products
    return relevant_count / k

# Recall at K
def calculate_recall_at_k(recommended_products, relevant_products, k):
    # Get the top k recommendations (assuming recommended_products is a DataFrame)
    top_k_recommendations = recommended_products['product_name'][:k].tolist()
    relevant_count = len(set(top_k_recommendations) & set(relevant_products))  # Find common products
    return relevant_count / len(relevant_products)  # Recall is based on the relevant products


In [None]:
# Evaluate the model
results = []

for test in test_cases:
    # Generate recommendations
    recommended_products = recommend_filtered_products(
        user_skin_type=test['user_skin_type'],
        user_category=test['user_category'],
        user_price=test['user_price'],
        user_concern=test['user_concern'],
        data=data,
        tfidf_vectorizer=tfidf_vectorizer,
        top_n=5
    )

    # Calculate Precision and Recall at K
    precision = calculate_precision_at_k(recommended_products, test['relevant_products'], k=5)
    recall = calculate_recall_at_k(recommended_products, test['relevant_products'], k=5)

    results.append({'Test Case': test, 'Precision@5': precision, 'Recall@5': recall})

# Display results
for idx, result in enumerate(results, start=1):
    print(f"Test Case {idx}:")
    print(f"User Input: {result['Test Case']}")
    print(f"Precision@5: {result['Precision@5']:.2f}")
    print(f"Recall@5: {result['Recall@5']:.2f}\n")

In [None]:
# Calculate average Precision and Recall
avg_precision = sum(result['Precision@5'] for result in results) / len(results)
avg_recall = sum(result['Recall@5'] for result in results) / len(results)

print(f"Average Precision@5: {avg_precision:.2f}")
print(f"Average Recall@5: {avg_recall:.2f}")


In [None]:
model_df['Product Category Encoded'].values

In [None]:
data.to_dict()
pickle.dump(data.to_dict(), open('/content/drive/MyDrive/BlushBot/datasets/data_df.pkl','wb'))

In [None]:
data['Skin Type Encoded'].values

#Frontend

In [None]:
!pip install streamlit

In [None]:
import streamlit as st
import pandas as pd
import pickle
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load your data and models
with open('data.pkl', 'wb') as file:
    pickle.dump(data, file)
with open('model_df.pkl', 'wb') as file:
    pickle.dump(model_df,file)
with open('tfidf_vectorizer.pkl', 'wb') as file:
    pickle.dump(tfidf_vectorizer,file)

In [None]:
# Load Pickle files
with open('data.pkl', 'rb') as file:
    data = pickle.load(file)

with open('model_df.pkl', 'rb') as file:
    model_df = pickle.load(file)

with open('tfidf_vectorizer.pkl', 'rb') as file:
    tfidf_vectorizer = pickle.load(file)

'''# Define your functions (ensure these are imported or included)
def encode_user_inputs(user_skin_type, user_category, user_price, model_df):
    # Example encoding logic
    skin_type_mapping = {'oily': 0, 'dry': 1, 'sensitive': 2, 'combination': 3, 'normal': 4}
    category_mapping = {'moisturizer': 0, 'serum': 1, 'sunscreen': 2, 'cleanser': 3, 'toner': 4}

    if user_skin_type not in skin_type_mapping or user_category not in category_mapping:
        return None  # Invalid input

    skin_type_encoded = skin_type_mapping[user_skin_type]
    category_encoded = category_mapping[user_category]
    price_normalized = (user_price - model_df['price'].min()) / (model_df['price'].max() - model_df['price'].min())

    return skin_type_encoded, category_encoded, price_normalized


def recommend_filtered_products(user_skin_type, user_category, user_price, user_concern, data, model_df, top_n=5):
    encoded_inputs = encode_user_inputs(user_skin_type, user_category, user_price, model_df)
    if encoded_inputs is None:
        return "Invalid input provided."

    skin_type_encoded, category_encoded, price_normalized = encoded_inputs

    # Filter products based on model_df
    filtered_model_df = model_df[
        (model_df['Skin Type Encoded'] == skin_type_encoded) &
        (model_df['Product Category Encoded'] == category_encoded) &
        ((model_df['price_normalized'] <= price_normalized + 0.1) &
         (model_df['price_normalized'] >= price_normalized - 0.1))
    ].copy()

    if filtered_model_df.empty:
        return "No products match your filters"

    # Merge filtered data with user-facing attributes
    filtered_data = pd.merge(data, filtered_model_df, on='product_id')

    # Sort by final_score and return top N
    filtered_data = filtered_data.sort_values(by='final_score', ascending=False).head(top_n)

    return filtered_data[['product_name', 'product_brand', 'product_category', 'price', 'skin_type', 'concern', 'star_5_percentage', 'final_score']]
'''

# Streamlit App
st.title("Personalized Skincare Recommendation System")

# User Input Section
skin_type = st.selectbox("Select your skin type", ['oily', 'dry', 'sensitive', 'combination', 'normal'])
category = st.selectbox("Select product category", ['moisturizer', 'serum', 'sunscreen', 'cleanser', 'toner'])
concern = st.text_input("Enter your skin concern", "hydration")
price = st.number_input("Enter your price range (INR)", min_value=0, max_value=5000, value=500)

# Recommendation Button
if st.button("Get Recommendations"):
    recommendations = recommend_filtered_products(
        user_skin_type=skin_type,
        user_category=category,
        user_price=price,
        user_concern=concern,
        data=data,
        model_df=model_df,
        top_n=5
    )

    if isinstance(recommendations, str):
        st.write(recommendations)
    else:
        st.write("Top Recommended Products:")
        for _, row in recommendations.iterrows():
            st.write(f"**{row['product_name']}** by {row['product_brand']} - ₹{row['price']} ({row['concern']})")
