<center>
<h1><b>Perfumedia</b></h1>
</center>
















# **Abstract**
In the enchanting world of fragrance, each scent tells a unique story, revealing facets of your personality and the impression you wish to leave on others. Every person has a different taste and will like distinct scents. By understanding your distinct taste, Perfumedia helps you to find and recommend the flawless perfume that perfectly matches your unique choice.

# **Problem Statement**
Build a recommendation system that helps the users to find perfumes that corresponds to their preference.

### Details of the Data
Initially I searched a lot for a dataset but could not find any that matched the features I wanted. So, the best option I found was to web-scrape a fragrance website and get the data. The website I chose for this is https://en.parfumdreams.de/Fragrances.

The data I scraped from this website includes:
* Brand of the perfume
* Perfume name
* Category (EDT, EDP, Perfume etc.)
* Gender
* Base price (price in Euro per 1000ml)
* Notes of the perfume
* Fragrance of the perfume
* Charactor of the perfume
* Customer ratings
* Review counts
* URL of the perfume
* Image of the perfume


### Design Flow

These are the things I would like to perform to achieve results:


*   Web scraping
*   Data preprocessing
*   EDA (Exploratory Data Analysis)
* 	Data visualization (in Phython, SQL & Tableau)
* 	Data processing - feature selection & extraction
* 	Data modeling & model evaluation
* 	Perfume APPs


### Import Liabraries

In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Image, Markdown
import ipywidgets as widgets

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
warnings.filterwarnings('ignore')

# **Web Scraping with BeautifulSoup**

In [4]:
# url = 'https://en.parfumdreams.de/Fragrances'

In [5]:
# Selector 
# Men page 1-27
# url = https://en.parfumdreams.de/Fragrances/Mens-fragrances/Mens-perfumes?att_ziel=1&p={}

# Unisex page 1-37
# url = https://en.parfumdreams.de/Fragrances/Womens-fragrances/Womens-perfumes?att_ziel=1&p={}

# Women --> Women page 1-37
# url = https://en.parfumdreams.de/Fragrances/Womens-fragrances/Womens-perfumes?att_ziel=1&p={}

# url
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div.product-image.c.d-12 > a
#CategoryBoxItem > div.cw.clearfix > div:nth-child(96) > div.product-image.c.d-12 > a

# brand
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div.sale > a > p > span:nth-child(1)
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(1)

# name
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div.sale > a > p > span:nth-child(2)
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(2)

# category
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div.sale > a > p > span:nth-child(4)
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(4)

# base price
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div.sale > div.item-base-price.font-description.premium-subscription-inactive
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div:nth-child(3) > div:nth-child(2) > div.item-base-price.font-description.premium-subscription-inactive

# image
#CategoryBoxItem > div.cw.clearfix > div:nth-child(2) > div.product-image.c.d-12 > a > img
#CategoryBoxItem > div.cw.clearfix > div:nth-child(96) > div.product-image.c.d-12 > a > img

In [6]:
# # Define an empty list to store the extracted information
# data = []

# # Define the base URLs for different gender categories
# urls = {
#     'Men': 'https://en.parfumdreams.de/Fragrances/Mens-fragrances/Mens-perfumes?att_ziel=1&p={}',
#     'Women': 'https://en.parfumdreams.de/Fragrances/Womens-fragrances/Womens-perfumes?att_ziel=2&p={}',
#     'Unisex': 'https://en.parfumdreams.de/Fragrances/Womens-fragrances/Womens-perfumes?att_ziel=0&p={}'
# }

# # Define the CSS selector for product information
# base_selector = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div.product-image.c.d-12 > a'

# # Define the CSS selectors for brand, name, category, base price, and image
# selector_brand = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div.sale > a > p > span:nth-child(1)'
# selector_brand_alternative = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(1)'

# selector_name_1 = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div.sale > a > p > span:nth-child(2)'
# selector_name_2 = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div.sale > a > p > span:nth-child(3)'
# selector_name_alternative_1 = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(2)'
# selector_name_alternative_2 = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(3)'

# selector_category = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div.sale > a > p > span:nth-child(4)'
# selector_category_alternative = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div:nth-child(2) > a > p > span:nth-child(4)'

# selector_base_price = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div.sale > div.item-base-price.font-description.premium-subscription-inactive'
# selector_base_price_alternative = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div:nth-child(3) > div:nth-child(2) > div.item-base-price.font-description.premium-subscription-inactive'

# selector_image = '#CategoryBoxItem > div.cw.clearfix > div:nth-child({}) > div.product-image.c.d-12 > a > img'

# # Iterate over different gender categories and corresponding URLs
# for gender, url_pattern in urls.items():
#     # Determine the maximum page number based on the gender category
#     max_pages = 0
#     if gender == 'Men':
#         max_pages = 27
#     elif gender == 'Women':
#         max_pages = 45
#     elif gender == 'Unisex':
#         max_pages = 37

#     for page in range(1, max_pages + 1):
#         # Construct the URL by substituting the page number into the base URL pattern
#         url = url_pattern.format(page)

#         # Send a GET request to the URL
#         r = requests.get(url)

#         # Parsing the HTML code
#         soup = BeautifulSoup(r.content, 'html.parser')

#         # Iterate over the range of child indices
#         for index in range(2, 97):
#             # Construct the CSS selectors by substituting the index into the base selectors
#             selector_product = base_selector.format(index)
#             selector_brand_product = selector_brand.format(index)
#             selector_name_product_1 = selector_name_1.format(index)
#             selector_name_product_2 = selector_name_2.format(index)
#             selector_category_product = selector_category.format(index)
#             selector_base_price_product = selector_base_price.format(index)
#             selector_image_product = selector_image.format(index)

#             # Find the element matching the selector for product information
#             element_product = soup.select_one(selector_product)

#             # Check if the element exists
#             if element_product:
#                 # Extract the href attribute (product URL)
#                 href = element_product['href']

#                 # Find the elements matching the selectors for brand, name, category, base price, and image
#                 brand_element = soup.select_one(selector_brand_product)
#                 name_element_1 = soup.select_one(selector_name_product_1)
#                 name_element_2 = soup.select_one(selector_name_product_2)
#                 category_element = soup.select_one(selector_category_product)
#                 base_price_element = soup.select_one(selector_base_price_product)
#                 image_element = soup.select_one(selector_image_product)

#                 # Check if the elements exist before extracting their text
#                 brand = brand_element.text.strip() if brand_element else soup.select_one(selector_brand_alternative.format(index)).text.strip()
#                 name_1 = name_element_1.text.strip() if name_element_1 else soup.select_one(selector_name_alternative_1.format(index)).text.strip()
#                 name_2 = name_element_2.text.strip() if name_element_2 else (soup.select_one(selector_name_alternative_2.format(index)).text.strip() if soup.select_one(selector_name_alternative_2.format(index)) else None)
#                 category = category_element.text.strip() if category_element else soup.select_one(selector_category_alternative.format(index)).text.strip()
#                 base_price = base_price_element.text.strip() if base_price_element else (soup.select_one(selector_base_price_alternative.format(index)).text.strip() if soup.select_one(selector_base_price_alternative.format(index)) else None)
#                 image = image_element['src'] if image_element else None

#                 # Append the extracted information to the list along with the gender information
#                 data.append({'brand': brand, 'name_1': name_1, 'name_2': name_2, 'category': category, 'gender': gender, 'base_price': base_price, 'url': href, 'image': image})

# # Create a DataFrame from the extracted data
# df = pd.DataFrame(data, columns=['brand', 'name_1', 'name_2', 'category', 'gender', 'base_price', 'url', 'image'])

# # Save the DataFrame to a CSV file
# df.to_csv('perfume_basic.csv', index=False)

In [7]:
# df = pd.read_csv('perfume_basic.csv')
# df.head()

In [8]:
# Selectors for more information

# Top note
#ProductDetailDescription > div > div.left-content > div.kopfnote.smell-items > p.subline

# Heart note
#ProductDetailDescription > div > div.left-content > div.herznote.smell-items > p.subline

# Base note
#ProductDetailDescription > div > div.left-content > div.basisnote.smell-items > p.subline

# Fragrance
#ProductDetailDescription > div > div.right-content > div.duftrichtung > p:nth-child(2)

# Charactor
#ProductDetailDescription > div > div.right-content > div.charakter > p:nth-child(2)

# Customer rating
#ratingsContent > div > div.p-l-0.left > div > div > span:nth-child(2)

# Review count
#ratingsContent > div > div.p-l-0.left > div > div > span:nth-child(3)

In [9]:
# # Load the URLs from the CSV file
# urls = df['url']

# # Iterate over the URLs
# for index, url in urls.items():
#     # Send a GET request to the URL
#     response = requests.get('https://en.parfumdreams.de' + url)
#     response.raise_for_status()

#     # Parse the HTML content
#     soup = BeautifulSoup(response.content, 'html.parser')

#     # Extract information using the selectors
#     top_note_elements = soup.select('#ProductDetailDescription > div > div.left-content > div.kopfnote.smell-items > p.subline')
#     top_note = ' '.join([element.text.strip().replace('\r\n\r\n', ', ').replace('\r\n', ', ') for element in top_note_elements]) if top_note_elements else None

#     heart_note_elements = soup.select('#ProductDetailDescription > div > div.left-content > div.herznote.smell-items > p.subline')
#     heart_note = ' '.join([element.text.strip().replace('\r\n\r\n', ', ').replace('\r\n', ', ') for element in heart_note_elements]) if heart_note_elements else None

#     base_note_elements = soup.select('#ProductDetailDescription > div > div.left-content > div.basisnote.smell-items > p.subline')
#     base_note = ' '.join([element.text.strip().replace('\r\n\r\n', ', ').replace('\r\n', ', ') for element in base_note_elements]) if base_note_elements else None

#     fragrance_element = soup.select_one('#ProductDetailDescription > div > div.right-content > div.duftrichtung > p:nth-child(2)')
#     fragrance = fragrance_element.text.strip() if fragrance_element else None

#     character_element = soup.select_one('#ProductDetailDescription > div > div.right-content > div.charakter > p:nth-child(2)')
#     character = character_element.text.strip() if character_element else None

#     customer_rating_element = soup.select_one('#ratingsContent > div > div.p-l-0.left > div > div > span:nth-child(2)')
#     customer_rating = customer_rating_element.text.strip() if customer_rating_element else None

#     review_count_element = soup.select_one('#ratingsContent > div > div.p-l-0.left > div > div > span:nth-child(3)')
#     review_count = review_count_element.text.strip() if review_count_element else None

#     # Assign the extracted information to the corresponding URL index in the DataFrame
#     df.loc[index, ['top_note', 'heart_note', 'base_note', 'fragrance', 'character', 'customer_rating', 'review_count']] = top_note, heart_note, base_note, fragrance, character, customer_rating, review_count

# # Specify the desired order of columns
# column_order = ['brand', 'name_1', 'name_2', 'category', 'gender', 'base_price', 'top_note', 'heart_note', 'base_note', 'fragrance', 'character', 'customer_rating', 'review_count', 'url', 'image']

# # Reindex the DataFrame with the new column order
# df = df.reindex(columns=column_order)

# # Save the DataFrame to a CSV file
# df.to_csv('perfume_data.csv', index=False)

# **Data Preprocessing**

##### Dataset
* Importing the CSV files which were web scraped.
* The table consists of perfume brand, name, category, gender, base price, top notes, heart notes, base notes, fragrance, character, customer ratings, review counts, url, and image source.

In [10]:
# Load data
df = pd.read_csv('perfume_data.csv')
df.head()

Unnamed: 0,brand,name_1,name_2,category,gender,base_price,top_note,heart_note,base_note,fragrance,character,customer_rating,review_count,url,image
0,Hugo Boss,BOSS Bottled,,Eau de Toilette Spray,Men,BP: €739.00* / 1000 ml,"apple, bergamot, lemon","geranium, carnation, cinnamon","sandalwood, vetiver, cedar",,"elegant, masculine, oriental",49,144.0,/Hugo-Boss/Boss-Black-Mens-fragrances/BOSS-Bottled/Eau-de-Toilette-Spray/index_12855.aspx,https://cdn.parfumdreams.de/Img/Art/13/Hugo-Boss-BOSS-Bottled-Eau-de-Toilette-Spray-12855_26.jpg
1,Yves Saint Laurent,Y,,Eau de Parfum Spray,Men,BP: €915.83* / 1000 ml,"apple, bergamot, ginger","geranium, mint, sage, juniper berry","amber, tonka bean, incense",,,49,137.0,/Yves-Saint-Laurent/Mens-fragrances/Y/Eau-de-Parfum-Spray/index_79931.aspx,https://cdn.parfumdreams.de/Img/Art/13/Yves-Saint-Laurent-Y-Eau-de-Parfum-Spray-79931_22.jpg
2,Abercrombie & Fitch,Away Weekend Men,,Eau de Toilette Spray,Men,"BP: €1,398.33* / 1000 ml","bergamot, cardamom, mandarin","lavender, rosemary, sage","cocoa, patchouli, cedar wood",,,0,0.0,/Abercrombie-Fitch/Mens-fragrances/Away-Weekend-Men/Eau-de-Toilette-Spray/index_123613.aspx,https://cdn.parfumdreams.de/Img/Art/13/Abercrombie-Fitch-Away-Weekend-Men-Eau-de-Toilette-Spray-123613_4.jpg
3,Jean Paul Gaultier,Le Mâle,,Eau de Toilette Spray,Men,BP: €923.75* / 1000 ml,"bergamot, lavender, mint","spices, orange, cinnamon","amber, sandalwood, tonka bean",,"masculine, sensual, oriental",49,275.0,/Jean-Paul-Gaultier/Mens-fragrances/Le-Male/Eau-de-Toilette-Spray/index_17304.aspx,https://cdn.parfumdreams.de/Img/Art/13/Jean-Paul-Gaultier-Le-Male-Eau-de-Toilette-Spray-17304_31.jpg
4,DIOR,Sauvage,Refillable - Citrus and Vanilla Notes,Eau de Parfum Spray,Men,"BP: €2,231.67* / 1000 ml","bergamot, lavender",pepper,"amber, patchouli, vetiver",wooden,"arromatic, vitalising, oriental",49,156.0,/DIOR/Mens-fragrances/Sauvage/Eau-de-Parfum-Spray/index_74836.aspx,https://cdn.parfumdreams.de/Img/Art/13/DIOR-Sauvage-Eau-de-Parfum-Spray-74836x8_81.jpg


* There are 3230 perfumes in the dataset.

In [11]:
df.shape

(3230, 15)

* Understanding the data

In [12]:
df.describe(include='all')

Unnamed: 0,brand,name_1,name_2,category,gender,base_price,top_note,heart_note,base_note,fragrance,character,customer_rating,review_count,url,image
count,3230,3230,1320,3230,3230,3217,2118,2108,2116,665,395,3229.0,3229.0,3230,3229
unique,292,1614,1177,166,3,862,1547,1581,1253,18,146,24.0,,3230,3229
top,Montale,Oud,Intense,Eau de Parfum Spray,Women,"BP: €1,250.00* / 1000 ml",bergamot,jasmine,musk,floral,oriental,0.0,,/Hugo-Boss/Boss-Black-Mens-fragrances/BOSS-Bottled/Eau-de-Toilette-Spray/index_12855.aspx,https://cdn.parfumdreams.de/Img/Art/13/Hugo-Boss-BOSS-Bottled-Eau-de-Toilette-Spray-12855_26.jpg
freq,105,40,13,1742,1338,42,48,45,50,258,52,1459.0,,1,1
mean,,,,,,,,,,,,,12.26262,,
std,,,,,,,,,,,,,77.789371,,
min,,,,,,,,,,,,,0.0,,
25%,,,,,,,,,,,,,0.0,,
50%,,,,,,,,,,,,,1.0,,
75%,,,,,,,,,,,,,4.0,,


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3230 entries, 0 to 3229
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   brand            3230 non-null   object 
 1   name_1           3230 non-null   object 
 2   name_2           1320 non-null   object 
 3   category         3230 non-null   object 
 4   gender           3230 non-null   object 
 5   base_price       3217 non-null   object 
 6   top_note         2118 non-null   object 
 7   heart_note       2108 non-null   object 
 8   base_note        2116 non-null   object 
 9   fragrance        665 non-null    object 
 10  character        395 non-null    object 
 11  customer_rating  3229 non-null   object 
 12  review_count     3229 non-null   float64
 13  url              3230 non-null   object 
 14  image            3229 non-null   object 
dtypes: float64(1), object(14)
memory usage: 378.6+ KB


* Cleaning the data

In [14]:
# Checking for duplicates.
duplicates = df[df.duplicated(keep=False)]

# We add a new column 'is_duplicate' to the duplicates dataframe
df['is_duplicate'] = True

if duplicates.empty:
    print("No duplicates found")
else:
    print("Duplicates found:\n", duplicates)

No duplicates found


In [15]:
# df_notes_null = df[df['top_note'].isnull() & df['heart_note'].isnull() & df['base_note'].isnull()]

In [16]:
# Select the rows in the df where the "top_note", "heart_note", or "base_note" column is not null
df = df[df['top_note'].notnull() | df['heart_note'].notnull() | df['base_note'].notnull()]

In [17]:
# Check name columns
# df.name_1.value_counts()
df.name_2.value_counts()

Intense                                         10
Black                                            5
Rose                                             5
Eau de Parfum Spray                              4
Noir                                             3
Eau Fraîche                                      3
Limited Edition                                  3
Pink                                             3
Royal                                            3
Gold                                             3
Elixir                                           3
Hypnotic Poison                                  2
Overtake 320                                     2
Magnetic                                         2
Blossom Delight                                  2
Costa Azzurra                                    2
Vétiver                                          2
Aqua Gold                                        2
Eau Intense                                      2
White Musk                     

In [18]:
df[df.name_2 == "Eau de Parfum Spray"]

Unnamed: 0,brand,name_1,name_2,category,gender,base_price,top_note,heart_note,base_note,fragrance,character,customer_rating,review_count,url,image,is_duplicate
310,Eisenberg,Les Orientaux Latins,Eau de Parfum Spray,Secret N°VI Cuir d'Orient Homme,Men,"BP: €3,731.67* / 1000 ml","blackcurrant, pepper, saffron","jasmine, leather, rose, violet, cedar","leather, rosewood, sandalwood, incense",leathery,,0,0.0,/Eisenberg/Mens-fragrances/Les-Orientaux-Latins/Secret-NVI-Cuir-dOrient-Homme-/index_84721.aspx,https://cdn.parfumdreams.de/Img/Art/13/Eisenberg-Les-Orientaux-Latins-Eau-de-Parfum-Spray-Secret-NVI-Cuir-dOrient-Homme--84721.jpg,True
335,Eisenberg,Les Orientaux Latins,Eau de Parfum Spray,Secret N°III Patchouli Nobile Homme,Men,"BP: €3,731.67* / 1000 ml","bergamot, cardamom, carnation, pepper, cinnamon, lemon, nutmeg","iris, lavender, patchouli, vetiver","iris, ladan resin, musk, rosewood, tonka bean, vanilla, myrrh, benzoin",oriental,,5,1.0,/Eisenberg/Mens-fragrances/Les-Orientaux-Latins/Secret-NIII-Patchouli-Nobile-Homme/index_84718.aspx,https://cdn.parfumdreams.de/Img/Art/13/Eisenberg-Les-Orientaux-Latins-Eau-de-Parfum-Spray-Secret-NIII-Patchouli-Nobile-Homme-84718.jpg,True
521,Eisenberg,Les Orientaux Latins,Eau de Parfum Spray,Secret N°V Ambre d'Orient Homme,Men,"BP: €3,731.67* / 1000 ml","bergamot, jasmine, pepper, rose, cinnamon","nagarmotha, patchouli, vetiver, cedar","musk, tonka bean, vanilla, incense, myrrh, ambra, benzoin",oriental,,0,0.0,/Eisenberg/Mens-fragrances/Les-Orientaux-Latins/Secret-NV-Ambre-dOrient-Homme-/index_84720.aspx,https://cdn.parfumdreams.de/Img/Art/13/Eisenberg-Les-Orientaux-Latins-Eau-de-Parfum-Spray-Secret-NV-Ambre-dOrient-Homme--84720.jpg,True
544,Eisenberg,Les Orientaux Latins,Eau de Parfum Spray,Secret N°IV Rituel d'Orient Homme,Men,"BP: €3,731.67* / 1000 ml",lemon,flowers,"amber, musk, patchouli, vanilla, guaiac wood",,,0,0.0,/Eisenberg/Mens-fragrances/Les-Orientaux-Latins/Secret-NIV-Rituel-dOrient-Homme/index_84719.aspx,https://cdn.parfumdreams.de/Img/Art/13/Eisenberg-Les-Orientaux-Latins-Eau-de-Parfum-Spray-Secret-NIV-Rituel-dOrient-Homme-84719.jpg,True


In [19]:
# Create a boolean mask to filter rows where name_2 column is "Eau de Parfum Spray"
mask = df['name_2'] == "Eau de Parfum Spray"

# Store the values from 'name_2' column in a temporary variable
temp_values = df.loc[mask, 'name_2'].copy()

# Update the 'name_2' column with the values from the 'category' column
df.loc[mask, 'name_2'] = df.loc[mask, 'category']

# Update the 'category' column with the values from the temporary variable
df.loc[mask, 'category'] = temp_values

In [20]:
#df[df.name_2 == "Refillable - Citrus and Vanilla Notes"]
df[df.name_2 == "Citrus and Woody Notes - Refillable Bottle"]

Unnamed: 0,brand,name_1,name_2,category,gender,base_price,top_note,heart_note,base_note,fragrance,character,customer_rating,review_count,url,image,is_duplicate
40,DIOR,Sauvage,Citrus and Woody Notes - Refillable Bottle,Parfum Men's Fragrance,Men,"BP: €2,565.00* / 1000 ml","bergamot, lavender, pepper","geranium, pepper","patchouli, vetiver",arromatic,,46,21.0,/DIOR/Mens-fragrances/Sauvage/Parfum-Mens-Fragrance/index_87043.aspx,https://cdn.parfumdreams.de/Img/Art/13/DIOR-Sauvage-Le-Parfum-87043x8_48.jpg,True


In [21]:
# More cleaning in 'name_2' column
df.name_2 = df.name_2.replace("Refillable - Citrus and Vanilla Notes", "Citrus and Vanilla Notes")
df.name_2 = df.name_2.replace("Citrus and Woody Notes - Refillable Bottle", "Citrus and Woody Notes")

In [22]:
df.loc[40, 'name_2']

'Citrus and Woody Notes'

In [23]:
# Create a new column "name" to combine "name_1" and "name_2" columns
df['name'] = df.apply(lambda row: f"{row['name_1']} {row['name_2']}" if pd.notnull(row['name_1']) and pd.notnull(row['name_2']) else row['name_1'], axis=1)

# Drop the 'name_1' and 'name_2' columns
df.drop(columns=['name_1', 'name_2'], inplace=True)

# Rearrange the order of columns
column_order = ['brand', 'name', 'category', 'gender', 'base_price', 'top_note', 'heart_note', 'base_note', 'fragrance', 'character', 'customer_rating', 'review_count', 'url', 'image']
df = df.reindex(columns=column_order)

In [24]:
# Check "category" column
df.category.unique()

array(['Eau de Toilette Spray', 'Eau de Parfum Spray', 'Parfum Spray',
       'Eau de Parfum Spray Intense', 'Le Parfum Spray', 'Gift set',
       'Parfum- refillable', 'Parfum', "Parfum Men's Fragrance",
       'Eau de Parfum Spray- refillable', 'Eau de Toilette Spray Intense',
       'Eau de Toilette Spray Refillable', 'Le Parfum', 'Cologne Spray',
       'Eau de Cologne Spray', 'Eau de Toilette Spray- refillable',
       'Perfume', 'Parfum Intense', 'Parfume Spray',
       'Extrait de Parfum Spray', 'Eau de Parfum Spray Essentiel',
       'Gift Set', 'Eau de Toilette Spray Tuscany',
       'Eau de Parfum Spray Extreme', 'Eau de Cologne Spray Concentré',
       'Eau de Toilete Spray', 'Eau de Parfum Splash Bottle',
       'Perfume Spray', 'Extrait de Parfum',
       'Eau de Toilette Splash Bottle', 'Travellers Set Masculine',
       'Eau de Parfum Spray refillable', 'Eau de Parfum Spray Florale',
       'Eau de Parfum Spray Refillable', 'Eau De Toilette Spray',
       'Eau de Parfum 

In [None]:
# Create a new column 'category_mod'
df['category_mod'] = df['category'].str.lower()

# Map values to the new column based on keywords
df.loc[df['category_mod'].str.contains('eau de parfum'), 'category_mod'] = 'Eau de Parfum'
df.loc[df['category_mod'].str.contains('eau de toilette|toilet'), 'category_mod'] = 'Eau de Toilette'
df.loc[df['category_mod'].str.contains('cologne'), 'category_mod'] = 'Eau de Cologne'
df.loc[df['category_mod'].str.contains('parfum|perfume|extrait'), 'category_mod'] = 'Parfum'
df.loc[df['category_mod'].str.contains('fraiche|fraîche'), 'category_mod'] = 'Eau Fraiche'
df.loc[~df['category_mod'].isin(['Eau de Parfum', 'Eau de Toilette', 'Parfum', 'Eau de Cologne', 'Eau Fraiche']), 'category_mod'] = 'Other'

In [None]:
# df[df['category_mod'].isin(['Eau de Parfum'])][['name', 'category', 'category_mod', 'url']]
# df[df['category_mod'].isin(['Eau de Toilette'])][['name', 'category', 'category_mod', 'url']]
# df[df['category_mod'].isin(['Eau de Cologne'])][['name', 'category', 'category_mod', 'url']]
# df[df['category_mod'].isin(['Parfum'])][['name', 'category', 'category_mod', 'url']]
# df[df['category_mod'].isin(['Eau Fraiche'])][['name', 'category', 'category_mod', 'url']]
df[df['category_mod'].isin(['Other'])][['name', 'category', 'category_mod', 'url']]

In [None]:
# df[df.category == "Eau de Parfum Spray Intense"]
# df[df.name == "Brioni"]
# df[df.name == "L'Homme Idéal"]
# df[df.name == "Himalaya"]
# df[df.name == "Viking"]
# df[df.name == "La vie est belle"]
# df[df.name == "My Way"]
# df[df.name == "Black Opium"]
# df[df.name == "La Panthère"]
# df[df.name == "Nomade"]
# df[df.name == "Annicke Collection"]
# df[df.name == "Cloud Collection"]
# df[df.category == "Parfum d’eau"]
# df[df.name == "Eau Sauvage"]
# df[df.name == "Les Créations de Monsieur Dior"]
# df[df.name == "Girl"]
# df[df.name == "Les Légendaires"]
# df[df.name == "Classic"]
# df[df.name == "Black Collection"]
# df[df.category == "Roller-Pearl"]
# df[df.category == "Roller-Pearl Rose N Roses"]
# df[df.category == "Eau du Coq Spray"]
# df[df.category == "Eau Imperiale Spray"]
# df[df.category == "Eau Cedrat Spray"]
# df[df.category == "Eau Guerlain Spray"]
df[df.name == "Original Eau de Cologne"]

In [None]:
# More cleaning in 'category' column

# Eau de Parfum 
df.loc[df['category'] == "Eau de Parfum Spray Intense", 'name'] = df.loc[df['category'] == "Eau de Parfum Spray Intense", 'name'].apply(lambda x: x + " Intense" if "Intense" not in x else x)
df.loc[153, 'name'] = "Brioni Essentiel"
df.loc[219, 'name'] = "L'Homme Idéal Extreme"
df.loc[600, 'name'] = "Himalaya Splash Bottle"
df.loc[648, 'name'] = "Viking Splash Bottle"
df.loc[858, 'name'] = "La vie est belle Intensément"
df.loc[1017, 'name'] = "My Way Intense"
df.loc[1100, 'name'] = "Black Opium Extreme"
df.loc[1229, 'name'] = "La Panthère Limited Edition"
df.loc[1256, 'name'] = "Nomade Absolu"
df.loc[1497, 'name'] = "Annicke Collection 6"
df.loc[1700, 'name'] = "Annicke Collection 5"
df.loc[1701, 'name'] = "Annicke Collection 4"
df.loc[1702, 'name'] = "Annicke Collection 3"
df.loc[1735, 'name'] = "Annicke Collection 1"
df.loc[1940, 'name'] = "Annicke Collection 2"
df.loc[2486, 'name'] = "Cloud Collection No.1"
df.loc[2711, 'name'] = "Cloud Collection No.3"
df.loc[3022, 'name'] = "Cloud Collection No.2"
df.loc[df['category'] == 'Parfum d’eau', 'category_mod'] = 'Eau de Parfum'

# Eau de Toilette
df.loc[737, 'name'] = "Eau Sauvage Splash Bottle"
df.loc[972, 'name'] = "Les Créations de Monsieur Dior Forever and Ever"
df.loc[1059, 'name'] = "Les Créations de Monsieur Dior Dioressence"
df.loc[1137, 'name'] = "Les Créations de Monsieur Dior Diorella"
df.loc[1989, 'name'] = "Girl Blooming Edition"
df.loc[2081, 'name'] = "Les Légendaires Bee Bottle"

# Eau de Cologne
df.loc[270, 'name'] = "Classic Concentré"
df.loc[1265, 'name'] = "Original Eau de Cologne Splash Bottle"

# Perfum
df.loc[3151, 'name'] = "Black Collection Fresh Oud"
df.loc[3152, 'name'] = "Black Collection Sweet Rose"

# Other
df.loc[df['category'] == 'Roller-Pearl', 'category_mod'] = 'Eau de Toilette'
df.loc[df['category'] == 'Roller-Pearl Rose N Roses', ['category_mod', 'name']] = ['Eau de Toilette', 'Miss Dior Rose N Roses']
df.loc[df['category'] == 'Eau du Coq Spray', 'category_mod'] = 'Eau de Cologne'
df.loc[df['category'] == 'Eau Imperiale Spray', 'category_mod'] = 'Eau de Cologne'
df.loc[df['category'] == 'Eau Cedrat Spray', 'category_mod'] = 'Eau Fraiche'
df.loc[df['category'] == 'Eau Guerlain Spray', 'category_mod'] = 'Eau de Cologne'
df = df[df['category_mod'] != 'Other']

In [None]:
# Drop the 'category' column
df.drop(columns=['category'], inplace=True)

# Change the column name 'category_mod' to 'category'
df.rename(columns={'category_mod': 'category'}, inplace=True)

# Rearrange the order of columns
column_order = ['brand', 'name', 'category', 'gender', 'base_price', 'top_note', 'heart_note', 'base_note', 'fragrance', 'character', 'customer_rating', 'review_count', 'url', 'image']
df = df.reindex(columns=column_order)

In [None]:
# Check "base_price" column
df.base_price.value_counts()

In [None]:
# Clean "base_price" column
def extract_numeric_value(text):
    if isinstance(text, str):
        match = re.search(r"€([\d,\.]+)\*", text)
        if match:
            price_str = match.group(1).replace(",", "")  # Remove commas from the price string
            return float(price_str)
    return

df["base_price"] = df["base_price"].apply(extract_numeric_value)

In [None]:
# Check the range of the cleaned "base_price" column
min_price = df["base_price"].min()
max_price = df["base_price"].max()
min_price, max_price

In [None]:
# Check NaN values in "base_price" column
df[df.base_price.isna()]

In [None]:
# Update the base price for this item
df.loc[1578, 'base_price'] = 2660.00

In [None]:
# Check notes column
print(df.top_note.unique())
print(df.heart_note.unique())
print(df.base_note.unique())

In [None]:
# Clean notes column
df.top_note = df.top_note.str.lower().str.replace(r'\s*,\s*', ', ', regex=True)
df.heart_note = df.heart_note.str.lower().str.replace(r'\s*,\s*', ', ', regex=True)
df.base_note = df.base_note.str.lower().str.replace(r'\s*,\s*', ', ', regex=True)

df.top_note.unique()

In [None]:
# Function to concatenate top_note, heart_note, and base_note into "notes" column and remove duplicate notes
def concatenate_notes(row):
    notes_list = [note for note in [row["top_note"], row["heart_note"], row["base_note"]] if not pd.isnull(note)]
    notes_combined = ", ".join(notes_list)
    
    # Remove duplicate words
    unique_notes = []
    for note in notes_combined.split(", "):
        if note not in unique_notes:
            unique_notes.append(note)
    
    return ", ".join(unique_notes)

# Concatenate the notes columns into a new "notes" column
df["notes"] = df.apply(concatenate_notes, axis=1)

# Rearrange the order of columns
column_order = ['brand', 'name', 'category', 'gender', 'base_price', 'notes', 'top_note', 'heart_note', 'base_note', 'fragrance', 'character', 'customer_rating', 'review_count', 'url', 'image']
df = df.reindex(columns=column_order)
df.head()

In [None]:
# Check "character" column
# df.character.value_counts()
df.character.isna().sum()

In [None]:
# Drop "character" column
df.drop("character", axis=1, inplace=True)

In [None]:
# Clean "customer_rating" column
df.customer_rating = df.customer_rating.str.replace(",", ".").astype(float)

In [None]:
# Convert 'review_count' from float to integer
df.review_count = df.review_count.astype(int)

In [None]:
# Add the "id" column starting from 1
df['id'] = range(1, len(df) + 1)

# Set "id" column as the index
df.set_index(df.id, inplace=True)

# **General EDA**

* EDA report

In [None]:
df_eda = df.copy()
df_eda.drop('id', axis=1, inplace=True)
eda_report = ProfileReport(df_eda, title = "EDA Report - Perfume")
eda_report

In [None]:
print ("Unique brands: ", df['brand'].nunique())
print ("Perfumes: ", df['id'].nunique())

* More EDA

In [None]:
# Split the "category" column into separate columns
Category = df['category'].apply(pd.Series)

# Get dummies for each category and stack them to get a binary representation of categories
category_matrix = pd.get_dummies(Category.apply(pd.Series).stack())

# Sum the dummies at level 0 to combine them for each row
category_matrix = category_matrix.groupby(level=0).sum()

In [None]:
# Check the sum of each category
category_matrix.sum(axis=0).sort_values(ascending=False)

In [None]:
# Split the "gender" column into separate columns
Gender = df['gender'].apply(pd.Series)

# Get dummies for each category and stack them to get a binary representation of categories
gender_matrix = pd.get_dummies(Gender.apply(pd.Series).stack())

# Sum the dummies at level 0 to combine them for each row
gender_matrix = gender_matrix.groupby(level=0).sum()

In [None]:
# Check the sum of each gender
gender_matrix.sum(axis=0).sort_values(ascending=False)

In [None]:
# Split the "notes" column and create a DataFrame
Notes = df['notes'].apply(lambda x: pd.Series(x.split(', ')) if isinstance(x, str) else pd.Series())

# Get dummies for each category and stack them to get a binary representation of categories
note_matrix = pd.get_dummies(Notes.apply(pd.Series).stack())

# Sum the dummies at level 0 to combine them for each row
note_matrix = note_matrix.groupby(level=0).sum()

In [None]:
# Check the sum of each note
note_matrix.sum(axis=0).sort_values(ascending=False)

* EDA with SQL

# **Data Visualization**

### Data Visualization with Python

* Top 10 Most Frequently Used Notes

In [None]:
# Sum the occurrences of each note across all samples (axis=0) and Select the top 20 notes
top_notes = note_matrix.sum(axis=0).sort_values(ascending=False)
top_10_notes = top_notes.head(10)

# Remove "note" and replace underscores with spaces in the notes' names
top_10_notes_labels = top_10_notes.index

# Create a bar plot using seaborn
plt.figure(figsize=(15, 6))
sns.barplot(x=top_10_notes_labels, y=top_10_notes.values, palette='viridis')

# Add labels and title
plt.xlabel('Notes')
plt.ylabel('Occurrences')
plt.title('Top 10 Most Frequently Used Notes')

plt.show()

* Top 10 popular women's perfumes

In [None]:
# Filter the DataFrame for women's perfumes
women_perfumes = df[df['gender'] == 'Women']

# Sort the DataFrame by review_count in descending order and take the top 10 rows
top_women_perfumes = women_perfumes.sort_values(by='review_count', ascending=False).head(10)

# Reset the index of the top_women_perfumes DataFrame
top_women_perfumes.reset_index(drop=True, inplace=True)

# Combine brand and name to create a new column for x-axis label with names in two lines
top_women_perfumes['perfume_label'] = top_women_perfumes['brand'] + '\n' + top_women_perfumes['name'] + '\n' + top_women_perfumes['category']

# Define different colors for the bars
colors = sns.color_palette("spring", 10)

# Plot the top 10 popular women's perfumes with different bar colors
plt.figure(figsize=(15, 6))
bar_plot = plt.bar(top_women_perfumes['perfume_label'], top_women_perfumes['review_count'], color=colors, alpha=0.7)
plt.title("Top 10 Popular Women's Perfumes")
plt.xlabel("Perfume")
plt.ylabel("Review Count")
plt.xticks(rotation=45)

# Set the maximum limit for the y-axis (review count) to 3500
plt.ylim(0, 3500)

# Add customer rating on each column
for index, row in top_women_perfumes.iterrows():
    plt.text(index, row['review_count'], f"Customer\nRating: {row['customer_rating']}", ha='center', va='bottom', color='black', fontsize=9)

plt.show()

* Top 10 popular men's perfumes

In [None]:
men_perfumes = df[df['gender'] == 'Men']
top_men_perfumes = men_perfumes.sort_values(by='review_count', ascending=False).head(10)
top_men_perfumes.reset_index(drop=True, inplace=True)
top_men_perfumes['perfume_label'] = top_men_perfumes['brand'] + '\n' + top_men_perfumes['name'] + '\n' + top_men_perfumes['category']
colors = sns.color_palette("winter", 10)
plt.figure(figsize=(15, 6))
bar_plot = plt.bar(top_men_perfumes['perfume_label'], top_men_perfumes['review_count'], color=colors, alpha=0.7)
plt.title("Top 10 Popular Men's Perfumes")
plt.xlabel("Perfume")
plt.ylabel("Review Count")
plt.xticks(rotation=45)
plt.ylim(0, 600)
for index, row in top_men_perfumes.iterrows():
    plt.text(index, row['review_count'], f"Customer\nRating: {row['customer_rating']}", ha='center', va='bottom', color='black', fontsize=9)
plt.show()

* Top 10 popular unisex perfumes

In [None]:
unisex_perfumes = df[df['gender'] == 'Unisex']
top_unisex_perfumes = unisex_perfumes.sort_values(by='review_count', ascending=False).head(10)
top_unisex_perfumes.reset_index(drop=True, inplace=True)
top_unisex_perfumes['perfume_label'] = top_unisex_perfumes['brand'] + '\n' + top_unisex_perfumes['name'] + '\n' + top_unisex_perfumes['category']
colors = sns.color_palette("summer", 10)
plt.figure(figsize=(15, 6))
bar_plot = plt.bar(top_unisex_perfumes['perfume_label'], top_unisex_perfumes['review_count'], color=colors, alpha=0.7)
plt.title("Top 10 Popular Unisex Perfumes")
plt.xlabel("Perfume")
plt.ylabel("Review Count")
plt.xticks(rotation=45)
plt.ylim(0, 80)
for index, row in top_unisex_perfumes.iterrows():
    plt.text(index, row['review_count'], f"Customer\nRating: {row['customer_rating']}", ha='center', va='bottom', color='black', fontsize=9)
plt.show()

* Top 10 popular EDP

In [None]:
edp_perfumes = df[(df['category'] == 'Eau de Parfum')]
top_edp_perfumes = edp_perfumes.sort_values(by='review_count', ascending=False).head(10)
top_edp_perfumes.reset_index(drop=True, inplace=True)
top_edp_perfumes['perfume_label'] = top_edp_perfumes['brand'] + '\n' + top_edp_perfumes['name']
colors = sns.color_palette('rocket', 10)
plt.figure(figsize=(15, 6))
sns.barplot(x=top_edp_perfumes['perfume_label'], y=top_edp_perfumes['review_count'], palette=colors)
plt.title("Top 10 Popular Eau de Parfum")
plt.xlabel("Perfume")
plt.ylabel("Review Count")
plt.ylim(0, 3500)
for index, row in top_edp_perfumes.iterrows():
    plt.text(index, row['review_count'], f"Customer\nRating: {row['customer_rating']}", ha='center', va='bottom', color='black', fontsize=9)
plt.xticks(rotation=45)
plt.show()

* Top 10 popular Parfum

In [None]:
parfum_perfumes = df[(df['category'] == 'Parfum')]
top_parfum_perfumes = parfum_perfumes.sort_values(by='review_count', ascending=False).head(10)
top_parfum_perfumes.reset_index(drop=True, inplace=True)
top_parfum_perfumes['perfume_label'] = top_parfum_perfumes['brand'] + '\n' + top_parfum_perfumes['name']
colors = sns.color_palette("mako", 10)
plt.figure(figsize=(15, 6))
sns.barplot(x=top_parfum_perfumes['perfume_label'], y=top_parfum_perfumes['review_count'], palette=colors)
plt.title("Top 10 Popular Parfum")
plt.xlabel("Perfume")
plt.ylabel("Review Count")
plt.ylim(0, 600)
for index, row in top_parfum_perfumes.iterrows():
    plt.text(index, row['review_count'], f"Customer\nRating: {row['customer_rating']}", ha='center', va='bottom', color='black', fontsize=9)
plt.xticks(rotation=45)
plt.show()

* Top 10 Popular EDT

In [None]:
edt_perfumes = df[(df['category'] == 'Eau de Toilette')]
top_edt_perfumes = edt_perfumes.sort_values(by='review_count', ascending=False).head(10)
top_edt_perfumes.reset_index(drop=True, inplace=True)
top_edt_perfumes['perfume_label'] = top_edt_perfumes['brand'] + '\n' + top_edt_perfumes['name']
colors = sns.color_palette("plasma", 10)
plt.figure(figsize=(15, 6))
sns.barplot(x=top_edt_perfumes['perfume_label'], y=top_edt_perfumes['review_count'], palette=colors)
plt.title("Top 10 Popular Eau de Toilette")
plt.xlabel("Perfume")
plt.ylabel("Review Count")
plt.ylim(0, 350)
for index, row in top_edt_perfumes.iterrows():
    plt.text(index, row['review_count'], f"Customer\nRating: {row['customer_rating']}", ha='center', va='bottom', color='black', fontsize=9)
plt.xticks(rotation=45)
plt.show()

* Brand vs. number of perfumes

In [None]:
# Group by brand and count the number of perfumes for each brand
perfumes_by_brand = df.groupby('brand')['id'].count().sort_values(ascending=False)

# Select the top 30 perfume brands
top_30_brands = perfumes_by_brand.head(30)

# Plot the top 30 perfume brands with a bar plot using seaborn
plt.figure(figsize=(15, 6))
sns.barplot(x=top_30_brands.index, y=top_30_brands.values, palette='Blues_r')
plt.xticks(rotation=45, ha="right")

plt.title("Top 10% Perfume Brands")
plt.xlabel("Brand")
plt.ylabel("Number of Perfumes")
plt.show()

### Data Visualization with Tableau

# **Data Processing**

##### `Fragrance` - fill the missing values with more web scraping

In [None]:
url = 'https://en.parfumdreams.de/Fragrances'

# Send a GET request to the URL and parse the HTML content
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

# Find the HTML code containing the fragrance values
selector = '#attrFilterDuftrichtung > div.dropdown-items-content'
html_code = soup.select_one(selector)

# Initialize an empty list to store the information
fragrance_list = []

# Find all the item values and extract the name, number, and max_pages for each fragrance
item_values = html_code.find_all('span', class_='item-value')
max_pages = [3, 2, 27, 6, 1, 12, 12, 2, 23, 1, 2, 1, 1, 3, 9, 7, 8, 1, 1, 1, 1, 2, 12, 2, 1, 1]

for i, item in enumerate(item_values):
    # Extract the name of the fragrance
    name = item.text
    # Extract the number of the fragrance from the input value attribute
    number = re.search(r'/Fragrances\?att_duftrichtung=(\d+)', str(item.find_next_sibling('input'))).group(1)
    # Append the fragrance information to the list
    fragrance_list.append((name, number, max_pages[i]))

print(fragrance_list)

In [None]:
# Initialize an empty list to store the perfume fragrance information
fragrance_links = []

In [None]:
### Since it is too slow to web scrape thee whole package, I decided to web scrape each fragrance individually

In [None]:
# # Fragrance - amber/3206/3
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3206&p={}'
# for page in range(1, 4):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'amber') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - aquatic/3212/2
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3212&p={}'
# for page in range(1, 3):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'aquatic') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - floral/2159/27
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2159&p={}'
# for page in range(1, 28):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'floral') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - floral/3221/6
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3221&p={}'
# for page in range(1, 7):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'floral') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - fougère/3222/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3222&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'fougère') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - fresh/2166/12
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2166&p={}'
# for page in range(1, 13):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'fresh') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - fruity/2167/12
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2167&p={}'
# for page in range(1, 13):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'fruity') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - gourmand/3228/2
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3228&p={}'
# for page in range(1, 3):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'gourmand') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - wooden/2169/23
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2169&p={}'
# for page in range(1, 24):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'wooden') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - intensive/251/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=251&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'intensive') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - leathery/3191/2
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3191&p={}'
# for page in range(1, 3):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'leathery') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - mossy/2173/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2173&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'mossy') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - heavy/262/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=262&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'heavy') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - sweet/2180/3
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2180&p={}'
# for page in range(1, 4):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'sweet') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - spicy/2182/9
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2182&p={}'
# for page in range(1, 10):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'spicy') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - citrusy/2184/7
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=2184&p={}'
# for page in range(1, 8):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'citrusy') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - arromatic/3343/8
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3343&p={}'
# for page in range(1, 9):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'arromatic') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - classic/3348/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3348&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'classic') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - creamy/3344/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3344&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'creamy') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - earthy/3345/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3345&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'earthy') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - exotic/3346/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3346&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'exotic') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - green/3347/2
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3347&p={}'
# for page in range(1, 3):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'green') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - oriental/3349/12
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3349&p={}'
# for page in range(1, 13):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'oriental') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - powdery/3350/2
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3350&p={}'
# for page in range(1, 3):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'powdery') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - smoky/3351/2
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3351&p={}'
# for page in range(1, 3):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'smoky') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# # Fragrance - sporty/3352/1
# base_url = 'https://en.parfumdreams.de/Fragrances?att_duftrichtung=3352&p={}'
# for page in range(1, 2):
#     url = base_url.format(page)
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content, 'html.parser')
#     for child_num in range(1, 96):
#         selector = f'#CategoryBoxItem > div.cw.clearfix > div:nth-child({child_num}) > div.product-image.c.d-12 > a'
#         elements = soup.select(selector)
#         links = [(element['href'], 'sporty') for element in elements]
#         fragrance_links.extend(links)

In [None]:
# df_fragrance_links = pd.DataFrame(fragrance_links, columns=['url', 'fragrance'])
# df_fragrance_links.to_csv('fragrance_links.csv', index=False)

In [None]:
df_fragrance = pd.read_csv('fragrance_links.csv')
df_fragrance.shape

In [None]:
# Checking for duplicates in the 'url' column of df_fragrance
duplicates = df_fragrance[df_fragrance.duplicated(subset='url', keep=False)]
len(duplicates)

In [None]:
df_fragrance.fragrance.unique()

In [None]:
df.fragrance.value_counts()

In [None]:
# Cleaning and reduction the fragrance categories
df.loc[df['fragrance']=='wooden' , 'fragrance'] = 'woody'
df_fragrance.loc[df_fragrance['fragrance']=='wooden' , 'fragrance'] = 'woody'
df.loc[df['fragrance']=='amber' , 'fragrance'] = 'oriental'
df.loc[df['fragrance']=='arromatic' , 'fragrance'] = 'aromatic'
df_fragrance.loc[df_fragrance['fragrance']=='arromatic' , 'fragrance'] = 'aromatic'
df.loc[df['fragrance']=='fougère' , 'fragrance'] = 'aromatic'

In [None]:
# In order to make the fragrance more balanced, use the ascending order of the fragrance value counts in df

# Calculate fragrance value counts in ascending order
fragrance_counts = df['fragrance'].value_counts().sort_values(ascending=True)

# List of fragrances to keep
fragrance_values_to_keep = fragrance_counts.index.tolist()

# Create a dictionary to store the priority value for each fragrance
fragrance_priority = {fragrance: index for index, fragrance in enumerate(fragrance_values_to_keep)}

# Function to determine the priority value of a fragrance
def get_fragrance_priority(fragrance):
    return fragrance_priority.get(fragrance, len(fragrance_values_to_keep))

# Drop rows in df_fragrance where fragrance value is not in the specified list
mask = ~df_fragrance['fragrance'].isin(fragrance_values_to_keep)

# Filter the DataFrame and keep only the rows where the 'fragrance' value is not in the specified list
df_fragrance_filtered = df_fragrance[~mask]

# Drop duplicates in the 'url' column and keep the row with the highest priority fragrance value
df_fragrance_filtered.sort_values(by='fragrance', key=lambda col: col.map(get_fragrance_priority), inplace=True)
df_fragrance_filtered.drop_duplicates(subset='url', keep='first', inplace=True)

In [None]:
df_fragrance_filtered.shape

In [None]:
# Merge the DataFrames on the 'url' column and perform a left join
merged_df = df.merge(df_fragrance_filtered[['url', 'fragrance']], on='url', how='left', indicator=True)
merged_df.info()

In [None]:
# Replace the missing values in the 'fragrance_x' column with the values from the 'fragrance_y' column
merged_df['fragrance_x'].fillna(merged_df['fragrance_y'], inplace=True)

# Drop the unnecessary columns 'url', 'fragrance_y', and '_merge' from the merged DataFrame
merged_df.drop(columns=['url', 'fragrance_y', '_merge'], inplace=True)

# Rename the 'fragrance_x' column to 'fragrance'
merged_df.rename(columns={'fragrance_x': 'fragrance'}, inplace=True)

# Set "id" column as the index
merged_df.set_index('id', inplace=True)

In [None]:
merged_df.fragrance.value_counts()

# **Data Modeling and Model Evaluation**

##### `Fragrance` - fill the rest missing values with model prediction

In [None]:
# Use data from df with non-NaN "fragrance" values for training and testing
df_train = df.dropna(subset=['fragrance'])

In [None]:
# X-y split for training set
X_train = df_train['notes']
y_train = df_train['fragrance']

In [None]:
# Encode 'fragrance' class labels to integers using LabelEncoder
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

In [None]:
# Convert 'notes' text data into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)

In [None]:
# Train test split for training set
X_train, X_test, y_train, y_test = train_test_split(X_train_vectorized, y_train_encoded, test_size=0.2)

* KNN model

In [None]:
# Train the KNN classifier
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)

In [None]:
# Define a function to evaluate different models
def evaluate_model(model, X_test, y_test, label_encoder):
    # Make predictions on the test set
    y_test_pred_encoded = model.predict(X_test)
    # Convert the predicted numeric labels back to their original class labels for evaluation
    y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

    # Calculate evaluation metrics for the test set
    accuracy_test = accuracy_score(label_encoder.inverse_transform(y_test), y_test_pred)
    precision_test = precision_score(label_encoder.inverse_transform(y_test), y_test_pred, average='weighted')
    recall_test = recall_score(label_encoder.inverse_transform(y_test), y_test_pred, average='weighted')
    f1_test = f1_score(label_encoder.inverse_transform(y_test), y_test_pred, average='weighted')

    # Print evaluation metrics
    print("Test Set Metrics:")
    print("Accuracy:", accuracy_test)
    print("Precision:", precision_test)
    print("Recall:", recall_test)
    print("F1-Score:", f1_test)

    return accuracy_test, precision_test, recall_test, f1_test

In [None]:
# Evaluate KNN model
accuracy_knn, precision_knn, recall_knn, f1_knn = evaluate_model(knn_classifier, X_test, y_test, label_encoder)

* Random Forest model

In [None]:
# Train the RandomForest classifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

# Evaluate RandomForest model
accuracy_rf, precision_rf, recall_rf, f1_rf = evaluate_model(rf_classifier, X_test, y_test, label_encoder)

* XGBoost model

In [None]:
# Train the XGBoost classifier
xgb_classifier = XGBClassifier(objective='multi:softmax', num_class=len(df_train['fragrance'].unique()))
xgb_classifier.fit(X_train, y_train)

# Evaluate XGBoost model
accuracy_xgb, precision_xgb, recall_xgb, f1_xgb = evaluate_model(xgb_classifier, X_test, y_test, label_encoder)

* The best model is Random Forest, use it to predict the rest missing values in `fragrance`

In [None]:
# # Improve the model with Hyperparameter tuning

# # Define the parameter grid to search
# param_grid = {
#     'n_estimators': [50, 100, 150],
#     'max_depth': [None, 10, 20],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4],
#     'max_features': ['auto', 'sqrt', 'log2'],
#     'bootstrap': [True, False]
# }

# # Create the Random Forest classifier
# rf_classifier = RandomForestClassifier()

# # Perform Grid Search with 5-fold cross-validation
# grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, n_jobs=-1)
# grid_search.fit(X_train, y_train)

# # Get the best hyperparameter values
# best_params = grid_search.best_params_

# # Train the model with the best hyperparameters
# best_rf_classifier = RandomForestClassifier(**best_params)
# best_rf_classifier.fit(X_train, y_train)

# # Evaluate the model on the test set
# accuracy_test, precision_test, recall_test, f1_test = evaluate_model(best_rf_classifier, X_test, y_test, label_encoder)

# print("Best Hyperparameters:", best_params)

In [None]:
# Get the best parameters from the results above and train the model
best_params = {
    'bootstrap': False, 
    'max_depth': 20,
    'max_features': 'auto',
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'n_estimators': 50
}
best_rf_classifier = RandomForestClassifier(**best_params)
best_rf_classifier.fit(X_train, y_train)

In [None]:
# Predict missing values for 'fragrance' in merged_df
df_predict = merged_df[merged_df['fragrance'].isnull()]
X_predict = df_predict['notes']
X_predict_vectorized = vectorizer.transform(X_predict)
y_predict_encoded = best_rf_classifier.predict(X_predict_vectorized)
y_predict = label_encoder.inverse_transform(y_predict_encoded)

# Fill missing values in the merged_df
merged_df.loc[merged_df['fragrance'].isnull(), 'fragrance'] = y_predict

# Save the complete DataFrame to a CSV file
merged_df.to_csv('perfume_final.csv', index=False)

# **Perfume Apps**

### Perfume Finder

In [None]:
df = pd.read_csv('perfume_final.csv')
df.info()

* Perfume finder based on brand and name of perfume

In [None]:
def search_perfume_info():
    user_input = input('Enter the brand or name of the perfume: ')
    # Convert the user input to lowercase
    input_str = user_input.lower()
  
    # Combine 'brand' and 'name' columns into a new 'brand_name' column and convert to lowercase
    df['brand_name'] = df['brand'].str.lower() + ' ' + df['name'].str.lower()

    mask = df['brand_name'].apply(lambda x: all(word in x for word in input_str.split()))

    # Filter the DataFrame based on the mask
    results = df[mask]

    # Drop the 'brand_name' column to keep the original DataFrame unchanged
    df.drop(columns=['brand_name'], inplace=True)

    if results.empty:
        print('No perfumes found matching the input.')
    else:
        # Sort the results by customer_rating and review_count
        results.sort_values(by=['customer_rating', 'review_count'], ascending=[False, False], inplace=True)

        # Convert the image URLs to image tags for display
        results['image'] = results['image'].apply(lambda url: f'<img src="{url}" width="100" height="100">')

        # Split the multiple words in top_note, heart_note, and base_note and display each word on a separate line
        for col in ['top_note', 'heart_note', 'base_note']:
            results[col] = results[col].str.replace(', ', '<br>').fillna('')
        
        # Convert the values in the 'base_price' column to 1/10 of the original values
        results['base_price'] = round(results['base_price'] / 10, 2).apply(lambda price: f'{price} €')      

        # Rename the columns
        results.rename(columns={'top_note': 'top note', 'heart_note': 'heart note',
                                'base_note': 'base note', 'base_price': 'price per 100ml',
                                'customer_rating': 'customer rating', 'review_count': 'review count'}, inplace=True)
        
        # Show the desired information for the matching perfumes
        columns_to_show = ['image', 'brand', 'name', 'category', 'gender', 'fragrance', 'top note', 'heart note', 'base note', 'price per 100ml', 'customer rating', 'review count']
        
        # Reset the index to use the 'id' column as the index
        results.index.name = 'id'

        # Display the table
        display(HTML(results[columns_to_show].to_html(escape=False)))

        return

In [None]:
# # Example usage:
# search_perfume_info()

* Perfume finder based on certain perfume using cosine similarity

In [None]:
# Function to find the 5 similar perfumes as the user input perfume based on perfume notes

def find_similar_perfumes(top_n=6):
    # Ask the user to input the perfume ID
    input_perfume_id = int(input("Enter the perfume ID: "))

    # Find the notes of the input perfume using the given ID
    input_perfume_notes = df.loc[input_perfume_id, 'notes']
    
    # Vectorize the notes using TF-IDF
    vectorizer = TfidfVectorizer()
    notes_vectors = vectorizer.fit_transform(df['notes'])
    
    # Vectorize the notes of the input perfume
    input_vector = vectorizer.transform([input_perfume_notes])
    
    # Calculate cosine similarity between the input perfume and all other perfumes
    cosine_similarities = cosine_similarity(input_vector, notes_vectors).flatten()
    
    # Add cosine similarity as a new column to the DataFrame
    df['cosine_similarity'] = cosine_similarities
    
    # Sort the DataFrame based on cosine similarity in descending order
    recommended_perfumes = df.sort_values(by='cosine_similarity', ascending=False)
    
    # Convert the image URLs to image tags for display
    recommended_perfumes['image'] = recommended_perfumes['image'].apply(lambda url: f'<img src="{url}" width="100" height="100">')

    # Split the multiple words in top_note, heart_note, and base_note and display each word on a separate line
    for col in ['top_note', 'heart_note', 'base_note']:
        recommended_perfumes[col] = recommended_perfumes[col].str.replace(', ', '<br>').fillna('')
    
    # Convert the values in the 'base_price' column to 1/10 of the original values
    recommended_perfumes['base_price'] = round(recommended_perfumes['base_price'] / 10, 2).apply(lambda price: f'{price} €')      

    # Rename the columns for better display
    recommended_perfumes.rename(columns={'top_note': 'top note', 'heart_note': 'heart note',
                            'base_note': 'base note', 'base_price': 'price per 100ml', 'customer_rating': 'customer rating', 
                            'review_count': 'review count', 'cosine_similarity': 'cosine similarity'}, inplace=True)
    
    # Define the columns to show in the output table
    columns_to_show = ['image', 'brand', 'name', 'category', 'gender', 'fragrance', 'top note', 'heart note', 'base note', 'price per 100ml', 'customer rating', 'review count', 'cosine similarity']
    
    # Reset the index to use the 'id' column as the index
    recommended_perfumes.index.name = 'id'

    # Display the table with the top N similar perfumes
    display(HTML(recommended_perfumes.head(top_n)[columns_to_show].to_html(escape=False)))

    return

In [None]:
# # Example usage:
# find_similar_perfumes()

### Perfume Recommender

* Recommender based on input criteria

In [None]:
# Function to get the desired brand from the user
def get_brand():
    # Get user input for brand and convert to lowercase
    brand_input = input('Enter brand (or press Enter to skip): ')
    brand = brand_input.lower()

    # Get unique brands from the DataFrame and convert them to lowercase for comparison
    all_brands = df['brand'].str.lower().unique()

    if brand_input:  # Check if there is a user input
        # Check if the lowercase brand input exists in the lowercase brands list
        if brand in all_brands:
            return [brand]  # Return the brand as a list
        else:
            print("Sorry, we don't have the brand you want. Please enter another brand or skip.")
            return get_brand()  # Ask user to input again
    else:
        print("No brand chosen. Getting all brands.")
        return all_brands.tolist()  # Return all the unique brands in the DataFrame as a list
    return


In [None]:
# # Example usage:
# input_brand = get_brand()
# input_brand

In [None]:
# Function to get the desired categories from the user
def get_categories():
    all_categories = df['category'].unique()
    print("Available categories:")
    for i, category in enumerate(all_categories, start=1):
        print(f"{i}. {category}")

    categories = []
    max_choices = 5

    while len(categories) < max_choices:
        choice = input("Choose a category number (or press Enter to skip): ")

        if not choice:  # Check if the input is empty (None or empty string)
            if len(categories) == 0:
                print("No category chosen. Getting all categories.")
                return all_categories.tolist()
            else:
                break

        try:
            choice = int(choice)
            if 1 <= choice <= len(all_categories):
                category = all_categories[choice - 1]
                categories.append(category)
                print(f"{category} added to the chosen categories.")
            else:
                print("Invalid category number. Please choose a valid number.")
        except ValueError:
            print("Invalid input. Please enter a valid category number.")

    return categories

In [None]:
# # Example usage:
# input_categories = get_categories()
# input_categories

In [None]:
# Function to get the desired gender from the user
def get_gender():
    all_genders = df['gender'].unique()
    print("Available genders:")
    for i, gender in enumerate(all_genders, start=1):
        print(f"{i}. {gender}")

    genders = []
    max_choices = 3

    while len(genders) < max_choices:
        choice = input("Choose a gender number (or press Enter to skip): ")

        if not choice:  # Check if the input is empty (None or empty string)
            if len(genders) == 0:
                print("No gender chosen. Getting all genders.")
                return all_genders.tolist()
            else:
                break

        try:
            choice = int(choice)
            if 1 <= choice <= len(all_genders):
                gender = all_genders[choice - 1]
                genders.append(gender)
                print(f"{gender} added to the chosen genders.")
            else:
                print("Invalid gender number. Please choose a valid number.")
        except ValueError:
            print("Invalid input. Please enter a valid gender number.")

    return genders

In [None]:
# # Example usage:
# input_genders = get_gender()
# input_genders

In [None]:
# Function to get the desired fragrance from the user
def get_fragrance():
    all_fragrances = df['fragrance'].unique()
    print("Available fragrances:")
    for i, fragrance in enumerate(all_fragrances, start=1):
        print(f"{i}. {fragrance}")

    fragrances = []
    max_choices = 16

    while len(fragrances) < max_choices:
        choice = input("Choose a fragrance number (or press Enter to skip): ")

        if not choice:  # Check if the input is empty (None or empty string)
            if len(fragrances) == 0:
                print("No fragrance chosen. Getting all fragrances.")
                return all_fragrances.tolist()
            else:
                break

        try:
            choice = int(choice)
            if 1 <= choice <= len(all_fragrances):
                fragrance = all_fragrances[choice - 1]
                fragrances.append(fragrance)
                print(f"{fragrance} added to the chosen fragrances.")
            else:
                print("Invalid fragrance number. Please choose a valid number.")
        except ValueError:
            print("Invalid input. Please enter a valid fragrance number.")

    return fragrances

In [None]:
# # Example usage:
# input_fragrances = get_fragrance()
# input_fragrances

In [None]:
# Function to get the desired notes from the user
def get_notes():
    notes = []
    max_choices = 5

    print("Enter notes (one at a time) and press Enter after each note.")
    print("Press Enter without typing anything to skip.")

    for i in range(max_choices):
        choice = input(f"Enter note {i+1}: ")
        
        if not choice:  # Check if the input is empty (None or empty string)
            if len(notes) == 0:
                print("No notes chosen.")
                return []
            else:
                break

        notes.append(choice)
        print(f"{choice} added to the chosen notes.")

    return notes

In [None]:
# # Example usage:
# input_notes = get_notes()
# input_notes

In [None]:
# Function to get the desired price range from the user
def get_price_range():
    # Get the lowest and highest prices from the DataFrame's "base_price" column (in 1000ml)
    lowest_price_1000ml = df["base_price"].min()
    highest_price_1000ml = df["base_price"].max()

    # Adjust the lowest and highest prices for 100ml comparison
    lowest_price = lowest_price_1000ml / 10
    highest_price = highest_price_1000ml / 10

    while True:
        price_from = input('Insert price range for 100ml from: ')
        price_to = input('to: ')

        try:
            price_from = float(price_from) if price_from else lowest_price
            price_to = float(price_to) if price_to else highest_price

            if price_to < price_from:
                print("Invalid price range. The price upper limit cannot be lower than the price lower limit.")
            else:
                break

        except ValueError:
            print("Invalid price range, please input numbers.")
            
    return price_from, price_to

In [None]:
# # Example usage:
# input_price_from, input_price_to = get_price_range()
# input_price_from, input_price_to

In [None]:
def perfume_recommender():
    # Get user inputs for brand, categories, gender, fragrance, and price range
    brand = get_brand()
    categories = get_categories()
    genders = get_gender()
    price_from, price_to = get_price_range()
    fragrances = get_fragrance()
    notes = get_notes()

    # Apply the filters to the DataFrame
    mask_brand = df['brand'].str.lower().isin(brand) if brand and len(brand) > 0 else True
    mask_categories = df['category'].isin(categories) if categories and len(categories) > 0 else True
    mask_genders = df['gender'].isin(genders) if genders and len(genders) > 0 else True
    mask_fragrances = df['fragrance'].isin(fragrances) if fragrances and len(fragrances) > 0 else True
    mask_price = (df['base_price'] >= price_from) & (df['base_price'] <= price_to)

    # Apply the notes filter to the DataFrame if notes are provided
    if notes and len(notes) > 0:
        note_masks = [df['notes'].str.contains(note, case=False) for note in notes]
        mask_notes = pd.DataFrame(note_masks).T.any(axis=1)
    else:
        mask_notes = True

    # Combine all the masks using the logical AND operator (&)
    combined_mask = mask_brand & mask_categories & mask_genders & mask_fragrances & mask_price & mask_notes

    # Filter the DataFrame based on the combined mask
    results = df[combined_mask]

    if results.empty:
        print('No perfumes found matching the selected criteria.')
    else:
        # Sort the results by the number of matched notes, customer_rating, and review_count
        results['num_matched_notes'] = results['notes'].apply(lambda x: sum(note in x for note in notes))
        results.sort_values(by=['num_matched_notes', 'customer_rating', 'review_count'], ascending=[False, False, False], inplace=True)
        results.drop(columns=['num_matched_notes'], inplace=True)

        # Convert the image URLs to image tags for display
        results['image'] = results['image'].apply(lambda url: f'<img src="{url}" width="100" height="100">')

        # Split the multiple words in top_note, heart_note, and base_note and display each word on a separate line
        for col in ['top_note', 'heart_note', 'base_note']:
            results[col] = results[col].str.replace(', ', '<br>').fillna('')

        # Convert the values in the 'base_price' column to 1/10 of the original values
        results['base_price'] = round(results['base_price'] / 10, 2).apply(lambda price: f'{price} €')

        # Rename the columns
        results.rename(columns={'top_note': 'top note', 'heart_note': 'heart note',
                                'base_note': 'base note', 'base_price': 'price per 100ml',
                                'customer_rating': 'customer rating', 'review_count': 'review count'}, inplace=True)

        # Show the desired information for the matching perfumes
        columns_to_show = ['image', 'brand', 'name', 'category', 'gender', 'fragrance', 'top note', 'heart note', 'base note', 'price per 100ml', 'customer rating', 'review count']

        # Reset the index to use the 'id' column as the index
        results.index.name = 'id'

        # Display the table
        display(HTML(results[columns_to_show].to_html(escape=False)))

In [None]:
# # Call the perfume recommender function
# perfume_recommender()

#**Summary**:

* Project Title: Perfumedia
* Name: Ying

---

![alt text](https://cdn.shopify.com/s/files/1/0021/7713/8775/files/How_to_Select_the_Right_Perfume_for_You_600x600.jpg?v=1653001992)


* The recommendation system that I built targets the user who fancy perfumes but struggle to decide what to wear.
* This is a content based recommendation system.
* Perfume Finder 1: A brand and name-based recommendation system that provides curated lists of perfumes matching specific brand and name preferences.
* Perfume Finder 2: Utilizes cosine similarity to recommend perfumes similar to a user's favorite perfume based on perfume notes.
* Perfume Recommender: Allows users to input criteria like perfume category, gender, price range, fragrances, and notes to receive personalized top-rated perfume recommendations.

#**Future Work**
* In future I would develop a website and deploy this in the backend with added functinalities and also enhance the recommendation system using more features and algorithms.