In this notebook we are going to try to cluster the product descriptions into different categories using several different NLP models, then run a comparative price analysis in each category across gender. This is a different approach than the previous one, which was using a simple bag-of-words method to encode the product descriptions and running linear regression / random forest regression to see the weights associated with each word. 

The new clustering method is an effort to recognize and work with Simpson's paradox. 

## Model 1: BERTopic
We will start with BERTopic because it has many advantages for this type of problem, as outlined in this research article: 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9120935/

The main one being that it works well with smaller sets of data, and that it doesn't require any type of pre-processing. Yippee! 

General plan: 
1. Clean the data a little - at least make sure that there are no weird symbols or slashes. 
2. Remove duplicates in the data. This is also crucial to removing Simpson's paradox. 
3. Keep track of which product descriptions are womens and which are mens. 
4. Cluster all the product descriptions using BERTopic. 
5. With each cluster, find the average price in the women's clothing and the men's. 
6. Analyze and discuss results. 

I will be using this tutorial to help me: 
https://www.youtube.com/watch?v=v3SePt3fr9g&ab_channel=PythonTutorialsforDigitalHumanities 

Let's go ahead and get started!

In [3]:
from bertopic import BERTopic
import json 
import pandas as pd

In [6]:
# now let's preprocess the data just a little. 

w_df = pd.read_csv("womens_nike_data.csv")
w_df.head()

Unnamed: 0,Product Title,Product Subtitle,Price
0,Nike Pro,Women's Dri-FIT Cropped Tank Top,$40
1,Nike Pro Sculpt,"Women's High-Waisted 3"" Biker Shorts",$38
2,Nike Sportswear Phoenix Fleece,Women's Over-Oversized Pullover Hoodie,$75
3,Nike Sportswear Phoenix Fleece,Women's High-Waisted Oversized Sweatpants,$70
4,Nike Sportswear Women's Artist Collection,Bomber Jacket,$160


In [3]:
w_df.shape

(3054, 3)

This dataset is great to start BERTopic with because it has a pretty long product description, consisting of both a product title and a subtitle. Let's just remove the word "Women's" and "Nike" from each product to be consistent with the other datasets. We also want to remove any duplicates.  

In [7]:
w_df['Product Title'] = w_df['Product Title'].str.replace("Women's", "")
w_df['Product Subtitle'] = w_df['Product Subtitle'].str.replace("Women's", "")
w_df['Product Title'] = w_df['Product Title'].str.replace("Nike", "")
w_df['Product Subtitle'] = w_df['Product Subtitle'].str.replace("Nike", "")
w_df['Price'] = w_df['Price'].str.replace("$", "")
w_df['Price'] = w_df['Price'].astype(float)
w_df = w_df.drop_duplicates()
w_df.head()

Unnamed: 0,Product Title,Product Subtitle,Price
0,Pro,Dri-FIT Cropped Tank Top,40.0
1,Pro Sculpt,"High-Waisted 3"" Biker Shorts",38.0
2,Sportswear Phoenix Fleece,Over-Oversized Pullover Hoodie,75.0
3,Sportswear Phoenix Fleece,High-Waisted Oversized Sweatpants,70.0
4,Sportswear Artist Collection,Bomber Jacket,160.0


In [5]:
w_df.shape

(2152, 3)

By removing the duplicates got rid of about 900 products!

Now, let's just combine these product descriptions into one. I don't know why I didn't do this before. Lol. 

In [8]:
w_df['Product Description'] = w_df['Product Title'].astype(str) + " " + w_df['Product Subtitle'].astype(str)
w_df = w_df.drop('Product Title', axis=1)
w_df = w_df.drop('Product Subtitle', axis=1)
# rearrange order of columns 
w_df = w_df[['Product Description', 'Price']]
w_df.head()

Unnamed: 0,Product Description,Price
0,Pro Dri-FIT Cropped Tank Top,40.0
1,"Pro Sculpt High-Waisted 3"" Biker Shorts",38.0
2,Sportswear Phoenix Fleece Over-Oversized Pul...,75.0
3,Sportswear Phoenix Fleece High-Waisted Overs...,70.0
4,Sportswear Artist Collection Bomber Jacket,160.0


In [9]:
# now repeat for mens 
m_df = pd.read_csv("mens_nike_data.csv")

m_df['Product Title'] = m_df['Product Title'].str.replace("Men's", "")
m_df['Product Subtitle'] = m_df['Product Subtitle'].str.replace("Men's", "")
m_df['Product Title'] = m_df['Product Title'].str.replace("Nike", "")
m_df['Product Subtitle'] = m_df['Product Subtitle'].str.replace("Nike", "")
m_df['Price'] = m_df['Price'].str.replace("$", "")
m_df['Price'] = m_df['Price'].astype(float)
m_df = m_df.drop_duplicates()
m_df['Product Description'] = m_df['Product Title'].astype(str) + " " + m_df['Product Subtitle'].astype(str)
m_df = m_df.drop('Product Title', axis=1)
m_df = m_df.drop('Product Subtitle', axis=1)
m_df = m_df[['Product Description', 'Price']]
m_df.head()

Unnamed: 0,Product Description,Price
0,Club Woven Flow Shorts,50.0
1,Sportswear Club Fleece Graphic Crew,65.0
2,Sportswear Club Graphic Shorts,30.97
3,Sportswear T-Shirt,35.0
4,Club Chino Shorts,70.0


In [8]:
m_df.shape

(5227, 2)

Hmmm. There are still over 2x mens products than womens, even when we remove duplicates. Oh well I'm just gonna ignore that for now. Maybe later I'll put in a step here and run the analysis again. But right now I kind of just want to see what will happen. 

In [10]:
descriptions = (w_df['Product Description'].to_list()) + (m_df['Product Description'].to_list())
print("All product descriptions among men's and women's data: ",len(descriptions))
descriptions = set(descriptions)
descriptions = list(descriptions)
print("Set of UNIQUE product descriptions among men's and women's data: ", len(descriptions))


All product descriptions among men's and women's data:  7379
Set of UNIQUE product descriptions among men's and women's data:  6602


oh wait there's actually 6600 product descriptions. That means we can try other topic modeling approaches such as Top2Vec. 

For the embedding model we're just going to start with all-MiniLM-L6-v2, which is a good all-purpose model according to [BERT's website](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html). Later we can experiment with different embedding models. 

In [11]:
file_path = "descriptions.json"
with open(file_path, 'w') as json_file:
    json.dump(descriptions, json_file, indent=4)

In [11]:
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")

In [12]:
with open('descriptions.json', 'r') as json_file:
    descriptions = json.load(json_file)

In [13]:
small_descriptions = descriptions[:100]

In [14]:
topics, probs = topic_model.fit_transform(small_descriptions)

: 

K this is being weird so I'm gonna try it in google colab... please work! Be right back. 

Okay it worked. Here is what I ran in Google Colab: 
```
from bertopic import BERTopic
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics, probs = topic_model.fit_transform(descriptions)
info = topic_model.get_topic_info()

import pickle
with open('topics.pkl', 'wb') as f:
    pickle.dump(topics, f)
with open('probs.pkl', 'wb') as f: 
  pickle.dump(probs, f)
with open('info.pkl', 'wb') as f:
  pickle.dump(info, f)

```

In [18]:
# now we load everything up again 
import pickle
with open('topics.pkl', 'rb') as f:
    topics = pickle.load(f)

with open('probs.pkl', 'rb') as f:
    probs = pickle.load(f)

with open('info.pkl', 'rb') as f: 
    info = pickle.load(f)

In [19]:
info.head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,760,-1_courtside_am_e9rica_usa,"[courtside, am, e9rica, usa, nba, yoga, wolf, ...",[Los Angeles Lakers Courtside Statement Editio...
1,0,284,0_large_fuse_stack_logo,"[large, fuse, stack, logo, athletic, coopersto...",[Chicago White Sox Large Logo Back Stack MLB...
2,1,187,1_limited_adv_elite_jersey,"[limited, adv, elite, jersey, league, star, al...",[New York Yankees Cooperstown Dri-FIT ADV ML...
3,2,145,2_blitz_essential_nfl_primetime,"[blitz, essential, nfl, primetime, wordmark, r...",[Pittsburgh Steelers Blitz Essential NFL T-S...
4,3,124,3_break_fast_long_sleeve,"[break, fast, long, sleeve, college, max90, st...",[Michigan Fast Break College Long-Sleeve T-S...


In [20]:
print(len(list(set(topics))))

179


Wow! This is so sick. BERTopic just clustered these 6600 product descriptions into just 179 clusters. Let's learn a little bit more about each one.

In [36]:
print("Some items representative of the cluster named ", info.iloc[3]['Name'], "are: ")
for item in info.iloc[3]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[60]['Name'], "are: ")
for item in info.iloc[60]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[150]['Name'], "are: ")
for item in info.iloc[150]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[170]['Name'], "are: ")
for item in info.iloc[170]['Representative_Docs']: 
    print("        ", item)

Some items representative of the cluster named  2_blitz_essential_nfl_primetime are: 
         Pittsburgh Steelers Blitz Essential   NFL T-Shirt
         San Francisco 49ers Blitz Essential   NFL T-Shirt
         New York Giants Blitz Essential   NFL T-Shirt

 Some items representative of the cluster named  59_coach_sideline_top_nfl are: 
         Miami Dolphins Sideline Coach   Dri-FIT NFL Top
         Miami Dolphins Sideline Coach   Dri-FIT NFL Long-Sleeve Top
         New York Giants Sideline Coach   Dri-FIT NFL Top

 Some items representative of the cluster named  149_bomber_time_full_game are: 
         Houston Astros Authentic Collection City Connect Game Time   MLB Full-Zip Bomber Jacket
         New York Mets Authentic Collection City Connect Game Time   MLB Full-Zip Bomber Jacket
         Chicago Cubs Authentic Collection City Connect Game Time   MLB Full-Zip Bomber Jacket

 Some items representative of the cluster named  169_swingman_icon_edition_2020 are: 
         Golden St

As we can see, it appears as though BERTopic has done a fairly accurate job clustering the product descriptions. However, remember how in the beginning we combined both the product title and the product subtitle to create one long product description? Well, I am afraid that these product descriptions are too unique and that within each cluster, there won't even be much overlap between women's items and men's items. Aka, one cluster will be just men's items and another will be just women's, because the clusters are way too specific. 

Let's redo this using only the product subtitles. Don't worry, we'll be quick. 

In [48]:
w_df = pd.read_csv("womens_nike_data.csv")
w_df.drop('Product Title', axis=1)
w_df['Product Subtitle'] = w_df['Product Subtitle'].str.replace("Women's", "")
w_df['Product Subtitle'] = w_df['Product Subtitle'].str.replace("Nike", "")
w_df['Price'] = w_df['Price'].str.replace("$", "")
w_df['Price'] = w_df['Price'].astype(float)
w_df['Product Description'] = w_df['Product Subtitle'].astype(str)
w_df = w_df.drop('Product Subtitle', axis=1)
w_df = w_df[['Product Description', 'Price']]
w_df = w_df.drop_duplicates()
w_df.head()


Unnamed: 0,Product Description,Price
0,Dri-FIT Cropped Tank Top,40.0
1,"High-Waisted 3"" Biker Shorts",38.0
2,Over-Oversized Pullover Hoodie,75.0
3,High-Waisted Oversized Sweatpants,70.0
4,Bomber Jacket,160.0


In [46]:
m_df = pd.read_csv("mens_nike_data.csv")
m_df.drop('Product Title', axis=1)
m_df['Product Subtitle'] = m_df['Product Subtitle'].str.replace("Men's", "")
m_df['Product Subtitle'] = m_df['Product Subtitle'].str.replace("Nike", "")
m_df['Price'] = m_df['Price'].str.replace("$", "")
m_df['Price'] = m_df['Price'].astype(float)
m_df['Product Description'] = m_df['Product Subtitle'].astype(str)
m_df = m_df.drop('Product Subtitle', axis=1)
m_df = m_df[['Product Description', 'Price']]
m_df = m_df.drop_duplicates()
m_df.head()

Unnamed: 0,Product Description,Price
0,Woven Flow Shorts,50.0
1,Graphic Crew,65.0
2,Graphic Shorts,30.97
3,T-Shirt,35.0
4,Chino Shorts,70.0


In [49]:
print("No. women's unique items: ", w_df.shape[0], "\nNo. men's unique items: ", m_df.shape[0])

No. women's unique items:  1466 
No. men's unique items:  1894
     Product Description  Price
71               T-Shirt   35.0
6358             T-Shirt   60.0


Okay now this is even better than the first time because we have more even number of womens vs mens products! Once again I'm not gonna do anything about this class imbalance just yet, lol. Because I need clusters based on all the products, I guess. I can make a step to come back to this later. 

In [51]:
descriptions = (w_df['Product Description'].to_list()) + (m_df['Product Description'].to_list())
print("All product descriptions among men's and women's data: ",len(descriptions))
descriptions = set(descriptions)
descriptions = list(descriptions)
print("Set of UNIQUE product descriptions among men's and women's data: ", len(descriptions))

# the _s stands for subtitle (or simple, whatever you want to remember it by, since we are simplifying the product descriptions)
file_path = "descriptions_s.json"
with open(file_path, 'w') as json_file:
    json.dump(descriptions, json_file, indent=4)

All product descriptions among men's and women's data:  3360
Set of UNIQUE product descriptions among men's and women's data:  2043


Now we go back into google colab and run the above code once again. 

```
from bertopic import BERTopic
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")

with open('descriptions_s.json', 'r') as json_file:
    descriptions_s = json.load(json_file)
  
topics, probs = topic_model.fit_transform(descriptions_s)
info = topic_model.get_topic_info()

import pickle
with open('topics_s.pkl', 'wb') as f:
    pickle.dump(topics, f)
with open('probs_s.pkl', 'wb') as f: 
  pickle.dump(probs, f)
with open('info_s.pkl', 'wb') as f:
  pickle.dump(info, f)
```

In [53]:
# load it back up: 
import pickle
with open('topics_s.pkl', 'rb') as f:
    topics = pickle.load(f)

with open('probs_s.pkl', 'rb') as f:
    probs = pickle.load(f)

with open('info_s.pkl', 'rb') as f: 
    info = pickle.load(f)

print(len(list(set(topics))))

60


AMAZING!!! We went from 2000 products to 60 clusters! SIXTY!!!! Wahoo! Okay now let's see if these are worth anything though... 

In [54]:
print("Some items representative of the cluster named ", info.iloc[3]['Name'], "are: ")
for item in info.iloc[3]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[10]['Name'], "are: ")
for item in info.iloc[10]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[30]['Name'], "are: ")
for item in info.iloc[30]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[50]['Name'], "are: ")
for item in info.iloc[50]['Representative_Docs']: 
    print("        ", item)

Some items representative of the cluster named  2_pants_dri_fit_bodysuit are: 
          Dri-FIT Soccer Pants
           Dri-FIT Pants
          Dri-FIT Pants

 Some items representative of the cluster named  9_jersey_replica_football_hockey are: 
           College Basketball Replica Jersey
          College Basketball Replica Jersey
           College Football Replica Jersey

 Some items representative of the cluster named  29_skirt_midi_tennis_flouncy are: 
          Dri-FIT Tennis Skirt
          Tennis Skirt
          Skirt

 Some items representative of the cluster named  49_sweatpants_wide_leg_rise are: 
          High-Waisted Wide-Leg Sweatpants (Plus Size)
          Mid-Rise Wide-Leg Sweatpants
          Mid-Rise Wide-Leg Sweatpants (Plus Size)


Looking good, looking good, looking good. Once again, might be just a bit too specific. For example, I'd prefer things like "Sweatpants" as opposed to specifically ones that specify a rise and leg size, because those are usually women's only and can't be compared to men's. But let's just see where this takes us. 

In [57]:
df = pd.DataFrame({"topic": topics, "document": descriptions})
df.head()

Unnamed: 0,topic,document
0,16,Bikini Swim Bottom
1,51,Cropped Ribbed Tank
2,30,Jordan NBA Swingman Shorts
3,-1,College Fleece Shorts
4,11,Ankle Socks


In [58]:
print(df.shape)

(2043, 2)


So we just made a dataframe that shows which description is in which topic or cluster. The final step is to match these descriptions back to their womens or mens pages, find the price in each, and compare. But if the topics are too specific then once again this might not give us anything useful. Not to fret however because then we can experiment with even more models and methods of semantic clustering!!! 

In [69]:
w_df
m_df
results_list = []

for topic in df['topic'].unique():
    topic_docs = df[df['topic'] == topic]['document']
    womens_prices = []
    mens_prices = []
    for doc in topic_docs:
        if doc in w_df['Product Description'].values:
            womens_price = w_df[w_df['Product Description'] == doc]['Price'].values[0]
            womens_prices.append(womens_price)
        if doc in mens_df['Product Description'].values:
            mens_price = m_df[m_df['Product Description'] == doc]['Price'].values[0]
            mens_prices.append(mens_price)
    if womens_prices: 
        womens_avg = sum(womens_prices) / len(womens_prices)
    else: 
        womens_avg = None
    if mens_prices: 
        mens_avg = sum(mens_prices) / len(mens_prices) 
    else: 
        mens_avg = None 
    results_list.append({
        "topic": topic,
        "womens average price": womens_avg,
        "mens average price": mens_avg
    })
# convert back to df 
results_df = pd.DataFrame(results_list)

In [70]:
results_df.head()

Unnamed: 0,topic,womens average price,mens average price
0,16,65.494375,57.0
1,51,41.554286,
2,30,47.54,56.893333
3,-1,77.800062,89.792947
4,11,21.184048,21.090244


In [71]:
# initiatlize counts 
count1 = 0 
count2 = 0  
count3 = 0  
count_nan = 0  

for index, row in results_df.iterrows():
    womens_price = row['womens average price']
    mens_price = row['mens average price']
    if pd.isna(womens_price) or pd.isna(mens_price):
        count_nan += 1
    else:
        if womens_price - mens_price > 1:
            count1 += 1
        elif mens_price - womens_price > 1:
            count2 += 1
        else:
            count3 += 1
no_nan = len(results_df) - count_nan

# Print counts and total
print(f"Number of comparable categories: {no_nan}")
print(f"No. of categories where women's price is higher by $1: {count1}")
print(f"No. of categories where men's price is higher by $1: {count2}")
print(f"No. of categories where prices are equal (less than $1 difference): {count3}")

Number of comparable categories: 51
No. of categories where women's price is higher by $1: 13
No. of categories where men's price is higher by $1: 36
No. of categories where prices are equal (less than $1 difference): 2


Very cool. Nice work, Nike. Let's run the new one tomorrow. 

I also must run this on gymshark before I jump to any conclusions, because with gymshark I know what the results need to be. It's sort of like supervised learning, in a way. Not really. Lol. 