### Using NLP methods to create clothing clusters 

This approach experiments with NLP clustering methods to automatically create cohesive clothing categories for products - so that we don't need a manual analysis. The ultimate goal is to take these clusters and then compare by price across gender. 

In [1]:
from bertopic import BERTopic
import json 
import pandas as pd

path = "" # your path here 

# preprocess just a tad for this task 

w_df = pd.read_csv("womens_df_clean.csv")
w_df = w_df.drop('Specific Category', axis = 1) # these were the manual categories I created, they're not useful here 
w_df = w_df.drop('General Category', axis = 1) 
w_df['Price'] = w_df['Price'].str.replace("$", "")
w_df['Price'] = w_df['Price'].astype(float)
w_df = w_df.drop_duplicates()
m_df = pd.read_csv("mens_df_clean.csv")
m_df = m_df.drop('Specific Category', axis = 1)
m_df = m_df.drop('General Category', axis = 1)
m_df['Price'] = m_df['Price'].str.replace("$", "")
m_df['Price'] = m_df['Price'].astype(float)
m_df = m_df.drop_duplicates()

print("Number of womens items: ", w_df.shape[0], "\nNumer of mens items: ", m_df.shape[0])

Number of womens items:  42 
Numer of mens items:  36


Hmmm... very few items. Lol. We lowkey don't even need a topic modeling analysis here. But I just want to see what happens. 
Also in this case it means there's not a point in using Top2Vec, since Top2Vec only works well with large datasets (over 1000 items). 

In [18]:
descriptions = (w_df['Product'].to_list()) + (m_df['Product'].to_list())
print("All product descriptions among men's and women's data: ",len(descriptions))
descriptions = set(descriptions)
descriptions = list(descriptions)
print("Set of UNIQUE product descriptions among men's and women's data: ", len(descriptions))

All product descriptions among men's and women's data:  78
Set of UNIQUE product descriptions among men's and women's data:  76


In [19]:
file_path = f"{path}gymshark_descriptions.json"
with open(file_path, 'w') as json_file:
    json.dump(descriptions, json_file, indent=4)

Went into Google Colab to run the next step cuz my IDE has issues with it. This is what I ran: 

```
from bertopic import BERTopic
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics, probs = topic_model.fit_transform(descriptions)
info = topic_model.get_topic_info()

import pickle
with open('topics.pkl', 'wb') as f:
    pickle.dump(topics, f)
with open('probs.pkl', 'wb') as f: 
  pickle.dump(probs, f)
with open('info.pkl', 'wb') as f:
  pickle.dump(info, f)

```

In [23]:
# now we load everything up again 
import pickle
with open('topics.pkl', 'rb') as f:
    topics = pickle.load(f)

with open('probs.pkl', 'rb') as f:
    probs = pickle.load(f)

with open('info.pkl', 'rb') as f: 
    info = pickle.load(f)

print(len(list(set(topics))))

BERTopic went from 76 product descriptions to 3?? 3 almost feels like too FEW now! well, let's see what they are then. 

In [25]:
print("Some items representative of the cluster named ", info.iloc[-1]['Name'], "are: ")
for item in info.iloc[-1]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[0]['Name'], "are: ")
for item in info.iloc[0]['Representative_Docs']: 
    print("        ", item)

print("\n Some items representative of the cluster named ", info.iloc[1]['Name'], "are: ")
for item in info.iloc[1]['Representative_Docs']: 
    print("        ", item)

Some items representative of the cluster named  1_shirt_oversized_graphic_joggers are: 
         Arrival T-Shirt
         Essential Oversized T-Shirt
         Arrival T-Shirt T-Shirt

 Some items representative of the cluster named  -1_bra_sports_ruched_react are: 
         Ruched Sports Bra
         Everyday Seamless Sports Bra
         Everyday Sports Bra

 Some items representative of the cluster named  0_shorts_seamless_leggings_everyday are: 
         Vital Seamless Shorts
         Vital Seamless 2.0 Shorts
         Everyday Seamless Shorts


In [26]:
info.head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,10,-1_bra_sports_ruched_react,"[bra, sports, ruched, react, everyday, strappy...","[Ruched Sports Bra, Everyday Seamless Sports B..."
1,0,34,0_shorts_seamless_leggings_everyday,"[shorts, seamless, leggings, everyday, vital, ...","[Vital Seamless Shorts, Vital Seamless 2.0 Sho..."
2,1,32,1_shirt_oversized_graphic_joggers,"[shirt, oversized, graphic, joggers, arrival, ...","[Arrival T-Shirt, Essential Oversized T-Shirt,..."


In [27]:
df = pd.DataFrame({"topic": topics, "document": descriptions})
df.head()

Unnamed: 0,topic,document
0,0,Sport Shorts
1,1,Strong Girl Lifting Club Oversized Graphic Crew
2,1,Pump Cover T-Shirt
3,0,"Sport 5"" Shorts"
4,-1,Peek A Boo Sports Bra


In [30]:
mens_df = m_df
womens_df = w_df
results_list = []

for topic in df['topic'].unique():
    topic_docs = df[df['topic'] == topic]['document']
    womens_prices = []
    mens_prices = []
    for doc in topic_docs:
        if doc in w_df['Product'].values:
            womens_price = w_df[w_df['Product'] == doc]['Price'].values[0]
            womens_prices.append(womens_price)
        if doc in mens_df['Product'].values:
            mens_price = m_df[m_df['Product'] == doc]['Price'].values[0]
            mens_prices.append(mens_price)
    if womens_prices: 
        womens_avg = sum(womens_prices) / len(womens_prices)
    else: 
        womens_avg = None
    if mens_prices: 
        mens_avg = sum(mens_prices) / len(mens_prices) 
    else: 
        mens_avg = None 
    results_list.append({
        "topic": topic,
        "womens average price": womens_avg,
        "mens average price": mens_avg
    })
# convert back to df 
results_df = pd.DataFrame(results_list)

results_df.head()

Unnamed: 0,topic,womens average price,mens average price
0,0,37.481818,31.883333
1,1,37.166667,33.11
2,-1,36.0,24.0


Okay. So the BERTopic does accurately show that womens clothing is priced higher than mens clothing in the two categories that  it found. This is better than both RFR and Lasso, which found that there was no effect. 

Regardless, I think 2 categories is still far too few. Furthermore, the overall average price of womens items is higher anyway, so I feel like this result wasn't that hard to come to. 

Let's experiment with a different method. 