# K-means Modeling

### Contents

- [Import Packages & Data](#Import-Packages-&-Data)
- [Preprocess Text](#Preprocess-Text)
- [Train Test Split](#Train-Test-Split)
- [Limitations](#Limitations)
- [Conclusion & Recommendations](#Conclusion-&-Recommendations)
- [Sources](#Sources) 

### Import Packages & Data

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# reading in data
data = pd.read_csv('./data/processed_data.csv')

In [3]:
# calling .head
data.head()

Unnamed: 0,reviewerID,asin,reviewText,int_reviewerID,int_asin,reviewText_processed
0,A1D4G1SNUZWQOT,7106116521,exactly what i needed,1705385300413037995,-2353527134216546931,"['exactly', 'needed']"
1,A3DDWDH9PX2YX2,7106116521,i agree with the other review the opening is t...,5561379548298860475,-2353527134216546931,"['agree', 'review', 'opening', 'small', 'almos..."
2,A2MWC41EW7XL15,7106116521,love these i am going to order another pack to...,3310565570865106017,-2353527134216546931,"['love', 'going', 'order', 'another', 'pack', ..."
3,A2UH2QQ275NV45,7106116521,too tiny an opening,-5293977400703050919,-2353527134216546931,"['tiny', 'opening']"
4,A89F3LQADZBS5,7106116521,okay,-5397713386084954674,-2353527134216546931,['okay']


In [4]:
# checking for nulls 
data.isnull().sum()

reviewerID                0
asin                      0
reviewText              249
int_reviewerID            0
int_asin                  0
reviewText_processed      0
dtype: int64

In [5]:
# dropping any columns with no ReviewText
data.dropna(subset = ['reviewText'], inplace = True)

To create a baseline K-means clustering model we will need to process the `reviewText` column with a TFIDF vectorizer to feed into the clustering algorithm. 

In [6]:
tfdif = TfidfVectorizer()

text_df = tfdif.fit_transform(data['reviewText'])

In [7]:
df_text = pd.DataFrame(text_df.toarray(), columns = tfdif.get_feature_names())

In [8]:
df_text.shape

(782165, 128626)

In [None]:
combined_df = pd.concat([data, df_text], axis = 1 )

In [None]:
features = combined_df.drop(columns = ['reviewText', 'reviewerID', 'reviewText_processed'])
X = features
y = combined_df['asin']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size =.70, random_state = 42, stratify = y)

### Limitations 

There are a few limitations we should consider when it come to this model. First, the model was trained specificially on the data from the Amazon Fashion dataset, it will not perform as well on any other product group on Amazon, you would need to re-train the model for each different product group, as the words and topics in the reviews will be different. 


### Conclusion and Recommendations 

Using LDA modeling we were able to find the ideal number of topics to segement the data based on different topics represented in the reviews. Not all of these topics were clear indicators of types of customers though, there were topics that represent different categories of products but also topics that represented how customers felt about the products they'd purchased. 

### Sources 

Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019

https://colab.research.google.com/drive/1Zv6MARGQcrBbLHyjPVVMZVnRWsRnVMpV#scrollTo=LgWrDtZ94w89

https://stylecaster.com/amazon-fashion-the-drop-by-you-february-2020/ Bella Gerard 

https://www.latimes.com/entertainment-arts/business/story/2020-02-22/amazon-making-the-cut-reality-tv-heidi-klum-prime
Wendy Lee, Feb 22 2020

https://github.com/marcotav/unsupervised-learning/tree/master/topic-modeling

https://books.google.com/books?id=i8-PDwAAQBAJ&pg=PA164&lpg=PA164&dq=using+nlp+data+and+also+customer+ids+for+clustering&source=bl&ots=J8auw-oehF&sig=ACfU3U3RKK_rHPXi0dH6bQ-le4A9BXSUFw&hl=en&ppis=_c&sa=X&ved=2ahUKEwjg7fLK0ojoAhUtj3IEHSorC7gQ6AEwCXoECA8QAQ#v=onepage&q&f=false


https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28

Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70,
Baltimore, Maryland, USA, June 27, 2014. c 2014 Association for Computational Linguistics https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf