# DTSA 5510 Unsupervised Algorithms in Machine Learning Final Project

## Project Overview

This project will use unsupervised methods to group textual product descriptions from E-commerce websites. The data are from this Kaggle dataset: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification . The data are all labeled with 4 classes: "Electronics", "Household", "Books" and "Clothing & Accessories". So the dataset is suitable for a supervised multiclass classification task. However we will apply unsupervised clustering methods to see how well we can group the text descriptions into the appropriate product categories. We will compare these results to supervised methods on the same data set.



## Project Organization

The github repository for this project is located here:
    
https://github.com/albert-kepner/ECommerce_Text_Unsupervised

This project is organized as 4 separate notebooks. We tried two unsupervised approaches and two supervised approaches to separate product descriptions using the same dataset. This is an NLP problem because the data are unstructured English text product descriptions.

The four notebooks are in the github project with these direct links:

* <b>Model_1 ECommerce_Text_Unsupervised_NMF</b>

https://github.com/albert-kepner/ECommerce_Text_Unsupervised/blob/master/ECommerce_Text_Unsupervised_NMF.ipynb

* <b>Model 2 ECommerce_SVD_KMeans_Unsupervised</b>

https://github.com/albert-kepner/ECommerce_Text_Unsupervised/blob/master/ECommerce_SVD_KMeans_Unsupervised.ipynb

* <b>Model 3 ECommerce_TFIDF_XGBoost_Supervised</b>

https://github.com/albert-kepner/ECommerce_Text_Unsupervised/blob/master/Ecommerce_TFIDF_XGBoost_Supervised.ipynb

* <b>Model 4 ECommerce_GloVe_LSTM_Supervised</b>

https://github.com/albert-kepner/ECommerce_Text_Unsupervised/blob/master/ECommerce_GLOVE_LSTM_Supervised.ipynb

## Exploratory Data Analysis and Data Cleaning

The EDA and Data Cleaning are repeated in the first section of each of the 4 notebooks. It was convenient to have each notebook be self contained. The data consist of 50425 Ecommerce product descriptions, each labeled with one of 4 classes: "Electronics", "Household", "Books" and "Clothing & Accessories". The data cleaning step incuded removing English stop words, removing all non-alphabetic characters, and changing all text to lowere case.

## Model Development

The first 3 models use scikit-learn TF-IDF or Term Frequency - Inverse Document Frequency to extract features from the English text of the product descriptions. We used TfidfVectorizer to extract n-grams of length 1 or 2, and then used the 10,000 most frequently occurring n-grams across the corpus of all the product descriptions as features. The TFIDF weightings of each n-gram try to measure the importance of each term in each document with the goal of differentiating between topics in different documents. A few (about 6) of the documents were removed because they had no useful terms after the data cleaning step. This resulted in a TFIDF weighting matrix of 50419 documents by 10000 features, as a scipy sparse matrix.

Model 1 used Nonnegative Matrix Factorization (NMF) to try to identify the topic or product category for each document. We know that there are 4 product categories, so we used NFM to factor the 50419 by 10000 matrix into:

* 50419 by 4 (documents by topic weighting) and

* 4 by 10000 (topics by term weighting per topic)

We can then take the first matrix which gives 4 topic weightings per document, and pick the strongest topic weighting to decide with topic the document belongs with.

For Model 2, we wanted to use KMeans as a clustering method. KMeans does not work well with a high dimentional feature set, so we used sci-kit learn TruncatedSVD to do Singular Value Decomposition as a dimension reduction technique. We then used KMeans to cluster the documents using a reduced set of features from TruncatedSVD. We explored varying the number of SVD compontents from 1 to 100 components, and picked the best result based on how well we were able to match our clusters to the labeled product descriptions.

For Models 3 and 4 we used supervised models to do  classification for comparison with the unsupervised models on the same data set.

Model 3 uses the XGBoost library to do supervised classification on the TFIDF 50419 by 10000 sparse matrix. XGBoost is a highly scalable library that can work with large datasets to implement gradient boosting. XGBoost works with scipi sparse matracies which is a particular advantage when working with a TFIDF matrix.

Model 4 uses a different approach to extract features from English text product descriptions. This time we used GloVe word embeddings. ( Global Vectors for Word Representation https://nlp.stanford.edu/projects/glove/) Instead of using TFIDF to assign a vector of features to each document, this approach assignes a fixed length vector (in our case 100 floats) to each word of each document (after data cleaning and stop word removal). The GloVe word embedding vectors are pre-trained on a large corpus of text, using the context in which words frequently appear to assign similar vectors to words with similar meanings. We used the 100 length vectors from this particular GloVe dataset: <b>Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip</b>

The GloVe vectors are used to encode the sequence of words in each document as a sequence of the Glove vectors. We then train a sequence based supervised classifcation model to recognize product categories, based on similar sequences of word meanings. Specifically we used an LSTM (Long Short Term Memory) neural network built with Keras for this model. For the LSTM model to work we need sequences of uniform length. Therefore we either pad or truncate all the sequences of word vectors to be the same length, in our case 150 word vectors. So the features of each document are a matrix of 150 words by 100 floats encoding each word.

## Model Results

For the first 2 models we did a clustering operation by unsupervised means. We then matched the clusters produced to the 4 product categories using the labels in the original data set. We can then compute a classification accuracy score by measuring how well the unsupervised clusters match the corresponding true labels for the product categories.

Tnen for comparison we ran 2 distinctly different supervised classifcation algorithms on the same data set in models 3 and 4.

## Conclusions

## References

Kaggle dataset: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification 

Text Clustering (TFIDF, PCA...) Beginner Tutorial  https://www.kaggle.com/code/albeffe/text-clustering-tfidf-pca-beginner-tutorial

XGBoost Documentation  https://xgboost.readthedocs.io/en/latest/index.html

Using XGBoost with Scikit-learn https://www.kaggle.com/code/stuarthallows/using-xgboost-with-scikit-learn/notebook

Using Sparse Matrices in XGBoost  https://towardsdatascience.com/using-sparse-matrices-in-xgboost-2c2112f362f8

Recurrent Neural Networks (RNN) for Language Modeling in Python / The Embedding Layer https://campus.datacamp.com/courses/recurrent-neural-networks-rnn-for-language-modeling-in-python/rnn-architecture?ex=7

GloVe: Global Vectors for Word Representation https://nlp.stanford.edu/projects/glove/

Word Embeddings https://en.wikipedia.org/wiki/Word_embedding