Below is a mock example of our project starting from data cleanig till model evluation. To avoid confusion, mainly look into **feature engineering** which shows how the NLP method we're using, [TF-IDF](https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3#:~:text=TF%2DIDF%20or%20(%20Term%20Frequency,machine%20read%20words%20in%20numbers.), will be implemented.

pca

**revise and add to paper**

In this project, the **TF-IDF (Term Frequency-Inverse Document Frequency)** method is utilized to represent and analyze the product names in order to classify them into their respective categories. The TF-IDF approach provides a numerical representation of the text data, allowing for effective analysis and prediction.

First, the product names are preprocessed by tokenizing them into individual words using ```nltk.word_tokenize```. Then, the words are lemmatized using ```nltk.WordNetLemmatizer``` to obtain their root forms. This preprocessing step ensures consistency and reduces the complexity of the data.

TF-IDF is calculated based on two main components: Term Frequency (TF) and Inverse Document Frequency (IDF).

The TF component measures the frequency of a term (word) within a specific product name. It is computed as the number of occurrences of the term in the product name divided by the total number of terms in the product name. Mathematically, the TF formula can be represented as:

$$TF(t, d) = \frac{{\text{Number of occurrences of term t} \text{ in document d}}}{{\text{Total number of terms in document d}}}$$

The IDF component measures the importance of a term across the entire dataset of product names. It considers how many documents (product names) contain the term. IDF is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The IDF formula can be expressed as:

$$IDF(t, D) = \log \left( \frac{{\text{Total number of documents in the corpus D}}}{{\text{Number of documents containing term t}}} \right)$$

The TF-IDF value is obtained by multiplying the TF and IDF values for each term in a product name: $\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$. It assigns higher weights to terms that are frequent within the product name but rare across the entire dataset. This captures the relative importance of words in distinguishing between different product categories.

The probabilities of each category can be computed using a Multinomial Naive Bayes classifier, where the likelihood of a product belonging to a specific category is estimated based on the TF-IDF values. The probability of a product belonging to a category can be calculated using **Bayes' theorem**:

$$P(Category | Product) = \frac{{P(Product | Category) \cdot P(Category)}}{{P(Product)}}$$

In this equation:

- $P(Category | Product)$: represents the probability of a product belonging to a specific category given its TF-IDF representation.
- $P(Product | Category)$: represents the likelihood of a product's TF-IDF values given a specific category. This is estimated using the Multinomial Naive Bayes classifier.
- $P(Category)$: represents the prior probability of a category, which can be calculated based on the distribution of categories in the training data.
- $P(Product)$: is the evidence probability, which serves as a normalization factor and can be calculated by summing the probabilities over all possible categories.

By leveraging the TF-IDF methodology and probability-based classification, the project aims to accurately categorize Amazon product sales based on their names, facilitating improved organization and searchability on the platform.

In [1]:
import pandas as pd

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /home/bzekeria/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/bzekeria/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/bzekeria/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package words to /home/bzekeria/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [2]:
df = pd.read_csv('data/amazon_products_sampled_raw.csv')

In [5]:
lemmatizer = WordNetLemmatizer()
df["cleaned_name"] = df["cleaned_name"].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x)]))

## Data Cleaning

In [3]:
# remove punctuation
df["cleaned_name"] = df["name"].apply(lambda x: re.sub(r"[^a-zA-Z0-9\s]", "", str(x)))
# convert to lowercase
df["cleaned_name"] = df["cleaned_name"].apply(lambda x: x.lower())
# Remove words containing a number (brand numbers, product id, etc.)
df["cleaned_name"] = df["cleaned_name"].apply(lambda x: ' '.join([word for word in x.split() if not re.search(r'\d', word)]))
# Filter out non-English words
english_words = set(nltk.corpus.words.words())
df["cleaned_name"] = df["cleaned_name"].apply(lambda x: ' '.join([word for word in x.split() if word in english_words]))

**It may not be a good thing to remove non-English words as the ```NLTK corpus``` doesn't contain all english words. There's other problems as well which can be explored later**

In [4]:
df["name"][0], df["cleaned_name"][0]

('Leonardi Tie Pin for Men', 'tie pin for men')

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df["cleaned_name"], df["main_category"], test_size = 0.2, random_state = 42)

In [7]:
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
vectorizer

TfidfVectorizer()

In [8]:
# Fit the vectorizer on the training data
train_features = vectorizer.fit_transform(X_train)
train_words_numerical = pd.DataFrame(train_features.toarray(), columns=vectorizer.get_feature_names())
train_words_numerical

Unnamed: 0,aa,aam,abacus,abb,abbey,abdomen,abdominal,aberration,ability,able,...,zircon,zirconia,zirconium,zodiac,zombie,zone,zoo,zoom,zooter,zyme
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Get the list of words (vocabulary)
vocabulary = vectorizer.get_feature_names()
vocabulary[0:10]

In [9]:
# Transform the testing data using the fitted vectorizer
test_features = vectorizer.transform(X_test)
test_words_numerical = pd.DataFrame(test_features.toarray(), columns = vectorizer.get_feature_names())
test_words_numerical

Unnamed: 0,aa,aam,abacus,abb,abbey,abdomen,abdominal,aberration,ability,able,...,zircon,zirconia,zirconium,zodiac,zombie,zone,zoo,zoom,zooter,zyme
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18319,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18321,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Create a dictionary to store the vocabulary for each category
category_vocabulary = {}

categories = train_labels.unique()
for category in categories:
    
    category_data = train_data[train_labels == category]
    
    category_features = vectorizer.fit_transform(category_data)

    words = vectorizer.get_feature_names()
    words = [word for word in words if not re.search(r'\d', word)]
    english_words = set(nltk.corpus.words.words())
    words = [word for word in words if word in english_words]

    category_vocabulary[category] = words

category_vocabulary["sports & fitness"]

## Model Training

### Multinomial Naive Bayes classifier

In [10]:
classifier = MultinomialNB()
classifier.fit(train_features, y_train)

MultinomialNB()

In [11]:
# Make predictions on the testing data
predictions = classifier.predict(test_features)

In [12]:
# Evaluate the model
report = classification_report(y_test, predictions)
print(report)

                         precision    recall  f1-score   support

            accessories       0.62      0.77      0.69      1062
             appliances       0.84      0.91      0.88      1124
         bags & luggage       0.68      0.77      0.73      1096
        beauty & health       0.79      0.78      0.78      1088
        car & motorbike       0.79      0.85      0.82      1096
grocery & gourmet foods       0.88      0.86      0.87       653
         home & kitchen       0.69      0.75      0.71      1103
    home, kitchen, pets       0.00      0.00      0.00         3
    industrial supplies       0.85      0.66      0.74       815
          kids' fashion       0.71      0.58      0.64      1084
         men's clothing       0.71      0.91      0.80      1092
            men's shoes       0.70      0.80      0.75      1132
                  music       0.96      0.30      0.45       235
           pet supplies       0.99      0.65      0.78       310
       sports & fitness 

### Linear Support Vector Classifier (LinearSVC)

# Initialize and train the classifier
classifier = LinearSVC()
classifier.fit(train_features, train_labels)

# Predict labels for the test set
predictions = classifier.predict(test_features)

# Evaluate the performance of the classifier
classification_report(test_labels, predictions)

### K-Nearest Neighbors (KNN)

# Create a KNN classifier
k = 5  # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Train the KNN classifier
knn.fit(train_features.toarray(), train_labels)

# Predict the categories for test data
y_pred = knn.predict(test_features)

# Calculate the accuracy of the model
accuracy = accuracy_score(test_labels, y_pred)
print('Accuracy:', accuracy)