# Feature Engineering

**revise and add to paper**

> In this project, the **TF-IDF (Term Frequency-Inverse Document Frequency)** method is utilized to represent and analyze the product names in order to classify them into their respective categories. The TF-IDF approach provides a numerical representation of the text data, allowing for effective analysis and prediction. 

First, the product names are preprocessed by tokenizing them into individual words using ```nltk.word_tokenize```. Then, the words are lemmatized using ```nltk.WordNetLemmatizer``` to obtain their root forms. This preprocessing step ensures consistency and reduces the complexity of the data.

> TF-IDF is calculated based on two main components: Term Frequency (TF) and Inverse Document Frequency (IDF).

> The TF component measures the frequency of a term (word) within a specific product name. It is computed as the number of occurrences of the term in the product name divided by the total number of terms in the product name. Mathematically, the TF formula can be represented as:

$$TF(t, d) = \frac{{\text{Number of occurrences of term t} \text{ in document d}}}{{\text{Total number of terms in document d}}}$$

> The IDF component measures the importance of a term across the entire dataset of product names. It considers how many documents (product names) contain the term. IDF is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The IDF formula can be expressed as:

$$IDF(t, D) = \log \left( \frac{{\text{Total number of documents in the corpus D}}}{{\text{Number of documents containing term t}}} \right)$$

> The TF-IDF value is obtained by multiplying the TF and IDF values for each term in a product name: $\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$. It assigns higher weights to terms that are frequent within the product name but rare across the entire dataset. This captures the relative importance of words in distinguishing between different product categories.

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve
from sklearn.multiclass import OneVsRestClassifier

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /home/bzekeria/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/bzekeria/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/bzekeria/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package words to /home/bzekeria/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [2]:
df = pd.read_csv("data/amazon_products_sampled_cleaned.csv")
df.sample()

Unnamed: 0,name,main_category
44304,Impakto Ajanta Dancing Flash Running Shoes Men Black,men's shoes


## Text Features

In [3]:
df["word_count"] = df["name"].apply(lambda x: len(str(x).split()))

In [4]:
df["char_count"] = df["name"].apply(lambda x: len(str(x)))

In [5]:
# think of other stuff?

## Methods

### TF-IDF

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df["name"], df["main_category"], test_size = 0.2, random_state = 42)

In [7]:
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
vectorizer

TfidfVectorizer()

In [8]:
X_train_null_counts = X_train.isnull().sum()
X_train_null_counts

4

In [9]:
X_train_null_rows = X_train[X_train.isnull()]
X_train_null_rows

72857    NaN
75908    NaN
24632    NaN
59339    NaN
Name: name, dtype: object

In [10]:
X_train = X_train.fillna('')

In [11]:
X_train.isnull().sum()

0

In [12]:
X_train.dtype

dtype('O')

In [13]:
# Fit the vectorizer on the training data
train_features = vectorizer.fit_transform(X_train)
train_words_numerical = pd.DataFrame(train_features.toarray(), columns=vectorizer.get_feature_names())
train_words_numerical

Unnamed: 0,aa,aaa,aaaaa,aaaaacd,aaber,aabha,aac,aachi,aada,aadab,...,zyla,zym,zyme,zync,zyozique,zysk,zyzta,zz,zziggler,zzowin
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61742,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
61743,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
61744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
61745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
X_test_null_counts = X_test.isnull().sum()
X_test_null_counts

3

In [15]:
X_test_null_rows = X_test[X_test.isnull()]
X_test_null_rows

45767    NaN
3978     NaN
23775    NaN
Name: name, dtype: object

In [16]:
X_test = X_test.fillna('')

In [17]:
X_test.isnull().sum()

0

In [18]:
X_test.dtype

dtype('O')

In [19]:
# Transform the testing data using the fitted vectorizer
test_features = vectorizer.transform(X_test)
test_words_numerical = pd.DataFrame(test_features.toarray(), columns = vectorizer.get_feature_names())
test_words_numerical

Unnamed: 0,aa,aaa,aaaaa,aaaaacd,aaber,aabha,aac,aachi,aada,aadab,...,zyla,zym,zyme,zync,zyozique,zysk,zyzta,zz,zziggler,zzowin
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15432,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Models
- originally we had naive bayes (intuitive)
- professor suggested KNN (K-Nearest Neighbors)
- we can compare the two and stick with 1 model?

**revise and add to paper**

### Multinomial Naive Bayes Classifier

> The probabilities of each category can be computed using a Multinomial Naive Bayes classifier, where the likelihood of a product belonging to a specific category is estimated based on the TF-IDF values. The probability of a product belonging to a category can be calculated using **Bayes' theorem**:

$$P(Category | Product) = \frac{{P(Product | Category) \cdot P(Category)}}{{P(Product)}}$$

> In this equation:

- $P(Category | Product)$: represents the probability of a product belonging to a specific category given its TF-IDF representation.
- $P(Product | Category)$: represents the likelihood of a product's TF-IDF values given a specific category. This is estimated using the Multinomial Naive Bayes classifier.
- $P(Category)$: represents the prior probability of a category, which can be calculated based on the distribution of categories in the training data.
- $P(Product)$: is the evidence probability, which serves as a normalization factor and can be calculated by summing the probabilities over all possible categories.

> By leveraging the TF-IDF methodology and probability-based classification, the project aims to accurately categorize Amazon product sales based on their names, facilitating improved organization and searchability on the platform.

In [20]:
classifier = MultinomialNB()
classifier.fit(train_features, y_train)

MultinomialNB()

In [21]:
# Make predictions on the testing data
predictions = classifier.predict(test_features)

In [22]:
# Evaluate the model
report = classification_report(y_test, predictions)
print(report)

                         precision    recall  f1-score   support

            accessories       0.78      0.79      0.78       894
             appliances       0.85      0.93      0.89       913
         bags & luggage       0.69      0.82      0.75       902
        beauty & health       0.81      0.82      0.82       852
        car & motorbike       0.86      0.87      0.87       869
grocery & gourmet foods       0.90      0.91      0.90       610
         home & kitchen       0.75      0.78      0.76       910
    home, kitchen, pets       0.00      0.00      0.00         6
    industrial supplies       0.87      0.75      0.81       807
          kids' fashion       0.78      0.72      0.75       897
         men's clothing       0.78      0.93      0.85       910
            men's shoes       0.79      0.91      0.84       911
                  music       0.97      0.31      0.47       232
           pet supplies       0.98      0.66      0.79       271
       sports & fitness 

#### Visualizations
???