# Feature Engineering

**revise and add to paper**

> We initially considered the \textbf{Bag-of-Words (BoW) model} as a potential feature selection method for our task. The BoW model represents text data by creating a set of unique words, disregarding their order, and counting their occurrences in each document. While the BoW model is a simple and widely used approach, it has certain limitations that led us to explore alternative methods.

The BoW model formula calculates the term frequency (TF) for each word in a document. It can be expressed as follows:

$$\text{BOW}(d) = (t_1, \text{Count}(t_1, d)), (t_2, \text{Count}(t_2, d)), \ldots, (t_n, \text{Count}(t_n, d))$$

One limitation of the BoW model is that it does not consider the importance of words within a document or in the entire corpus. It treats all words equally, which may not be ideal for capturing the semantic meaning of the products. Additionally, the BoW model suffers from the "curse of dimensionality" since the resulting feature space can be very large, making it computationally expensive and potentially leading to overfitting.

To address these limitations, we decided to use the \textbf{Term Frequency-Inverse Document Frequency (TF-IDF)} as our preferred feature extraction method to numerically represent and analyze the product names in order to classify them into their respective categories. 

First, the product names are preprocessed by \emph{tokenizing} them into individual words using \texttt{nltk.word\_tokenize}. Then, the words are lemmatized using \texttt{nltk.WordNetLemmatizer} to obtain their root forms. This preprocessing 
step ensures consistency and reduces the complexity of the data.

TF-IDF is calculated based on two main components: Term Frequency ($TF$) and Inverse Document Frequency ($IDF$).

The TF component measures the frequency of a term (word) within a specific product name. It is computed as the number of occurrences of the term in the product name divided by the total number of terms in the product name. Mathematically, the TF formula can be represented as:

$$\text{TF}(\text{t},~\text{d}) = \frac{{\text{Number~of~occurrences~of~term~}\text{t}~\text{in~document~}\text{d}}}{{\text{Total~number~of~terms~in~document~}\text{d}}}$$

The IDF component measures the importance of a term across the entire dataset of product names. It considers how many documents (product names) contain the term. IDF is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The IDF formula can be expressed as:

$$\text{IDF}(\text{t},~\text{D}) = \log \left( \frac{{\text{Total~number~of~documents~in~the~corpus~D}}}{{\text{Number~of~documents~containing~term~t}}} \right)$$

The $TF-IDF$ value is obtained by multiplying the $TF$ and $IDF$ values for each term in a product name: $\text{TF-IDF}(\text{t}, ~\text{d}, ~\text{D}) = \text{TF}(\text{t}, ~\text{d}) \times \text{IDF}(\text{t}, ~\text{D})$. It assigns higher weights to terms that are frequent within the product name but rare across the entire dataset. This captures the

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier


# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /home/nsharma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nsharma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/nsharma/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package words to /home/nsharma/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [4]:
df = pd.read_csv("data/amazon_products_sampled_cleaned.csv")
df.sample()

Unnamed: 0,name,main_category
47875,ahuja portable pa system bluetooth usb echo effect rechargeable battery black,music


## Text Features

In [5]:
df["word_count"] = df["name"].apply(lambda x: len(str(x).split()))

In [6]:
df["char_count"] = df["name"].apply(lambda x: len(str(x)))

In [7]:
# think of other stuff?

## Methods

### TF-IDF

In [8]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df["name"], df["main_category"], test_size = 0.2, random_state = 42)

In [9]:
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
vectorizer

TfidfVectorizer()

In [10]:
X_train_null_counts = X_train.isnull().sum()
X_train_null_counts

4

In [11]:
X_train_null_rows = X_train[X_train.isnull()]
X_train_null_rows

72857    NaN
75908    NaN
24632    NaN
59339    NaN
Name: name, dtype: object

In [12]:
X_train = X_train.fillna('')

In [13]:
X_train.isnull().sum()

0

In [14]:
X_train.dtype

dtype('O')

In [None]:
# Fit the vectorizer on the training data
train_features = vectorizer.fit_transform(X_train)
train_words_numerical = pd.DataFrame(train_features.toarray(), columns=vectorizer.get_feature_names())
train_words_numerical

In [22]:
X_test_null_counts = X_test.isnull().sum()
X_test_null_counts

0

In [23]:
X_test_null_rows = X_test[X_test.isnull()]
X_test_null_rows

Series([], Name: name, dtype: object)

In [24]:
X_test = X_test.fillna('')

In [25]:
X_test.isnull().sum()

0

In [26]:
X_test.dtype

dtype('O')

In [None]:
# Transform the testing data using the fitted vectorizer
test_features = vectorizer.transform(X_test)
test_words_numerical = pd.DataFrame(test_features.toarray(), columns = vectorizer.get_feature_names())
test_words_numerical

## Discuss

- Both the ```train_features``` and ```test_features``` have a **high-dimensional data**  which can be **computationally expensive** for the ML algorithm (see KNN below)
- Professor discussed with us some options
    - Feature selection?
        - Why?
            - Our TF-IDF bectorizer above is wayyyy too large for our model to handle
            - Allows us to identify the most relevant features (words) and eliminate noise -> reduce overfitting
                - Also eliminates irrelevent features ```aaaaacd``` is garbage lol
    - Mean TF-IDF by category?
        - confused on how to implement it

In [None]:
from sklearn.decomposition import PCA

n_components = 100  # Adjust the number of components as per your needs
pca = PCA(n_components=n_components)
train_features_reduced = pca.fit_transform(train_words_numerical)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_rescaled = scaler.fit_transform(train_features_reduced)

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# Create a SelectKBest object with the chi2 scoring function
k = 600  # Number of top features to select
selector = SelectKBest(score_func=chi2, k=k)

# Fit the selector to the TF-IDF training data
X_train_selected = selector.fit_transform(train_features, y_train)

# Feature names
features = vectorizer.get_feature_names()

selected_feature_indices = selector.get_support(indices=True)

selected_features = [features[i] for i in selected_feature_indices]

# Transform the TF-IDF test data using the selected features
X_test_selected = test_features[:, selected_feature_indices]

In [None]:
X_train_selected

In [None]:
# Convert X_train_selected to a DataFrame
df_train_selected = pd.DataFrame(X_train_selected.toarray(), columns=selected_features)
df_train_selected

In [None]:
X_test_selected

In [None]:
# Convert X_test_selected to a DataFrame
df_test_selected = pd.DataFrame(X_test_selected.toarray(), columns=selected_features)
df_test_selected

## Models
- originally we had naive bayes (intuitive)
- professor suggested KNN (K-Nearest Neighbors)
- we can compare the two and stick with 1 model?

**revise and add to paper**

### Multinomial Naive Bayes Classifier

To find the most probable category $c$ given a document $d$ (product name), we need to calculate $P(c|d)$, the probability of category $c$ given the document $d$. In the **Multinomial Naive Bayes Classifier (MNB)**, this can be done by estimating the conditional probability $P(c|d)$ using **Bayes' theorem**:

$$P(c|d) = \frac{{P(d|c) \cdot P(c)}}{{P(d)}}$$

where $P(d|c)$ is the probability of product $d$ given category $c$, $P(c)$ is the prior probability of category $c$, and $P(d)$ is the probability of product $d$.

To estimate $P(d|c)$, we consider the features present in product $d$ and calculate their probabilities given category $c$:

$$P(d|c) = P(f_1|c) \cdot P(f_2|c) \cdot \ldots \cdot P(f_n|c)$$

where $f_1, f_2, \ldots, f_n$ represent the features in product $d$. In the Multinomial Naive Bayes Classifier, we assume that the features follow a multinomial distribution.

To calculate $P(f_i|c)$, the probability of feature $f_i$ given category $c$, we can use the following formula:

$$P(f_i|c) = \frac{{\text{tf-idf}(f_i, d) + \alpha}}{{\sum_{f \in F}(\text{tf-idf}(f, d) + \alpha \cdot |F|)}}$$

where $\text{tf-idf}(f_i, d)$ represents the TF-IDF value of feature $f_i$ in product $d$. The smoothing parameter $\alpha$ is added to the numerator, ensuring non-zero probabilities for features with zero count. The denominator includes the term $\alpha \cdot |F|$ to normalize the probabilities, where $|F|$ is the cardinality of the set of all features. This normalization guarantees that the probabilities sum up to 1 for each category $c$ and balances the probabilities across all features. By incorporating $\alpha$ and $|F|$, we handle unseen features and achieve well-distributed and normalized probabilities, enhancing the robustness and reliability of the classification results.

By employing the TF-IDF methodology and a probability-based classification approach, the project aims to accurately categorize Amazon product sales based on their names. Through calculating probabilities for each category and comparing them, the most likely category for a given product can be determined. This categorization process will enhance organization and searchability on the platform, facilitating improved browsing and product discovery experiences for users.

In [28]:
classifier = MultinomialNB()
classifier.fit(train_features, y_train)

MultinomialNB()

In [29]:
# Make predictions on the testing data
predictions = classifier.predict(test_features)

In [30]:
# Evaluate the model
report = classification_report(y_test, predictions)
print(report)

                         precision    recall  f1-score   support

            accessories       0.78      0.79      0.78       894
             appliances       0.85      0.93      0.89       913
         bags & luggage       0.69      0.82      0.75       902
        beauty & health       0.81      0.82      0.82       852
        car & motorbike       0.86      0.87      0.87       869
grocery & gourmet foods       0.90      0.91      0.90       610
         home & kitchen       0.75      0.78      0.76       910
    home, kitchen, pets       0.00      0.00      0.00         6
    industrial supplies       0.87      0.75      0.81       807
          kids' fashion       0.78      0.72      0.75       897
         men's clothing       0.78      0.93      0.85       910
            men's shoes       0.79      0.91      0.84       911
                  music       0.97      0.31      0.47       232
           pet supplies       0.98      0.66      0.79       271
       sports & fitness 

In [31]:
### feature selection implementation
classifier = MultinomialNB()
classifier.fit(X_train_selected, y_train)
predictions = classifier.predict(X_test_selected)
report = classification_report(y_test, predictions)
print(report)

                         precision    recall  f1-score   support

            accessories       0.73      0.73      0.73       894
             appliances       0.77      0.82      0.80       913
         bags & luggage       0.65      0.75      0.70       902
        beauty & health       0.42      0.75      0.53       852
        car & motorbike       0.78      0.76      0.77       869
grocery & gourmet foods       0.83      0.71      0.77       610
         home & kitchen       0.64      0.53      0.58       910
    home, kitchen, pets       0.00      0.00      0.00         6
    industrial supplies       0.80      0.52      0.63       807
          kids' fashion       0.70      0.61      0.65       897
         men's clothing       0.71      0.93      0.81       910
            men's shoes       0.73      0.90      0.80       911
                  music       0.98      0.44      0.61       232
           pet supplies       0.97      0.76      0.85       271
       sports & fitness 

**revise and add to paper**

### K-Nearest Neighbors (KNN)

> The probabilities of each category can be 

In [36]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_features_numerical, y_train)

ValueError: could not convert string to float: 'nyx professional makeup jumbo eye pencil black bean'

In [35]:
#Make predictions on the testing data
predictions = knn.predict(test_features)
#predictions = knn.predict(X_test_selected)

NotFittedError: This KNeighborsClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

#Evaluate the model
report = classification_report(y_test, predictions)
print(report)

#### Visualizations
???