<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D3_4_E3_Predicting_E_Commerce_Product_Recommendation_from_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1wxkbM60NlBlkbGK1JqUypKL24RrTiiYk" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center>Constructor Learning, 2023</center>

# Predicting E-Commerce Product Recommendations from Reviews


![](https://github.com/dipanjanS/feature_engineering_session_dhs18/blob/master/ecommerce_product_ratings_prediction/clothing_banner.jpg?raw=1)

This is a classic NLP problem dealing with data from an e-commerce store focusing on women's clothing. Each record in the dataset is a customer review which consists of the review title, text description and a recommendation 0 or 1) for a product amongst other features


__Main Objective:__ Leverage the review text attributes and build deep learning models to predict the recommendation (classification)

# Load up basic dependencies

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
import tensorflow_hub as hub
import nltk
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm

# Load and View the Dataset

The data is available at https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews from where you can download it.

You can also access it from my [__GitHub Repo__](https://github.com/dipanjanS/text-analytics-with-python/blob/master/media) if needed.

Following code enables it to get it easily from the web.

In [None]:
df = pd.read_csv('https://github.com/dipanjanS/text-analytics-with-python/raw/master/media/Womens%20Clothing%20E-Commerce%20Reviews%20-%20NLP.csv', keep_default_na=False)
df.head()

# Basic Data Processing

- Merge all review text attributes (title, text description) into one attribute
- Subset out columns of interest

In [None]:
df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())
df['Recommended'] = df['Recommended IND']
df = df[['Review', 'Recommended']]
df.head()

## Remove all records with no review text

In [None]:
df = df[df['Review'] != '']
df.info()

## There is some imbalance in the data based on product recommendations

In [None]:
df['Recommended'].value_counts()

# Build train and test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Recommended']), df['Recommended'], test_size=0.3, random_state=42)
X_train.shape, X_test.shape

In [None]:
from collections import Counter
Counter(y_train), Counter(y_test)

In [None]:
X_train.head(3)

In [None]:
y_train[:3]

# Text Pre-processing and Wrangling

We do minimal text pre-processing here given we will be building deep learning models.

In [None]:
import nltk
import contractions
import re
import tqdm


def normalize_document(doc):
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', ' ', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()

    return doc

def normalize_corpus(docs):
    norm_docs = []
    for doc in tqdm.tqdm(docs):
        norm_doc = normalize_document(doc)
        norm_docs.append(norm_doc)

    return norm_docs


In [None]:
X_train['Clean Review'] = normalize_corpus(X_train['Review'].values)
X_test['Clean Review'] = normalize_corpus(X_test['Review'].values)

# Experiment 1: Train Classfier with CNN + FastText Embeddings & Evaluate Performance on Test Data

__Note:__ Skip FastText Embeddings part if it takes too much time to download or load it since it does consume a good amount of memory to load the pretrained embeddings.

If you want to load pre-trained embeddings use a slightly smaller file than the one we used in live-coding which had over 2 million words. Here is the link to get embeddings from facebook's pre-trained fasttext model.

https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

__Hint:__ Use the code from the live-coding session to download and load relevant embeddings from the above dataset

# Experiment 2: Train Classfier with LSTM + FastText Embeddings & Evaluate Performance on Test Data

__Note:__ Skip FastText Embeddings part if it takes too much time to download or load it since it does consume a good amount of memory to load the pretrained embeddings.

# Experiment 3: Train Classfier with NNLM Universal Embedding Model

__Hint:__ This model should accept the pre-processed text directly (as shown in livecoding)


# Experiment 4: Train Classfier with BERT

##### Note: You might need to restart the notebook environment on colab after installing the below library


In [None]:
!pip install transformers --ignore-installed

##### Note: Run the below cell to get all the pre-processed data again in case you needed to reload the notebook after the above installation

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
import tensorflow_hub as hub
import nltk
import matplotlib.pyplot as plt

df = pd.read_csv('https://github.com/dipanjanS/text-analytics-with-python/raw/master/media/Womens%20Clothing%20E-Commerce%20Reviews%20-%20NLP.csv', keep_default_na=False)
df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())
df['Recommended'] = df['Recommended IND']
df = df[['Review', 'Recommended']]
df = df[df['Review'] != '']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Recommended']), df['Recommended'], test_size=0.3, random_state=42)

import nltk
import contractions
import re
import tqdm


def normalize_document(doc):
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', ' ', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()

    return doc

def normalize_corpus(docs):
    norm_docs = []
    for doc in tqdm.tqdm(docs):
        norm_doc = normalize_document(doc)
        norm_docs.append(norm_doc)

    return norm_docs

X_train['Clean Review'] = normalize_corpus(X_train['Review'].values)
X_test['Clean Review'] = normalize_corpus(X_test['Review'].values)

train_clean_text = X_train['Clean Review']
test_clean_text = X_test['Clean Review']


#### Train and Evaluate your BERT model using `transformers`

In [None]:
import transformers