# **Part 2 - Price prediction using Natural Language Processing (NLP)**

In Part 1, I used attributes of Les Paul models to predict the retail price. This approach worked in principle but the model error was too large to be useful. A better approach would be to use the product description to identify key pieces of text which would allude to the price. For example, a basic guitar might have a very simplistic product description but an expensive guitar could have some unique keywords which would allude to a higher price tag. With this in mind, let's try NLP to predict guitar prices!

## Extract, transform, load (ETL) pipeline
Data engineering relies on (1) extraction of data from different sources, (2) transformation of the raw data into useful features, and (3) loading the data into a database. I made a `Scrapy` spider to extract guitar product descriptions, model names, and prices from the *Thomann* website (which I have already run and saved the data as a CSV file in the `guitar_scraper` directory). In step 2, we will format the price data from string to integer format and run some NLP techniques on the product descriptions, before carrying out step 3 where the transformed data will be saved in a local database.

In [29]:
import re
import nltk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download(["punkt", "wordnet", "stopwords"])

raw = pd.read_csv("../guitar_scraper/guitar_info.csv")
unique_urls = raw['url'].unique().shape[0]
unique_desc = raw['description'].unique().shape[0]

print(f"{unique_urls} unique urls")
print(f"{unique_desc} unique descriptions")
print(f"{unique_urls - unique_desc} duplicate descriptions")
print("\nRaw data:")
print(raw.iloc[445, :])

4317 unique urls
3973 unique descriptions
344 duplicate descriptions

Raw data:
description    Custom Shop George Harrison "Rocky" Signature ...
name                         Fender George Harrison "Rocky" MBPW
price                                                    €26,590
url            https://www.thomann.de/ie/fender_george_harris...
Name: 445, dtype: object


[nltk_data] Downloading package punkt to /home/martin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/martin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/martin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Product listings for a particular guitar model should be the same, so it is expected that there are some duplicate descriptions in the raw data. These guitars should only differ in terms of colour so we'll keep them in the dataset for now. To begin formatting our data, let's first convert the price column to integer format.

In [28]:
# copy the raw data to separate dataframe
df = raw.copy()

# transform the price column
df["price"] = df["price"].apply(lambda x: int(x[1:].replace(",", "")))

To process the text data, I'm borrowing a function I used for a different NLP project.

In [37]:
def tokenize(text):
    """
    Process raw text into tokenized data for training (feature extraction)

    Parameters
    ----------
    text : string
        tweet in string format

    Returns
    -------
    cleaned_tokens : list of tokenized strings

    """
    # convert to lower case and only keep alphanumeric characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # split string into word tokens
    tokens = word_tokenize(text)
    # remove inflections of words with similar meaning
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    cleaned_tokens = []
    for token in tokens:
        if token not in stop_words:
            clean = lemmatizer.lemmatize(token).lower().strip()
            cleaned_tokens.append(clean)
    return cleaned_tokens

tokens = tokenize(df["description"][445])

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(tokenizer=tokenize)
X = vectorizer.fit_transform(df["description"])
print(len(vectorizer.get_feature_names()))
print(X.toarray())

5083
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [2 0 0 ... 0 0 0]]
