# Segment 2 Lab 2

## A real world case study

We will look at prices of actual products scraped from Amazon

We have details of the products, along with key features.

We'll first examine the data, then we'll run Regression

## DO YOU HAVE PEN & PAPER HANDY??

In [None]:
# imports

import os
import random
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
from items import Item
from loaders import ItemLoader
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import numpy as np
from tqdm import tqdm
import pickle
import json

# Downloading the Pickle files

I've made convenient pkl files with the training and test data for the remaining labs.

Sadly, they are a bit too large to go in git. I've uploaded them to Google Drive and you can fetch them here:  
https://drive.google.com/drive/folders/1Imh1NNSsVDXkUWpkeape0hTYL1QROCvj?usp=sharing

Please download them and place them in the project root directory (i.e. the `tech2ai` directory, the parent of this current directory).

If these files are too large for you, please message me and I will make you a smaller dataset!

In [None]:
# Once you have the pickle files in the tech2ai directory (above this one), you can load in dataset

with open('../training_data.pkl', 'rb') as file:
    train = pickle.load(file)

with open('../test_data.pkl', 'rb') as file:
    test = pickle.load(file)

In [None]:
items = train + test
print(f"There are {len(items):,} items, split into {len(train):,} training and {len(test):,} test points")

In [None]:
print(train[10000].text)

## An essential first step to all types of Data Science:

# Investigate the data!

Each item in our dataset has a category, and it has 3 features: weight, rank (best-seller's rank) and timestamp (when was it released)

In [None]:
categories = list(set(item.category for item in items))
counts = [len([item for item in items if item.category==category]) for category in categories]

In [None]:

# Bar chart by category
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="goldenrod")
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

In [None]:
# Plot the distribution of prices

prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="purple", bins=range(0, 1000, 10))
plt.show()

In [None]:
# Plot the distribution of weights

weights = [item.weight for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Weight (ounces)")
plt.xlabel('Weight (ounces)')
plt.ylabel('Count')
plt.hist(weights, rwidth=0.7, color="skyblue", bins=range(0, 2000, 20))
plt.show()

In [None]:
print(max(weights))

In [None]:
heavy = [item for item in items if item.weight==400000.0][0]
heavy.text

In [None]:
# How does the price vary with the weight

weights = [item.weight for item in items]
prices = [item.price for item in items]

# Create the scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(weights, prices, s=0.1, color="red")
plt.xlim(0, 3000)
plt.ylim(0, 1000)

# Add labels and title
plt.xlabel('Weight')
plt.ylabel('Price')
plt.title('Investigate correlations')

# Display the plot
plt.show()

In [None]:
# How does the price vary with how high the product ranks in Amazon best seller lists
import math
ranks = [item.rank for item in items]
prices = [item.price for item in items]

# Create the scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(ranks, prices, s=0.1, color="green")
plt.xlim(0, 20000)
plt.ylim(0, 1000)

# Add labels and title
plt.xlabel('Rank')
plt.ylabel('Price')
plt.title('Investigate correlations')

# Display the plot
plt.show()

In [None]:
# How does the price vary with the timestamp - when it was first released

when = [item.timestamp for item in items]
prices = [item.price for item in items]

# Create the scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(when, prices, s=0.1, color="orange")
plt.ylim(0, 1000)
plt.xlim(0, 2e9)

# Add labels and title
plt.xlabel('When')
plt.ylabel('Price')
plt.title('Investigate correlations')

# Display the plot
plt.show()

In [None]:
# Imports for machine learning

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestRegressor
from testing import Tester

In [None]:
# Before we start our Linear Regression, let's have some fun
# Let's make a terrible model that simply guesses the answer!!

def guess(item):
    return random.randrange(1,1000)

In [None]:
# Set random seed so that our results can be reproduced

random.seed(42)

In [None]:
# This is a useful function I wrote that takes a function to test, and a dataset

Tester.test(guess, test)

## Write this down!

## The error from the random model: $359 

We will be comparing a few models..

In [None]:
# Another amusingly basic model, but perhaps a bit better than the last one!

train_prices = [t.price for t in train]
train_average = sum(train_prices)/len(train_prices)

def guess2(item):
    return train_average

In [None]:
Tester.test(guess2, test)

In [None]:
# Now let's do linear regression with our features

def get_features(item):
    return {
        "weight": item.weight,
        "rank": item.rank,
        "timestamp": item.timestamp,
        "is_top_tech": 1 if item.is_top_tech else 0,
        "is_top_toys": 1 if item.is_top_toys else 0,
        "price": item.price
    }

def list_to_dataframe(items):
    features = [get_features(item) for item in items]
    df = pd.DataFrame(features)
    df['price'] = [item.price for item in items]
    return df

train_df = list_to_dataframe(train)
test_df = list_to_dataframe(test[:250])

In [None]:
# Traditional Linear Regression!

np.random.seed(42)

# Separate features and target
feature_columns = ['weight', 'rank', 'timestamp', "is_top_tech", "is_top_toys"]

X_train = train_df[feature_columns]
y_train = train_df['price']
X_test = test_df[feature_columns]
y_test = test_df['price']

# Train a Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# What were the model parameters for our features?

for feature, coef in zip(feature_columns, model.coef_):
    print(f"{feature}: {coef:.7f}")
print(f"Intercept: {model.intercept_}")

In [None]:
# Function to predict price for a new item

def linear_regression_pricer(item):
    features = get_features(item)
    del features["price"]
    features_df = pd.DataFrame([features])
    return model.predict(features_df)[0]

In [None]:
# test it

Tester.test(linear_regression_pricer, test)

In [None]:
# Here is a short description of each item - perhaps we would do better to train a model on this text?
# This is the start of "natural language processing" or NLP

train[0].text

In [None]:
# For the next few models, we prepare our documents and prices

prices = np.array([float(item.price) for item in train])
documents = [item.text for item in train]

In [None]:
documents[0]

In [None]:
# Use the CountVectorizer
# This changes a paragraph of text into a list of numbers, i.e. a vector
# How does it do that? It just counts the number of times words appear!

np.random.seed(42)
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(documents)

In [None]:
# Here are the 1,000 most common words that it picked, not including "stop words":

selected_words = vectorizer.get_feature_names_out()
print(f"Number of selected words: {len(selected_words)}")
print("Selected words:", selected_words)

In [None]:
regressor = LinearRegression()
regressor.fit(X, prices)

In [None]:
# Now we create a model to use this for prediction

def bag_of_words(item):
    x = vectorizer.transform([item.text])
    return max(regressor.predict(x)[0], 0)

In [None]:
Tester.test(bag_of_words, test)

In [None]:
# And the powerful Random Forest regression

subset=15_000
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=4)
rf_model.fit(X[:subset], prices[:subset])

## Random Forest model

The Random Forest is a type of "**ensemble**" algorithm, meaning that it combines many smaller algorithms to make better predictions.

It uses a very simple kind of machine learning algorithm called a **decision tree**. A decision tree makes predictions by examining the values of features in the input. Like a flow chart with IF statements. Decision trees are very quick and simple, but they tend to overfit.

In our case, the "features" are the elements of the Vector - in other words, it's the number of times that a particular word appears in the product description.

So you can think of it something like this:

**Decision Tree**  
\- IF the word "TV" appears more than 3 times THEN  
-- IF the word "LED" appears more than 2 times THEN  
--- IF the word "HD" appears at least once THEN  
---- Price = $500


With Random Forest, multiple decision trees are created. Each one is trained with a different random subset of the data, and a different random subset of the features. You can see above that we specify 100 trees, which is the default.

Then the Random Forest model simply takes the average of all its trees to product the final result.

In [None]:
def random_forest(item):
    x = vectorizer.transform([item.text])
    return max(0, rf_model.predict(x)[0])

In [None]:
Tester.test(random_forest, test)

## Introducing XGBoost

Like Random Forest, XGBoost is also an ensemble model that combines multiple decision trees.

But unlike Random Forest, XGBoost builds one tree after another, with each next tree correcting for errors in the prior trees, using 'gradient descent'.

It's much faster than Random Forest, so we can run it for the full dataset, and it's typically better at generalizing.

In [None]:
import xgboost as xgb

np.random.seed(42)

xgb_model = xgb.XGBRegressor(n_estimators=100, random_state=42, n_jobs=4, learning_rate=0.4)
xgb_model.fit(X, prices)

In [None]:
def xg_boost(item):
    x = vectorizer.transform([item.text])
    return max(0, xgb_model.predict(x)[0])

In [None]:
Tester.test(xg_boost, test)

# Exercises

Try engineering more features

Try different models from traditional machine learning, such as Support Vector Machines