# Predicting Resale Value of Knives from a Texas Government Surplus Store

## Using Machine Learning to Support an Ebay Store's Financial Success




### Model and Intepret Notebook


**Author:** Dylan Dey
***

# Overview
[Texas State Surplus Store](https://www.tfc.texas.gov/divisions/supportserv/prog/statesurplus/)

[What happens to all those items that get confiscated by the TSA? Some end up in a Texas store.](https://www.wfaa.com/article/news/local/what-happens-to-all-those-items-that-get-confiscated-by-the-tsa-some-end-up-in-a-texas-store/287-ba80dac3-d91a-4b28-952a-0aaf4f69ff95)

[Texas Surplus Store PDF](https://www.tfc.texas.gov/divisions/supportserv/prog/statesurplus/State%20Surplus%20Brochure-one%20bar_rev%201-10-2022.pdf)

![Texas State Surplus Store](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRYkwyu20VBuQ52PrXdVRaGRIIg9OPXJg86lA&usqp=CAU)



Thousands of people make a living selling pre-owned items on sites like EBay. A good place to locate items for sale is the Texas Facilities Commission collects left behind possessions, salvage, and surplus from Texas state agencies such as DPS, TXDOT, TCEQ, and Texas Parks & Wildlife. Examples of commonly available items include vehicles, furniture, office equipment and supplies, small electronics, and heavy equipment. The goal of this project is to create a predictive model in order to determine the resale value of knives from the Texas State Surplus Store on eBay. Descriptive analysis of over 70K sold knives on eBay in the last 2 years will also be used to examine the profitability of investing in knives from the surplus store. 


# BUSINESS PROBLEM


![Father's Ebay Account Since 1999](attachment:texas_dave.jpg)

[Texas Dave's Knives](https://www.ebay.com/str/texasdave3/Knives/_i.html?store_cat=3393246519)

 My family has been running a resale shop and selling on Ebay and other sites for years and lately the business has picked up. We are interested in exploring if the most common item sold at the Texas Surplus Store, pocket knives, would be a safe investment. On the surface they seem great for reselling, as they are oftentimes collectible and small enough to be easily shipped. 

I have been experimenting with low cost used knives for resale but have not risked a large capital investment in the higher end items. Analyzing past listings on eBay for the top brands available at the Surplus Store could prove useful for gaining insight on whether a larger investment would pay off. Understanding the risks involved in investing capital into different brands of knives and their potential returns will help narrow down what brands to invest in and help reduce excess inventory.

It has been very time consuming and inaccurate trying to find the correct value to list an item for on eBay. Currently when listing we try to identify the specific knife by Google search, and then try to find the same or similar items sold on Ebay or other sites. This “guess and check” method often results in inventory not moving due to overpricing or being sold at a price lower than its true potential profit. Building a model that predicts the value of a pocket knife on eBay could help to easily determine the correct value of the item before a listing is live on the website.



# Data Understanding

> There are <mark>eight buckets of presorted brand knives</mark> that I was interested in exploring from the Texas Surplus Store. The Eight Pocketknife brands and their associated cost at the Texas Surplus Store:

<ul>
  <li>Benchmade: \$45.00</li>
  <li>Buck: \$20.00</li>
  <li>Case/Casexx: \$20.00</li>
  <li>CRKT: \$15.00</li>
  <li>Kershaw: \$15.00</li>
  <li>SOG: \$15.00</li>
  <li>Spyderco: \$30.00</li>
  <li>Victorinox: \$20.00</li>
</ul>

### Domain Understading: Cost Breakdown
- padded envelopes: \$0.50 per knife
- flatrate shipping: \$4.45 per knife
- brand knife at surplus store: 15, 20, 30, or 45 dollars per knife
- overhead expenses (gas, cleaning suplies, sharpening supplies, etc): \$3.00
- Ebay's comission, with 13\% being a reasonable approximation

>A majority of the data was scraped from eBays proprietary Terapeak webapp, as this data goes back 2 years as compared to the API listed data that only goes back 90 days. It is assumed a large enough amount of listed data should approximate sold data well enough to prove useful for this project. 

> The target feature for the model to predict is the total price (shipping included) that a knife should be listed on eBay. One model will be using titles and images in order to find potential listings that are undervalued and could be worth investing in. Another model will accept only images as input, as this is an input that can easily be obtained in person at the store. This model will use past sold data of knives on eBay in order to determine within an acceptable amount of error the price it will resale for on eBay (shipping included) using only an image

In [1]:
from sklearn.model_selection import train_test_split
import os
from collections import Counter

import pandas as pd 
import  json
import requests
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import ast
import re

import nltk
from nltk.corpus import stopwords
import string
from nltk import word_tokenize, FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import LSTM, Embedding, Flatten, GRU
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling2D
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, BatchNormalization
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.models import Model
from keras import models
from keras import layers
import tensorflow as tf
from keras.utils import plot_model
from sklearn.metrics import mean_absolute_error
from keras_preprocessing.image import ImageDataGenerator

In [2]:
#helps see plots in readme
plt.style.use('dark_background')

### Function Definition

Define functions to import and clean data for modeling.

In [3]:
def apply_iqr_filter(df):
    
    price_Q1 = df['converted_price'].quantile(0.25)
    price_Q3 = df['converted_price'].quantile(0.75)
    price_iqr = price_Q3 - price_Q1

    profit_Q1 = df['profit'].quantile(0.25)
    profit_Q3 = df['profit'].quantile(0.75)
    profit_iqr = profit_Q3 - profit_Q1

    ROI_Q1 = df['ROI'].quantile(0.25)
    ROI_Q3 = df['ROI'].quantile(0.75)
    ROI_iqr = ROI_Q3 - ROI_Q1

    price_upper_limit = price_Q3 + (1.5 * price_iqr)
    price_lower_limit = price_Q1 - (1.5 * price_iqr)

    profit_upper_limit = profit_Q3 + (1.5 * profit_iqr)
    profit_lower_limit = profit_Q1 - (1.5 * profit_iqr)

    ROI_upper_limit = ROI_Q3 + (1.5 * ROI_iqr)
    ROI_lower_limit = ROI_Q1 - (1.5 * ROI_iqr)
    
#     print(f'Brand: {df.brand[0]}')
#     print(f'price upper limit: ${np.round(price_upper_limit,2)}')
#     print(f'price lower limit: ${np.round(price_lower_limit,2)}')
#     print('-----------------------------------')
#     print(f'profit upper limit: ${np.round(profit_upper_limit,2)}')
#     print(f'profit lower limit: ${np.round(profit_lower_limit,2)}')
#     print('-----------------------------------')
#     print(f'ROI upper limit: {np.round(ROI_upper_limit,2)}%')
#     print(f'ROI lower limit: {np.round(ROI_lower_limit,2)}%')
#     print('-----------------------------------')

    
    new_df = df[(df['converted_price'] < price_upper_limit) &
                (df['converted_price'] > price_lower_limit) &
                (df['profit'] < profit_upper_limit) &
                (df['ROI'] > profit_lower_limit) &
                (df['profit'] < ROI_upper_limit) &
                (df['ROI'] > ROI_lower_limit)]
    
    return new_df
#download jpg urls from dataFrame
def download(row):
    filename = os.path.join(root_folder, str(row.name) + im_extension)

# create folder if it doesn't exist
    os.makedirs(os.path.dirname(filename), exist_ok=True)

    url = row.Image
#     print(f"Downloading {url} to {filename}")
    
    try:
        r = requests.get(url, allow_redirects=True)
        with open(filename, 'wb') as f:
            f.write(r.content)
    except:
        print(f'{filename} error')



# This function removes noisy data
#lots/sets/groups of knives can
#confuse the model from predicting
#the appropriate value of individual knives
def data_cleaner(df):
    lot = re.compile('(?<!-\S)lot(?![^\s.,:?!])')
    group = re.compile('(group)')
    is_set = re.compile('(?<!-\S)set(?![^\s.,?!])')
    df['title'] = df['title'].str.lower()
    trim_list = [lot,group,is_set]
    for item in trim_list:
        df.loc[df['title'].apply(lambda x: re.search(item, x)).notnull(), 'trim'] = 1 
    to_drop = df.loc[df['trim'] == 1].index
    df.drop(to_drop, inplace=True)
    df.drop('trim', axis=1, inplace=True)
    
    return df


#take raw data and prepare it for modeling
def prepare_listed(listed_data_df):
    listed_used_knives = listed_data_df.loc[listed_data_df['condition'] != 1000.0]
    listed_used_knives = data_cleaner(listed_used_knives.copy())
    listed_used_knives.reset_index(drop=True, inplace=True)
    
    return listed_used_knives

#take raw data and prepare it for modeling
def prepare_tera_df(df, x, overhead_cost=3):
    df['price_in_US'] = df['price_in_US'].str.replace("$", "")
    df['price_in_US'] = df['price_in_US'].str.replace(",", "")
    df['price_in_US'] = df['price_in_US'].apply(float)
    
    df['shipping_cost'] = df['shipping_cost'].str.replace("$", "")
    df['shipping_cost'] = df['shipping_cost'].str.replace(",", "")
    df['shipping_cost'] = df['shipping_cost'].apply(float)
    
    df['brand'] = list(bucket_dict.keys())[x]
    df['converted_price'] = (df['price_in_US'] + df['shipping_cost'])
    df['cost'] = list(bucket_dict.values())[x] + overhead_cost + 4.95
    df['profit'] = ((df['converted_price']*.87) -  df['cost'])
    df['ROI'] = (df['profit']/ df['cost'])*100.0
    
    return df   


def avg_word_len(x):
    words = x.split()
    word_len = 0
    for word in words:
        word_len += len(word)
        
    return word_len / len(words)

### Load Data

In [4]:
cd ..

/Users/dylandey/Documents/GitHub/Neural_Network_Predicting_Reseller_Success_Ebay


In [5]:
#load Finding API data
df_bench = pd.read_csv("listed_data/df_bench.csv")
df_buck = pd.read_csv("listed_data/df_buck.csv")
df_case = pd.read_csv("listed_data/df_case.csv")
df_caseXX = pd.read_csv("listed_data/df_CaseXX.csv")
df_crkt = pd.read_csv("listed_data/df_crkt.csv")
df_kersh = pd.read_csv("listed_data/df_kershaw.csv")
df_sog = pd.read_csv("listed_data/df_sog.csv")
df_spyd = pd.read_csv("listed_data/df_spyderco.csv")
df_vict = pd.read_csv("listed_data/df_victorinox.csv")


#Load scraped terapeak sold data
sold_bench = pd.read_csv("terapeak_data/bench_scraped2.csv")
sold_buck1 = pd.read_csv("terapeak_data/buck_scraped2.csv")
sold_buck2 = pd.read_csv("terapeak_data/buck_scraped2_reversed.csv")
sold_case = pd.read_csv("terapeak_data/case_scraped2.csv")
sold_caseXX1 = pd.read_csv("terapeak_data/caseXX_scraped2.csv")
sold_caseXX2 = pd.read_csv("terapeak_data/caseXX2_reversed.csv")
sold_crkt = pd.read_csv("terapeak_data/crkt_scraped.csv")
sold_kershaw1 = pd.read_csv("terapeak_data/kershaw_scraped2.csv")
sold_kershaw2 = pd.read_csv("terapeak_data/kershaw_scraped2_reversed.csv")
sold_sog = pd.read_csv("terapeak_data/SOG_scraped2.csv")
sold_spyd = pd.read_csv("terapeak_data/spyd_scraped2.csv")
sold_vict1 = pd.read_csv("terapeak_data/vict_scraped.csv")
sold_vict2 = pd.read_csv("terapeak_data/vict_reversed.csv")

sold_list = [sold_bench,sold_buck1,
             sold_buck2,sold_case,
             sold_caseXX1,sold_caseXX2,
             sold_crkt,sold_kershaw1,
             sold_kershaw2,sold_sog, 
             sold_spyd, sold_vict1,
             sold_vict2]


listed_df = pd.concat([df_bench,df_buck,
                       df_case,df_caseXX,
                       df_crkt,df_kersh,
                       df_sog,df_spyd,
                       df_vict])

used_listed = prepare_listed(listed_df)

bucket_dict = {'benchmade': 45.0,
               'buck': 20.0,
               'case': 20.0,
               'crkt': 15.0,
               'kershaw': 15.0,
               'sog': 15.0,
               'spyderco': 30.0,
               'victorinox': 20.0
               }

In [6]:
used_listed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12159 entries, 0 to 12158
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   itemId                 12159 non-null  int64  
 1   title                  12159 non-null  object 
 2   galleryURL             12158 non-null  object 
 3   viewItemURL            12159 non-null  object 
 4   autoPay                12159 non-null  bool   
 5   postalCode             11833 non-null  object 
 6   sellingStatus          12159 non-null  object 
 7   shippingInfo           12159 non-null  object 
 8   listingInfo            12159 non-null  object 
 9   returnsAccepted        12159 non-null  bool   
 10  condition              12158 non-null  float64
 11  topRatedListing        12159 non-null  bool   
 12  galleryPlusPictureURL  1011 non-null   object 
 13  pictureURLLarge        11541 non-null  object 
 14  pictureURLSuperSize    11491 non-null  object 
 15  sh

### Prepare Data

In [7]:
for dataframe in sold_list:
    dataframe.rename({'Text': 'title',
                      'shipping_': 'shipping_cost'},
                     axis=1, inplace=True)

    dataframe['date_sold'] = pd.to_datetime(dataframe['date_sold'])

#limited out at 10K columns while scraping. Combine dataframes that went over 10K.
sold_buck = pd.concat([sold_buck1,sold_buck2])
sold_caseXX = pd.concat([sold_caseXX1,sold_caseXX2])
sold_kershaw = pd.concat([sold_kershaw1,sold_kershaw2])
sold_vict = pd.concat([sold_vict1,sold_vict2])

#apply function to remove characters from price
#and create profit/ROI features
sold_bench = prepare_tera_df(sold_bench, 0)
sold_buck = prepare_tera_df(sold_buck, 1)
sold_case = prepare_tera_df(sold_case, 2)
sold_caseXX = prepare_tera_df(sold_caseXX, 2)
sold_crkt = prepare_tera_df(sold_crkt, 3)
sold_kershaw = prepare_tera_df(sold_kershaw, 4)
sold_sog = prepare_tera_df(sold_sog, 5)
sold_spyd = prepare_tera_df(sold_spyd, 6)
sold_vict = prepare_tera_df(sold_vict, 7)

In [8]:
#lowercase and strip titles and remove duplicates
for dataframe in sold_list:
    dataframe['title'] = dataframe['title'].str.lower()
    dataframe['title'] = dataframe['title'].str.strip()
    dataframe.drop_duplicates(
        subset = ['date_sold','price_in_US', 
                  'shipping_cost'],
        keep = 'last', inplace=True)

In [9]:
sold_df = pd.concat([sold_bench, sold_buck,
                     sold_case, sold_caseXX, 
                     sold_crkt, sold_kershaw,
                     sold_sog, sold_spyd,
                     sold_vict]) 
#remove lots
sold_knives = data_cleaner(sold_df).copy()

#combine data
df = pd.concat([sold_knives,used_listed]).copy()
df['Image'].fillna(df['pictureURLLarge'], inplace=True)

#apply IQR filtering
df = apply_iqr_filter(df).copy()
df.reset_index(drop=True, inplace=True)

In [10]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
#remove any special characters
def remove_special_char(x):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', x)
    return text

def remove_punctuations(x):
    x.translate(str.maketrans('', '', string.punctuation))
    return x

def apply_text_prep(df):

    df['title'] = df['title'].apply(remove_punctuations)
    df['title'] = df['title'].apply(remove_special_char)
    #A lot of the strings had duplicate phrases
    #create a set on split strings in order to
    #only get unique words in each title
    df['title'] = df['title'].apply(lambda s: ' '.join(list(set(s.split()))))


    df['title_len'] = df['title'].apply(lambda x: len(x))
    df['word_count'] = df['title'].apply(lambda x: len(x.split()))
    df['avg_word_len'] = df['title'].apply(lambda x: avg_word_len(x))

    stop = stopwords.words('english')

    df['title_nostop'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
    
    return df

In [11]:
df = apply_text_prep(df)

# Model

## Neural network with "title" column as input

In [None]:
df['word_count'].plot(kind = 'hist', title = 'Word Count Distribution')

In [None]:
df['avg_word_len'].plot(kind='hist', bins = 50, title = 'Avg_Word_len Distribution')

In [None]:
df['title_len'].plot(kind='hist', bins= 100,title = 'Title Length Distribution');

### Neural network with "title" column as input

In [None]:
df_title = df.loc[:, ['title_nostop', 'converted_price']]


df_title.rename({'title_nostop': 'data',
                 'converted_price': 'labels'},
                axis=1, inplace=True)

In [None]:
df_title.info()

In [None]:
# df_title['labels'] = (df_title['labels']/mean_price)
Y = df_title['labels'].values
df_train, df_test, Ytrain, Ytest = train_test_split(df_title['data'],
                                                    Y, 
                                                    test_size=0.3, 
                                                    random_state=42)

In [None]:
X_val, X_test, Y_val, Y_test = train_test_split(df_test, 
                                                Ytest, 
                                                test_size=0.5, 
                                                random_state=42)

### GRU

In [None]:
#Vectorize vocab 
voc_size = 30000
max_len = 11
embedding_features = 100
tokenizer = Tokenizer(num_words=voc_size, oov_token = '<OOV>')
tokenizer.fit_on_texts(df_train)
sequences_train = tokenizer.texts_to_sequences(df_train) 
sequences_val = tokenizer.texts_to_sequences(X_val)
sequences_test = tokenizer.texts_to_sequences(X_test)

In [None]:
#add padding to ensure all inputs are the same size
data_train = pad_sequences(sequences_train, maxlen=max_len, padding= 'post', truncating = 'post')
data_val = pad_sequences(sequences_val, maxlen=max_len, padding= 'post', truncating = 'post')
data_test = pad_sequences(sequences_test, maxlen=max_len, padding= 'post', truncating = 'post')

In [None]:
data_train.shape

In [None]:
model = models.Sequential()
model.add(Embedding(voc_size, embedding_features, input_length = max_len)) 
model.add(GRU(300, dropout=0.5))
model.add(Dense(1, activation = 'linear'))
model.summary()

In [None]:
# Compile and fit
model.compile(
  loss='MSE',
  optimizer='adam',
  metrics=['mae']
)


print('Training model...')
r = model.fit(
  data_train,
  Ytrain,
  epochs=5,
  validation_data=(data_val, Y_val)
)

In [None]:
s1 = "Spyderco Mantra 3 Liner Lock Knife Black Carbon Fiber & G-10 S30V Steel C233CFP"
s1_p = 136.1
s2 = "Benchmade 556 Green 154cm Combo Blade Pardue Design"
s2_p = 71.95
s3 = "Case XX 6207 SS Mini Trapper Brown Peachseed Bone Pocket Knife Made in Usa"
s3_p = 51.45

In [None]:
def test_single_string(s):
    s = remove_special_char(s.lower())
    s = remove_punctuations(s)
    s = ' '.join(list(set(s.split())))
    test = tokenizer.texts_to_sequences([s])
    test2 = pad_sequences(test, maxlen=max_len, padding= 'post', truncating = 'post')
    pred=model.predict(test2)
    return pred

In [None]:
pred1 = test_single_string(s1)[0][0]
pred2 = test_single_string(s2)[0][0]
pred3 = test_single_string(s3)[0][0]

In [None]:
ls

![sample1](images/RNN/randomSpyd.jpeg)
![sample2](images/RNN/randomBench.jpeg)
![sample3](images/RNN/randomCase.jpeg)

In [None]:
print(f'True value: ${s1_p}, Predicted Value: ${pred1:.2f}, difference: ${pred1 - s1_p:.2f}')
print(f'True value: ${s2_p}, Predicted Value: ${pred2:.2f} difference: ${pred2 - s2_p:.2f}')
print(f'True value: ${s3_p}, Predicted Value: ${pred3:.2f} difference: ${pred3 - s3_p:.2f}')

In [None]:
preds =model.predict(data_test)

In [None]:
preds = preds.reshape(len(preds))

In [None]:
test_results = model.evaluate(data_test, Y_test)

In [None]:
fig = plt.subplots(figsize=(12,8))
plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.title("Loss vs val Loss for RNN model on titles (MSE)", fontsize=15)
plt.xlabel("epochs", fontsize=15)
plt.ylabel("loss (mean squared error)", fontsize=15)
plt.legend();
plt.savefig('images/RNN_GRU_MSE1.png')

In [None]:
fig = plt.subplots(figsize=(12,8))
plt.plot(r.history['mae'], label='mae')
plt.plot(r.history['val_mae'], label='val_mae')
plt.title("Loss vs val Loss for RNN model on titles (MAE)", fontsize=15)
plt.xlabel("epochs", fontsize=15)
plt.ylabel("loss (mean absolute error)", fontsize=15)
plt.legend();
plt.savefig('images/RNN_GRU_MAE1.png')

In [None]:
plot_model(model,show_shapes=True, to_file='images/RNN_GRU1_arc.png')

In [None]:
test_mae = mean_absolute_error(Y_test, preds)

In [None]:
RMSE = np.sqrt(test_results[0])

In [None]:
string_score = f'\nMAE on training set: ${test_mae:.2f}'
string_score += f'\nRMSE on training set: ${RMSE:.2f}'
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(Y_test, preds)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red")
plt.text(3, 150, string_score)
plt.title('RNN Model for Predicting Resale Value')
plt.ylabel('Model predictions for Resale Value($US)')
plt.xlabel('True Values for Resale Value($US)')
plt.savefig('images/regression_GRU_relu1.png');

In [None]:
df_title['labels'].describe()

In [None]:
df_title = df.loc[:, ['title', 'converted_price']]


df_title.rename({'title': 'data',
                 'converted_price': 'labels'},
                axis=1, inplace=True)

In [None]:
# df_title['labels'] = (df_title['labels']/mean_price)
Y = df_title['labels'].values

In [None]:
df_train, df_test, Ytrain, Ytest = train_test_split(df_title['data'],
                                                    Y, 
                                                    test_size=0.3, 
                                                    random_state=42)

In [None]:
X_val, X_test, Y_val, Y_test = train_test_split(df_test, 
                                                Ytest, 
                                                test_size=0.5, 
                                                random_state=42)

### LSTM

In [None]:
# Convert sentences to sequences
MAX_VOCAB_SIZE = 30000
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE) 
tokenizer.fit_on_texts(df_train)
sequences_train = tokenizer.texts_to_sequences(df_train) 
sequences_val = tokenizer.texts_to_sequences(X_val) 
sequences_test = tokenizer.texts_to_sequences(X_test)

In [None]:
# get word -> integer mapping
word2idx = tokenizer.word_index
V = len(word2idx)
print('Found %s unique tokens.' % V)

In [None]:
# pad sequences so that we get a N x T matrix
data_train = pad_sequences(sequences_train)
print('Shape of data train tensor:', data_train.shape)

# get sequence length
T = data_train.shape[1]

In [None]:
data_val = pad_sequences(sequences_val, maxlen=T)
print('Shape of data test tensor:', X_val.shape)

In [None]:
data_test = pad_sequences(sequences_test, maxlen=T)
print('Shape of data test tensor:', X_test.shape)

In [None]:
# Create the RNN model
# We get to choose embedding dimensionality
D = 12
# Hidden state dimensionality
M = 100
i = Input(shape=(T,))
x = Embedding(V + 1, D)(i)
x = LSTM(M, return_sequences=True)(x) 
x = GlobalMaxPooling1D()(x)
x = Dense(62, activation='relu')(x)
x = Dense(32, activation='relu')(x) 
x = Dropout(0.3)(x)
x = Dense(1)(x)
model = Model(i, x)

In [None]:
# Compile and fit
model.compile(
  loss='MSE',
  optimizer='adam',
  metrics=['mae']
)


print('Training model...')
r = model.fit(
  data_train,
  Ytrain,
  epochs=5,
  validation_data=(data_val, Y_val)
)

In [None]:
model.summary()

In [None]:
plot_model(model,show_shapes=True, to_file='images/RNN_LSTM_arc.png')

In [None]:
pred=model.predict(data_test)

In [None]:
pred.shape

In [None]:
preds = pred.reshape(11750)

In [None]:
test_results = model.evaluate(data_test, Y_test)

In [None]:
RMSE = np.sqrt(test_results[0])

In [None]:
fig = plt.subplots(figsize=(12,8))
plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.title("Loss vs val Loss for RNN model on titles (MSE)", fontsize=15)
plt.xlabel("epochs", fontsize=15)
plt.ylabel("loss (mean squared error)", fontsize=15)
plt.legend()
plt.savefig('images/MSE_LSTM_relu.png');

In [None]:
fig = plt.subplots(figsize=(12,8))
plt.plot(r.history['mae'], label='mae')
plt.plot(r.history['val_mae'], label='val_mae')
plt.title("Loss vs val Loss for RNN model on titles (MAE)", fontsize=15)
plt.xlabel("epochs", fontsize=15)
plt.ylabel("loss (mean absolute error)", fontsize=15)
plt.legend()
plt.savefig('images/MAE_LSTM_relu.png');

In [None]:
test_mae = mean_absolute_error(Y_test, preds)

In [None]:
string_score = f'\nMAE on training set: ${test_mae:.2f}'
string_score += f'\nMAE on training set: ${RMSE:.2f}'
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(Y_test, preds)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red")
plt.text(5, 135, string_score)
plt.title('Regression Model for Predicting Resale Value')
plt.ylabel('Model predictions for Resale Value($US)')
plt.xlabel('True Values for Resale Value($US)')
plt.savefig("images/regression_LSTM_relu.png")

### CNN Titles

In [None]:
# Create the CNN model

# We get to choose embedding dimensionality
D = 256



i = Input(shape=(T,))
x = Embedding(V + 1, D)(i)
x = Conv1D(32, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(64, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(1, activation='relu')(x)

model = Model(i, x)

In [None]:
# Compile and fit
model.compile(
  loss='MSE',
  optimizer='adam',
  metrics=['mae']
)


print('Training model...')
r = model.fit(
  data_train,
  Ytrain,
  epochs=5,
  validation_data=(data_val, Y_val)
)


In [None]:
# Plot loss per iteration
import matplotlib.pyplot as plt
plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.legend();

In [None]:
# Plot accuracy per iteration
plt.plot(r.history['loss'], label='MSE')
plt.plot(r.history['val_loss'], label='val_MSE')
plt.legend();

### CNN using images as input

In [None]:
df_imgs = df.drop(['title', 'url', 
                   'date_sold', 'profit',
                   'ROI', 'brand', 'cost',
                   'pictureURLLarge'],
                     axis=1).copy()

In [None]:
df_imgs.dropna(subset=['Image'], inplace=True)

In [None]:
df_imgs.reset_index(drop=True, inplace=True)

In [None]:
df_imgs['file_index'] = df_imgs.index.values
df_imgs['file_index'] = df_imgs['file_index'].astype(str)

In [None]:
df_imgs['filename'] = df_imgs['file_index'] + '.jpg'

In [None]:
def download(row):
    filename = row.filepath

# create folder if it doesn't exist
#     os.makedirs(os.path.dirname(filename), exist_ok=True)

    url = row.Image
#     print(f"Downloading {url} to {filename}")
    
    try:
        r = requests.get(url, allow_redirects=True)
        with open(filename, 'wb') as f:
            f.write(r.content)
    except:
        print(f'{filename} error')

In [None]:
root_folder = 'C:/Users/12108/Documents/GitHub/Neural_Network_Predicting_Reseller_Success_Ebay/nn_images/'
df_imgs['filepath'] = root_folder + df_imgs['filename']

In [None]:
df_imgs['filepath'].sample(2).apply(print)

In [None]:
# df_imgs.apply(download, axis=1)

#### All image files are stored locally for this project. The below markdown code is for reference.

```
img_list = os.listdir('C:/Users/12108/Documents/GitHub/Neural_Network_Predicting_Reseller_Success_Ebay/nn_images/')

img_df = df_imgs.loc[df_imgs['filename'].isin(img_list)].copy()

img_df.reset_index(drop=True, inplace=True)
```

```
img_df.rename({'Image': 'data',
               'converted_price': 'labels'},
                axis=1, inplace=True)
```

```
df_train, df_test, Ytrain, Ytest = train_test_split(img_df, Y, test_size=0.20)
datagen=ImageDataGenerator(rescale=1./255.,validation_split=0.20)

train_generator=datagen.flow_from_dataframe(
dataframe=df_train,
directory= None,
x_col="filepath",
y_col="labels",
subset="training",
batch_size=100,
seed=55,
shuffle=True,
class_mode="raw")
    
valid_generator=datagen.flow_from_dataframe(
dataframe=df_train,
directory=None,
x_col="filepath",
y_col="labels",
subset="validation",
batch_size=100,
seed=55,
shuffle=True,
class_mode="raw")

test_datagen=ImageDataGenerator(rescale=1./255.)
test_generator=test_datagen.flow_from_dataframe(
dataframe=df_test,
directory=None,
x_col="filepath",
y_col="labels",
batch_size=100,
seed=55,
shuffle=False,
class_mode="raw")
```

In [None]:
# model = models.Sequential()

# model.add(layers.Conv2D(16, (3, 3), padding='same', activation='relu',
#                         input_shape=(256 ,256,  3)))
# model.add(layers.BatchNormalization())
# model.add(layers.Conv2D(16, (3, 3), activation='relu', padding='same'))
# model.add(layers.BatchNormalization())
# model.add(layers.MaxPooling2D((2, 2)))

# model.add(layers.Conv2D(32, (3, 3), padding='same', activation='relu',
#                         input_shape=(256 ,256,  3)))
# model.add(layers.BatchNormalization())
# model.add(layers.Conv2D(32, (3, 3), activation='relu', padding='same'))
# model.add(layers.BatchNormalization())
# model.add(layers.MaxPooling2D((2, 2)))

# model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
# model.add(layers.BatchNormalization())
# model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
# model.add(layers.BatchNormalization())
# model.add(layers.MaxPooling2D((2, 2)))

# model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
# model.add(layers.BatchNormalization())
# model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
# model.add(layers.BatchNormalization())
# model.add(layers.MaxPooling2D((2, 2)))

# model.add(layers.Flatten())

# model.add(Dense(512, activation='relu'))
# model.add(Dropout(0.1))
# model.add(Dense(256, activation='relu'))
# model.add(Dropout(0.1))
# model.add(Dense(128, activation='relu'))
# model.add(Dense(1, activation='linear'))

# model.compile(loss='MSE',
#               optimizer='Adam',
#                metrics=['mae', 'mse'])

# summary = model.fit(train_generator, epochs=3, validation_data=valid_generator)

In [None]:
model = tf.keras.models.load_model('cnn_grayscale_relu1.h5',  compile=False)

In [None]:
plot_model(model, show_shapes=True, to_file="images/CNN_architecture.png")

In [None]:
model.summary()

In [None]:
model.evaluate(valid_generator)

In [None]:
test_generator.reset()
pred=model.predict(test_generator,verbose=1)

In [None]:
test_results = model.evaluate(test_generator)

In [None]:
fig = plt.figure(figsize=(12,8))
plt.plot(summary.history['loss'])
plt.plot(summary.history['val_loss'])
plt.plot
plt.title('model loss')
plt.ylabel('loss(mean absolute error)')
plt.xlabel('epoch')
plt.legend(['train_loss', 'val_loss'], loc='upper right')
plt.show();

# Results

### Recurrent Neural Network (Long Short Term Memory)

![RNN LSTM Arc](images/RNN_LSTM_arc.png)
![RNN CNN MAE](images/MAE_LSTM_relu.png)
![regression_plot](images/regression_LSTM_relu.png)

- The mean price of the 8 brands of knives sold on ebay is around \\$50.00. 
- A mean absolute error of about plus or minus \\$13.80 is acceptable. 

### Convoluted Neural Network on Grayscale Images

![CNN_Architecture](images/CNN_architecture.png)
![CNN Regression Plot](images/Regression_CNN_relu1.png)
![CNN_MSE](images/CNN_MAE_relu1.png)

- The MAE when testing the CNN was roughly \\$25.00. That is an error of plus or minus about 50\% of the mean price of knives sold. Not acceptable yet as compared to the RNN with titles. Will address in future work.

## Future Work
- Expand data to include other products readily purchasable at the Surplus Store. 

- Attempt data augmentation on the CNN image network

- Attempt to obtain more aspect data for sold knives. Some important aspect data is limited access to sellers who average a certain amount of money per month. 

# Appendix

### Random Forest with TFIDF vectorization and feature importance

In [None]:
# df_title['labels'] = (df_title['labels']/mean_price)
Y = df_title['labels'].values

In [None]:
df_title['data'].sample(10).apply(print)

In [None]:
df_train, df_test, Ytrain, Ytest = train_test_split(df_title['data'],
                                                    Y, 
                                                    test_size=0.3, 
                                                    random_state=51)




In [None]:
# X_val, X_test, Y_val, Y_test = train_test_split(df_test, 
#                                                 Ytest, 
#                                                 test_size=0.5, 
#                                                 random_state=51)

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(df_train)
X_train_vec = tfidf_vectorizer.transform(df_train)
x_test_vec = tfidf_vectorizer.transform(df_test)

In [None]:
X_train_vec.get_shape()

In [None]:
tfidf_vectorizer.get_feature_names()

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(verbose=3, n_jobs=-1, random_state=42)

In [None]:
rf_model.fit(X_train_vec,Ytrain)

In [None]:
from sklearn import metrics

y_true = Ytest
y_pred = rf_model.predict(x_test_vec)

print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_true, y_pred))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_true, y_pred))
print('Root Mean Squared Error (RMSE):', metrics.mean_squared_error(y_true, y_pred, squared=False))
print('Explained Variance Score:', metrics.explained_variance_score(y_true, y_pred))
print('Max Error:', metrics.max_error(y_true, y_pred))
print('Mean Squared Log Error:', metrics.mean_squared_log_error(y_true, y_pred))
print('Median Absolute Error:', metrics.median_absolute_error(y_true, y_pred))
print('R^2:', metrics.r2_score(y_true, y_pred))
print('Mean Poisson Deviance:', metrics.mean_poisson_deviance(y_true, y_pred))
print('Mean Gamma Deviance:', metrics.mean_gamma_deviance(y_true, y_pred))

In [None]:
features = tfidf_vectorizer.get_feature_names()
fi = rf_model.feature_importances_
importance = [(features[i], fi[i]) for i in range(0,2000)]

In [None]:
importance[:50]

In [None]:
# df_title['labels'] = (df_title['labels']/mean_price)
Y = df_title['labels'].values

df_title['data'].sample(10).apply(print)

df_train, df_test, Ytrain, Ytest = train_test_split(df_title['data'],
                                                    Y, 
                                                    test_size=0.3, 
                                                    random_state=51)




# X_val, X_test, Y_val, Y_test = train_test_split(df_test, 
#                                                 Ytest, 
#                                                 test_size=0.5, 
#                                                 random_state=51)

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(df_train)
X_train_vec = tfidf_vectorizer.transform(df_train)
x_test_vec = tfidf_vectorizer.transform(df_test)

X_train_vec.get_shape()

tfidf_vectorizer.get_feature_names()

from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(verbose=3, n_jobs=-1, random_state=42)

rf_model.fit(X_train_vec,Ytrain)

from sklearn import metrics

y_true = Ytest
y_pred = rf_model.predict(x_test_vec)

print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_true, y_pred))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_true, y_pred))
print('Root Mean Squared Error (RMSE):', metrics.mean_squared_error(y_true, y_pred, squared=False))
print('Explained Variance Score:', metrics.explained_variance_score(y_true, y_pred))
print('Max Error:', metrics.max_error(y_true, y_pred))
print('Mean Squared Log Error:', metrics.mean_squared_log_error(y_true, y_pred))
print('Median Absolute Error:', metrics.median_absolute_error(y_true, y_pred))
print('R^2:', metrics.r2_score(y_true, y_pred))
print('Mean Poisson Deviance:', metrics.mean_poisson_deviance(y_true, y_pred))
print('Mean Gamma Deviance:', metrics.mean_gamma_deviance(y_true, y_pred))

features = tfidf_vectorizer.get_feature_names()
fi = rf_model.feature_importances_
importance = [(features[i], fi[i]) for i in range(0,2000)]

importance[:50]