# Projeto 1 - módulo 6

## Precificação dinâmica - e-commerce

### Mercari Price Suggestion Challenge - Kaggle

Mercari é um site de revenda de produtos online. Uma dos desafios desse tipo de plataforma é auxiliar o usuário, muitas vezes com pouco conhecimento de vendas, a determinar um preço para os seus produtos de modo a maximizar as chances de venda.

### Sobre este projeto

O presente projeto tem o objetivo de desenvolver um algoritmo que identifique produtos já vendidos similares e sugira ao usuário um preço ótimo para novos produtos cadastrados.


### Preparação do ambiente

Para este projeto, acesse o link https://www.kaggle.com/competitions/mercari-price-suggestion-challenge/overview 


In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers
from tensorflow.keras.models import Model

!pip install tensorflow_addons
import tensorflow_addons as tfa
from sklearn import metrics

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow_addons
  Downloading tensorflow_addons-0.18.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 31.8 MB/s 
Installing collected packages: tensorflow-addons
Successfully installed tensorflow-addons-0.18.0


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
base_train = pd.read_csv ('/content/drive/MyDrive/Blue Edtech/data/processed/train_data.csv') 
base_test = pd.read_csv ('/content/drive/MyDrive/Blue Edtech/data/processed/test_data.csv') 

In [4]:
# base_train['item_description'] = base_train['item_description'].fillna("No description")
# base_train['name'] = base_train['name'].fillna("No name")

# base_test['item_description'] = base_test['item_description'].fillna("No description")
# base_test['name'] = base_test['name'].fillna("No name")

In [5]:
stop_words = stopwords.words('english')

# função de tokenização e preenchimento de comprimento

def text_vectorizer(feature):
  
  # # REMOVENDO STOPWORDS DO TREINO
  # base_train[feature] = base_train[feature].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
  # # REMOVENDO STOPWORDS DO TESTE
  # base_test[feature] = base_test[feature].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

  # TOKENIZER
  tk = Tokenizer()
  # FIT ON TRAIN 
  tk.fit_on_texts(base_train[feature].apply(str))
  # TOKENIZES THE TRAIN DATASET
  tk_train = tk.texts_to_sequences(base_train[feature].apply(str))
  # TOKENIZES THE TEST DATASET
  tk_test = tk.texts_to_sequences(base_test[feature].apply(str))
    
  # COMPUTES THE MAX LENGTH
  max_length = base_train[feature].apply(lambda x :len(str(x).split())).max()
    
  # COMPUTE THE VOCAB SIZE
  vocab_size = len(tk.word_index) + 1
    
  # PADDING THE TRAIN SEQUENCES
  train_pad= pad_sequences(tk_train,padding="post",maxlen = max_length)
  # PADDING THE TEST SEQUENCES
  test_pad = pad_sequences(tk_test,padding = "post", maxlen = max_length)
    
  # RETURN THE TOKENIZER, MAX LENGTH , PADDED TRAIN SEQUENCES , PADDED VALIDATION SEQUENCES 
  return tk , max_length, vocab_size, train_pad , test_pad

In [6]:
# separando treino e teste do atributo item_condition_id

train_item_cond = base_train.item_condition_id
test_item_cond = base_test.item_condition_id

In [7]:
# separando treino e teste do atributo shipping

train_shipping = base_train.shipping
test_shipping = base_test.shipping

In [8]:
# rodando a função text_vectorizer para todos os atributos

tk_name , max_length_name, vocab_size_name, train_name_pad , test_name_pad = text_vectorizer('name')
tk_category_1 , max_length_category_1, vocab_size_category_1, train_category_1_pad , test_category_1_pad = text_vectorizer('category_1')
tk_category_2 , max_length_category_2, vocab_size_category_2, train_category_2_pad , test_category_2_pad = text_vectorizer('category_2')
tk_category_3 , max_length_category_3, vocab_size_category_3, train_category_3_pad , test_category_3_pad = text_vectorizer('category_3')
tk_brand_name , max_length_brand_name, vocab_size_brand_name, train_brand_name_pad , test_brand_name_pad = text_vectorizer('brand_name')
tk_item_description , max_length_item_description, vocab_size_item_description, train_item_description_pad , test_item_description_pad = text_vectorizer('item_description')

In [10]:
# identificando o comprimento de cada atributo

print(max_length_name, 'name')
print(max_length_category_1, 'category_1')
print(max_length_category_2, 'category_2')
print(max_length_category_3, 'category_3')
print(max_length_brand_name, 'brand_name')
print(max_length_item_description, 'item_description')

17 name
3 category_1
5 category_2
7 category_3
6 brand_name


In [11]:
# armazenando os dados em uma lista

x_train = [train_item_cond,train_shipping,train_brand_name_pad,train_category_1_pad,train_category_2_pad,train_category_3_pad,train_name_pad]
x_test= [test_item_cond,test_shipping,test_brand_name_pad,test_category_1_pad, test_category_2_pad, test_category_3_pad, test_name_pad]

In [12]:
# convertendo formato do atributo price para log

base_train['log_price'] = np.log(base_train['price'])
base_test['log_price'] = np.log(base_test['price'])

y_train = base_train.log_price
y_test = base_test.log_price

In [17]:
# arquitetura do deep learning

tf.keras.backend.clear_session()
# ITEM CONDITION ID
inp1 = layers.Input(shape=(1)) # INPUT 1 
emb1  = layers.Embedding(6,10,input_length=1)(inp1) # EMBEDDING 1
flat1 = layers.Flatten()(emb1) # FLATTEN
# SHIPPING 
inp2 = layers.Input(shape=(1)) # INPUT 2 
d2 = layers.Dense(10,activation="relu")(inp2) # DENSE LAYER 2
# BRAND NAME
inp3 = layers.Input(shape= (6)) # INPUT 3
emb3 = layers.Embedding(vocab_size_brand_name ,16 ,input_length= 6 )(inp3) # EMBEDDING 3
flat3 = layers.Flatten()(emb3) # FLATTEN
# CATEGORY_1
inp4 = layers.Input(shape = (3)) # INPUT 4
emb4 = layers.Embedding(vocab_size_category_1, 16 , input_length=3 )(inp4) # EMBEDDING 4
flat4 = layers.Flatten()(emb4) # FLATTEN 
# CATEGORY_2
inp5= layers.Input(shape = (5)) # INPUT 5
emb5 = layers.Embedding(vocab_size_category_2 , 16 ,input_length= 5 )(inp5) # EMBEDDING 5
flat5 = layers.Flatten()(emb5) # FLATTEN
# CATEGORY_3
inp6= layers.Input(shape = (7)) # INPUT 6 
emb6 = layers.Embedding(vocab_size_category_3, 16 ,input_length= 7 )(inp6) # EMBEDDING 6
flat6 = layers.Flatten()(emb6) # FLATTEN
# NAME
inp7= layers.Input(shape = (17)) # INPUT 7
emb7 = layers.Embedding(vocab_size_name,20 ,input_length= 17 )(inp7) # EMBEDDING 7
lstm7 = layers.GRU(64,return_sequences=True)(emb7) # GRU
flat7 = layers.Flatten()(lstm7) # FLATTEN
# ITEM DESCRIPTION
inp8= layers.Input(shape = (169)) # INPUT 8 
emb8 = layers.Embedding(vocab_size_item_description , 40 , input_length= 169 )(inp8) # EMBEDDING 8
lstm8 = layers.GRU(64,return_sequences=True)(emb8) # GRU
flat8 = layers.Flatten()(lstm8) # FLATTEN
# CONCATENAÇÃO
concat = layers.Concatenate()([flat1,d2,flat3,flat4,flat5,flat6,flat7,flat8])
# DENSE LAYERS
dense1 = layers.Dense(512,activation="relu")(concat)
# DROPOUT LAYER
drop2 = layers.Dropout(0.2)(dense1)
# DENSE LAYER
dense2 = layers.Dense(256,activation="relu")(drop2)
# DROPOUT LAYER
drop2 = layers.Dropout(0.3)(dense2)
# DENSE LAYER
dense3 = layers.Dense(128,activation="relu")(drop2)
# DROPOUT LAYER
drop2 = layers.Dropout(0.4)(dense3)
# BATCHNORM LAYER
bn2  = layers.BatchNormalization()(drop2)
# DENSE LAYER
dense4 = layers.Dense(1,activation="linear")(bn2)
# MODEL
model =  Model(inputs= [inp1,inp2,inp3,inp4,inp5,inp6,inp7,inp8],outputs=dense4)

# SCHEDULE
def shedule(epoch,lr):
    if epoch<=2:
        return lr
    else:
        return lr*0.1
# CALLBACKS
lr = tf.keras.callbacks.LearningRateScheduler(shedule,verbose=1)
save = tf.keras.callbacks.ModelCheckpoint("content/drive/MyDrive/Blue Edtech/notebooks",monitor="val_root_mean_squared_error",mode="min",save_best_only=True, save_weights_only=True,verbose=1)
earlystop = tf.keras.callbacks.EarlyStopping(monitor="val_root_mean_squared_error",min_delta= 0.01, patience=2,mode="min" )

model.compile(optimizer="adam",loss="mse",metrics=  [tf.keras.losses.MeanAbsoluteError(), tfa.metrics.r_square.RSquare(), tf.keras.metrics.RootMeanSquaredError(), tf.keras.metrics.mean_absolute_percentage_error , tf.keras.metrics.mean_squared_logarithmic_error ])

In [18]:
# FITTING THE MODEL
model.fit(x_train, y_train, validation_data= (x_test, y_test), epochs=10, batch_size = 1024, callbacks=[save,lr,earlystop])


Epoch 1: LearningRateScheduler setting learning rate to 0.0010000000474974513.
Epoch 1/10
Epoch 1: val_root_mean_squared_error improved from inf to 0.50970, saving model to content/drive/MyDrive/Blue Edtech/notebooks

Epoch 2: LearningRateScheduler setting learning rate to 0.0010000000474974513.
Epoch 2/10
Epoch 2: val_root_mean_squared_error improved from 0.50970 to 0.49554, saving model to content/drive/MyDrive/Blue Edtech/notebooks

Epoch 3: LearningRateScheduler setting learning rate to 0.0010000000474974513.
Epoch 3/10
Epoch 3: val_root_mean_squared_error improved from 0.49554 to 0.48941, saving model to content/drive/MyDrive/Blue Edtech/notebooks

Epoch 4: LearningRateScheduler setting learning rate to 0.00010000000474974513.
Epoch 4/10
Epoch 4: val_root_mean_squared_error improved from 0.48941 to 0.48494, saving model to content/drive/MyDrive/Blue Edtech/notebooks

Epoch 5: LearningRateScheduler setting learning rate to 1.0000000474974514e-05.
Epoch 5/10
Epoch 5: val_root_mean_

<keras.callbacks.History at 0x7fa3a2e67090>

In [19]:
y_pred = np.exp(model.predict(x_test))

In [20]:
def print_avaliacao(obs, pred):
    print('R² = %.3f' % metrics.r2_score(obs, pred))
    print('MAPE = %.3f %%' % (100 * metrics.mean_absolute_percentage_error(obs, pred)))
    print('MAE = U$S %.2f' % (metrics.mean_absolute_error(obs, pred)))
    print('RMSE = U$S %.2f' % metrics.mean_squared_error(obs, pred)**0.5)
    print('MSLE = %.3f' % metrics.mean_squared_log_error(obs, pred))

In [21]:
print_avaliacao(base_test.price,y_pred)

R² = 0.477
MAPE = 39.278 %
MAE = U$S 10.12
RMSE = U$S 27.75
MSLE = 0.208


### Referência

https://github.com/pushapgandhi/Mercari_Price_Prediction/blob/main/Deep_Learning_Model.ipynb