# Car Price prediction

<img src="https://whatcar.vn/media/2018/09/car-lot-940x470.jpg"/>

## Прогнозирование стоимости автомобиля по характеристикам
*Этот ноутбук является шаблоном (Baseline) к текущему соревнованию и не служит готовым решением!*   
Вы можете использовать его как основу для построения своего решения.


> **Baseline** создается больше как шаблон, где можно посмотреть, как происходит обращение с входящими данными и что нужно получить на выходе. При этом ML начинка может быть достаточно простой. Это помогает быстрее приступить к самому ML, а не тратить ценное время на инженерные задачи. 
Также baseline является хорошей опорной точкой по метрике. Если наше решение хуже baseline -  мы явно делаем что-то не так и стоит попробовать другой путь) 

In [2]:
!pip install -q tensorflow==2.3

In [3]:
# Augmentation
!pip install albumentations -q

In [4]:
!pip install pymorphy2

In [5]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import random
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import sys
import PIL
import cv2
import re
import pymorphy2
import nltk

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,RobustScaler

# # keras
import tensorflow as tf
import tensorflow.keras.layers as L
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
import albumentations

# plt
import matplotlib.pyplot as plt
import seaborn as sns
# set default picture size
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5
%config InlineBackend.figure_format = 'svg' 
%matplotlib inline

# Profiling
import pandas_profiling
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [6]:
print('Python       :', sys.version.split('\n')[0])
print('Numpy        :', np.__version__)
print('Tensorflow   :', tf.__version__)

In [7]:
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred-y_true)/y_true))

In [8]:
# Random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [9]:
!pip freeze > requirements.txt

# DATA

Features types:

* bodyType - category
* brand - category
* color - category
* description - text
* engineDisplacement - numeric as text
* enginePower - numeric as text
* fuelType - category
* mileage - numeric
* modelDate - numeric
* model_info - category
* name - category, high dimension
* numberOfDoors - category
* price - numeric
* productionDate - numeric
* sell_id - picture (address based on sell_id)
* vehicleConfiguration - not in use
* vehicleTransmission - category
* Владельцы - category
* Владение - numeric as text
* ПТС - category
* Привод - category
* Руль - category

In [10]:
DATA_DIR = '../input/sf-dst-car-price-prediction-part2/'
train = pd.read_csv(DATA_DIR + 'train.csv')
test = pd.read_csv(DATA_DIR + 'test.csv')
sample_submission = pd.read_csv(DATA_DIR + 'sample_submission.csv')

In [11]:
train.info()

In [12]:
train.nunique()

# Model 1: Create base model 
Avg price according to year and model.
We will compare future models with it.



In [13]:
# split data
data_train, data_test = train_test_split(train, test_size=0.15, shuffle=True, random_state=RANDOM_SEED)

In [14]:
# Base model MEDIAN on model_info and productionDate
predicts = []
for index, row in pd.DataFrame(data_test[['model_info', 'productionDate']]).iterrows():
    query = f"model_info == '{row[0]}' and productionDate == '{row[1]}'"
    predicts.append(data_train.query(query)['price'].median())

# fill na
predicts = pd.DataFrame(predicts)
predicts = predicts.fillna(predicts.median())

# round
predicts = (predicts // 1000) * 1000

# metric MAPE
print(f"MAPE: {(mape(data_test['price'], predicts.values[:, 0]))*100:0.2f}%")

# EDA

Numeric distributions:

In [15]:
# numeric features distribution
def visualize_distributions(titles_values_dict):
  columns = min(3, len(titles_values_dict))
  rows = (len(titles_values_dict) - 1) // columns + 1
  fig = plt.figure(figsize = (columns * 6, rows * 4))
  for i, (title, values) in enumerate(titles_values_dict.items()):
    hist, bins = np.histogram(values, bins = 20)
    ax = fig.add_subplot(rows, columns, i + 1)
    ax.bar(bins[:-1], hist, width = (bins[1] - bins[0]) * 0.7)
    ax.set_title(title)
  plt.show()

visualize_distributions({
    'mileage': train['mileage'].dropna(),
    'modelDate': train['modelDate'].dropna(),
    'productionDate': train['productionDate'].dropna()
})

Summary:
* CatBoost can work with current features, but we need to normalize them to work with NN.

# PreProcessing

In [16]:
# categorical features
categorical_features = ['bodyType', 'brand', 'color', 'engineDisplacement', 'enginePower', 'fuelType', 'model_info', 'name',
  'numberOfDoors', 'vehicleTransmission', 'Владельцы', 'Владение', 'ПТС', 'Привод', 'Руль']

# numeric features
numerical_features = ['mileage', 'modelDate', 'productionDate']

In [17]:
# concat train and test in one dataset
train['sample'] = 1 # train
test['sample'] = 0 # test
test['price'] = 0 # fill price in test with "0"

data = test.append(train, sort=False).reset_index(drop=True)
print(train.shape, test.shape, data.shape)

In [18]:
data.info()

In [19]:
# Create temporarary df to work with features, then we will include preprocessing in one function
df = data.copy()

In [20]:
# Use fast profiling package to explore data
pandas_profiling.ProfileReport(df)

In [21]:
# BodyType
df.bodyType.value_counts()

In [22]:
# Keep first word only
df['bodyType'] = df.bodyType.apply(lambda x: x.split(' ')[0].lower())

In [23]:
df.bodyType.value_counts()

In [24]:
# EngineDisplacement
# Here we have undefined LTr but let's keep it as separate category
df.engineDisplacement.unique()

In [25]:
# EnginePower
# Remove 'N12' and make it as int
df['enginePower'] = df['enginePower'].apply(lambda x: x[:-4]).astype('int')

In [26]:
# Dist plot
df.enginePower.hist().barh

In [27]:
# Name
df.name.head()

In [28]:
# Name has the same features as other columns. Let's create bool variable xDrive. If car has it - 1 otherwise "0"
df['xdrive'] = df['name'].apply(lambda x: 1 if 'xDrive' in x else 0)

In [29]:
df['xdrive'].value_counts()

In [30]:
# Владельцы keep only number and remove words
df['Владельцы'].fillna('3 или более', inplace=True)
df['Владельцы'] = df['Владельцы'].apply(
    lambda x: int(x[0])).astype('int')

In [31]:
sns.countplot(x = 'Владельцы', data = data)

In [32]:
# New feature age 
df['age'] = 2021 - df['productionDate']

In [33]:
# New feature usage
df['usage'] = df['mileage']/df['age']

In [34]:
# Correlation heatmap
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax = sns.heatmap(df.corr(),fmt='.1g',annot=True)

##### High correlation betwenn age, modelDate, productionDate Let's remove last 2

In [55]:
#  UPDATE CAT and numeric features
# categorical features
categorical_features = ['bodyType', 'brand', 'color', 'engineDisplacement', 'enginePower','fuelType',
                        'model_info', 'numberOfDoors', 'vehicleTransmission', 'Владельцы', 'ПТС', 
                        'Привод', 'Руль']

# numeric features
numerical_features = ['mileage', 'modelDate', 'productionDate', 'age', 'usage', 'enginePower']

# bool
bool_features = ['xdrive']

In [58]:
def preproc_data(df_input):
    '''includes several functions to pre-process the predictor data.'''
    
    df_output = df_input.copy()
    
    # ################### 1. Preprocessing ############################################################## 
    # Keep first word only for bodyType
    df_output['bodyType'] = df_output.bodyType.apply(lambda x: x.split(' ')[0].lower())
    # Changes in enginePower
    df_output['enginePower'] = df_output['enginePower'].apply(lambda x: x[:-4]).astype('int')
    # xdrive based on name
    df_output['xdrive'] = df_output['name'].apply(lambda x: 1 if 'xDrive' in x else 0)
    # Владельцы keep only number and remove words
    df_output['Владельцы'].fillna('3 или более', inplace=True)
    df_output['Владельцы'] = df_output['Владельцы'].apply(
        lambda x: int(x[0])).astype('int')
    
    
    # ################### Feature Engineering ####################################################
    # New features age and usage
    df_output['age'] = 2021 - df_output['productionDate']
    df_output['usage'] = df_output['mileage']/df_output['age']
    
    
    # ################### Numerical Features ############################################################## 
    # Fill nan 
    for column in numerical_features:
        df_output[column].fillna(df_output[column].median(), inplace=True)
    
    # Logarithm option for some columns (better to use mileage only)
    df_output['mileage'] = np.log(df_output['mileage'])
#     df_output['age'] = np.log(df_output['age'])
#     df_output['usage'] = np.log(df_output['usage'])
#     df_output['enginePower'] = np.log(df_output['enginePower'])
    
    # data normalization
    scaler = RobustScaler() # deals good with outliers like in our data
    for column in numerical_features:
        df_output[column] = scaler.fit_transform(df_output[[column]])[:,0]
    
    
    # ################### Categorical Features ############################################################## 
    # Label Encoding
    for column in categorical_features:
        df_output[column] = df_output[column].astype('category').cat.codes
        
    # One-Hot Encoding: get_dummies.
    df_output = pd.get_dummies(df_output, columns=categorical_features, dummy_na=False)
    
    
    # ################### Drop columns #################################################### 
    df_output.drop(['vehicleConfiguration','description','sell_id','modelDate','age', 'name', 'Владение'], axis = 1, inplace=True)
    
    return df_output

In [59]:
# Check what we have
df_preproc = preproc_data(data)
df_preproc.sample(10)

In [60]:
df_preproc.info()

In [61]:
# Check xdrive
df_preproc['xdrive']

## Split data

In [62]:
# Return test part back
train_data = df_preproc.query('sample == 1').drop(['sample'], axis=1)
test_data = df_preproc.query('sample == 0').drop(['sample'], axis=1)

y = train_data.price.values     # target
X = train_data.drop(['price'], axis=1)
X_sub = test_data.drop(['price'], axis=1)

In [63]:
test_data.info()

# Model 2: CatBoostRegressor

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, shuffle=True, random_state=RANDOM_SEED)

In [65]:
model = CatBoostRegressor(iterations = 5000,
                          #depth=10,
                          #learning_rate = 0.5,
                          random_seed = RANDOM_SEED,
                          eval_metric='MAPE',
                          custom_metric=['RMSE', 'MAE'],
                          od_wait=500,
                          #task_type='GPU',
                         )
model.fit(X_train, np.log(y_train),
         eval_set=(X_test, np.log(y_test)),
         verbose_eval=100,
         use_best_model=True,
         plot=True
         )

In [66]:
test_predict_catboost = np.exp(model.predict(X_test))
print(f"TEST mape: {(mape(y_test, test_predict_catboost))*100:0.2f}%")

#### Good MAPE 11.69% after data preprocessing

### Submission

In [67]:
sub_predict_catboost = np.exp(model.predict(X_sub))
sample_submission['price'] = sub_predict_catboost
sample_submission.to_csv('catboost_submission.csv', index=False)

# Model 3: Tabular NN

Build simple network:

In [68]:
X_train.head(5)

## Simple Dense NN

In [69]:
model = Sequential()
model.add(L.Dense(512, input_dim=X_train.shape[1], activation="relu"))
model.add(L.Dropout(0.5))
model.add(L.Dense(256, activation="relu"))
model.add(L.Dropout(0.5))
model.add(L.Dense(1, activation="linear"))

In [70]:
model.summary()

In [71]:
# Compile model
optimizer = tf.keras.optimizers.Adam(0.01)
model.compile(loss='MAPE',optimizer=optimizer, metrics=['MAPE'])

In [72]:
checkpoint = ModelCheckpoint('../working/best_model.hdf5' , monitor=['val_MAPE'], 
                             verbose=0  , mode='min')
earlystop = EarlyStopping(monitor='val_MAPE', patience=50, restore_best_weights=True,)
callbacks_list = [checkpoint, earlystop]

### Fit

In [73]:
history = model.fit(X_train, y_train,
                    batch_size=512,
                    epochs=500, 
                    validation_data=(X_test, y_test),
                    callbacks=callbacks_list,
                    verbose=0,
                   )

In [74]:
plt.title('Loss')
plt.plot(history.history['MAPE'], label='train')
plt.plot(history.history['val_MAPE'], label='test')
plt.show();

In [75]:
model.load_weights('../working/best_model.hdf5')
model.save('../working/nn_1.hdf5')

In [76]:
test_predict_nn1 = model.predict(X_test)
print(f"TEST mape: {(mape(y_test, test_predict_nn1[:,0]))*100:0.2f}%")

In [77]:
sub_predict_nn1 = model.predict(X_sub)
sample_submission['price'] = sub_predict_nn1[:,0]
sample_submission.to_csv('nn1_submission.csv', index=False)

#### MAPE 11.25% a bit better when not to take logarithms for all numeric features (just mileage)

# Model 4: NLP + Multiple Inputs

#### Install package for working with RUSSIAN LANGUAGE

In [78]:
data.description

In [79]:
# Create lemmatization function
# All possible patterns in our text
patterns = "[A-Za-z0-9!#$%&'()*+,./:;<=>?@[\]^_`{|}~—\"\-]+"

def lemmatize(doc):
    doc = re.sub(patterns, ' ', doc) # sub finds OCCURENCES
    tokens = []
    for token in doc.split():
        token = token.strip()
        token = morph.normal_forms(token)[0]
        tokens.append(token)
    return ' '.join(tokens)

In [80]:
# Create word parse function (РАЗБОР СЛОВА) and put all words in normal form
def pos(word, morth):
    return morth.parse(word)[0].normal_form # return normal form

In [81]:
morph = pymorphy2.MorphAnalyzer()

In [82]:
data.description = data.description.apply(lambda x: " ".join([pos(word, morph) for word in x.split()]))

In [83]:
df_NLP = data.copy()
data['description'] = df_NLP.apply(lambda df_NLP: lemmatize(df_NLP.description), axis=1)

In [84]:
# TOKENIZER
# The maximum number of words to be used. (most frequent)
MAX_WORDS = 100000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 256

In [85]:
# data split 
text_train = data.description.iloc[X_train.index]
text_test = data.description.iloc[X_test.index]
text_sub = data.description.iloc[X_sub.index]

### Tokenizer

In [86]:
%%time
tokenize = Tokenizer(num_words=MAX_WORDS)
tokenize.fit_on_texts(data.description)

In [87]:
tokenize.word_index

In [88]:
%%time
text_train_sequences = sequence.pad_sequences(tokenize.texts_to_sequences(text_train), maxlen=MAX_SEQUENCE_LENGTH)
text_test_sequences = sequence.pad_sequences(tokenize.texts_to_sequences(text_test), maxlen=MAX_SEQUENCE_LENGTH)
text_sub_sequences = sequence.pad_sequences(tokenize.texts_to_sequences(text_sub), maxlen=MAX_SEQUENCE_LENGTH)

print(text_train_sequences.shape, text_test_sequences.shape, text_sub_sequences.shape, )

In [89]:
# this is how text look like now
print(text_train.iloc[6])
print(text_train_sequences[6])

### RNN NLP

LSTM model

In [90]:
model_nlp = Sequential()
model_nlp.add(L.Input(shape=MAX_SEQUENCE_LENGTH, name="seq_description"))
model_nlp.add(L.Embedding(len(tokenize.word_index)+1, MAX_SEQUENCE_LENGTH,))
model_nlp.add(L.LSTM(256, return_sequences=True))
model_nlp.add(L.Dropout(0.5))
model_nlp.add(L.LSTM(128,))
model_nlp.add(L.Dropout(0.25))
model_nlp.add(L.Dense(64, activation="relu"))
model_nlp.add(L.Dropout(0.25))

### MLP

In [91]:
model_mlp = Sequential()
model_mlp.add(L.Dense(512, input_dim=X_train.shape[1], activation="relu"))
model_mlp.add(L.Dropout(0.5))
model_mlp.add(L.Dense(256, activation="relu"))
model_mlp.add(L.Dropout(0.5))

### Multiple Inputs NN

In [92]:
combinedInput = L.concatenate([model_nlp.output, model_mlp.output])
# being our regression head
head = L.Dense(64, activation="relu")(combinedInput)
head = L.Dense(1, activation="linear")(head)

model = Model(inputs=[model_nlp.input, model_mlp.input], outputs=head)

In [93]:
model.summary()

### Fit

In [94]:
optimizer = tf.keras.optimizers.Adam(0.01)
model.compile(loss='MAPE',optimizer=optimizer, metrics=['MAPE'])

In [95]:
checkpoint = ModelCheckpoint('../working/best_model.hdf5', monitor=['val_MAPE'], verbose=0, mode='min')
earlystop = EarlyStopping(monitor='val_MAPE', patience=10, restore_best_weights=True,)
callbacks_list = [checkpoint, earlystop]

In [96]:
history = model.fit([text_train_sequences, X_train], y_train,
                    batch_size=512,
                    epochs=500,
                    validation_data=([text_test_sequences, X_test], y_test),
                    callbacks=callbacks_list
                   )

In [97]:
plt.title('Loss')
plt.plot(history.history['MAPE'], label='train')
plt.plot(history.history['val_MAPE'], label='test')
plt.show();

In [98]:
model.load_weights('../working/best_model.hdf5')
model.save('../working/nn_mlp_nlp.hdf5')

In [99]:
test_predict_nn2 = model.predict([text_test_sequences, X_test])
print(f"TEST mape: {(mape(y_test, test_predict_nn2[:,0]))*100:0.2f}%")

In [100]:
sub_predict_nn2 = model.predict([text_sub_sequences, X_sub])
sample_submission['price'] = sub_predict_nn2[:,0]
sample_submission.to_csv('nn2_submission.csv', index=False)

# Model 5: Add pictures

### Data

In [101]:
# check prices and pictures
plt.figure(figsize = (12,8))

random_image = train.sample(n = 9)
random_image_paths = random_image['sell_id'].values
random_image_cat = random_image['price'].values

for index, path in enumerate(random_image_paths):
    im = PIL.Image.open(DATA_DIR+'img/img/' + str(path) + '.jpg')
    plt.subplot(3, 3, index + 1)
    plt.imshow(im)
    plt.title('price: ' + str(random_image_cat[index]))
    plt.axis('off')
plt.show()

In [102]:
size = (320, 240)

def get_image_array(index):
    images_train = []
    for index, sell_id in enumerate(data['sell_id'].iloc[index].values):
        image = cv2.imread(DATA_DIR + 'img/img/' + str(sell_id) + '.jpg')
        assert(image is not None)
        image = cv2.resize(image, size)
        images_train.append(image)
    images_train = np.array(images_train)
    print('images shape', images_train.shape, 'dtype', images_train.dtype)
    return(images_train)

images_train = get_image_array(X_train.index)
images_test = get_image_array(X_test.index)
images_sub = get_image_array(X_sub.index)

## Albumentations (from my previous project)

In [103]:
import albumentations as a

# Example from here: https://github.com/VictorKovatsenko/portfolio_ds/blob/master/Project_7_Deep_learning_car_classification/Kovatsenko_car_classification_keras.ipynb
augmentations = a.Compose([
    a.GaussianBlur(p=0.05), # add Gauss bluring and noise with 5% probability
    a.GaussNoise(p=0.05),
    a.ShiftScaleRotate(shift_limit=0.0625, 
                       scale_limit=0.01, 
                       interpolation=1, 
                       border_mode=4, 
                       rotate_limit=20, 
                       p=.75), # shift, scale, rotate with higher than default probablity can be useful for variety 
#     in our images 
    a.RandomBrightness(limit=0.2, p=0.5),
    
#     Add some more aumentations with default parameters
    
    a.HorizontalFlip(), # as we take a look on car's model it may vary with different mirroring,
                        # then to get better result we can turn it on NO VERTICAL FLIP because cars are always in horizontal
    a.HueSaturationValue(), # random hue and saturation
    a.RGBShift(),
    a.FancyPCA(alpha=0.1, 
               always_apply=False, 
               p=0.5),
    
    #  add OneOfs with default 50% probability for brightness contrast
    a.OneOf([
        a.RandomBrightnessContrast(brightness_limit=0.3, 
                                                contrast_limit=0.3),
        a.RandomBrightnessContrast(brightness_limit=0.1, 
                                                contrast_limit=0.1)],
        p=0.5)
])

#example
plt.figure(figsize = (12,8))
for i in range(9):
    img = augmentations(image = images_train[0])['image']
    plt.subplot(3, 3, i + 1)
    plt.imshow(img)
    plt.axis('off')
plt.show()

In [104]:
# FUNCTION FROM BASELINE NOT IN USE

# def make_augmentations(images):
#     print('applying augmentations', end = '')
#     augmented_images = np.empty(images.shape)
#     for i in range(images.shape[0]):
#         if i % 200 == 0:
#               print('.', end = '')
#     augment_dict = augmentation(image = images[i])
#     augmented_image = augment_dict['image']
#     augmented_images[i] = augmented_image
#     print('')
#     return augmented_images

In [105]:
# NLP part
tokenize = Tokenizer(num_words=MAX_WORDS)
tokenize.fit_on_texts(data.description)

In [106]:
def process_image(image):
    return augmentations(image = image.numpy())['image']

def tokenize_(descriptions):
    return sequence.pad_sequences(tokenize.texts_to_sequences(descriptions), maxlen = MAX_SEQUENCE_LENGTH)

def tokenize_text(text):
    return tokenize_([text.numpy().decode('utf-8')])[0]

def tf_process_train_dataset_element(image, table_data, text, price):
    im_shape = image.shape
    [image,] = tf.py_function(process_image, [image], [tf.uint8])
    image.set_shape(im_shape)
    [text,] = tf.py_function(tokenize_text, [text], [tf.int32])
    return (image, table_data, text), price

def tf_process_val_dataset_element(image, table_data, text, price):
    [text,] = tf.py_function(tokenize_text, [text], [tf.int32])
    return (image, table_data, text), price

train_dataset = tf.data.Dataset.from_tensor_slices((
    images_train, X_train, data.description.iloc[X_train.index], y_train
    )).map(tf_process_train_dataset_element)

test_dataset = tf.data.Dataset.from_tensor_slices((
    images_test, X_test, data.description.iloc[X_test.index], y_test
    )).map(tf_process_val_dataset_element)

y_sub = np.zeros(len(X_sub))
sub_dataset = tf.data.Dataset.from_tensor_slices((
    images_sub, X_sub, data.description.iloc[X_sub.index], y_sub
    )).map(tf_process_val_dataset_element)

#Check for errors:
train_dataset.__iter__().__next__();
test_dataset.__iter__().__next__();
sub_dataset.__iter__().__next__();

### Build convolutional network without "head"

In [107]:
efficientnet_model = tf.keras.applications.efficientnet.EfficientNetB3(weights = 'imagenet', include_top = False, input_shape = (size[1], size[0], 3))

#### Add Fine-tunning (one attempt unfreeze 50% of layers)

In [108]:
efficientnet_model.trainable = True

# Fine-tune starting point
start_point = len(efficientnet_model.layers)//2

# Keep all other layers frozen
for layer in efficientnet_model.layers[:start_point]:
    layer.trainable =  False

In [110]:
# Check
for layer in efficientnet_model.layers:
    print(layer, layer.trainable)

In [111]:
efficientnet_output = L.GlobalAveragePooling2D()(efficientnet_model.output)

In [112]:
#tabular model NN
tabular_model = Sequential([
    L.Input(shape = X.shape[1]),
    L.Dense(512, activation = 'relu'),
    L.Dropout(0.5),
    L.Dense(256, activation = 'relu'),
    L.Dropout(0.5),
    ])

In [113]:
# NLP
nlp_model = Sequential([
    L.Input(shape=MAX_SEQUENCE_LENGTH, name="seq_description"),
    L.Embedding(len(tokenize.word_index)+1, MAX_SEQUENCE_LENGTH,),
    L.LSTM(256, return_sequences=True),
    L.Dropout(0.5),
    L.LSTM(128),
    L.Dropout(0.25),
    L.Dense(64),
    ])

In [114]:
#concatenate 3 NN
combinedInput = L.concatenate([efficientnet_output, tabular_model.output, nlp_model.output])

# building our regression head
head = L.Dense(256, activation="relu")(combinedInput)
head = L.Dense(1,)(head)

model = Model(inputs=[efficientnet_model.input, tabular_model.input, nlp_model.input], outputs=head)
model.summary()

In [115]:
optimizer = tf.keras.optimizers.Adam(0.005)
model.compile(loss='MAPE',optimizer=optimizer, metrics=['MAPE'])

## Add LR manage technique

In [116]:
# Add LR scheduler (decrease rate after 2 epoch)
lr_scheduler = ReduceLROnPlateau(monitor='val_loss',
                              factor=0.2, #let's reduce LR 5 times
                              patience=2, # if no improvement after 2 epoch - reduce LR
                              min_lr=0.0000001,
                              verbose=1,
                              mode='auto')

In [117]:
checkpoint = ModelCheckpoint('../working/best_model.hdf5', monitor=['val_MAPE'], verbose=0, mode='min')
earlystop = EarlyStopping(monitor='val_MAPE', patience=10, restore_best_weights=True,)
callbacks_list = [checkpoint, earlystop, lr_scheduler]

In [118]:
history = model.fit(train_dataset.batch(30),
                    epochs=100,
                    validation_data = test_dataset.batch(30),
                    callbacks=callbacks_list
                   )

In [119]:
plt.title('Loss')
plt.plot(history.history['MAPE'], label='train')
plt.plot(history.history['val_MAPE'], label='test')
plt.show();

In [120]:
model.load_weights('../working/best_model.hdf5')
model.save('../working/nn_final.hdf5')

In [121]:
test_predict_nn3 = model.predict(test_dataset.batch(30))
print(f"TEST mape: {(mape(y_test, test_predict_nn3[:,0]))*100:0.2f}%")

In [122]:
sub_predict_nn3 = model.predict(sub_dataset.batch(30))
sample_submission['price'] = sub_predict_nn3[:,0]
sample_submission.to_csv('nn3_submission.csv', index=False)

# Blend

In [128]:
blend_predict = (test_predict_catboost +
                 test_predict_nn1[:, 0] + test_predict_nn3[:, 0]) / 3
print(f"TEST mape: {(mape(y_test, blend_predict))*100:0.2f}%")

In [129]:
blend_sub_predict = (sub_predict_catboost +
                     sub_predict_nn1[:, 0] + sub_predict_nn3[:, 0]) / 3
sample_submission['price'] = blend_sub_predict
sample_submission.to_csv('blend_submission.csv', index=False)

# Conclusion:
- Worked with data, cleaned up bodyType, owners and name columns, generated 3 new features, made experiments with logarithms of some variables
- Did Natural Language analysis by placing each word to normal form and finding most frequent occurencies
- Different augmentation techniques applied, the best one - with Albumentations library
- ReduceOnPlateau method was applied to manage Learning rate.
- The main output for me is get to know how to use Multiple Inputs NN and apply different methods working with dataset, NLP and pictures together.

Final MAPE metrics result using blending is **10.8%**. Which performs better than baseline and puts me on middle place in leaderboard

What could be better? 
Due to the convolutional network is very time consuming, did't apply different fine-tuning options for EfficientNetB3. The same reason was for not to try other NN from ImageNet.

## Model Bonus: feature forwarding (no changes)

In [130]:
# MLP
model_mlp = Sequential()
model_mlp.add(L.Dense(512, input_dim=X_train.shape[1], activation="relu"))
model_mlp.add(L.Dropout(0.5))
model_mlp.add(L.Dense(256, activation="relu"))
model_mlp.add(L.Dropout(0.5))

In [132]:
# FEATURE Input
# Input
productiondate = L.Input(shape=[1], name="productionDate")
# Embeddings layers
emb_productiondate = L.Embedding(len(X.productionDate.unique().tolist())+1, 20)(productiondate)
f_productiondate = L.Flatten()(emb_productiondate)

In [133]:
combinedInput = L.concatenate([model_mlp.output, f_productiondate,])
# being our regression head
head = L.Dense(64, activation="relu")(combinedInput)
head = L.Dense(1, activation="linear")(head)

model = Model(inputs=[model_mlp.input, productiondate], outputs=head)

In [134]:
model.summary()

In [135]:
optimizer = tf.keras.optimizers.Adam(0.01)
model.compile(loss='MAPE',optimizer=optimizer, metrics=['MAPE'])