## Using self-trained embeddings from train_active on the description

This kernel shows how to use the Word2Vec model created in [this kernel](https://www.kaggle.com/christofhenkel/using-train-active-for-training-word-embeddings) on the description. To compare the performance with [the pre-trained embedding model](https://www.kaggle.com/christofhenkel/fasttext-starter-description-only) we use exactly the same model structure.

In [1]:
import pandas as pd
from gensim.models import word2vec
from keras.preprocessing import text, sequence
import numpy as np
from tqdm import tqdm
from keras.layers import Input, SpatialDropout1D,Dropout, GlobalAveragePooling1D, CuDNNGRU, Bidirectional, Dense, Embedding
from keras.models import Model
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
import keras.backend as K
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
import os


EMBEDDING = 'avito.vec'
TRAIN_CSV = 'train.csv'
TEST_CSV = 'test.csv'


Using TensorFlow backend.


In [2]:
text_cols = ['param_1','param_2','param_3','title','description']
print('loading train...')
train = pd.read_csv('train.csv', index_col = 'item_id', usecols = text_cols + ['item_id','image_top_1','deal_probability', 'param_1', 'param_2', 'param_3', 
    'city', 'region', 'category_name', 'parent_category_name', 'user_type'])
train_indices = train.index
print('loading test')
test = pd.read_csv('test.csv', index_col = 'item_id', usecols = text_cols + ['item_id','image_top_1', 'param_1', 'param_2', 'param_3', 
    'city', 'region', 'category_name', 'parent_category_name', 'user_type'])
test_indices = test.index
print('concat dfs')

loading train...
loading test
concat dfs


In [3]:
for txt in text_cols:
    train[txt]  = train[txt].astype(str)
    test[txt]  = test[txt].astype(str)

In [4]:
train['text'] = train[text_cols].apply(lambda x: ' '.join(x), axis=1)
train.drop(text_cols,axis = 1, inplace = True)

In [5]:
test['text'] = test[text_cols].apply(lambda x: ' '.join(x), axis=1)
test.drop(text_cols,axis = 1, inplace = True)

In [6]:
max_features = 100000
maxlen = 200
embed_size = 300
labels = train[['deal_probability']].copy()
train = train[['text']].copy()

tokenizer = text.Tokenizer(num_words=max_features)

print('fitting tokenizer...',end='')
tokenizer.fit_on_texts(list(train['text'].fillna('NA').values))
print('done.')

fitting tokenizer...done.


In [7]:
model = word2vec.Word2Vec.load(EMBEDDING)
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    try:
        embedding_vector = model[word]
    except KeyError:
        embedding_vector = None
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

  


In [8]:
EMBEDDING_FILE = 'wiki.ru.vec'
embeddings_index = {}
with open(EMBEDDING_FILE, encoding='utf8') as f:
    for line in f:
        values = line.rstrip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float16')
        embeddings_index[word] = coefs
word_index = tokenizer.word_index
#prepare embedding matrix
num_words = min(max_features, len(word_index) + 1)
embedding_matrix_image = np.zeros((num_words, embed_size))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix_image[i] = embedding_vector

In [27]:
translated_train = pd.read_csv('translated_train.csv')
translated_test = pd.read_csv('translated_test.csv')

In [28]:
translated_train.head()

Unnamed: 0,item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability,en_desc,en_title
0,b912c3c6a6ad,e00f8ff2eaf9,Свердловская область,Екатеринбург,Личные вещи,Товары для детей и игрушки,Постельные принадлежности,,,Кокоби(кокон для сна),"Кокон для сна малыша,пользовались меньше месяц...",400.0,2,2017-03-28,Private,d10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c...,1008.0,0.12789,"Cocoon for sleeping baby, enjoyed less than a ...",Kokobi (cocoon for sleep)
1,2dac0150717d,39aeb48f0017,Самарская область,Самара,Для дома и дачи,Мебель и интерьер,Другое,,,Стойка для Одежды,"Стойка для одежды, под вешалки. С бутика.",3000.0,19,2017-03-26,Private,79c9392cc51a9c81c6eb91eceb8e552171db39d7142700...,692.0,0.0,"Rack for clothes, under hangers. From the bout...",Rack for Clothes
2,ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ...",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a...,3032.0,0.43177,"In good condition, home theater with blu ray, ...",Philips bluray
3,02996f1dd2ea,bf5cccea572d,Татарстан,Набережные Челны,Личные вещи,Товары для детей и игрушки,Автомобильные кресла,,,Автокресло,Продам кресло от0-25кг,2200.0,286,2017-03-25,Company,e6ef97e0725637ea84e3d203e82dadb43ed3cc0a1c8413...,796.0,0.80323,Selling an armchair from 0-25kg,Car seat
4,7c90be56d2ab,ef50846afc0b,Волгоградская область,Волгоград,Транспорт,Автомобили,С пробегом,ВАЗ (LADA),2110.0,"ВАЗ 2110, 2003",Все вопросы по телефону.,40000.0,3,2017-03-16,Private,54a687a3a0fc1d68aed99bdaaf551c5c70b761b16fd0a2...,2264.0,0.20797,All questions on the phone.,"VAZ 2110, 2003"


In [29]:
translated_train['text'] = translated_train['en_desc'] +' '+ translated_train['en_title']
translated_test['text'] = translated_test['en_desc'] +' '+ translated_test['en_title']

In [None]:
def normalize(s):
    """
    Given a text, cleans and normalizes it. Feel free to add your own stuff.
    """
    s = s.lower()
    # Replace ips
    s = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ' _ip_ ', s)
    # Isolate punctuation
    s = re.sub(r'([\'\"\.\(\)\!\?\-\\\/\,])', r' \1 ', s)
    # Remove some special characters
    s = re.sub(r'([\;\:\|•«\n])', ' ', s)
    # Replace numbers and symbols with language
    s = s.replace('&', ' and ')
    s = s.replace('@', ' at ')
    s = s.replace('0', ' zero ')
    s = s.replace('1', ' one ')
    s = s.replace('2', ' two ')
    s = s.replace('3', ' three ')
    s = s.replace('4', ' four ')
    s = s.replace('5', ' five ')
    s = s.replace('6', ' six ')
    s = s.replace('7', ' seven ')
    s = s.replace('8', ' eight ')
    s = s.replace('9', ' nine ')
    return s


In [30]:
max_features = 100000
maxlen = 200
embed_size = 300
translated_train = translated_train[['text']].copy()
translated_test = translated_test[['text']].copy()


tokenizer_en = text.Tokenizer(num_words=max_features)

print('fitting tokenizer...',end='')
tokenizer_en.fit_on_texts(list(translated_train['text'].fillna('unknown').values) + list(translated_test['text'].fillna('unknown').values))
print('done.')

fitting tokenizer...done.


In [31]:


EMBEDDING_FILE = 'glove.840B.300d.txt'
embeddings_index = {}
with open(EMBEDDING_FILE, encoding='utf8') as f:
    for line in f:
        values = line.rstrip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float16')
        embeddings_index[word] = coefs
word_index = tokenizer_en.word_index
#prepare embedding matrix
num_words = min(max_features, len(word_index) + 1)
embedding_matrix_image_two = np.zeros((num_words, embed_size))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix_image_two[i] = embedding_vector

In [55]:
from fastText import load_model

In [None]:
word_index = tokenizer.word_index
#prepare embedding matrix
num_words = min(max_features, len(word_index) + 1)
embedding_matrix_image = np.zeros((num_words, embed_size))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix_image[i] = embedding_vector

In [66]:
lang_model = load_model(EMBEDDING_FILE1)
words_in_model = set(lang_model.get_words())
words_seen = set()
words_seen_in_model = set()
word_index = tokenizer.word_index
embedding_matrix_image_two = np.zeros((num_words, embed_size))
for word, i in word_index.items():
    #nonchars.update(set(word).difference( chars))
    if i >= max_features:
        continue
    embedding_vector = lang_model.get_word_vector(word)[:embed_size]
    words_seen.add(word)
    if word in words_in_model:
        words_seen_in_model.add(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix_image_two[i] = embedding_vector

In [32]:
X_eng_train = tokenizer_en.texts_to_sequences(translated_train['text'].values)
print('done.')
print('padding...',end='')
X_eng_train = sequence.pad_sequences(X_eng_train, maxlen=maxlen)
print('done.')
X_eng_test = translated_test['text'].values
X_eng_test = tokenizer_en.texts_to_sequences(X_eng_test)

print('padding')
X_eng_test = sequence.pad_sequences(X_eng_test, maxlen=maxlen)

done.
padding...done.
padding


In [9]:
X_train = tokenizer.texts_to_sequences(train['text'].values)
print('done.')
print('padding...',end='')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
print('done.')

#del train

done.
padding...done.


In [10]:
test = test[['text']].copy()
X_test = test['text'].values
X_test = tokenizer.texts_to_sequences(X_test)

print('padding')
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

padding


In [11]:
maxlen

200

In [12]:
from sklearn.preprocessing import LabelEncoder,StandardScaler

gp = pd.read_csv('aggregated_features.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train = train.merge(gp, on='user_id', how='left')
test = test.merge(gp, on='user_id', how='left')
train['avg_days_up_user'] = np.log1p(train['avg_days_up_user'])
train['avg_times_up_user'] = np.log1p(train['avg_times_up_user'])
train['n_user_items'] = np.log1p(train['n_user_items'])
test['avg_days_up_user'] = np.log1p(test['avg_days_up_user'])
test['avg_times_up_user'] = np.log1p(test['avg_times_up_user'])
test['n_user_items'] = np.log1p(test['n_user_items'])


temp_all = pd.concat([train[['price','avg_days_up_user','avg_times_up_user','n_user_items']],
                      test[['price','avg_days_up_user','avg_times_up_user','n_user_items']]])
temp_all["price"] = np.log(temp_all["price"]+0.001)
train["price"] = np.log(train["price"]+0.001)
train["price"].fillna(temp_all['price'].mean(),inplace=True)
test["price"] = np.log(test["price"]+0.001)
test["price"].fillna(temp_all['price'].mean(),inplace=True)

train["avg_days_up_user"].fillna(temp_all['avg_days_up_user'].max(),inplace=True)
train["avg_times_up_user"].fillna(temp_all['avg_times_up_user'].max(),inplace=True)
train["n_user_items"].fillna(temp_all['n_user_items'].max(),inplace=True)
test["avg_days_up_user"].fillna(temp_all['avg_days_up_user'].max(),inplace=True)
test["avg_times_up_user"].fillna(temp_all['avg_times_up_user'].max(),inplace=True)
test["n_user_items"].fillna(temp_all['n_user_items'].max(),inplace=True)


extra_feat = ['avg_days_up_user','avg_times_up_user','n_user_items']
for ef in extra_feat:
    train[ef].fillna(temp_all[ef].max(),inplace=True)
    test[ef].fillna(temp_all[ef].max(),inplace=True)
features = train[['price','avg_days_up_user','avg_times_up_user','n_user_items']]
test_features = test[['price','avg_days_up_user','avg_times_up_user','n_user_items']]

ss = StandardScaler()
ss.fit(np.vstack([features, test_features]))
features = ss.transform(features)
test_features = ss.transform(test_features)

In [15]:
image_top_train = pd.read_csv("train_image_top_1_features.csv") 
image_top_test = pd.read_csv("test_image_top_1_features.csv") 
train['image_top_1'] = image_top_train['image_top_1']
test['image_top_1'] = image_top_test['image_top_1']


train['category_name'] = train['category_name'].astype('category')
train['parent_category_name'] = train['parent_category_name'].astype('category')
train['region'] = train['region'].astype('category')
train['city'] = train['city'].astype('category')
train['image_top_1'] = train['image_top_1'].fillna('missing')

test['category_name'] = test['category_name'].astype('category')
test['parent_category_name'] = test['parent_category_name'].astype('category')
test['region'] = test['region'].astype('category')
test['city'] = test['city'].astype('category')
test['image_top_1'] = test['image_top_1'].fillna('missing')

In [16]:
categorical = [
    'category_name','parent_category_name','region','city','image_top_1'
]
for feature in categorical:
    print(f'Transforming {feature}...')
    encoder = LabelEncoder()
    encoder.fit(train[feature].append(test[feature]).astype(str))
    train[feature] = encoder.transform(train[feature].astype(str))
    test[feature] = encoder.transform(test[feature].astype(str))

Transforming category_name...
Transforming parent_category_name...
Transforming region...
Transforming city...
Transforming image_top_1...


In [17]:
max_region = np.max(train.region.max())+2
max_city= np.max(train.city.max())+2
max_category_name = np.max(train.category_name.max())+2
max_parent_category_name = np.max(train.parent_category_name.max())+2
max_image_top_1 = np.max(train.image_top_1.max())+2

In [17]:
print("DONE")

DONE


In [31]:
from keras.preprocessing.image import load_img, img_to_array
def data_gen(text, tabular, train_df , label):
    batch_size = 50
    size = len(text)
    current_id = 0
    while True:
        bc = batch_size+current_id
        if(bc < size):
            batch_idx = range(current_id,bc)
        else:  
            batch_idx = range(current_id, size)
            batch_idx = np.append(range(0,bc-size), batch_idx)
        total_image = []
        for i in train_df[np.array(batch_idx)]:
            if (str(i)!='nan'):
                img = load_img(i, target_size = (224, 224, 3))
                x = image.img_to_array(img)
                x = np.expand_dims(x, axis=0)
                im = x/255.
            else:
                im = np.zeros(shape=(224,224,3))
            total_image.append(im)
        total_image = np.vstack(total_image)
        batch_idx = np.array(batch_idx)
        yield [text[batch_idx], tabular[batch_idx], total_image], label[batch_idx]

In [29]:
import keras.backend as K
K.clear_session()

In [14]:


model = build_model()


  del sys.path[0]
  name=name)


Instructions for updating:
Use the retry module or similar alternatives.


  name=name)


In [15]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 200)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 200, 300)     30000000    input_2[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 200, 300)     30000000    input_2[0][0]                    
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 200, 600)     0           embedding_1[0][0]                
                                                                 embedding_2[0][0]                
__________

In [21]:
test = pd.read_csv('test.csv', usecols = ['item_id', 'image'])

In [22]:
import os
from PIL import Image
from tqdm import tqdm_notebook as tqdm
test_image_path = []
for _ , c_row in tqdm(test.iterrows(), total=test.shape[0]):
    if(str(c_row['image'])!='nan'):
        if(os.path.exists('data/competition_files/test/test_jpg/'+str(c_row['image'])+'.jpg')):
            try:
                image = Image.open('data/competition_files/test/test_jpg/' +str(c_row['image'])+'.jpg')
                test_image_path.append('data/competition_files/test/test_jpg/'+str(c_row['image'])+'.jpg')
            except IOError:
                test_image_path.append('nan')
        else:
            test_image_path.append('nan')
    else:
        test_image_path.append('nan')






In [23]:
train = pd.read_csv('train.csv', usecols = ['item_id', 'image','deal_probability'])


In [24]:
import os
from PIL import Image
from tqdm import tqdm_notebook as tqdm
image_path = []
for _ , c_row in tqdm(train.iterrows(), total=train.shape[0]):
    if(str(c_row['image'])!='nan'):
        if(os.path.exists('data/competition_files/train/train_jpg/' + str(c_row['deal_probability'])+'/'+str(c_row['image'])+'.jpg')):
            try:
                image = Image.open('data/competition_files/train/train_jpg/' + str(c_row['deal_probability'])+'/'+str(c_row['image'])+'.jpg')
                image_path.append('data/competition_files/train/train_jpg/' + str(c_row['deal_probability'])+'/'+str(c_row['image'])+'.jpg')
            except IOError:
                image_path.append('nan')
        elif(os.path.exists('data/competition_files/train/val_jpg/' + str(c_row['deal_probability'])+'/'+str(c_row['image'])+'.jpg')):
            try:
                image = Image.open('data/competition_files/train/val_jpg/' + str(c_row['deal_probability'])+'/'+str(c_row['image'])+'.jpg')
                image_path.append('data/competition_files/train/val_jpg/' + str(c_row['deal_probability'])+'/'+str(c_row['image'])+'.jpg')
            except IOError:
                image_path.append('nan')
        else:
            image_path.append('nan')
    else:
        image_path.append('nan')






In [None]:
print('hi')

hi


In [None]:
from keras.preprocessing import image

test_images = []
for tim in tqdm(test_image_path):
    if (tim!="nan"):
        img = image.load_img(tim, target_size=(224, 224, 3))
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        x = x/255.
        test_images.append(x)
    else:
        test_images.append(np.zeros(shape=(224,224,3)))
test_images = np.vstack(test_images)



In [None]:
len(X_train)

In [None]:
from sklearn.model_selection import KFold

RS = 20180601
folds = KFold(n_splits=10, shuffle=True, random_state=546789)
oof_preds = np.zeros(X_train.shape[0])

test_predicts_list = []
np.random.seed(RS)
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(X_train)):
    batch_size = 50
    trn_x, trn_y, features_trn, img_path_trn = X_train[trn_idx], labels['deal_probability'].values[trn_idx],features[trn_idx],image_path[trn_idx]
    val_x, val_y, features_val, img_path_val = X_train[val_idx], labels['deal_probability'].values[val_idx], features[val_idx], image_path[val_idx]
    trn_length = len(trn_x)
    val_length = len(val_x)
    model = build_model()
    model_GPU = multi_gpu_models(model, 4)
    check_point = ModelCheckpoint('nlp.hdf5', monitor = "val_root_mean_squared_error", mode = "min", save_best_only = True, verbose = 1)
    early_stop = EarlyStopping(monitor="val_root_mean_squared_error", mode="min", patience=5)
    rlrop = ReduceLROnPlateau(monitor='val_root_mean_squared_error',mode='auto',patience=2,verbose=1,factor=0.5,cooldown=0,min_lr=1e-6)
    callbacks= [check_point, early_stop, rlrop]
    model_GPU.fit_generator(generator=data_gen(trn_x, features_trn,img_path_trn, trn_y),
                    steps_per_epoch=math.ceil(trn_length / batch_size),
                    verbose=1,
                    callbacks=callbacks,
                    validation_data=data_gen(val_x, features_val,img_path_val, val_y),
                    initial_epoch=0,
                    epochs=17,
                    use_multiprocessing=True,
                    max_queue_size=10,
                    workers = 20,
                    validation_steps=math.ceil(val_length / batch_size))
    
    val_images = []
    for tim in tqdm(img_path_val):
        if (tim!="nan"):
            img = image.load_img(tim, target_size=(224, 224, 3))
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = x/255.
            val_images.append(x)
        else:
            val_images.append(np.zeros(shape=(224,224,3)))
    val_images = np.vstack(val_images)

    
    oof_preds[val_idx]  = model.predict([val_x,features_val, val_images])
    
    pred =  model.predict([X_test, test_features, test_images])
    test_predicts_list.append(pred)

In [18]:
from keras.layers import Lambda, concatenate
from keras import Model

import tensorflow as tf

def multi_gpu_models(model, gpus):
    if isinstance(gpus, (list, tuple)):
        num_gpus = len(gpus)
        target_gpu_ids = gpus
    else:
        num_gpus = gpus
        target_gpu_ids = range(num_gpus)

    def get_slice(data, i, parts):
        shape = tf.shape(data)
        batch_size = shape[:1]
        input_shape = shape[1:]
        step = batch_size // parts
        if i == num_gpus - 1:
            size = batch_size - step * i
        else:
            size = step
        size = tf.concat([size, input_shape], axis=0)
        stride = tf.concat([step, input_shape * 0], axis=0)
        start = stride * i
        return tf.slice(data, start, size)

    all_outputs = []
    for i in range(len(model.outputs)):
        all_outputs.append([])

    # Place a copy of the model on each GPU,
    # each getting a slice of the inputs.
    for i, gpu_id in enumerate(target_gpu_ids):
        with tf.device('/gpu:%d' % gpu_id):
            with tf.name_scope('replica_%d' % gpu_id):
                inputs = []
                # Retrieve a slice of the input.
                for x in model.inputs:
                    input_shape = tuple(x.get_shape().as_list())[1:]
                    slice_i = Lambda(get_slice,
                                     output_shape=input_shape,
                                     arguments={'i': i,
                                                'parts': num_gpus})(x)
                    inputs.append(slice_i)

                # Apply model on slice
                # (creating a model replica on the target device).
                outputs = model(inputs)
                if not isinstance(outputs, list):
                    outputs = [outputs]

                # Save the outputs for merging back together later.
                for o in range(len(outputs)):
                    all_outputs[o].append(outputs[o])

          # Merge outputs on CPU.
    with tf.device('/cpu:0'):
        merged = []
        for name, outputs in zip(model.output_names, all_outputs):
            merged.append(concatenate(outputs,
                                       axis=0, name=name))
        return Model(model.inputs, merged)

In [67]:
from keras.layers import Concatenate, Flatten, Bidirectional, CuDNNLSTM, GlobalMaxPooling1D, concatenate,BatchNormalization
from keras.layers import PReLU, merge,GlobalAveragePooling2D, Conv2D, Conv1D
from keras.applications.resnet50 import ResNet50

def root_mean_squared_error(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred - y_true)))

def build_model():
    
    features_input = Input(shape=(features.shape[1],))
    feat_x = Dense(30, activation='sigmoid')(features_input)
    feat_x2 = Dense(30)(features_input)
    feat_x2 = PReLU()(feat_x2)
    x1 = merge([feat_x, feat_x2], mode='concat')
    x1 = Dropout(0.2)(x1)
    
    region = Input(shape=[1])
    city = Input(shape=[1])
    category_name = Input(shape=[1])
    parent_category_name = Input(shape=[1])
    image_top_1 = Input(shape=[1])
    
    emb_region = Embedding(max_region, 10)(region)
    emb_city = Embedding(max_city, 10)(city)
    emb_category_name = Embedding(max_category_name, 10)(category_name)
    emb_parent_category_name = Embedding(max_parent_category_name, 10)(parent_category_name)
    emb_image_top_1 =  Embedding(max_image_top_1, 10)(image_top_1)
    region_f = Flatten() (emb_region)
    city_f = Flatten() (emb_city)
    cat_name_f = Flatten() (emb_category_name)
    par_cat_name_f = Flatten() (emb_parent_category_name)
    ito_f = Flatten() (emb_image_top_1)
    categorical_feat = Concatenate()([region_f, city_f, cat_name_f, par_cat_name_f,ito_f])
    cat_x = Dense(30, activation='sigmoid')(categorical_feat)
    cat_x = Dropout(0.2)(cat_x)
    cat_x2 = Dense(30)(categorical_feat)
    cat_x2 = PReLU()(cat_x2)
    cat_x2 = Dropout(0.2)(cat_x2)
    cat_m = merge([cat_x, cat_x2], mode='concat')
    cat_m = Dropout(0.2)(cat_m)
    
    
    inp = Input(shape = (maxlen, ))
    inp_eng = Input(shape = (maxlen, ))
    emb = Embedding(nb_words, embed_size, weights = [embedding_matrix],
                    input_length = maxlen, trainable = False)(inp)
    
    embedding = Embedding(nb_words, embed_size, weights = [embedding_matrix_image],
                    input_length = maxlen, trainable = False)(inp)
    
    embedding_eng = Embedding(nb_words, embed_size, weights = [embedding_matrix_image_two],
                    input_length = maxlen, trainable = False)(inp_eng)
    
    #eng = Bidirectional(CuDNNGRU(256,return_sequences = True))(embedding_eng)
    #x_eng, x_h_eng, x_c_eng = CuDNNLSTM(256,return_sequences=True, return_state=True)(eng)
    #avg_pool_eng = GlobalAveragePooling1D()(x_eng)
    #max_pool_eng = GlobalMaxPooling1D()(x_eng)
    
    final_emb = Concatenate()([emb, embedding, embedding_eng])
    
    main = SpatialDropout1D(0.2)(final_emb)
    #main = Conv1D(256, kernel_size=3, padding='valid')(main)
    
    main = Bidirectional(CuDNNGRU(256,return_sequences = True))(main)
    
    main = Bidirectional(CuDNNGRU(256,return_sequences = True))(main)
    
    x, x_h, x_c = CuDNNLSTM(256,return_sequences=True, return_state=True)(main)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    x = concatenate([avg_pool, max_pool, x_h, x1, cat_m]) 
    x = BatchNormalization()(x)
    x3 = Dense(100, activation='sigmoid')(x)
    x3 = Dropout(0.2)(x3)
    x2  = Dense(100)(x)
    x2 = PReLU()(x2)
    x2 = Dropout(0.2)(x2)
    x = merge([x2,x3], mode='concat')
    x2 = Dense(30, activation='sigmoid')(x)
    x2 = Dropout(0.2)(x2)
    x3 = Dense(30, activation = 'linear')(x)
    x3 = Dropout(0.2)(x3)
    x4 = Dense(30)(x)
    x4 = PReLU()(x4)
    x4 = Dropout(0.2)(x4)
    x = merge([x2,x3,x4], mode='concat')
    main = Dropout(0.2)(x)
    out = Dense(1, activation = "sigmoid")(main)
    
    
    model = Model(inputs = [inp, inp_eng,features_input, region, city, category_name,
                            parent_category_name,image_top_1], outputs = out)

    model.compile(optimizer = Adam(lr=0.001), loss = 'mean_squared_error',
                  metrics =[root_mean_squared_error])
    return model

In [None]:
from sklearn.model_selection import KFold

RS = 20180601
folds = KFold(n_splits=10, shuffle=True, random_state=546789)
oof_preds = np.zeros(X_train.shape[0])
catt_feat = ['region','city','category_name','parent_category_name','image_top_1']
test_predicts_list = []
np.random.seed(RS)
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(X_train)):
    trn_x, trn_y, features_trn, categ_trn = X_train[trn_idx], labels['deal_probability'].values[trn_idx], features[trn_idx], train[catt_feat].loc[trn_idx].values
    val_x, val_y, features_val, categ_val = X_train[val_idx], labels['deal_probability'].values[val_idx], features[val_idx], train[catt_feat].loc[val_idx].values
    
    model = build_model()
    model_gpu = multi_gpu_models(model,4)
    #model_gpu.compile(optimizer = Adam(lr=0.005), loss = 'mean_squared_error',
    model_gpu.compile(optimizer = Adam(lr=0.005), loss = root_mean_squared_error,
                  metrics =[root_mean_squared_error])
    
    check_point = ModelCheckpoint('nlp.hdf5', monitor = "val_root_mean_squared_error", mode = "min",
                                  save_best_only = True, verbose = 1)
    early_stop = EarlyStopping(monitor="val_root_mean_squared_error", mode="min", patience=5)
    rlrop = ReduceLROnPlateau(monitor='val_root_mean_squared_error',mode='auto',patience=1,verbose=1,
                              factor=0.5,cooldown=0,min_lr=1e-6)
    history = model_gpu.fit([trn_x,trn_eng_x, features_trn, categ_trn[:,0],categ_trn[:,1],categ_trn[:,2],categ_trn[:,3],
                             categ_trn[:,4]], trn_y, batch_size = 2000, epochs = 1000,
                            validation_data = ([val_x,val_eng_x, features_val,categ_val[:,0],categ_val[:,1],
                                                categ_val[:,2],categ_val[:,3],categ_val[:,4]], val_y),verbose = 1,
                            callbacks = [check_point, early_stop, rlrop])
    oof_preds[val_idx]  = model_gpu.predict([val_x,features_val,categ_val[:,0],categ_val[:,1],
                                                categ_val[:,2],categ_val[:,3],categ_val[:,4]]).reshape(-1)
    pred =  model_gpu.predict([X_test,  test_features, test[catt_feat[0]],test[catt_feat[1]],test[catt_feat[2]],
                          test[catt_feat[3]],test[catt_feat[4]]])
    test_predicts_list.append(pred)

  
  name=name)


Train on 1353081 samples, validate on 150343 samples
Epoch 1/1000

## Train with English text

In [41]:
from sklearn.model_selection import KFold

RS = 20180601
folds = KFold(n_splits=10, shuffle=True, random_state=546789)
oof_preds = np.zeros(X_train.shape[0])
catt_feat = ['region','city','category_name','parent_category_name','image_top_1']
test_predicts_list = []
np.random.seed(RS)
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(X_train)):
    trn_x, trn_eng_x, trn_y, features_trn, categ_trn = X_train[trn_idx], X_eng_train[trn_idx], labels['deal_probability'].values[trn_idx], features[trn_idx], train[catt_feat].loc[trn_idx].values
    val_x, val_eng_x, val_y, features_val, categ_val = X_train[val_idx], X_eng_train[val_idx], labels['deal_probability'].values[val_idx], features[val_idx], train[catt_feat].loc[val_idx].values
    
    model = build_model()
    model_gpu = multi_gpu_models(model,4)
    #model_gpu.compile(optimizer = Adam(lr=0.005), loss = 'mean_squared_error',
    model_gpu.compile(optimizer = Adam(lr=0.01), loss = root_mean_squared_error,
                  metrics =[root_mean_squared_error])
    
    check_point = ModelCheckpoint('nlp.hdf5', monitor = "val_root_mean_squared_error", mode = "min",
                                  save_best_only = True, verbose = 1)
    early_stop = EarlyStopping(monitor="val_root_mean_squared_error", mode="min", patience=5)
    rlrop = ReduceLROnPlateau(monitor='val_root_mean_squared_error',mode='auto',patience=1,verbose=1,
                              factor=0.5,cooldown=0,min_lr=1e-6)
    history = model_gpu.fit([trn_x,trn_eng_x, features_trn, categ_trn[:,0],categ_trn[:,1],categ_trn[:,2],categ_trn[:,3],
                             categ_trn[:,4]], trn_y, batch_size = 2000, epochs = 1000,
                            validation_data = ([val_x,val_eng_x, features_val,categ_val[:,0],categ_val[:,1],
                                                categ_val[:,2],categ_val[:,3],categ_val[:,4]], val_y),verbose = 1,
                            callbacks = [check_point, early_stop, rlrop])
    oof_preds[val_idx]  = model_gpu.predict([val_x,val_eng_x,features_val,categ_val[:,0],categ_val[:,1],
                                                categ_val[:,2],categ_val[:,3],categ_val[:,4]]).reshape(-1)
    pred =  model_gpu.predict([X_test, X_eng_test,  test_features, test[catt_feat[0]],test[catt_feat[1]],test[catt_feat[2]],
                          test[catt_feat[3]],test[catt_feat[4]]])
    test_predicts_list.append(pred)

  
  name=name)


Train on 1353081 samples, validate on 150343 samples
Epoch 1/1000

Epoch 00001: val_root_mean_squared_error improved from inf to 0.29576, saving model to nlp.hdf5
Epoch 2/1000

Epoch 00002: val_root_mean_squared_error did not improve from 0.29576

Epoch 00002: ReduceLROnPlateau reducing learning rate to 0.003999999910593033.
Epoch 3/1000

KeyboardInterrupt: 

In [37]:
ss = np.zeros(shape=(1000,1))

In [39]:
ss.reshape(-1).shape

(1000,)

In [None]:
fit_generator(generator=my_generator(train_generator),
                    steps_per_epoch=math.ceil(1104367 / batch_size),
                    verbose=1,
                    callbacks=callbacks,
                    validation_data=my_generator(validation_generator),
                    initial_epoch=16,
                    epochs=17,
                    use_multiprocessing=True,
                    max_queue_size=10,
                    workers = 20,
                    validation_steps=math.ceil(114799 / batch_size))

In [41]:
test_predicts = np.ones(test_predicts_list[0].shape)
for fold_predict in test_predicts_list:
    test_predicts *= fold_predict

test_predicts **= (1. / len(test_predicts_list))   

In [43]:
np.save('predictions/four/rnn_mlp_oof_preds.npy',oof_preds)
np.save('predictions/four/rnn_mlp_test_predicts.npy',test_predicts)

In [50]:
sample_submission = pd.read_csv('sample_submission.csv', index_col = 0)
submission = sample_submission.copy()
submission['deal_probability'] = test_predicts
submission.to_csv('mark_rnn_1.csv')

In [49]:
submission[submission['deal_probability']=='']

  result = getattr(x, name)(y)


TypeError: invalid type comparison

Lets define the model and illustrate the architecture

Lets train our model for four epochs and save the best epoch.

In [8]:
EPOCHS = 4
file_path = "model.hdf5"

check_point = ModelCheckpoint(file_path, monitor = "val_loss", mode = "min", save_best_only = True, verbose = 1)
history = model.fit(X_train, y_train, batch_size = 256, epochs = EPOCHS, validation_data = (X_valid, y_valid),
                verbose = 1, callbacks = [check_point])

In [None]:
test = pd.read_csv(TEST_CSV, index_col = 0)
test = test[['description']].copy()

test['description'] = test['description'].astype(str)
X_test = test['description'].values
X_test = tokenizer.texts_to_sequences(X_test)

print('padding')
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
prediction = model.predict(X_test,batch_size = 128, verbose = 1)

sample_submission = pd.read_csv('../input/avito-demand-prediction/sample_submission.csv', index_col = 0)
submission = sample_submission.copy()
submission['deal_probability'] = prediction
submission.to_csv('submission.csv')

In [9]:
model.load_weights(file_path)
prediction = model.predict(X_valid)
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_valid, prediction)))

Thats some improvement compared to using  [the pre-trained embedding model](https://www.kaggle.com/christofhenkel/fasttext-starter-description-only) which scored 0.2370. Additionally since the embeddings here are trained also on param_1, param_2, param_3 and title which have much more out of vocabulary words when using Fasttext. Hence self-trained embeddings are clearly performing better.

Ok, now we are ready to do a submission and compare the LB score with [the pre-trained embedding model](https://www.kaggle.com/christofhenkel/fasttext-starter-description-only).