<h2 align=center>Modelo de recomendación con filtro colaborativo.</h2>

**Importamos librerías**

In [28]:

import sqlalchemy as sql
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import numpy as np

ModuleNotFoundError: No module named 'tensorflow.python'

**1.- Preparamos los datos para el modelo**

Definimos la conexion con la base de datos.

In [38]:
engine = sql.create_engine(
    "mysql+pymysql://root:password@localhost:3307/data_warehouse_olist?charset=utf8mb4"
)

Creamos dataframe con la información necesaria desde mysql. Incluimos los datos de id de clientes, id de productos y las calificaciones.

In [46]:
combined = pd.read_sql(sql="""
    SELECT customers.unique_id, order_items.product_id,  order_reviews.score
    FROM order_items
    LEFT JOIN orders ON (order_items.order_id = orders.order_id)
    LEFT JOIN customers ON (orders.customer_id = customers.customer_id)
    RIGHT JOIN order_reviews ON (orders.order_id = order_reviews.order_id);""", con=engine)


Descartamos duplicados.

In [47]:
combined.drop_duplicates(['unique_id','product_id'],inplace= True)

Porcederemos a preparar los datos para crear la matriz de usuario-producto. 

Para realizarla se tuvo en cuenta el hecho de que existe gran numero de clientes y productos, pero poca interacción entre ellos. En consecuencia, se procedió a filtrar admitiendo solo aquellos clientes que habian realizado mas de tres calificaciones y productos que fueron valorados por lo menos en cinco ocasiones. 

Creamos dos dataframes.

In [48]:
df_products = combined[['product_id', 'score']]
df_customer = combined[['unique_id', 'score']]


Aplicamos el filtro sobre ellos.

In [49]:
df_products= df_products.groupby("product_id").aggregate({"score":"count"})
df_products.reset_index(inplace=True)
df_products.rename(columns={"score":"product_count"},inplace=True)
df_products = df_products[df_products["product_count"] > 4]

In [50]:
df_customer= df_customer.groupby("unique_id").aggregate({"score":"count"})
df_customer.reset_index(inplace=True)
df_customer.rename(columns={"score":"customer_count"}, inplace= True)
df_customer = df_customer[df_customer["customer_count"] > 2]

Los utilizamos como filtro para el dataframe principal.

In [51]:
filter_ = combined["product_id"].isin(df_products["product_id"]) & combined["unique_id"].isin(df_customer["unique_id"])
combined = combined[filter_]


In [52]:
combined = pd.merge(left= combined, right = df_customer, on = 'unique_id',  how = 'left')
combined = pd.merge(left= combined, right = df_products, on = 'product_id',  how = 'left')

In [53]:
combined

Unnamed: 0,unique_id,product_id,score,customer_count,product_count
0,432aa6200ee9673be90863a912dc91dc,bed9b7934576c9ba61b6ba6f3babc698,5,3,6
1,fd8ccc89be43894d2553494c71a61fd8,8ae935cab2de3f74f4960de6ee604f90,4,3,5
2,a1a374f4c131638dc698c76bebd11769,8a443635fdf9759915c9be5be2e3b862,5,3,27
3,a1a374f4c131638dc698c76bebd11769,87d780fa7d2cf3710aa02dc4ca8db985,5,3,23
4,d75acd4c5b7b4dfd32b9d9172b195419,d0fe4295267f15ccaceac4fb233d8c9a,5,5,13
...,...,...,...,...,...
1014,dd8c09f1b309c9ffc302c745550a9ff3,372645c7439f9661fbbacfd129aa92ec,2,4,23
1015,dd8c09f1b309c9ffc302c745550a9ff3,525947dbe3304ac32bf51602f9557c12,2,4,10
1016,0aef107040099c08391f73e81821bbac,88c20c5a22f2ca169af8cfc2df00a7a2,1,3,12
1017,0aef107040099c08391f73e81821bbac,3625fbaf8284047185fb0351f2f84ae3,1,3,13


In [55]:
print('Number of unique products', combined['product_id'].nunique())
print('Number of unique users', combined['unique_id'].nunique())


Number of unique products 704
Number of unique users 528


Normalizamos la función de calificación

In [134]:
scaler = MinMaxScaler()
combined['score'] = combined['score'].values.astype(float)
score_scaled = pd.DataFrame(scaler.fit_transform(combined['score'].values.reshape(-1,1)))
combined['score'] = score_scaled

Creamos la matriz usuario-producto

In [135]:
combined = combined.drop_duplicates(['unique_id', 'product_id'])
user_products_matrix = combined.pivot(index='unique_id', columns='product_id', values='score')
user_products_matrix.fillna(0, inplace=True)

In [136]:
#probable error
users = user_products_matrix.index.tolist()
products = user_products_matrix.columns.tolist()

user_products_matrix = user_products_matrix.values

placeholder solo esta en V1, asi que lo importamos.

In [138]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Configuramos alguno parametros.

In [139]:
num_input = combined['product_id'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

X = tf.placeholder(tf.float64, [None, num_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

In [140]:
def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

Construimos el modelo y las predicciones.

In [141]:
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

y_pred = decoder_op

y_true = X

Difinimos funciones de evaluación.

In [142]:
loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

Inicializamos las variables y creamos un marco de datos vacíos.

In [143]:
init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

Entrenamos. 50 epocas.

In [144]:
with tf.compat.v1.Session() as session:
    epochs = 50
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_products_matrix.shape[0] / batch_size)
    user_products_matrix = np.array_split(user_products_matrix, num_batches)
    
    for i in range(epochs):

        avg_cost = 0
        for batch in user_products_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

        print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_products_matrix = np.concatenate(user_products_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_products_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name='score')
    pred_data.columns = ['unique_id', 'product_id', 'score']
    pred_data['unique_id'] = pred_data['unique_id'].map(lambda value: users[value])
    pred_data['product_id'] = pred_data['product_id'].map(lambda value: products[value])
    
    keys = ['unique_id', 'product_id']
    index_1 = pred_data.set_index(keys).index
    index_2 = combined.set_index(keys).index
    
    top_five_ranked = pred_data[~index_1.isin(index_2)]
    top_five_ranked = top_five_ranked.sort_values(['unique_id', 'score'], ascending=[True, False])
    top_five_ranked = top_five_ranked.groupby('unique_id').head(10)

epoch: 1 Loss: 0.36914319296677905
epoch: 2 Loss: 0.36901794870694477
epoch: 3 Loss: 0.36884625256061554
epoch: 4 Loss: 0.36861097315947217
epoch: 5 Loss: 0.36828866104284924
epoch: 6 Loss: 0.3678473581870397
epoch: 7 Loss: 0.36724352339903515
epoch: 8 Loss: 0.3664180636405945
epoch: 9 Loss: 0.3652910590171814
epoch: 10 Loss: 0.3637548585732778
epoch: 11 Loss: 0.3616654922564824
epoch: 12 Loss: 0.3588319420814514
epoch: 13 Loss: 0.35500313341617584
epoch: 14 Loss: 0.3498541017373403
epoch: 15 Loss: 0.3429752637942632
epoch: 16 Loss: 0.33387985825538635
epoch: 17 Loss: 0.32205556829770404
epoch: 18 Loss: 0.30721646547317505
epoch: 19 Loss: 0.2902962913115819
epoch: 20 Loss: 0.2732970863580704
epoch: 21 Loss: 0.256226879854997
epoch: 22 Loss: 0.23706882695357004
epoch: 23 Loss: 0.21355329205592474
epoch: 24 Loss: 0.18378194669882456
epoch: 25 Loss: 0.14623686422904333
epoch: 26 Loss: 0.1016488845149676
epoch: 27 Loss: 0.05983076182504495
epoch: 28 Loss: 0.033700027503073215
epoch: 29 Los

  pred_data = pred_data.append(pd.DataFrame(preds))


Seleccionamos un usuario y vemos la recomendación.

In [1]:
top_five_ranked.loc[top_five_ranked['unique_id'] == 'f0911a59fdcd8b103c3c87226d8769c5']


NameError: name 'top_five_ranked' is not defined