## Загрузка библиотек

In [16]:
!pip install tensorflow==2.3.0

time: 2.53 s (started: 2022-01-11 20:13:39 +00:00)


In [17]:
!pip install --no-warn-conflicts -q deepctr

time: 724 ms (started: 2022-01-11 20:13:42 +00:00)


In [18]:
!pip install ipython-autotime

time: 2.64 s (started: 2022-01-11 20:13:43 +00:00)


In [19]:
%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 5.9 ms (started: 2022-01-11 20:13:45 +00:00)


In [20]:
import tensorflow as tf
import pandas as pd
import numpy as np

time: 796 µs (started: 2022-01-11 20:13:45 +00:00)


In [21]:
tf.__version__

'2.3.0'

time: 8.03 ms (started: 2022-01-11 20:13:45 +00:00)


## Загрузка данных

In [22]:
# Загрузка набора данных с google диск
!gdown "https://drive.google.com/uc?export=download&id=1EI9nUbHp0g-hkAg7Dx1nmjDJovhDYdfL"

Downloading...
From: https://drive.google.com/uc?export=download&id=1EI9nUbHp0g-hkAg7Dx1nmjDJovhDYdfL
To: /content/winemag-data-130k-v2.csv.zip
100% 17.2M/17.2M [00:00<00:00, 151MB/s]
time: 3.04 s (started: 2022-01-11 20:13:45 +00:00)


In [23]:
# Разархивирование
!unzip -qq -o winemag-data-130k-v2.csv.zip -d winemag

time: 532 ms (started: 2022-01-11 20:13:48 +00:00)


In [24]:
def get_data():
    df = pd.read_csv('/content/winemag/winemag-data-130k-v2.csv')
    df.drop(columns='Unnamed: 0', inplace=True)

    print(f"Размер загруженного набора данных: {df.shape}")
    print(f"Названия столбцов: {df.columns}")
    
    return df

time: 2.47 ms (started: 2022-01-11 20:13:49 +00:00)


In [25]:
def preprocess_data(input_df):
    df = input_df.copy()
    
    # Замена для последующей обработки текста
    df.loc[df.variety == 'G-S-M', 'variety'] = 'GSM'
    # Извлечение года урожая из названия обзора
    df['year'] = df['title'].str.extract('(\d\d\d\d)')

    return df

time: 2.11 ms (started: 2022-01-11 20:13:49 +00:00)


In [26]:
data = preprocess_data(get_data())
data.head()

Размер загруженного набора данных: (129971, 13)
Названия столбцов: Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012


time: 1.16 s (started: 2022-01-11 20:13:49 +00:00)


## DeepFM подход

### Импорты

In [27]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from deepctr.feature_column import SparseFeat, get_feature_names

from sklearn.model_selection import train_test_split

from deepctr.models import DeepFM

from sklearn.metrics import mean_squared_error

time: 469 ms (started: 2022-01-11 20:13:50 +00:00)


### Данные

#### Рабочий датасет (модель изменяет его)

In [28]:
df = data.copy()
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012


time: 39.5 ms (started: 2022-01-11 20:13:50 +00:00)


In [29]:
df['id'] = np.arange(df.shape[0])

time: 4.11 ms (started: 2022-01-11 20:13:51 +00:00)


#### База данных вин

In [30]:
database = df.copy()
database.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,id
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013,0
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011,1
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013,2
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013,3
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012,4


time: 38 ms (started: 2022-01-11 20:13:51 +00:00)


### Тренировка модели

###### Feature embedding

In [31]:
sparse_features = ["id", "taster_name", "country", "price"]
target = ['points']

time: 1.26 ms (started: 2022-01-11 20:13:51 +00:00)


In [32]:
for feat in sparse_features:
    lbe = LabelEncoder()
    df[feat] = lbe.fit_transform(df[feat])

time: 106 ms (started: 2022-01-11 20:13:51 +00:00)


In [33]:
fixlen_feature_columns = [SparseFeat(feat, df[feat].max() + 1, embedding_dim=4)
                              for feat in sparse_features]
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)   

time: 19 ms (started: 2022-01-11 20:13:51 +00:00)


###### Сгенерируем тренировочные/тестовые данные для модели

In [34]:
train, test = train_test_split(df, test_size=0.2, random_state=2020)
train_model_input = {name: train[name].values for name in feature_names}
test_model_input = {name: test[name].values for name in feature_names}

time: 87.5 ms (started: 2022-01-11 20:13:51 +00:00)


###### Тренируем модель

In [35]:
model_deep_fm = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
model_deep_fm.compile("adam", "mse", metrics=['mse'], )

time: 2.34 s (started: 2022-01-11 20:13:51 +00:00)


In [36]:
history = model_deep_fm.fit(
    train_model_input,
    train[target].values,
    batch_size=256,
    epochs=30,
    verbose=2,
    validation_split=0.2,
    )


Epoch 1/30


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


325/325 - 3s - loss: 1199.0714 - mse: 1199.0713 - val_loss: 7.0303 - val_mse: 7.0297
Epoch 2/30
325/325 - 3s - loss: 4.2347 - mse: 4.2337 - val_loss: 6.3079 - val_mse: 6.3065
Epoch 3/30
325/325 - 3s - loss: 1.4917 - mse: 1.4902 - val_loss: 6.1279 - val_mse: 6.1262
Epoch 4/30
325/325 - 3s - loss: 0.5058 - mse: 0.5040 - val_loss: 6.0693 - val_mse: 6.0674
Epoch 5/30
325/325 - 3s - loss: 0.4852 - mse: 0.4831 - val_loss: 6.0134 - val_mse: 6.0111
Epoch 6/30
325/325 - 3s - loss: 0.2776 - mse: 0.2751 - val_loss: 5.9933 - val_mse: 5.9906
Epoch 7/30
325/325 - 3s - loss: 0.2420 - mse: 0.2392 - val_loss: 6.0173 - val_mse: 6.0145
Epoch 8/30
325/325 - 3s - loss: 0.2630 - mse: 0.2600 - val_loss: 5.9825 - val_mse: 5.9794
Epoch 9/30
325/325 - 3s - loss: 0.2041 - mse: 0.2009 - val_loss: 5.9631 - val_mse: 5.9598
Epoch 10/30
325/325 - 3s - loss: 0.2148 - mse: 0.2113 - val_loss: 5.9414 - val_mse: 5.9378
Epoch 11/30
325/325 - 3s - loss: 0.2585 - mse: 0.2547 - val_loss: 5.9296 - val_mse: 5.9256
Epoch 12/30
3

In [37]:
pred_ans = model_deep_fm.predict(test_model_input, batch_size=256)

time: 381 ms (started: 2022-01-11 20:15:16 +00:00)


###### Считаем ошибку на тестовой выборке

In [38]:
mse = mean_squared_error(test[target].values, pred_ans, squared=False)
rmse = mean_squared_error(test[target].values, pred_ans)

print(f"test MSE: {round(mse, 4)}")
print(f"test RMSE: {round(rmse, 4)}")

test MSE: 2.3921
test RMSE: 5.7222
time: 7.33 ms (started: 2022-01-11 20:15:17 +00:00)


### Рекоммендации

#### Предскажем рейтинг для одного пользователя и одного вина

 (Для того, чтобы понять, какие данные надо давать на вход, и какие данные получаем на выходе модели)

In [39]:
user_id = 14

time: 1.39 ms (started: 2022-01-11 20:15:17 +00:00)


In [40]:
single_item = df [df['taster_name'] == user_id][:1]
single_item

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,id
2,40,"Tart and snappy, the flavors of lime flesh and...",,87,10,Oregon,Willamette Valley,Willamette Valley,14,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013,2


time: 46.9 ms (started: 2022-01-11 20:15:17 +00:00)


In [41]:
single_input = single_item[['taster_name', 'id', 'country', 'price']]
single_input

Unnamed: 0,taster_name,id,country,price
2,14,2,40,10


time: 13.7 ms (started: 2022-01-11 20:15:17 +00:00)


In [42]:
one_prediction_request = {
    'taster_name': single_input['taster_name'].values,
    'id': np.array([user_id]),
    'country': single_input['country'].values,
    'price': single_input['price'].values,
    }
one_prediction_request

{'country': array([40]),
 'id': array([14]),
 'price': array([10]),
 'taster_name': array([14])}

time: 7.15 ms (started: 2022-01-11 20:15:17 +00:00)


In [43]:
single_rating_prediction = model_deep_fm.predict(one_prediction_request)
single_rating_prediction

array([[86.7459]], dtype=float32)

time: 47.1 ms (started: 2022-01-11 20:15:17 +00:00)


 #### Сопоставим предсказанный рейтинг и вино из оригинального датасета

In [44]:
database[database['id'] == one_prediction_request['id'][0]]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,id
14,US,Building on 150 years and six generations of w...,,87,12.0,California,Central Coast,Central Coast,Matt Kettmann,@mattkettmann,Mirassou 2012 Chardonnay (Central Coast),Chardonnay,Mirassou,2012,14


time: 65.3 ms (started: 2022-01-11 20:15:17 +00:00)


#### Предскажем рейтинг для одного пользователя и всех вин в датасете

In [45]:
def prepare_input_data(df, user_id, item_ids = None):
    rows = df [df['id'].isin(item_ids)] if item_ids else df
    feature_columns = rows[['id', 'country', 'price']]

    user_id_multiplied = np.full(
        shape=len(rows),
        fill_value=user_id,
        dtype=np.int
        )

    prediction_request = {
        'taster_name': np.array(user_id_multiplied),
        'id': feature_columns['id'].values,
        'country': feature_columns['country'].values,
        'price': feature_columns['price'].values,
    }

    return prediction_request


time: 4.39 ms (started: 2022-01-11 20:15:17 +00:00)


In [46]:
all_dataset_for_one_user = prepare_input_data(df, user_id)
all_dataset_for_one_user

{'country': array([22, 31, 40, ..., 15, 15, 15]),
 'id': array([     0,      1,      2, ..., 129968, 129969, 129970]),
 'price': array([390,  11,  10, ...,  26,  28,  17]),
 'taster_name': array([14, 14, 14, ..., 14, 14, 14])}

time: 8.84 ms (started: 2022-01-11 20:15:17 +00:00)


In [47]:
multiple_ratings_prediction = model_deep_fm.predict(all_dataset_for_one_user)
multiple_ratings_prediction

array([[89.536354],
       [87.69852 ],
       [86.61835 ],
       ...,
       [89.7418  ],
       [89.96265 ],
       [88.985985]], dtype=float32)

time: 6.97 s (started: 2022-01-11 20:15:17 +00:00)


#### Сопоставим предсказанные рейтинги с винами из изначального датасета

In [48]:
ids = all_dataset_for_one_user['id']
ids

array([     0,      1,      2, ..., 129968, 129969, 129970])

time: 4.83 ms (started: 2022-01-11 20:15:24 +00:00)


In [49]:
ratings = np.concatenate(multiple_ratings_prediction)
ratings

array([89.536354, 87.69852 , 86.61835 , ..., 89.7418  , 89.96265 ,
       88.985985], dtype=float32)

time: 97.2 ms (started: 2022-01-11 20:15:24 +00:00)


In [50]:
assert len(ids) == len(ratings)

time: 1.33 ms (started: 2022-01-11 20:15:24 +00:00)


In [51]:
id_rating_predictions_df = pd.DataFrame({'id': ids, 'rating': ratings})
id_rating_predictions_df

Unnamed: 0,id,rating
0,0,89.536354
1,1,87.698517
2,2,86.618347
3,3,89.654991
4,4,90.911331
...,...,...
129966,129966,90.433212
129967,129967,91.555176
129968,129968,89.741798
129969,129969,89.962646


time: 15.5 ms (started: 2022-01-11 20:15:24 +00:00)


In [52]:
top_ten_wines = id_rating_predictions_df.sort_values('rating', ascending=False)[:10].reset_index(drop=True)
top_ten_wines

Unnamed: 0,id,rating
0,111755,102.731369
1,7335,101.929695
2,111753,101.557327
3,60881,101.322952
4,118058,101.132088
5,45781,100.994263
6,58352,100.774719
7,111754,100.744438
8,39287,100.741348
9,111756,100.538147


time: 32.1 ms (started: 2022-01-11 20:15:24 +00:00)


In [53]:
original_wines_unordered = database[database['id'].isin(top_ten_wines['id'])]
original_wines_unordered

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,id
7335,Italy,Thick as molasses and dark as caramelized brow...,Occhio di Pernice,100,210.0,Tuscany,Vin Santo di Montepulciano,,,,Avignonesi 1995 Occhio di Pernice (Vin Santo ...,Prugnolo Gentile,Avignonesi,1995,7335
39287,Italy,Here's a “wow” wine you won't easily forget. M...,Messorio,99,320.0,Tuscany,Toscana,,,,Le Macchiole 2007 Messorio Merlot (Toscana),Merlot,Le Macchiole,2007,39287
45781,Italy,"This gorgeous, fragrant wine opens with classi...",Riserva,100,550.0,Tuscany,Brunello di Montalcino,,Kerin O’Keefe,@kerinokeefe,Biondi Santi 2010 Riserva (Brunello di Montal...,Sangiovese,Biondi Santi,2010,45781
58352,France,"This is a magnificently solid wine, initially ...",,100,150.0,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Léoville Barton 2010 Saint-Julien,Bordeaux-style Red Blend,Château Léoville Barton,2010,58352
60881,Italy,Always a standout among Gaja's five single-vin...,Sorì San Lorenzo,99,440.0,Piedmont,Langhe,,,,Gaja 2007 Sorì San Lorenzo Nebbiolo (Langhe),Nebbiolo,Gaja,2007,60881
111753,France,"Almost black in color, this stunning wine is g...",,100,1500.0,Bordeaux,Pauillac,,Roger Voss,@vossroger,Château Lafite Rothschild 2010 Pauillac,Bordeaux-style Red Blend,Château Lafite Rothschild,2010,111753
111754,Italy,It takes only a few moments before you appreci...,Cerretalto,100,270.0,Tuscany,Brunello di Montalcino,,,,Casanova di Neri 2007 Cerretalto (Brunello di...,Sangiovese Grosso,Casanova di Neri,2007,111754
111755,France,This is the finest Cheval Blanc for many years...,,100,1500.0,Bordeaux,Saint-Émilion,,Roger Voss,@vossroger,Château Cheval Blanc 2010 Saint-Émilion,Bordeaux-style Red Blend,Château Cheval Blanc,2010,111755
111756,France,"A hugely powerful wine, full of dark, brooding...",,100,359.0,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Léoville Las Cases 2010 Saint-Julien,Bordeaux-style Red Blend,Château Léoville Las Cases,2010,111756
118058,US,This wine dazzles with perfection. Sourced fro...,La Muse,100,450.0,California,Sonoma County,Sonoma,,,Verité 2007 La Muse Red (Sonoma County),Bordeaux-style Red Blend,Verité,2007,118058


time: 30.4 ms (started: 2022-01-11 20:15:24 +00:00)


In [54]:
final_top_ten = original_wines_unordered.merge(top_ten_wines, on = 'id').sort_values('rating', ascending=False)
final_top_ten

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,id,rating
7,France,This is the finest Cheval Blanc for many years...,,100,1500.0,Bordeaux,Saint-Émilion,,Roger Voss,@vossroger,Château Cheval Blanc 2010 Saint-Émilion,Bordeaux-style Red Blend,Château Cheval Blanc,2010,111755,102.731369
0,Italy,Thick as molasses and dark as caramelized brow...,Occhio di Pernice,100,210.0,Tuscany,Vin Santo di Montepulciano,,,,Avignonesi 1995 Occhio di Pernice (Vin Santo ...,Prugnolo Gentile,Avignonesi,1995,7335,101.929695
5,France,"Almost black in color, this stunning wine is g...",,100,1500.0,Bordeaux,Pauillac,,Roger Voss,@vossroger,Château Lafite Rothschild 2010 Pauillac,Bordeaux-style Red Blend,Château Lafite Rothschild,2010,111753,101.557327
4,Italy,Always a standout among Gaja's five single-vin...,Sorì San Lorenzo,99,440.0,Piedmont,Langhe,,,,Gaja 2007 Sorì San Lorenzo Nebbiolo (Langhe),Nebbiolo,Gaja,2007,60881,101.322952
9,US,This wine dazzles with perfection. Sourced fro...,La Muse,100,450.0,California,Sonoma County,Sonoma,,,Verité 2007 La Muse Red (Sonoma County),Bordeaux-style Red Blend,Verité,2007,118058,101.132088
2,Italy,"This gorgeous, fragrant wine opens with classi...",Riserva,100,550.0,Tuscany,Brunello di Montalcino,,Kerin O’Keefe,@kerinokeefe,Biondi Santi 2010 Riserva (Brunello di Montal...,Sangiovese,Biondi Santi,2010,45781,100.994263
3,France,"This is a magnificently solid wine, initially ...",,100,150.0,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Léoville Barton 2010 Saint-Julien,Bordeaux-style Red Blend,Château Léoville Barton,2010,58352,100.774719
6,Italy,It takes only a few moments before you appreci...,Cerretalto,100,270.0,Tuscany,Brunello di Montalcino,,,,Casanova di Neri 2007 Cerretalto (Brunello di...,Sangiovese Grosso,Casanova di Neri,2007,111754,100.744438
1,Italy,Here's a “wow” wine you won't easily forget. M...,Messorio,99,320.0,Tuscany,Toscana,,,,Le Macchiole 2007 Messorio Merlot (Toscana),Merlot,Le Macchiole,2007,39287,100.741348
8,France,"A hugely powerful wine, full of dark, brooding...",,100,359.0,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Léoville Las Cases 2010 Saint-Julien,Bordeaux-style Red Blend,Château Léoville Las Cases,2010,111756,100.538147


time: 40.3 ms (started: 2022-01-11 20:15:24 +00:00)
