## МГТУ им. Н. Э. Баумана
## Факультет: Информатика, искусственный интеллект и системы управления
## Кафедра: Системы обработки информации и управления
## Дисциплина: Методы машинного обучения
## Лабораторная работа №4 "Создание рекомендательной модели"
## Выполнил: Богданов Д. А. ИУ5-24М

Задание:

1. Выбрать произвольный набор данных (датасет), предназначенный для построения рекомендательных моделей.
2. Опираясь на материалы лекции, сформировать рекомендации для одного пользователя (объекта) двумя произвольными способами.
3. Сравнить полученные рекомендации (если это возможно, то с применением метрик).

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.impute import MissingIndicator
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import load_wine
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances
from IPython.display import Image
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline 
sns.set(style="ticks")

In [76]:
data = pd.read_csv('all_games.csv')
data.head()

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,"November 23, 1998","As a young boy, Link is tricked by Ganondorf, ...",99,9.1
1,Tony Hawk's Pro Skater 2,PlayStation,"September 20, 2000",As most major publishers' development efforts ...,98,7.4
2,Grand Theft Auto IV,PlayStation 3,"April 29, 2008",[Metacritic's 2008 PS3 Game of the Year; Also ...,98,7.7
3,SoulCalibur,Dreamcast,"September 8, 1999","This is a tale of souls and swords, transcendi...",98,8.4
4,Grand Theft Auto IV,Xbox 360,"April 29, 2008",[Metacritic's 2008 Xbox 360 Game of the Year; ...,98,7.9


In [77]:
data.shape

(18800, 6)

In [78]:
df_md_with_summary = data[data['summary'].notnull()]
df_md_with_summary = df_md_with_summary[~df_md_with_summary['summary'].str.isspace()]

In [79]:
name = df_md_with_summary['name'].values
name[0:6]


array(['The Legend of Zelda: Ocarina of Time', "Tony Hawk's Pro Skater 2",
       'Grand Theft Auto IV', 'SoulCalibur', 'Grand Theft Auto IV',
       'Super Mario Galaxy'], dtype=object)

In [80]:
platform = df_md_with_summary['platform'].values
platform[0:6]

array([' Nintendo 64', ' PlayStation', ' PlayStation 3', ' Dreamcast',
       ' Xbox 360', ' Wii'], dtype=object)

In [81]:
metaScore = df_md_with_summary['meta_score'].values
metaScore[0:6]

array([99, 98, 98, 98, 98, 97])

In [82]:
summary = df_md_with_summary['summary'].values
summary[0:6]

array(['As a young boy, Link is tricked by Ganondorf, the King of the Gerudo Thieves. The evil human uses Link to gain access to the Sacred Realm, where he places his tainted hands on Triforce and transforms the beautiful Hyrulean landscape into a barren wasteland. Link is determined to fix the problems he helped to create, so with the help of Rauru he travels through time gathering the powers of the Seven Sages.',
       "As most major publishers' development efforts shift to any number of next-generation platforms, Tony Hawk 2 will likely stand as one of the last truly fantastic games to be released on the PlayStation.",
       '[Metacritic\'s 2008 PS3 Game of the Year; Also known as "GTA IV"] What does the American Dream mean today? For Niko Belic, fresh off the boat from Europe. It\'s the hope he can escape his past. For his cousin, Roman, it is the vision that together they can find fortune in Liberty City, gateway to the land of opportunity. As they slip into debt and are dragged

Векторизуем описания с помощью Tf-Idf Vectorizer

In [83]:
tfidfv = TfidfVectorizer()
summary_matrix = tfidfv.fit_transform(summary)
summary_matrix


<18686x38922 sparse matrix of type '<class 'numpy.float64'>'
	with 1265978 stored elements in Compressed Sparse Row format>

И с помощью CountVectorizer:

In [84]:
countv = CountVectorizer()
summary_matrix_co = countv.fit_transform(summary)
summary_matrix_co

<18686x38922 sparse matrix of type '<class 'numpy.int64'>'
	with 1265978 stored elements in Compressed Sparse Row format>

In [85]:
class SimpleKNNRecommender:
    
    def __init__(self, X_matrix, X_name, X_platform, X_summary):
        """
        Входные параметры:
        X_matrix - обучающая выборка (матрица объект-признак)
        X_ids - массив идентификаторов объектов
        X_title - массив названий объектов
        X_overview - массив описаний объектов
        """
        #Сохраняем параметры в переменных объекта
        self._X_matrix = X_matrix
        self.df = pd.DataFrame(
            {'name': pd.Series(X_name, dtype='str'),
            'platform': pd.Series(X_platform, dtype='str'),
            'summary': pd.Series(X_summary, dtype='str'),
            'dist': pd.Series([], dtype='float')})
            
            
    def recommend_for_single_object(self, K: int, \
                X_matrix_object, cos_flag = True, manh_flag = False):
        """
        Метод формирования рекомендаций для одного объекта.
        Входные параметры:
        K - количество рекомендуемых соседей 
        X_matrix_object - строка матрицы объект-признак, соответствующая объекту
        cos_flag - флаг вычисления косинусного расстояния
        manh_flag - флаг вычисления манхэттэнского расстояния
        Возвращаемое значение: K найденных соседей
        """
        
        scale = 1000000
        # Вычисляем косинусную близость
        if cos_flag:
            dist = cosine_similarity(self._X_matrix, X_matrix_object)
            self.df['dist'] = dist * scale
            res = self.df.sort_values(by='dist', ascending=False)
            # Не учитываем рекомендации с единичным расстоянием,
            # так как это искомый объект
            res = res[res['dist'] < scale]
        
        else:
            if manh_flag:
                dist = manhattan_distances(self._X_matrix, X_matrix_object)
            else:
                dist = euclidean_distances(self._X_matrix, X_matrix_object)
            self.df['dist'] = dist * scale
            res = self.df.sort_values(by='dist', ascending=True)
            # Не учитываем рекомендации с единичным расстоянием,
            # так как это искомый объект
            res = res[res['dist'] > 0.0]            
        
        # Оставляем К первых рекомендаций
        res = res.head(K)
        return res

In [100]:
test_game_name = 5
name[test_game_name]

'Super Mario Galaxy'

Зададим его матрицу:

In [101]:
test_game_matrix = summary_matrix[test_game_name]
test_game_matrix

<1x38922 sparse matrix of type '<class 'numpy.float64'>'
	with 105 stored elements in Compressed Sparse Row format>

In [102]:
skr1 = SimpleKNNRecommender(summary_matrix, name, platform, summary)

In [103]:
test = df_md_with_summary.iloc[test_game_name]
test

name                                           Super Mario Galaxy
platform                                                      Wii
release_date                                    November 12, 2007
summary         [Metacritic's 2007 Wii Game of the Year] The u...
meta_score                                                     97
user_review                                                   9.1
Name: 5, dtype: object

In [104]:
rec1 = skr1.recommend_for_single_object(10, test_game_matrix)
rec1

Unnamed: 0,name,platform,summary,dist
3188,Mario Kart Wii,Wii,Mario Kart Wii comes with the intuitive Wii Wh...,395605.935463
1075,New Super Mario Bros. Wii,Wii,New Super Mario Bros. Wii offers a combination...,390494.857117
3064,Super Mario 3D All-Stars,Switch,Play three of Mario’s greatest 3D platform adv...,368606.459438
16084,Wii Play,Wii,"In Wii Play, the star of the game is you. Crea...",352888.817182
135,Super Mario 3D World,Wii U,Work together with your friends or compete for...,351778.776477
15206,Wii Play: Motion,Wii,"With the June 13 launch of Wii Play: Motion, W...",331220.622109
13247,NHL 2K9,Wii,Wii Remote Shooting – Execute wrist shots with...,321918.343124
11570,Mario Super Sluggers,Wii,In the same spirit as other sports-themed Mari...,321211.97192
18171,Furu Furu Park,PC,Furu Furu Park is a collection of 30 arcade-st...,306305.767881
223,Super Mario World: Super Mario Advance 2,Game Boy Advance,Mario hasn't even had a chance to savor his vi...,303054.238122


In [105]:
test_game_matrix_co = summary_matrix_co[test_game_name]
test_game_matrix_co

<1x38922 sparse matrix of type '<class 'numpy.int64'>'
	with 105 stored elements in Compressed Sparse Row format>

In [106]:
skr2 = SimpleKNNRecommender(summary_matrix_co, name, platform, summary)


Делаем рекомендации по описаниям векторизованным CountVectorizer и на основе Евклидова
расстояния:

In [107]:
rec2 = skr2.recommend_for_single_object(10, test_game_matrix_co, cos_flag = False)
rec2


Unnamed: 0,name,platform,summary,dist
18658,Jenga World Tour,Wii,Jenga is based on the world famous wooden bloc...,20174240.0
18587,Jenga World Tour,DS,Jenga is based on the world famous wooden bloc...,20174240.0
17066,Transformers: Revenge of the Fallen,Switch,The fate of the earth and the survival of the ...,20322400.0
3207,Europa 1400: The Guild,PC,It's the year 1400. Europe sees the dawn of th...,20856650.0
12600,Ride 2,PC,"Welcome to the temple of motorcycles, the only...",20880610.0
13973,The Council - Episode 5: Checkmate,PlayStation 4,Checkmate is the fifth and final episode of Th...,20880610.0
16355,The Council - Episode 5: Checkmate,PSP,Checkmate is the fifth and final episode of Th...,20880610.0
479,Okami,Wii,Okami tells the story of a mythical sun god wh...,21000000.0
3329,The Lord of the Rings: The Two Towers,PlayStation 2,Join the fellowship and experience all of the ...,21071310.0
12439,Cursed Mountain,Wii,"Set in the late 1980s, gamers take on the role...",21071310.0
