RecSys 기초 대회 강의에서는Book Crossing 데이터를 사용하여, 모든 실습 및 미션, 대회를 진행합니다. [Kaggle Book-Crossing](https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset) 을 출처로 하며, 데이터는 재구성되어 제공되었습니다. 해당 데이터는 CC0: Public Domain 라이센스임을 밝힙니다.


# [1] 데이터 불러오기



본 대회에 활용되는 데이터는 총 3개의 파일입니다.

- user : 사용자 정보를 담고 있는 데이터 파일

- ratings: 책의 등급을 1-10으로 표현한 데이터 파일

- books: 책과 관련된 정보를 표현한 데이터 파일


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

In [2]:
path='./data/'

users = pd.read_csv(path+'users.csv')
books = pd.read_csv(path+'books.csv')
ratings = pd.read_csv(path+'train_ratings.csv')

print('users shape: ', users.shape)
print('books shape: ', books.shape)
print('ratings shape: ', ratings.shape)

users shape:  (68092, 3)
books shape:  (149570, 10)
ratings shape:  (306795, 3)


In [3]:
test_ratings = pd.read_csv(path+'test_ratings.csv')

# [2] users

사용자의 정보를 담고 있는 파일입니다.

총 70753 명의 사용자 정보를 담고 있습니다.

`user_id`, `location`, `age` 컬럼으로 구성되어있습니다.

`user_id`는 unique한 값을 나타냅니다.


In [4]:
users.head()

Unnamed: 0,user_id,location,age
0,8,"timmins, ontario, canada",
1,11400,"ottawa, ontario, canada",49.0
2,11676,"n/a, n/a, n/a",
3,67544,"toronto, ontario, canada",30.0
4,85526,"victoria, british columbia, canada",36.0


In [5]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68092 entries, 0 to 68091
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   user_id   68092 non-null  int64  
 1   location  68092 non-null  object 
 2   age       40259 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.6+ MB


In [6]:
users.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,68092.0,139381.329539,80523.969862,8.0,69008.75,138845.5,209388.25,278854.0
age,40259.0,36.069873,13.842571,5.0,25.0,34.0,45.0,99.0


In [7]:
users['user_id'].nunique()

68092

In [8]:
users.isna().sum()/len(users)

user_id     0.000000
location    0.000000
age         0.408756
dtype: float64

age에 많은 결측값이 있는 것을 확인할 수 있습니다.

## (2-1) Preprocessing & Feature Engineering

현재 `location` 칼럼에는 지역, 주, 국가 순서로 모든 정보가 다 들어와있습니다.

이를 전처리를 통해 각각 컬럼으로 분리하는 과정을 거치겠습니다.

In [9]:
users['location'] = users['location'].apply(lambda x: re.sub(r'[^0-9a-zA-Z:,]', '',x)) # 특수문자 제거
users['location_city'] = users['location'].apply(lambda x: x.split(',')[0].strip())
users['location_state'] = users['location'].apply(lambda x: x.split(',')[1].strip())
users['location_country'] = users['location'].apply(lambda x: x.split(',')[2].strip())

users = users.replace('na', np.nan) #특수문자 제거로 n/a가 na로 바뀌게 되었습니다. 따라서 이를 컴퓨터가 인식할 수 있는 결측값으로 변환합니다.
users = users.replace('', np.nan) # 일부 경우 , , ,으로 입력된 경우가 있었으므로 이런 경우에도 결측값으로 변환합니다.

In [10]:
users.head()

Unnamed: 0,user_id,location,age,location_city,location_state,location_country
0,8,"timmins,ontario,canada",,timmins,ontario,canada
1,11400,"ottawa,ontario,canada",49.0,ottawa,ontario,canada
2,11676,"na,na,na",,,,
3,67544,"toronto,ontario,canada",30.0,toronto,ontario,canada
4,85526,"victoria,britishcolumbia,canada",36.0,victoria,britishcolumbia,canada


In [11]:
users.isna().sum()

user_id                 0
location                0
age                 27833
location_city         122
location_state       3254
location_country     2124
dtype: int64

In [12]:
users[users['location_country'].isna()]

Unnamed: 0,user_id,location,age,location_city,location_state,location_country
2,11676,"na,na,na",,,,
6,116866,"ottawa,,",,ottawa,,
32,115097,"seattle,,",27.0,seattle,,
49,245827,"albuquerque,,",,albuquerque,,
72,226745,"humble,,",38.0,humble,,
...,...,...,...,...,...,...
67797,257311,"lisbon,maine,",36.0,lisbon,maine,
67929,267240,"houston,,",,houston,,
67930,267276,"sammamish,,",,sammamish,,
68058,276221,"calgary,,",,calgary,,


country가 결측값인 일부 행을 살펴보면 city값이 존재하는데, country 정보가 없는 경우가 있습니다.

따라서 이런 경우를 처리해주도록 하겠습니다.

In [13]:
modify_location = users[(users['location_country'].isna())&(users['location_city'].notnull())]['location_city'].values # city는 존재하는데 country가 없는 목록
location_list = []
for location in modify_location: # country가 없는 각 city들에 대해서 
    try:
        right_location = users[(users['location'].str.contains(location))&(users['location_country'].notnull())]['location'].value_counts().idxmax() # 올바른 Location 정보
        location_list.append(right_location)
    except:
        pass

In [14]:
for location in location_list:
    users.loc[users[users['location_city']==location.split(',')[0]].index,'location_state'] = location.split(',')[1].strip()
    users.loc[users[users['location_city']==location.split(',')[0]].index,'location_country'] = location.split(',')[2].strip()

In [15]:
users.isna().sum()

user_id                 0
location                0
age                 27833
location_city         122
location_state       1132
location_country      271
dtype: int64

location_state와 location_country의 결측값이 줄어든 것을 확인할 수 있습니다.

## (2-2) 시각화
데이터 시각화를 통해 사용자들의 나이 분포, 거주 국가 등을 알아보겠습니다.

age의 결측값을 살펴보겠습니다.

In [139]:
users[users['age'].isna()]['location_country'].value_counts()

location_country
usa                  20214
canada                3139
germany               1105
unitedkingdom          958
australia              477
                     ...  
unitedstaes              1
missouri                 1
unknown                  1
dominicanrepublic        1
macedonia                1
Name: count, Length: 177, dtype: int64

위의 그래프와 큰 차이가 없는 점을 고려했을때, 특정국가에 몰려있다기 보단 골고루 결측치가 퍼져 있다고 판단할 수 있습니다.

국가별 평균, 중앙값, 최빈값 등을 넣어볼수도 있고, 국가별로 큰 나이분포에 큰 차이가 없다고 판단한 경우 전체의 통계치로 채울 수도 있습니다.

또는 결측값을 평균, 중앙값, 최빈값 등의 값으로 채우지 않고 결측값 자체로 가져가는 방법도 있습니다. 

다양한 방법을 시도해보시기 바랍니다.

### 사용자들의 나이 결측값을 어떻게 채울 수 있을 것인가?
**1. 국가별 평균, 중앙, 최빈값**  
**2. 전체 평균, 중앙, 최빈값**  
**3. 결측치 자체로 가져가기**  

In [16]:
to_replace = users.groupby('location_city').agg({'age':'median'}).reset_index() # 국가별 중앙값으로 대체할 리스트 
age_dict = {country : age for _,(country,age) in to_replace.iterrows()}
for country in age_dict:
    users.loc[(users['age'].isna())&(users['location_country']== country),'age'] = age_dict[country]

In [17]:
users['location_country'] = users['location_country'].fillna('unknown')
users['location_city'] = users['location_city'].fillna('unknown')
users['location_state'] = users['location_state'].fillna('unknown')

In [18]:
users['age'] = users['age'].fillna(-1)

In [19]:
users.isna().sum()

user_id             0
location            0
age                 0
location_city       0
location_state      0
location_country    0
dtype: int64

In [20]:
users = users.drop(columns='location')

# [3] books
이제 책에 대한 정보를 살펴보겠습니다.

`isbn`, `book_title`, `book_author`, `year_of_publication`, `publisher`, `img_s', `img_m`, `img_l`, `language`, `category` 칼럼으로 이뤄져있습니다.

`isbn`은 책의 고유 코드를 나타냅니다

같은 title을 가진 책이더라도 발행 년도, 출판사, 언어 등에 따라 책 코드가 다를 수 있습니다.


In [21]:
books.shape

(149570, 10)

In [22]:
books['isbn'].nunique()

149570

In [23]:
books['book_title'].nunique()

135436

In [24]:
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,img_url,language,category,summary,img_path
0,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,en,['Actresses'],"In a small town in Canada, Clara Callan reluct...",images/0002005018.01.THUMBZZZ.jpg
1,60973129,Decision in Normandy,Carlo D'Este,1991.0,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,en,['1940-1949'],"Here, for the first time in paperback, is an o...",images/0060973129.01.THUMBZZZ.jpg
2,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,en,['Medical'],"Describes the great flu epidemic of 1918, an o...",images/0374157065.01.THUMBZZZ.jpg
3,399135782,The Kitchen God's Wife,Amy Tan,1991.0,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,en,['Fiction'],A Chinese immigrant who is convinced she is dy...,images/0399135782.01.THUMBZZZ.jpg
4,425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000.0,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,en,['History'],"Essays by respected military historians, inclu...",images/0425176428.01.THUMBZZZ.jpg


In [25]:
books.isna().sum() #language, category, summary에 결측값이 있습니다

isbn                       0
book_title                 0
book_author                1
year_of_publication        0
publisher                  0
img_url                    0
language               67227
category               68851
summary                67227
img_path                   0
dtype: int64

In [26]:
books.fillna('None',inplace = True)

## (3-1) Preprocessing & Feature Engineering

books의 경우 데이터를 다양한 방법으로 수정하여 여러 테스트를 해볼 수 있을 것으로 예상됩니다.

본 미션의 내용 이외에 다양한 시도를 해보고 성능을 측정해보시기 바랍니다.



### (3-1-1) isbn

In [27]:
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,img_url,language,category,summary,img_path
0,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,en,['Actresses'],"In a small town in Canada, Clara Callan reluct...",images/0002005018.01.THUMBZZZ.jpg
1,60973129,Decision in Normandy,Carlo D'Este,1991.0,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,en,['1940-1949'],"Here, for the first time in paperback, is an o...",images/0060973129.01.THUMBZZZ.jpg
2,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,en,['Medical'],"Describes the great flu epidemic of 1918, an o...",images/0374157065.01.THUMBZZZ.jpg
3,399135782,The Kitchen God's Wife,Amy Tan,1991.0,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,en,['Fiction'],A Chinese immigrant who is convinced she is dy...,images/0399135782.01.THUMBZZZ.jpg
4,425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000.0,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,en,['History'],"Essays by respected military historians, inclu...",images/0425176428.01.THUMBZZZ.jpg



isbn은 책의 개별적인 고유번호를 나타내는 10자리 숫자입니다.

출판 국가, 출판사 번호, 항목 번호, 확인 숫자 순서로 구성되며 확인 숫자의 경우 10을 나타내는 X가 오기도 합니다.

이 번호를 활용하여 출판사의 항목 갯수를 줄여보도록 하겠습니다.

#### (1) isbn 활용하여 출판사의 항목 갯수 줄이기

In [28]:
publisher_dict=(books['publisher'].value_counts()).to_dict()
publisher_count_df= pd.DataFrame(list(publisher_dict.items()),columns = ['publisher','count'])

publisher_count_df = publisher_count_df.sort_values(by=['count'], ascending = False)

In [29]:
publisher_count_df.head()

Unnamed: 0,publisher,count
0,Harlequin,3005
1,Ballantine Books,2322
2,Pocket,2274
3,Penguin Books,1943
4,Bantam Books,1938


In [30]:
books['publisher'].nunique() # 수정전 항목 수를 확인합니다.

11571

In [31]:
modify_list = publisher_count_df[publisher_count_df['count']>1].publisher.values

In [32]:
for publisher in modify_list:
    try:
        number = books[books['publisher']==publisher]['isbn'].apply(lambda x: x[:4]).value_counts().index[0]
        right_publisher = books[books['isbn'].apply(lambda x: x[:4])==number]['publisher'].value_counts().index[0]
        books.loc[books[books['isbn'].apply(lambda x: x[:4])==number].index,'publisher'] = right_publisher
    except: 
        pass

In [33]:
books['publisher'].nunique() #수정 후 출판사 갯수입니다

1523

#### (2) isbn 활용하여 국가코드 만들기
[국가코드 참고 1](https://en.wikipedia.org/wiki/List_of_ISBN_registration_groups)    
[국가코드 참고 2](https://everything2.com/title/ISBN+Country+codes)

**시도한 것: isbn의 첫자리가 국가 코드라고 해서, 해당 국가에서는 그 나라의 언어만을 이용한다고 생각, 그래서 isbn의 첫자리를 따와서 language의 nan값을 채울 수 있을 것이라 생각**  
**결과: 나라를 기준으로 나눠놔서, 해당 나라에서 다른 나라의 언어로 출판된 출간물이 존재해서 채울 수 없을 것 같음**  
**그렇다면 나라 코드 자체로 의미가 있는 건 아닐까? -> isbn의 앞자리를 딴 feature 하나 만들어둠**  

In [34]:
def country_code(isbn:str)->str:
    prefix_1 = ('0','1','2','3','4','5','7')
    prefix_2 = tuple(map(str,range(80,94)))
    prefix_3 = tuple(list(map(str,range(950,960)))+list(map(str,range(961,969)))+list(map(str,range(970,985)))+['986','987'])
    if isbn.startswith(prefix_1):
        return isbn[0]
    elif isbn.startswith(prefix_2):
        return isbn[:2]
    elif isbn.startswith(prefix_3):
        return isbn[:3]
    else:
        return '-1'
books['country_code'] = books['isbn'].map(country_code)

### (3-1-2) Category

category를 칼럼의 항목을 대괄호 밖으로 빼는 과정을 거친 뒤 어떤 category가 있는지 살펴보겠습니다

In [35]:
import re
def category_preprocessing(category:str) -> str:
    category = re.sub("[^0-9a-zA-Z\\s]", " ", category) # 0-9,알파벳, 공백이 아닌것 제거 
    category = re.sub("\s+", " ", category)
    category = category.lower().strip() # 소문자/대문자 전처리
    return category

In [36]:
books.loc[books[books['category'].notnull()].index, 'category'] = books[books['category'].notnull()]['category'].map(category_preprocessing)

In [37]:
books['category'].unique()

array(['actresses', '1940 1949', 'medical', ..., 'deafness',
       'alternative histories',
       'authors canadian english 20th century biography'], dtype=object)

### 제목과 작가가 같은데 ISBN이 다르다면,category를 통해 카테고리 항목 결측치 채울수도!

### category 처리할때 다르게 고려하는 방법
```python
categories = ['garden','crafts','physics','adventure','music','fiction','nonfiction','science','science fiction','social','homicide',
 'sociology','disease','religion','christian','philosophy','psycholog','mathemat','agricult','environmental',
 'business','poetry','drama','literary','travel','motion picture','children','cook','literature','electronic',
 'humor','animal','bird','photograph','computer','house','ecology','family','architect','camp','criminal','language','india']

for category in categories:
    books.loc[books[books['category'].str.contains(category,na=False)].index,'category_high'] = category

books['category'].value_counts()
```

In [38]:
books['category_high'] = books['category'].copy()
categories = ['garden','crafts','physics','adventure','music','fiction','nonfiction','science','science fiction','social','homicide',
 'sociology','disease','religion','christian','philosophy','psycholog','mathemat','agricult','environmental',
 'business','poetry','drama','literary','travel','motion picture','children','cook','literature','electronic',
 'humor','animal','bird','photograph','computer','house','ecology','family','architect','camp','criminal','language','india']

for category in categories:
    books.loc[books[books['category'].str.contains(category,na=False)].index,'category_high'] = category

In [39]:
category_high_df = pd.DataFrame(books['category_high'].value_counts()).reset_index()
category_high_df.columns = ['category','count']

# 5개 이하인 항목은 others로 묶어주도록 하겠습니다.
others_list = category_high_df[category_high_df['count']<5]['category'].values
books.loc[books[books['category_high'].isin(others_list)].index, 'category_high']='others'

#### word2vec 시도

In [40]:
import gensim

# 1. unique한 카테고리에 대해 임베딩 형성
# 2. 해당 임베딩을 거기다가 매칭 
def word2vec(unique_category:np.array) -> dict:
    word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
    embedding_dict = dict()
    k = 300
    for category in unique_category:
        embedding = np.zeros(k,)
        try:
            words = category.split()
            for word in words:
                embedding += word2vec_model[category]
            embedding_dict[category] = embedding
        except:
            embedding_dict[category] = np.zeros(k,)
    return embedding_dict

In [41]:
embedding_dict = word2vec(books['category'].unique())
books['category_embedding'] = books['category'].map(embedding_dict)

In [42]:
books.isna().sum()

isbn                   0
book_title             0
book_author            0
year_of_publication    0
publisher              0
img_url                0
language               0
category               0
summary                0
img_path               0
country_code           0
category_high          0
category_embedding     0
dtype: int64

In [None]:
from typing import Union, Tuple, List

import os
import numpy as np
import random
import pandas as pd
from datetime import datetime, date
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score, precision_score, recall_score
from tqdm import tqdm

import matplotlib.pyplot as plt
# import seaborn as sns
%matplotlib inline

# from IPython.display import Image

import torch
import torch.nn as nn
from torch.nn.init import normal_
from torch.utils.data import TensorDataset, DataLoader

In [184]:
books

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,img_url,language,category,summary,img_path,country_code,final_category,category_embedding,final_country_code
0,0002005018,Clara Callan,Richard Bruce Wright,2001.0,Collins,http://images.amazon.com/images/P/0002005018.0...,en,actresses,"In a small town in Canada, Clara Callan reluct...",images/0002005018.01.THUMBZZZ.jpg,0,brothers,"[-0.1640625, -0.06298828125, -0.03125, 0.07910...",0
1,0060973129,Decision in Normandy,Carlo D'Este,1991.0,Perennial,http://images.amazon.com/images/P/0060973129.0...,en,1940 1949,"Here, for the first time in paperback, is an o...",images/0060973129.01.THUMBZZZ.jpg,0,1940 1949,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
2,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,en,medical,"Describes the great flu epidemic of 1918, an o...",images/0374157065.01.THUMBZZZ.jpg,0,geschichte,"[-0.1376953125, 0.1484375, -0.01544189453125, ...",0
3,0399135782,The Kitchen God's Wife,Amy Tan,1991.0,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,en,fiction,A Chinese immigrant who is convinced she is dy...,images/0399135782.01.THUMBZZZ.jpg,0,nonfiction,"[-0.007568359375, -0.265625, 0.00885009765625,...",0
4,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000.0,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,en,history,"Essays by respected military historians, inclu...",images/0425176428.01.THUMBZZZ.jpg,0,histoire,"[0.09619140625, 0.1357421875, 0.1357421875, 0....",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149565,067161746X,The Bachelor Home Companion: A Practical Guide...,P.J. O'Rourke,1987.0,Pocket,http://images.amazon.com/images/P/067161746X.0...,en,humor,A tongue-in-cheek survival guide for single pe...,images/067161746X.01.THUMBZZZ.jpg,0,fathers,"[0.43359375, -0.08544921875, 0.09912109375, 0....",0
149566,0767907566,All Elevations Unknown: An Adventure in the He...,Sam Lightner,2001.0,Broadway Books,http://images.amazon.com/images/P/0767907566.0...,en,nature,A daring twist on the travel-adventure genre t...,images/0767907566.01.THUMBZZZ.jpg,0,service,"[0.138671875, 0.2041015625, 0.0289306640625, 0...",0
149567,0884159221,Why stop?: A guide to Texas historical roadsid...,Claude Dooley,1985.0,Bridge Publications,http://images.amazon.com/images/P/0884159221.0...,,,,images/0884159221.01.THUMBZZZ.jpg,0,,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0
149568,0912333022,The Are You Being Served? Stories: 'Camping In...,Jeremy Lloyd,1997.0,Pub Group West,http://images.amazon.com/images/P/0912333022.0...,en,fiction,These hilarious stories by the creator of publ...,images/0912333022.01.THUMBZZZ.jpg,0,nonfiction,"[-0.007568359375, -0.265625, 0.00885009765625,...",0


In [185]:
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,8,2005018,4
1,67544,2005018,7
2,123629,2005018,8
3,200273,2005018,8
4,210926,2005018,9


In [189]:
# 유저가 평가한 카테고리를 문장으로, 개별 카테고리를 단어로 가정하여 word2vec 사용
positive_samples = ratings.copy()
# positive_samples = positive_samples[positive_samples['ratings'] == 1]
# positive_samples = positive_samples[['user_id', 'movie_id', 'liked']]

In [190]:
# 단어가 최소 3회 이상 등장해야 학습되고 모델에 저장됨
min_count = 3
# 1개의 positive sample당 negative sample의 개수
negative = 5

In [None]:
isbns_for_training = list()
for isbn in ratings['isbn'].unique():
    if positive_samples[positive_samples['isbn'] == isbn].shape[0] >= min_count:
        isbns_for_training.append(isbn)

In [None]:
new_positive_samples = dict()
new_positive_samples['user_id'] = list()
new_positive_samples['w_isbn_id'] = list()
new_positive_samples['c_isbn_id'] = list()

user_negative_samples = dict()

isbn_ids = ratings['isbn'].unique()

for user_id in tqdm(ratings['user_id'].unique()):
    user_positive_samples = positive_samples[positive_samples['user_id'] == user_id]
    user_isbns = user_positive_samples['isbn'].tolist()
    # sampling을 위해 각 user id의 negative ratings를 저장
    user_neg_isbns = [isbn_id for isbn_id in isbn_ids if isbn not in user_isbns]
    user_negative_samples[user_id] = np.array(user_neg_isbns)
    for w_isbn_id in user_isbns:
        # 단어가 최소 등장 횟수를 만족하지 않음
        if w_isbn_id not in isbns_for_training:
            continue
        for c_isbn_id in user_isbns:
            if c_isbn_id == w_isbn_id:
                continue
            new_positive_samples['user_id'].append(user_id)
            new_positive_samples['w_isbn_id'].append(w_isbn_id)
            new_positive_samples['c_isbn_id'].append(c_isbn_id)

new_positive_samples = pd.DataFrame(new_positive_samples)

In [None]:
for user_id in user_negative_samples:
    user_negative_samples[user_id] = np.array(user_negative_samples[user_id])

In [None]:
new_positive_samples['user_id'] = new_positive_samples['user_id'].astype("category")
new_positive_samples['w_isbn_id'] = new_positive_samples['w_isbn_id'].astype("category")
new_positive_samples['c_isbn_id'] = new_positive_samples['c_isbn_id'].astype("category")
train_df, test_df = train_test_split(
    new_positive_samples, stratify=new_positive_samples['user_id'], random_state=seed, test_size=0.20
)
print('학습 데이터 크기:', train_df.shape)
print('테스트 데이터 크기:', test_df.shape)

In [None]:
# PyTorch의 DataLoader에서 사용할 수 있도록 변환 
train_dataset = TensorDataset(torch.LongTensor(np.array(train_df)))
test_dataset = TensorDataset(torch.LongTensor(np.array(test_df)))

In [None]:
class Negative_Sampler(nn.Module):
    """
    Negative Sampler
    
    Args:
        - user_negative_samples: (Dict) keys: user id, items: list of negative samples
        - n_negs: (int) negative sample의 수
    Shape:
        - Input: (torch.Tensor) user id들, (user id, 중심 item id, 주변 item id). Shape: (batch size,)
        - Output: (torch.Tensor) sampling된 negative samples. Shape: (batch size, n_negs)
    """
    def __init__(self, user_negative_samples, n_negs):
        super(Negative_Sampler, self).__init__()
        self.user_negative_samples = user_negative_samples
        self.n_negs = n_negs
    
    def forward(self, user_ids):
        user_ids = user_ids.to('cpu').numpy()
        negative_samples = np.array([
            np.random.choice(self.user_negative_samples[user_id],self.n_negs,replace=False)
            for user_id in user_ids
        ])
        return torch.from_numpy(negative_samples)
    

In [None]:
class SGNS(nn.Module):
    """
    Skip-Gram with Negative Sampling
    
    Args:
        - n_items: (int) 전체 아이템의 수
        - emb_dim: (int) Embedding의 Dimension
        - user_negative_samples: (Dict) user id 별 전체 negative sample
        - n_negs: (int) negative sample의 수
    Shape:
        - Input: (torch.Tensor) input features, (user id, 중심 item id, 주변 item id). Shape: (batch size, 3)
        - Output: (torch.Tensor) sampling된 negative samples와 positive sample의 Loss 합. Shape: ()
    """
    def __init__(self, n_items, emb_dim, user_negative_samples, n_negs):
        super(SGNS, self).__init__()
        
        # initialize Class attributes
        self.n_items = n_items
        self.emb_dim = emb_dim
        self.user_negative_samples = user_negative_samples
        self.n_negs = n_negs
        self.negative_sampler = Negative_Sampler(self.user_negative_samples, self.n_negs)
        
        # define embeddings
        # 중심 아이템
        self.w_item_embedding =  nn.Embedding(self.n_items,self.emb_dim)
        # 주변 아이템
        self.c_item_embedding =  nn.Embedding(self.n_items,self.emb_dim)
        self.sigmoid = nn.Sigmoid()
        
        self.loss_fn = nn.BCELoss()
        
        self.apply(self._init_weights)
        
    # initialize weights
    def _init_weights(self, module):
        if isinstance(module, nn.Embedding):
            normal_(module.weight.data, mean=0.0, std=0.01)
    
    def forward(self, input_feature):
        batch_size = input_feature.size()[0]
        
        user_ids, w_item, c_item = torch.split(input_feature, [1, 1, 1], -1)
        # 유저 id
        user_ids = user_ids.squeeze(-1)
        # 중심 아이템
        w_item = w_item.squeeze(-1)
        # 주변 아이템 (positive sample)
        c_item = c_item.squeeze(-1)
        # 주변 아이템 negative sampling
        neg_c_items = self.negative_sampler(user_ids)
        
        # 중심 아이템 embedding
        w_item_e =  self.w_item_embedding(w_item) 
        # 주변 아이템 (positive sample) embedding
        c_item_e =  self.c_item_embedding(c_item)
        # 주변 아이템 (negative sample) embedding
        neg_c_items_e =  self.c_item_embedding(neg_c_items) 
        # HINT: neg_c_items_e.shape == (batch_size, self.n_negs, self.emb_dim)
        
        w_item_e = w_item_e.view(batch_size, 1, self.emb_dim)
        c_item_e = c_item_e.view(batch_size, self.emb_dim, 1)
        neg_c_items_e = neg_c_items_e.permute(0, 2, 1)
        
        pos_output = torch.bmm(w_item_e,c_item_e) 
        pos_output = pos_output.squeeze(-1)
        pos_output = self.sigmoid(pos_output).squeeze(-1)

        
        pos_y = torch.ones(pos_output.size())
        pos_loss = self.loss_fn(pos_output, pos_y)
        
        neg_output = torch.bmm(w_item_e,neg_c_items_e) 
        neg_output = neg_output.squeeze(-1)
        neg_output = self.sigmoid(neg_output)
        
        neg_y = torch.zeros(neg_output.size())
        neg_loss = self.loss_fn(neg_output, neg_y)

        
        
        return pos_loss + neg_loss


In [None]:
# Reference - https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html

def train_loop(dataloader, model, optimizer):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    train_loss = 0
    
    for batch, (X,) in enumerate(dataloader):
        X = X.to(device)
        # Compute prediction and loss
        loss = model(X)
        train_loss += loss.item()

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


        if (batch+1) % 100 == 0:
            loss, current = loss.item(), (batch+1) * len(X)
            print(f"Loss: {loss:>7f} | [{current:>5d}/{size:>5d}]")
    train_loss /= num_batches
    
    return train_loss


def test_loop(dataloader, model):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss= 0
    
    with torch.no_grad():
        for X, in dataloader:
            X= X.to(device)
            loss = model(X)
            test_loss += loss.item()
    test_loss /= num_batches
    print(f"Test Error:\n\tAvg Loss: {test_loss:>8f}")
    return test_loss


In [None]:
def train_and_test(train_dataloader, test_dataloader, model, optimizer, epochs):
    train_loss, test_loss = list(), list()

    for t in range(epochs):
        print(f"Epoch {t+1}\n-------------------------------")
        train_result= train_loop(train_dataloader, model, optimizer)
        train_loss.append(train_result)
        test_result = test_loop(test_dataloader, model)
        test_loss.append(test_result)
        print("-------------------------------\n")
    print("Done!")

    return train_loss, test_loss


In [None]:
######## Hyperparameter ########

batch_size = 2048
data_shuffle = True
emb_dim = 512
epochs = 5
learning_rate = 0.001
gpu_idx = 0

n_items = ratings_df['movie_id'].nunique()

################################
# torch.cuda.empty_cache() # if necessary
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

device = torch.device("cuda:{}".format(gpu_idx) if torch.cuda.is_available() else "cpu")

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=data_shuffle)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=data_shuffle)

model = SGNS(n_items, emb_dim, user_negative_samples, negative).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, amsgrad=True)

In [None]:
train_loss, test_loss = train_and_test(train_dataloader, test_dataloader, model, optimizer, epochs)

In [122]:
categories = ['garden','crafts','physics','adventure','music','fiction','nonfiction','science','science fiction','social','homicide',
 'sociology','disease','religion','christian','philosophy','psycholog','mathemat','agricult','environmental',
 'business','poetry','drama','literary','travel','motion picture','children','cook','literature','electronic',
 'humor','animal','bird','photograph','computer','house','ecology','family','architect','camp','criminal','language','india']

for category in categories:
    books.loc[books[books['category'].str.contains(category,na=False)].index,'category_high'] = category

In [123]:
category_high_df = pd.DataFrame(books['category_high'].value_counts()).reset_index()
category_high_df.columns = ['category','count']

# 5개 이하인 항목은 others로 묶어주도록 하겠습니다.
others_list = category_high_df[category_high_df['count']<5]['category'].values
books.loc[books[books['category_high'].isin(others_list)].index, 'category_high']='others'

print(books['category'].nunique())
print(books['category_high'].nunique())

Unnamed: 0,category,count
0,fiction,39678
1,religion,1824
2,nonfiction,1427
3,humor,1291
4,social,1271
5,business,1146
6,cook,1125
7,science,1063
8,family,988
9,literary,848


데이터를 살펴보다 보면 아래처럼 같은 책으로 여겨지는 항목이 있는 경우가 있습니다.

그러나 책의 고유 번호인 isbn이 다르고, 출판사가 다르므로 설령 같은 책이여도, 다른 국가에서 출판됐을 가능성을 배재할 수 없으므로 language를 채우는 것은 어려워보입니다.

물론, 0446365505 책의 category_high를 fiction으로 수정하는 작업은 해볼 수 있습니다.

뒤의 작업들을 해보면서 category 항목이 중요하다고 여겨지면, 좀더 시간을 들여서 상위 카테고리 지정하여 데이터 전처리 수준을 올릴 수 있을 것입니다.

In [129]:
books[books['book_title'].str.contains("Pleading Guilty")]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,img_url,language,category,summary,img_path,category_high
5041,446365505,Pleading Guilty,Scott Turow,1994.0,Warner Books,http://images.amazon.com/images/P/0446365505.0...,,,,images/0446365505.01.THUMBZZZ.jpg,
22680,816157464,Pleading Guilty (G K Hall Large Print Book Ser...,Scott Turow,1993.0,Troll Communications,http://images.amazon.com/images/P/0816157464.0...,,,,images/0816157464.01.THUMBZZZ.jpg,
37056,374234574,Pleading Guilty,Scott Turow,1993.0,Farrar Straus Giroux,http://images.amazon.com/images/P/0374234574.0...,en,fiction,Immediately. Turow&#39;s third novel takes us ...,images/0374234574.01.THUMBZZZ.jpg,fiction


# [4] ratings

ratings 파일은 사용자가 특정 책을 읽고 점수를 매긴 데이터 입니다.

 `user_id`, `isbn`, `rating` 으로 이뤄져있습니다.

 한 사용자가 다른 책을 여러번 읽기 때문에 중복된 값이 나타나게 되지만,
 
 users에 있는 모든 사용자가 포함된 데이터입니다.

In [43]:
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,8,2005018,4
1,67544,2005018,7
2,123629,2005018,8
3,200273,2005018,8
4,210926,2005018,9


In [44]:
ratings['user_id'].nunique()

59803

In [45]:
ratings.shape

(306795, 3)

In [46]:
ratings['rating'].value_counts(True)

rating
8     0.239877
7     0.172519
9     0.158650
10    0.139422
6     0.082501
5     0.045995
1     0.043185
2     0.042142
4     0.041419
3     0.034290
Name: proportion, dtype: float64

# [5] 파일 merge

이제 세 파일을 모두 합쳐서 컬럼간의 관계를 살펴보도록 하겠습니다.

In [None]:
merge1 = ratings.merge(books, how='left', on='isbn') # ratings 목록에 없는 책들 삭제
data = merge1.merge(users, how='inner', on='user_id') # 평점에도 존재하는 user_id 기준으로 병합
print('merge 결과 shape: ', data.shape)

In [None]:
ratings.shape
# ratings 기록 갯수만큼 결과가 나온것을 확인 할 수 있습니다.

In [None]:
set(ratings['isbn']) - set(books['isbn']) 

## (5-1) EDA 이후 진행방향

In [None]:
train_user_id = set(ratings['user_id'].unique())
train_isbn = set(ratings['isbn'].unique())

test_ratings = pd.read_csv(path+'test_ratings.csv')
test_user_id = set(test_ratings['user_id'].unique())
test_isbn = set(test_ratings['isbn'].unique())

total_user_id = set(users['user_id'].unique())
total_isbn = set(books['isbn'].unique())

In [None]:
# case 1.test에는 있는데 Train에는 없는 user_id: 해당 책이 다른 사용자에게 평점을 받은 case로 해결 
# 책 정보만을 고려한 평점 예측 
# 해당 유저가 users에는 존재한다면 users의 feature를 사용해서 예측?
# 해당 책에 대해 다른 유저가 내린 평점 이용 (산술평균)?
test_ratings[~test_ratings['user_id'].isin(train_user_id) & test_ratings['isbn'].isin(train_isbn)]

In [None]:
# case 2. test에는 있는데 Train에는 없는 isbn
# 유저의 정보만을 고려한 평점 예측
# 해당 책이 books에는 존재한다면 books의 feature 사용해서 예측
# 해당 유저가 다른 책에 내린 평점 이용
test_ratings[test_ratings['user_id'].isin(train_user_id) & ~test_ratings['isbn'].isin(train_isbn)]

In [None]:
# case 3. 둘다 없어? 어쩌지..?
# Books, Users만 가지고 모델링 
test_ratings[~test_ratings['user_id'].isin(train_user_id) & ~test_ratings['isbn'].isin(train_isbn)]

# !시도해볼것!

# EDA
## users.df
1. country로 바꿔도 다양한 형태로 결측치 존재
    - 정규표현식 수정
    - city만 존재한다면 같은 city를 지닌 데이터 중에 state와 country가 존재하는 데이터로 결측치 메울 수 있다!
2. 뉴질랜드, 네덜란드, 오스트레일리아 나이분포 형태 다름 -> 어떻게 해볼수 없을까? 나눠서? : 결측치 40퍼 넘어가는데 버려버릴까? 
3. 나이 결측치가 너무 많음 -> 어떻게 처리하는 게 좋을까?  

## books.df
1. cateogory 관련 부분
    - 처음에 우리 데이터로 word2vec 학습해서 사용하려고 했으나 문장 단위가 아니여서 학습이 제대로 되지 않는 문제 -> fiction과 제일 유사한 걸 nonfiction으로 매칭시킴
    - 구글이 제공하는 사전 훈련된 Word2Vec 모델을 사용하는 방법(사전 훈련된 3백만 개의 Word2Vec 단어 벡터들을 제공): 단어 단위로 되어 있어서 우리꺼는 chunk이기 때문에 해결할 방법이 필요
    - 단어별로 임베딩을 더해서 사용
    - 구글 사전 훈련 모델에도 없으면 0 벡터로 매칭
    - **임베딩이 dim=300인데 그냥 category high로 해서 43차원으로 줄이는 게 나을까? 고민되는 부분**
    - item2vec로 시도해보는건 어떨까? -> 한 유저가 읽은 카테고리(Category high)를 item2vec으로 임베딩?
2. Summary 관련 부분
    - 미션 5에서 제공한 DeepCoNN 모델을 살펴보니 summary를 이용해서 모델에 사용
    - 본 미션에서는 Bert 기반 사전 학습 모델 활용한 부분 참고하면 좋을 듯!
    - 리뷰가 아니어서 sentimental 분석이 맞을까?
3. year_of_publication 분포가 왜도 > 0 -> 어떻게 처리할 것인가? 
    - categorical로 만든다고 하면, 어떻게 묶어야할까?
    - 데이터가 얼마 없는 구간 하나로 묶고, 2000년대 이후는 1년 단위로 처리하는 게 괜찮은 처리일까? 
4. isbn 가지고 language 채울 수 있을 거 같아요! 
    - 다시 찾아보니 isbn으로 language를 채우는 건 어려운 문제
    - isbn의 의미를 뜯어보니 isbn 자체로 유의미한 피처를 따로 생성할수도 있을 것 같음

## ratings.df
1. test에는 있는데 train에 없는 user_id 존재: 8266개
    - 해당 책이 다른 사용자에게 평점을 받은 case로 해결 
    - 책 정보만을 고려한 평점 예측 
    - 해당 유저가 users에는 존재한다면 users의 feature를 사용해서 예측?
    - 해당 책에 대해 다른 유저가 내린 평점 이용 (산술평균)?
2. test에는 있는데 train에는 없는 isbn 존재: 19793개
    - 유저의 정보만을 고려한 평점 예측
    - 해당 책이 books에는 존재한다면 books의 feature 사용해서 예측
    - 해당 유저가 다른 책에 내린 평점 이용
3. train에 user_id,item_id도 없는 경우: 1734개
    - Books, Users만 가지고 모델링
4. 둘다 존재하는 경우: 나머지
    - 모든 모델 사용 가능
    - ensemble 하는게 BEST 아닐까?

# Modeling
Try 해볼 모델 List
- catboost
- deepCoNN
- TabNet

In [47]:
pred = pd.DataFrame()

In [48]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

In [49]:
def stratified_k_fold(k:int,df:pd.DataFrame) -> list:
    skf = StratifiedKFold(n_splits = k) # n_splits 는 나눌 fold의 갯수
    X = df.drop('rating',axis = 1)
    y = df['rating']
    fold_idx = list()
    for train_idx,valid_idx in skf.split(X,y):
        fold_idx.append([train_idx,valid_idx])
    return fold_idx

### catboost

In [50]:
from catboost import CatBoostRegressor

In [51]:
books.fillna(-1,inplace=True)
users.fillna(-1,inplace=True)
ratings.fillna(-1,inplace=True)

In [52]:
user2idx = {v:k for k,v in enumerate(users['user_id'].unique())}
book2idx = {v:k for k,v in enumerate(books['isbn'].unique())}

ratings['iid'] = ratings['isbn'].map(book2idx)
ratings['uid'] = ratings['user_id'].map(user2idx)

test_ratings['iid'] = test_ratings['isbn'].map(book2idx)
test_ratings['uid'] = test_ratings['user_id'].map(user2idx)

In [53]:
ratings_df = ratings.drop(['user_id','isbn'],axis=1)
ratings_test_df = test_ratings.drop(['user_id','isbn'],axis=1)

In [54]:
# Train/Test Split
users_df = users.copy()
users_df['uid'] = users_df['user_id'].map(user2idx) 

In [55]:
users_df['age'] = users_df['age'].map(lambda x:x//10 if type(x)!=str else -1)

u_label = ['location_country', 'location_state', 'location_city']
u_encoder = dict()
for l in u_label:
    u_encoder[l] = LabelEncoder()
    u_encoder[l].fit(users_df[l])
    users_df[l] = u_encoder[l].transform(users_df[l])

In [56]:
books_df = books.copy()
books_df['iid'] = books_df['isbn'].map(book2idx)
books_df['category'] = books_df['category'].astype(str)
books_df['language'] = books_df['language'].astype(str)
books_df['category_high'] = books_df['category_high'].astype(str)

In [57]:
i_label = ['category', 'language','category_high']
i_encoder = dict()
for l in i_label:
    i_encoder[l] = LabelEncoder()
    i_encoder[l].fit(books_df[l])
    books_df[l] = i_encoder[l].transform(books_df[l])

In [58]:
context_df = ratings_df.merge(users_df, on='uid', how='left').merge(books_df, on='iid', how='left')
test_df = ratings_test_df.merge(users_df, on='uid', how='left').merge(books_df, on='iid', how='left')

In [59]:
context_df.dtypes

rating                   int64
iid                      int64
uid                      int64
user_id                  int64
age                    float64
location_city            int64
location_state           int64
location_country         int64
isbn                    object
book_title              object
book_author             object
year_of_publication    float64
publisher               object
img_url                 object
language                 int64
category                 int64
summary                 object
img_path                object
country_code            object
category_high            int64
category_embedding      object
dtype: object

In [60]:
def modify_range(rating:int) -> int:
    if rating < 0:
        return 0
    elif rating > 10:
        return 10
    else:
        return rating
def rmse(real, predict):
    pred = list(map(modify_range, predict))  
    pred = np.array(pred)
    return np.sqrt(np.mean((real-pred) ** 2))

In [64]:
cat_list = ['user_id','isbn','book_author','publisher','language','category_high','location_country','location_state','location_city','country_code']                     
drop_lst = ['book_title','img_url','img_path','summary','category_embedding','category']

X_train,y_train = context_df.drop(['iid','uid','rating'],axis=1),context_df['rating']
X_train = X_train.drop(drop_lst,axis=1)
X_test = test_df.drop(['iid','uid','rating'],axis=1)
X_test = X_test.drop(drop_lst,axis=1)


In [66]:
params = {}
params['iterations'] = 200
params['learning_rate']=0.1
params['depth']=10
# Fit
catboost_r = CatBoostRegressor(**params, verbose=True, random_state=42,cat_features=cat_list)
catboost_r.fit(X_train, y_train, early_stopping_rounds=100)
# Predict
y_pred = catboost_r.predict(X_test)

0:	learn: 2.3997445	total: 205ms	remaining: 40.8s
1:	learn: 2.3721190	total: 432ms	remaining: 42.8s
2:	learn: 2.3490951	total: 586ms	remaining: 38.5s
3:	learn: 2.3305318	total: 685ms	remaining: 33.6s
4:	learn: 2.3152854	total: 801ms	remaining: 31.2s
5:	learn: 2.2994952	total: 968ms	remaining: 31.3s
6:	learn: 2.2859739	total: 1.07s	remaining: 29.6s
7:	learn: 2.2745128	total: 1.31s	remaining: 31.5s
8:	learn: 2.2647981	total: 1.42s	remaining: 30.1s
9:	learn: 2.2566086	total: 1.53s	remaining: 29s
10:	learn: 2.2498102	total: 1.61s	remaining: 27.8s
11:	learn: 2.2441794	total: 1.73s	remaining: 27s
12:	learn: 2.2393838	total: 1.93s	remaining: 27.7s
13:	learn: 2.2334558	total: 2.17s	remaining: 28.9s
14:	learn: 2.2283454	total: 2.38s	remaining: 29.3s
15:	learn: 2.2239772	total: 2.73s	remaining: 31.4s
16:	learn: 2.2203220	total: 2.99s	remaining: 32.2s
17:	learn: 2.2170881	total: 3.25s	remaining: 32.8s
18:	learn: 2.2141302	total: 3.62s	remaining: 34.5s
19:	learn: 2.2117197	total: 3.87s	remaining: 

In [67]:
pred['catboost'] = y_pred

In [107]:
X_test['rating'] = y_pred
X_test = X_test[['user_id','isbn','rating']]

In [110]:
X_test.head()

Unnamed: 0,user_id,isbn,rating
0,11676,2005018,6.618495
1,116866,2005018,6.967624
2,152827,60973129,7.772792
3,157969,374157065,7.936844
4,67958,399135782,7.892855


In [112]:
X_test.to_csv('./catboost.csv')

In [69]:
fold_idx = stratified_k_fold(10,context_df)
drop_lst = ['book_title','img_url','img_path','summary','category_embedding','category']
params = {}
params['iterations'] = 200
params['learning_rate']=0.1
params['depth']=10
rmse_score = 0
score = dict()

for train_idx,valid_idx in fold_idx:
    train_context = context_df.iloc[train_idx]
    valid_context = context_df.iloc[valid_idx]
    
    X_train = train_context.drop(['iid','uid', 'rating'], axis=1)
    y_train = train_context['rating']
    rating_train = train_context[['uid','iid','rating']] 
    
    X_valid = valid_context.drop(['iid','uid', 'rating'], axis=1)
    y_valid = valid_context['rating']
    rating_valid = valid_context[['uid','iid','rating']] 
    
    X_train.drop(drop_lst,axis=1,inplace=True)
    X_valid.drop(drop_lst,axis=1,inplace=True)
   
    cat_list = ['user_id','isbn','book_author','publisher','language','category_high','location_country','location_state','location_city','country_code']
    catboost_r = CatBoostRegressor(**params, verbose=True, random_state=42,cat_features=cat_list)
    catboost_r.fit(X_train, y_train, early_stopping_rounds=100)
    catboost_pred_r = catboost_r.predict(X_valid)
    rmse_score += rmse(y_valid,catboost_pred_r)
    
score['catboost_r'] = rmse_score/10
    

0:	learn: 2.3992925	total: 176ms	remaining: 35.1s
1:	learn: 2.3711839	total: 326ms	remaining: 32.3s
2:	learn: 2.3478657	total: 447ms	remaining: 29.3s
3:	learn: 2.3285810	total: 535ms	remaining: 26.2s
4:	learn: 2.3136889	total: 608ms	remaining: 23.7s
5:	learn: 2.2977408	total: 742ms	remaining: 24s
6:	learn: 2.2846053	total: 834ms	remaining: 23s
7:	learn: 2.2732685	total: 932ms	remaining: 22.4s
8:	learn: 2.2637168	total: 1.04s	remaining: 22.1s
9:	learn: 2.2561065	total: 1.12s	remaining: 21.3s
10:	learn: 2.2496713	total: 1.22s	remaining: 21s
11:	learn: 2.2443243	total: 1.34s	remaining: 21s
12:	learn: 2.2393886	total: 1.46s	remaining: 21s
13:	learn: 2.2353647	total: 1.74s	remaining: 23.2s
14:	learn: 2.2321688	total: 1.96s	remaining: 24.1s
15:	learn: 2.2274046	total: 2.18s	remaining: 25.1s
16:	learn: 2.2233100	total: 2.38s	remaining: 25.6s
17:	learn: 2.2200167	total: 2.65s	remaining: 26.8s
18:	learn: 2.2170907	total: 2.88s	remaining: 27.4s
19:	learn: 2.2143870	total: 3.03s	remaining: 27.3s


In [70]:
score

{'catboost_r': 2.303560794312064}

### DeepCoNN

- try: train에 존재하는 id에 대해서만 값 뽑아서 Ensemble

In [71]:
import os
import re
import warnings
import pandas as pd
import numpy as np
import requests
import tqdm
from tqdm import tqdm_notebook


import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset


# !pip install transformers
import nltk
from nltk import tokenize
from transformers import BertModel, BertTokenizer
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /opt/ml/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# !pip install ipywidgets
# jupyter nbextension enable --py widgetsnbextension

In [72]:
books_text_df = books[['isbn', 'summary','category_embedding']].copy()
books_text_df.shape
books_text_df.head()

Unnamed: 0,isbn,summary,category_embedding
0,2005018,"In a small town in Canada, Clara Callan reluct...","[-0.1640625, -0.06298828125, -0.03125, 0.07910..."
1,60973129,"Here, for the first time in paperback, is an o...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,374157065,"Describes the great flu epidemic of 1918, an o...","[-0.1376953125, 0.1484375, -0.01544189453125, ..."
3,399135782,A Chinese immigrant who is convinced she is dy...,"[-0.007568359375, -0.265625, 0.00885009765625,..."
4,425176428,"Essays by respected military historians, inclu...","[0.09619140625, 0.1357421875, 0.1357421875, 0...."


In [73]:
# 기본적인 전처리
# 필요에 따라 전처리 함수 추가 가능
def text_preprocessing(summary:str):
    summary = re.sub("[.,\'\"''""!?]", "", summary) # 특수문자 제외
    summary = re.sub("[^0-9a-zA-Z\\s]", " ", summary) # 0-9,알파벳, 공백이 아닌것 제거 
    summary = re.sub("\s+", " ", summary)
    summary = summary.lower() # 소문자/대문자 전처리
    return summary

# 적용
books_text_df['summary'] = books_text_df['summary'].apply(lambda x:text_preprocessing(x))
books_text_df['summary'].replace({'':'None', ' ':'None'}, inplace=True)

In [74]:
df = ratings.copy()
df_fe = pd.merge(df, books_text_df[['isbn', 'summary']], how='inner', on='isbn')
df_fe['summary_length'] = df_fe['summary'].apply(lambda x:len(x))
print(df_fe.shape)
df_fe.head()

(306795, 7)


Unnamed: 0,user_id,isbn,rating,iid,uid,summary,summary_length
0,8,2005018,4,0,0,in a small town in canada clara callan relucta...,107
1,67544,2005018,7,0,3,in a small town in canada clara callan relucta...,107
2,123629,2005018,8,0,7,in a small town in canada clara callan relucta...,107
3,200273,2005018,8,0,9,in a small town in canada clara callan relucta...,107
4,210926,2005018,9,0,10,in a small town in canada clara callan relucta...,107


In [75]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [76]:
# 유저 별 하나의 요약 문서 만들기
def summary_merge(df, user_id, max_summary):
    return " ".join(df[df['user_id'] == user_id].sort_values(by='summary_length', ascending=False)['summary'].values[:max_summary])

# 텍스트 벡터화
def text_to_vector(text, device):
    for sent in tokenize.sent_tokenize(text):
        text_ = "[CLS] " + sent + " [SEP]"
        tokenized = tokenizer.tokenize(text_) # text 토크나이즈
        indexed = tokenizer.convert_tokens_to_ids(tokenized) # 토큰화 한 거 아이디로 바꾸기
        segments_idx = [1] * len(tokenized)
        token_tensor = torch.tensor([indexed])
        sgments_tensor = torch.tensor([segments_idx])
        with torch.no_grad():
            outputs = model(token_tensor.to(device), sgments_tensor.to(device))
            encode_layers = outputs[0]
            sentence_embedding = torch.mean(encode_layers[0], dim=0)
    return sentence_embedding.cpu().detach().numpy()

In [77]:
user_summary_merge_vector = list(map(lambda x:text_to_vector(summary_merge(df_fe, x, 5), device), df_fe['user_id'].unique()))
user_review_text_df = pd.DataFrame(df_fe['user_id'].unique(), columns=['user_id'])
user_review_text_df['user_summary_merge_vector'] = user_summary_merge_vector
user_review_text_df.head()

Unnamed: 0,user_id,user_summary_merge_vector
0,8,"[-0.10631599, 0.07719624, 0.3148368, -0.046547..."
1,67544,"[-0.26861295, -0.05863912, 0.47870007, 0.00170..."
2,123629,"[-0.21812886, -0.20392, 0.064398155, -0.135915..."
3,200273,"[-0.21812886, -0.20392, 0.064398155, -0.135915..."
4,210926,"[-0.38687736, 0.074735105, 0.17002048, -0.2698..."


In [None]:
item_summary_vector = list(map(lambda x:text_to_vector(x, device), books_text_df['summary'].tolist()))
books_text_df['item_summary_vector'] = item_summary_vector
books_text_df.head()

In [None]:
# encoding
user2idx = {v:k for k,v in enumerate(users['user_id'].unique())}
book2idx = {v:k for k,v in enumerate(books['isbn'].unique())}

# df['uid'] = df['user_id'].map(user2idx)
# df['iid'] = df['isbn'].map(book2idx)


# User Review Text Vector 결합
df_fe_join = pd.merge(df_fe, user_review_text_df, on='user_id', how='left')

#  Item Review Text Vector 결합
df_fe_join = pd.merge(df_fe_join, books_text_df[['isbn', 'item_summary_vector','category_embedding']], on='isbn', how='left')

# print('unique user_id :', len(user2idx))
# print('unique isbn :', len(book2idx))
df_fe_join['isbn'] = df_fe_join['isbn'].map(book2idx)
df_fe_join['user_id'] = df_fe_join['user_id'].map(user2idx)
df_fe_join.rename(columns ={'category_embedding':'item_category_vector'},inplace=True)
print(df_fe_join.shape)
df_fe_join.head()

> **학습/평가 데이터 분할**
- 학습 80% 평가 20% 데이터 분할
- 무작위 표본 추출 방법 사용

In [None]:
test_df = test_ratings.merge(user_review_text_df, on='user_id', how='left').merge(books_text_df, on='isbn', how='left')
test_df['isbn'] = test_df['isbn'].map(book2idx)
test_df['user_id'] = test_df['user_id'].map(user2idx)
test_df = test_df.astype({'user_id':int})
to_fill = np.zeros(len(test_df.loc[test_df['user_summary_merge_vector'].notnull(),'user_summary_merge_vector'][0]),)

In [None]:
test_df['user_summary_merge_vector'] = test_df['user_summary_merge_vector'].map(lambda x: x if type(x)!= float else to_fill)

In [None]:
test_df.rename(columns ={'category_embedding':'item_category_vector'},inplace=True)

In [None]:
# size = 0.2
# seed = 42
# X_train, X_test, y_train, y_test = train_test_split(
#                                                     df_fe_join[['user_id', 'isbn', 'user_summary_merge_vector', 'item_summary_vector']], 
#                                                     df_fe_join['rating'], 
#                                                     test_size=size, 
#                                                     random_state=seed,
#                                                     )
X_train,y_train = df_fe_join[['user_id', 'isbn', 'user_summary_merge_vector', 'item_summary_vector','item_category_vector']],df_fe_join['rating']
X_test,y_test = test_df[['user_id', 'isbn', 'user_summary_merge_vector', 'item_summary_vector','item_category_vector']],test_df['rating']
print(X_train.shape)
print(X_test.shape)

In [None]:
X_train

## (3-2) DeepCoNN Model(Isbn+User+Text)
- Feature Space : 사용자(user_id) + 아이템(isbn) + 사용자 리뷰(user reivew text) + 아이템 리뷰(item review text) + category embedding

In [None]:
class CNN(nn.Module):
    def __init__(self, word_dim, out_dim, kernel_size, conv_1d_out_dim):
        super(CNN, self).__init__()
        self.conv = nn.Sequential(
                                nn.Conv1d(
                                        in_channels=word_dim,
                                        out_channels=out_dim,
                                        kernel_size=kernel_size,
                                        padding=(kernel_size - 1) // 2), 
                                nn.ReLU(),
                                nn.MaxPool2d(kernel_size=(kernel_size, 1)),  
                                nn.Dropout(p=0.5)
                                )
        self.linear = nn.Sequential(
                                    nn.Linear(int(out_dim/kernel_size), conv_1d_out_dim),
                                    nn.ReLU(),
                                    nn.Dropout(p=0.5))

    def forward(self, vec): 
        output = self.conv(vec)  
        output = self.linear(output.reshape(-1, output.size(1)))
        return output 

class FactorizationMachine(nn.Module):
    def __init__(self, input_dim, latent_dim):  
        super().__init__()
        self.v = nn.Parameter(torch.rand(input_dim, latent_dim), requires_grad = True)
        self.linear = nn.Linear(input_dim, 1, bias=True)
    def forward(self, x):
        linear = self.linear(x) 
        square_of_sum = torch.mm(x, self.v) ** 2
        sum_of_square = torch.mm(x ** 2, self.v ** 2)
        pair_interactions = torch.sum(square_of_sum - sum_of_square, dim=1, keepdim=True)
        output = linear + (0.5 * pair_interactions)
        return output  
    
class RMSELoss(torch.nn.Module):
    def __init__(self):
        super(RMSELoss,self).__init__()
        self.eps = 1e-6

    def forward(self, x, y):
        criterion = nn.MSELoss()
        loss = torch.sqrt(criterion(x, y)+self.eps)
        return loss

In [None]:
class FeaturesEmbedding(torch.nn.Module):
    def __init__(self, field_dims, embed_dim):
        super().__init__()
        self.embedding = torch.nn.Embedding(sum(field_dims), embed_dim)
        self.offsets = np.array((0, *np.cumsum(field_dims)[:-1]), dtype=np.int_)
        torch.nn.init.xavier_uniform_(self.embedding.weight.data)
    def forward(self, x):
        x = x + x.new_tensor(self.offsets).unsqueeze(0)
        x = self.embedding(x)
        return x.view(-1, x.size(1) * x.size(2))

class DeepCoNN(nn.Module):
    def __init__(self, field_dims, embed_dim, category_dim ,word_dim, out_dim, kernel_size, conv_1d_out_dim, latent_dim):
        super(DeepCoNN, self).__init__()
        self.embedding = FeaturesEmbedding(field_dims, embed_dim)
        self.cnn_u = CNN(
                         word_dim=word_dim, 
                         out_dim=out_dim, 
                         kernel_size=kernel_size, 
                         conv_1d_out_dim=conv_1d_out_dim,
                        )
        self.cnn_i_summary = CNN(
                         word_dim=word_dim, 
                         out_dim=out_dim, 
                         kernel_size=kernel_size, 
                         conv_1d_out_dim=conv_1d_out_dim,
                        )
        self.cnn_i_category = CNN(
                         word_dim=category_dim, 
                         out_dim=out_dim, 
                         kernel_size=kernel_size, 
                         conv_1d_out_dim=conv_1d_out_dim,
                        )
        self.fm = FactorizationMachine(
                                       input_dim=(conv_1d_out_dim * 3) + (embed_dim*len(field_dims)), 
                                       latent_dim=latent_dim,
                                       )
    def forward(self, x):
        user_isbn_vector, user_text_vector, item_summary_vector, item_category_vector = x[0], x[1], x[2], x[3]
        user_isbn_feature = self.embedding(user_isbn_vector)                    
        user_text_feature = self.cnn_u(user_text_vector)
        item_summary_feature = self.cnn_i_summary(item_summary_vector)
        item_category_feature = self.cnn_i_category(item_category_vector)
        feature_vector = torch.cat([user_isbn_feature, user_text_feature, item_summary_feature,item_category_feature], dim=1)
        output = self.fm(feature_vector)
        return output.squeeze(1)

In [None]:
def train(model, optimizer, data_loader, criterion, device, epochs, model_name):
    minimum_loss = 999999999
    model.train()
    loss_list = []
    tk0 = tqdm.tqdm(range(epochs), smoothing=0, mininterval=1.0)
    for epoch in tk0:
        total_loss = 0
        n = 0
        for i, data in enumerate(data_loader):
            if len(data)==3:
                fields, target = [data['user_summary_merge_vector'].to(device), data['item_summary_vector'].to(device)], data['label'].to(device)
            elif len(data)==4:
                fields, target = [data['user_isbn_vector'].to(device), data['user_summary_merge_vector'].to(device), data['item_summary_vector'].to(device)], data['label'].to(device)
            elif len(data)==5:
                fields, target = [data['user_isbn_vector'].to(device), data['user_summary_merge_vector'].to(device), data['item_summary_vector'].to(device),data['item_category_vector'].to(device)], data['label'].to(device)
            y = model(fields)
            loss = criterion(y, target.float())
            model.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            n += 1
        if minimum_loss > (total_loss/n):
            minimum_loss = (total_loss/n)
            torch.save(model.state_dict(), './{}.pt'.format(model_name))
            loss_list.append([epoch, total_loss/n, 'Model saved'])
        else:
            loss_list.append([epoch, total_loss/n, 'None'])
        tk0.set_postfix(loss=total_loss/n)
    loss_df = pd.DataFrame(loss_list, columns=['epoch', 'loss', 'check'])
    plt.figure(figsize=(21, 5))
    plt.title("Epoch vs Loss Plot")
    plt.plot(loss_df['epoch'], loss_df['loss'], label='loss', marker='o')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.ylim(0, 10)
    plt.legend()
    plt.show()

def test(model, data_loader, device, model_name):
    model.eval()
    model.load_state_dict(torch.load('./{}.pt'.format(model_name)))
    targets, predicts = list(), list()
    with torch.no_grad():
        for data in tqdm.tqdm(data_loader, smoothing=0, mininterval=1.0):
            if len(data)==3:
                fields, target = [data['user_summary_merge_vector'].to(device), data['item_summary_vector'].to(device)], data['label'].to(device)
            elif len(data)==4:
                fields, target = [data['user_isbn_vector'].to(device), data['user_summary_merge_vector'].to(device), data['item_summary_vector'].to(device)], data['label'].to(device)
            elif len(data)==5:
                fields, target = [data['user_isbn_vector'].to(device), data['user_summary_merge_vector'].to(device), data['item_summary_vector'].to(device),data['item_category_vector'].to(device)], data['label'].to(device)
            y = model(fields)
            targets.extend(target.tolist())
            predicts.extend(y.tolist())
    return targets, predicts

In [None]:
class Dataset(Dataset):
    def __init__(self, user_isbn_vector, user_summary_merge_vector, item_summary_vector,item_category_vector,label):
        self.user_isbn_vector = user_isbn_vector
        self.user_summary_merge_vector = user_summary_merge_vector
        self.item_summary_vector = item_summary_vector
        self.item_category_vector = item_category_vector
        self.label = label
        
    def __len__(self):
        return self.user_isbn_vector.shape[0]
    
    def __getitem__(self, i):
        return {
                'user_isbn_vector' : torch.tensor(self.user_isbn_vector[i], dtype=torch.long),
                'user_summary_merge_vector' : torch.tensor(self.user_summary_merge_vector[i].reshape(-1, 1), dtype=torch.float32),
                'item_summary_vector' : torch.tensor(self.item_summary_vector[i].reshape(-1, 1), dtype=torch.float32),
                'item_category_vector' : torch.tensor(self.item_category_vector[i].reshape(-1, 1), dtype=torch.float32),
                'label' : torch.tensor(self.label[i], dtype=torch.float32),
                }

In [None]:
# dataloader define
train_dataset = Dataset(
                        X_train[['user_id', 'isbn']].values,
                        X_train['user_summary_merge_vector'].values, 
                        X_train['item_summary_vector'].values, 
                        X_train['item_category_vector'].values,
                        y_train.values
                        )
test_dataset = Dataset(
    
                        X_test[['user_id', 'isbn']].values,
                        X_test['user_summary_merge_vector'].values, 
                        X_test['item_summary_vector'].values,
                        X_test['item_category_vector'].values,
                        y_test.values
                        )

user_isbn_text_train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, num_workers=0, shuffle=True)
user_isbn_text_test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=64, num_workers=0, shuffle=False)

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
user_isbn_text_deepconn = DeepCoNN(
                                    field_dims=np.array([len(user2idx), len(book2idx)], dtype=np.uint32), 
                                    embed_dim=128,
                                    word_dim=768, 
                                    category_dim = 300,
                                    out_dim=100, 
                                    kernel_size=3, 
                                    conv_1d_out_dim=50, 
                                    latent_dim=10,
                                    ).to(device)
opt = torch.optim.Adam(user_isbn_text_deepconn.parameters(), lr=1e-3)
loss = RMSELoss()

In [None]:
# train
train(
    model=user_isbn_text_deepconn,
    optimizer=opt,
    data_loader=user_isbn_text_train_dataloader,
    criterion=loss,
    device=device,
    epochs=10,
    model_name='user_isbn_text_deepconn_rmse',
)

In [None]:
# User Isbn Text DeepCoNN model(rmse) test
targets, predicts = test(
                        model=user_isbn_text_deepconn, 
                        data_loader=user_isbn_text_test_dataloader, 
                        device=device, 
                        model_name='user_isbn_text_deepconn_rmse',
                        )
pred['user_isbn_text_deepconn'] = predicts

In [None]:
pred_df = pd.DataFrame(pred)
test_df['rating'] = pred_df.sum(axis=1)

# End of Document

<font color='red'><b>**WARNING**</b></font> : **본 교육 콘텐츠의 지식재산권은 재단법인 네이버커넥트에 귀속됩니다. 본 콘텐츠를 어떠한 경로로든 외부로 유출 및 수정하는 행위를 엄격히 금합니다.** 다만, 비영리적 교육 및 연구활동에 한정되어 사용할 수 있으나 재단의 허락을 받아야 합니다. 이를 위반하는 경우, 관련 법률에 따라 책임을 질 수 있습니다.