# 0. load data
- 컨텐츠
    - 데이터: animes (16214개) = anime.csv (17562개) + anime_with_synopsis.csv (16214개) 
- 유저
    - 데이터: animelist.csv (유저 325770명, 영화 17562개, 평가 62M개)

In [1]:
INPUT_DIR = '/kaggle/input/anime-recommendation-database-2020'
#!ls {INPUT_DIR}

import pandas as pd
import warnings; warnings.filterwarnings("always"); warnings.filterwarnings(action='ignore')

""" 1) anime 데이터셋 """

anime = pd.read_csv(INPUT_DIR + '/anime.csv')
anime_with_synopsis = pd.read_csv(INPUT_DIR + '/anime_with_synopsis.csv', usecols=["MAL_ID", "sypnopsis"])
animes = pd.merge(anime, anime_with_synopsis, on='MAL_ID') # anime + synopsis 합침
print('number of animes: ', len(animes.MAL_ID.unique()))

""" 2) user rating 데이터셋 """ # 0 if the user didn't assign a score

animelist = pd.read_csv(INPUT_DIR + '/animelist.csv', usecols=["user_id", "anime_id", "rating"])
# Users who rated more than 3000 animies, 762 users (2000 animies, 2463 users)
n_ratings = animelist['user_id'].value_counts()
rating_df = animelist[animelist['user_id'].isin(n_ratings[n_ratings >= 2000].index)]
print('number of users: ', len(rating_df.user_id.unique()))

""" 결측치 처리 """

animes = animes.dropna()
rating_df = rating_df.dropna()

""" 중복값 처리 """

def remove_duplicated_rows(df):
    print('dataframe')
    duplicates = df.duplicated()
    if duplicates.sum() > 0:
        print('> {} duplicates'.format(duplicates.sum()))
        df = df[~duplicates]
    print('> {} duplicates'.format(df.duplicated().sum()))

remove_duplicated_rows(animes)
remove_duplicated_rows(rating_df)

number of animes:  16214
number of users:  2463
dataframe
> 0 duplicates
dataframe
> 0 duplicates


In [2]:
""" 0점은 평가 안한거라서 삭제해야 함 (나중에 삭제처리함) """
rating_df['rating'].value_counts()

0     4042050
7      847086
8      650730
6      608142
5      411998
9      319090
10     270191
4      207876
3      123912
2       80155
1       79187
Name: rating, dtype: int64

# 1. 데이터셋 생성

## 1-1. feature matrix (ani_df)
- metadata and/or embeddings(one-hot-encoding 장르, tf-idf 장르, tf-idf 시놉시스)
- PCA, autoencoder

In [3]:
""" 사용할 feature 선택 """

ani_df = animes[['MAL_ID', 'Name', 'Score', 'Genders', 'Type', 
                 'Episodes', 'Aired','Source', 'Duration', 'Rating', 'Ranked', 'sypnopsis']]
print(ani_df.shape)
ani_df.sample(3)

(16206, 12)


Unnamed: 0,MAL_ID,Name,Score,Genders,Type,Episodes,Aired,Source,Duration,Rating,Ranked,sypnopsis
13349,37929,Mandamgangho,Unknown,Comedy,Movie,1,"Mar 22, 2017",Unknown,1 hr. 13 min.,Unknown,15403.0,No synopsis information has been added to this...
8164,24627,Yamada-kun to 7-nin no Majo: Mou Hitotsu no Su...,7.46,"Comedy, Romance, School, Shounen",OVA,2,"Dec 17, 2014 to May 15, 2015",Manga,30 min. per ep.,PG-13 - Teens 13 or older,1722.0,Shiraishi Urara is the top student in her scho...
9462,30401,Rule,4.97,Dementia,Movie,1,2009,Unknown,2 min.,G - All Ages,10622.0,short animation by Taku Furukawa.


### 1-1-1. 전처리 (Numeric)
- feature: Episodes, Aired, Duration, Ranked
- 일단 unknown=0 으로 처리함

In [4]:
""" 1) Episodes: MinMaxScaler """

from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer, MinMaxScaler

ani_df['Episodes'] = ani_df['Episodes'].replace("Unknown", 0).astype(float)
ani_df[["Episodes"]] = MinMaxScaler().fit_transform(ani_df[["Episodes"]])

""" 2) Aired: 날짜 형식 통일 (e.g., Spe 1, 2001 => 2001.9) """

years  = []
months = []
for val in ani_df['Aired']:
    vr = val.split()
    y = 'Unknown'
    m = 'Unknown'
    for v in vr:
        if v.isdigit() and len(v) == 4 :
            y = v
            break
    for v in vr:
        if not v.isdigit() and len(v) >= 3 and v[0].isupper() and v != 'Unknown' :
            m = v[:3]
            break
        
    years += [ y ]
    months += [ m ]

ani_df['Year'] = years
ani_df['Month'] = months

month_to_number = {
'Jan' : 1,         
'Feb' : 2,         
'Mar' : 3,           
'Apr' : 4,              
'May' : 5, 
'Jun' : 6,
'Jul' : 7, 
'Aug' : 8, 
'Sep' : 9, 
'Oct' : 10, 
'Nov' : 11, 
'Dec' : 12}

ani_df['Month'] = ani_df['Month'].replace(month_to_number)

ani_df['Year'] = ani_df['Year'].replace("Unknown", 0).astype(float)
ani_df['Month'] = ani_df['Month'].replace("Unknown", 0).astype(float)

ani_df['date'] = ani_df['Year'] + (ani_df['Month']/10)

""" 3) Duration: 단위 통일 (e.g., 1 hr. 55 min. => 115) """

hrs  = []
mins = []
for val in ani_df['Duration']:
    split_list = val.split() # ['24', 'min.', 'per', 'ep.']
    h = 'Unknown'
    m = 'Unknown'
    for i in split_list:
        if i == 'hr.':
            h = split_list[split_list.index(i)-1]
        elif i == 'min.':
            m = split_list[split_list.index(i)-1]
        
    hrs += [ h ]
    mins += [ m ]

ani_df['hours'] = hrs
ani_df['mins'] = mins

ani_df['hours'] = ani_df['hours'].replace("Unknown", 0).astype(float)
ani_df['mins'] = ani_df['mins'].replace("Unknown", 0).astype(float)

ani_df['duration'] = (ani_df['hours']*60) + ani_df['mins']

""" 4) Ranked: unknown 처리한 후 str -> int 타입 변환 해주기 """

""" 최종 """
ani_df = ani_df.drop(['Aired', 'Duration','Year', 'Month','hours', 'mins' ], axis = 1)
ani_df.head()

Unnamed: 0,MAL_ID,Name,Score,Genders,Type,Episodes,Source,Rating,Ranked,sypnopsis,date,duration
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",TV,0.008505,Original,R - 17+ (violence & profanity),28.0,"In the year 2071, humanity has colonized sever...",1998.4,24.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Movie,0.000327,Original,R - 17+ (violence & profanity),159.0,"other day, another bounty—such is the life of ...",2001.9,115.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",TV,0.008505,Manga,PG-13 - Teens 13 or older,266.0,"Vash the Stampede is the man with a $$60,000,0...",1998.4,24.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",TV,0.008505,Original,PG-13 - Teens 13 or older,2481.0,ches are individuals with special powers like ...,2002.7,25.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",TV,0.01701,Manga,PG - Children,3710.0,It is the dark century and the people are suff...,2004.9,23.0


### 1-1-2. 전처리 (text) 
- BoW: Genders
- TF-IDF: sypnopsis
- OneHotEncoding: Type, Source, Rating

In [5]:
""" Genders => [Genders] (unkown일 경우 []) """

def process_multilabel(series):
    series = series.split(",")
    if "Unknown" in series:
        series.remove("Unknown")
    return series

ani_df["Genders"] = ani_df["Genders"].map(process_multilabel)

""" tf-idf vector 생성 """

from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

def preprocessing_category(df, column, is_multilabel=False):
    # Binarise labels
    lb = LabelBinarizer()
    if is_multilabel:
        lb = MultiLabelBinarizer()
        
    expandedLabelData = lb.fit_transform(df[column])
    labelClasses = lb.classes_

    # Create a pandas.DataFrame from our output
    category_df = pd.DataFrame(expandedLabelData, columns=labelClasses)
    del df[column]
    return pd.concat([df, category_df], axis=1)

from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3),
            stop_words = 'english')

# Filling NaNs with empty string
genres_original = ani_df['Genders'].fillna('').astype(str)
genres_vector_tf_idf = tfv.fit_transform(genres_original)

Genders = ani_df["Genders"]
genres_vector_one_hot = preprocessing_category(pd.DataFrame(Genders), "Genders", True).values

print("genres_vector_tf_idf.shape:", genres_vector_tf_idf.shape)
print("genres_vector_one_hot.shape:", genres_vector_one_hot.shape)

genres_vector_tf_idf.shape: (16206, 2211)
genres_vector_one_hot.shape: (16214, 80)


## 1-2. score matrix (score_df)

In [6]:
""" 데이터: 762명의 유저가 17556개 영화를 평가 """

print('animes: ', len(rating_df.anime_id.unique()))
print('users: ', len(rating_df.user_id.unique()))

animes:  17558
users:  2463


In [7]:
import numpy as np

top_users = rating_df.groupby('user_id')['rating'].count()
top_r = rating_df.join(top_users, rsuffix='_r', how='inner', on='user_id')

top_animes = rating_df.groupby('anime_id')['rating'].count()
top_r = top_r.join(top_animes, rsuffix='_r', how='inner', on='anime_id')

score_df = pd.crosstab(top_r.user_id, top_r.anime_id, top_r.rating, aggfunc=np.sum)
score_df

anime_id,1,5,6,7,8,15,16,17,18,19,...,48442,48456,48466,48470,48471,48481,48483,48488,48491,48492
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
146,,,,,,,0.0,,,,...,,,,,,,,,,
240,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0,0.0,...,,,,,,,,,,
436,0.0,,0.0,,,0.0,0.0,,,,...,,,,,,,,,,
446,7.0,7.0,7.0,,,7.0,10.0,,,,...,,,,,,,,,,
781,7.0,,10.0,10.0,,,7.0,,8.0,10.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
352832,0.0,0.0,10.0,,,0.0,0.0,,10.0,10.0,...,,,,,,,,,,
352887,,,,,,,,,,,...,,,,,,,,,,
352922,,,,,,,0.0,0.0,,7.0,...,,,,,,,,,,


## 1-3. user matrix (user_df)
- ani_df + score_df

In [8]:
all_data = []

user_id_list = list(score_df.index)

for user_id in user_id_list:
    """ 유저 별 score vector """
    score_vector = score_df.loc[user_id] 
    score_vector = score_vector.dropna() 
    score_vector = score_vector[score_vector != 0] # 0점은 평가 안한것 (여기서 많이 날라가는듯)

    anime_id_list = list(score_vector.index)
    user_score_list = list(score_vector.values)

    """ 유저가 본 영화만 추출 """
    ani_df_user = ani_df.loc[ani_df['MAL_ID'].isin(anime_id_list)] 

    """ 유저 별 매트릭스 생성 """
    score_vector_df = pd.DataFrame(score_vector)
    score_vector_df['MAL_ID'] = score_vector_df.index
    score_vector_df.columns = ['score_by_user_{}'.format(user_id), 'MAL_ID']

    user_df = pd.merge(ani_df_user,score_vector_df, how='inner',on='MAL_ID')
    
    """ append user_df """
    all_data.append(user_df)

print(len(all_data))
all_data[0]

2463


Unnamed: 0,MAL_ID,Name,Score,Genders,Type,Episodes,Source,Rating,Ranked,sypnopsis,date,duration,score_by_user_146
0,20,Naruto,7.91,"[Action, Adventure, Comedy, Super Power, M...",TV,0.071966,Manga,PG-13 - Teens 13 or older,660.0,"oments prior to Naruto Uzumaki's birth, a huge...",2003.0,23.0,6.0
1,22,Tennis no Ouji-sama,7.9,"[Action, Comedy, Sports, School, Shounen]",TV,0.058227,Manga,PG-13 - Teens 13 or older,675.0,The world of tennis is harsh and highly compet...,2002.0,22.0,8.0
2,50,Aa! Megami-sama! (TV),7.35,"[Comedy, Supernatural, Magic, Romance, Sei...",TV,0.007851,Manga,PG-13 - Teens 13 or older,2116.0,In a world where humans can have their wish gr...,2005.1,24.0,5.0
3,61,D.N.Angel,7.2,"[Action, Comedy, Magic, Romance, Fantasy, ...",TV,0.008505,Manga,PG-13 - Teens 13 or older,2761.0,"Daisuke Niwa is a clumsy, block-headed, and wi...",2003.4,23.0,7.0
4,74,Gakuen Alice,7.65,"[Comedy, School, Shoujo, Super Power]",TV,0.008505,Manga,G - All Ages,1158.0,kan Sakura is a normal 10-year-old girl. Optim...,2005.0,25.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
489,41466,Kyojinzoku no Hanayome,5.19,"[Fantasy, Shounen Ai]",TV,0.002944,Manga,R+ - Mild Nudity,10345.0,Kouichi Mizuki ends his high school basketball...,2020.7,6.0,6.0
490,41619,Munou na Nana,7.33,"[Psychological, Shounen, Super Power, Super...",TV,0.004253,Manga,R - 17+ (violence & profanity),2231.0,"Fifty years ago, horrific creatures dubbed as ...",2021.0,23.0,7.0
491,41930,Kamisama ni Natta Hi,6.86,"[Drama, Fantasy]",TV,0.003925,Original,PG-13 - Teens 13 or older,4209.0,Dressed in a conspicuous outfit and armed with...,2021.0,24.0,7.0
492,42517,Ookami-san wa Taberaretai,5.61,"[Romance, Ecchi]",TV,0.000981,Manga,R+ - Mild Nudity,9395.0,"""I want Akagashira-sensei to be my first!"" Hav...",2020.9,5.0,7.0


In [9]:
print('유저 수: ', len(all_data))

num = 0
for i in all_data:
    num += len(i)
    
print('유저 1명 당 평균 평가 수: ', round(num/len(all_data),1))

유저 수:  2463
유저 1명 당 평균 평가 수:  1382.7


# 2. modeling

## 2-1. vector space model

In [10]:
def cal_weight(df, target, k, weight = None):
    '''
    Function for predicting rating with cosine similarity,
    output two dataframe with mean and weighted mean prediction and two rmse value for both.
    --------
    df is dataframe of one user
    ; first column is anime_id, last column is rating
    target is prediction target anime's anime_id list
    k is determined by how many similar anime will be used for rating prediction
    ----
    * every column value range should be [0,1] before giving weight
    '''
    
    # 한 유저 matrix당 한 번 사용
    
    '''
    need pandas, numpy and below
    '''
    from sklearn.metrics.pairwise import cosine_similarity
    
    '''
    k가 유사도 개수보다 크면 유사도 개수로 변경
    '''
    
    if weight == None:
        weight = np.ones(df.shape[1]-2)
    
    if k > df.shape[0] - len(target):
        print('k is larger than number of possible vector so change k to total number of maximum', df.shape[0] - len(target))
        k = df.shape[0] - len(target)
    
    df_unseen = df.copy()
    
    df_unseen.iloc[df_unseen[df_unseen['movie_id'].isin(target)].index.tolist(),-1] = np.NaN
    
    df2 = df_unseen.copy()
    df3 = df_unseen.copy()
    
    # calculate cosine similiarty
    cosim = np.round(cosine_similarity(weight*df_unseen.iloc[:,1:-1], weight*df_unseen.iloc[:,1:-1]), 4)
    
    # for indexing
    ani2id = {}
    for i, c in enumerate(df_unseen.iloc[:,0]): ani2id[i] = c
    id2ani = {}
    for i, c in ani2id.items(): id2ani[c] = i

    idx = []

    for id in target:
        idx.append(id2ani[id])
        print(idx)
        
    for x in range(len(target)):
    
        sim_scores = [(i, c) for i, c in enumerate(cosim[idx[x]]) if i not in idx] # exclude targets
        sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True) # 유사도가 높은 순서대로 정렬 
    
        sim_scores_df = pd.DataFrame(sim_scores)   
    
        
        # ksim is k similar animes' indices and rating in df
        ksim = sim_scores_df.iloc[:k,:]
        # simidx is k similar animes' indices in df
        simidx = ksim.iloc[:,0]
    
        print('top', k, 'similarity and index for target ', target[x],':')
        print(ksim)
    
        # df2 estimate rating with just mean of k animes
        df2.iloc[idx[x],-1] = np.mean(df2.iloc[simidx,-1])
    
        # df3 estimate rating with weighted mean of k animes' similarity
        df3.iloc[idx[x],-1] = np.round( np.dot(df3.iloc[simidx,-1], ksim.iloc[:,1])/np.sum(ksim.iloc[:,1]) , 2)
    
    
    # Evaluation
    
    #from sklearn.metrics import mean_absolute_error
    #mean_absolute_error(y_test, y_pred)
    
    from sklearn.metrics import mean_squared_error
    
    RMSE2 = np.sqrt(mean_squared_error(df.iloc[idx,-1], df2.iloc[idx,-1]))
    
    RMSE3 = np.sqrt(mean_squared_error(df.iloc[idx,-1], df3.iloc[idx,-1]))
    
    return df2,df3,RMSE2,RMSE3

#### Have to consider how to find weight
- Emperically
- Or find optimal value with training

### About Weight

- synopsis의 경우 따로 cosine similarity 계산하고, 나머지는 각 feature에 가중치 주면서 similarity 계산. 최종 similarity는 두 similarity의 가중치를 또 다시 주어서 계산
- 혹은, 모든 feature에 대해 가중치 정하여 계산. (tf-idf 의 경우 0-1 범위가 아닌데 괜찮은지 생각)

#### Procedure

0-1. feature selection (if needed) and feature normalization to [0,1]

0-2. Split anime for test (used for Evaluation)

- just split 0.2 of anime.

0-3. Select validtion set(?) of anime for finding weight

* Have to determine whether weight is same for all user or find optimum weight for every user everytime. If determine as former, also have to determine whether use one user or multiple user for validation set for finding opimum weights.

- select 100 animes (10 with each 10 rating)

1 . Predict only with synopsis for comparison.

2 . Find weight by optimizing

* Have to consider the standard of optimum weight.

-> Valdation set with several same number of animes with same rating from one user.
   (ex 10 animes with 1~10 rating)
   Find w maximize similarity between anime of same rating

-> Validation set, find weights which output good performance in prediction for validation set.

- finding with random number is not good as i think because as a some domain knowledge, some features are more important than others with some certainty.

-> Or just determine empirically with some number of candidates.

3 . Predict with weighted features

4 . Predict w/o weight.

5 . Do comparison.

* ex) test 0.2 (ex for user who rated 2500 anime, predict 500 animes' rating)


## 2-2. prediction model