# Drama Recommender Systems

This is a continuation of the drama recommender systems that I had tried. Namely, I will be focusing on Collaborative filtering here. 

These were the recommenders that I have built:

1. Simple recommenders
2. Content-based recommender
   - Plot-based recommender
   - Main actors, genres and keywords recommender
3. Collaborative filtering
   - k-Nearest Neighbors
   - Matrix factorization
 

## 3. Collaborative Filtering

### k-Nearest Neighbors

The k-Nearest Neighbors algorithm works by finding its k closest neighbors based on a similarity metrix, such as the cosine similarity. 

The drama DataFrame is first converted into a m x n array, where m is the number of dramas and n is the number of users. Pivot is then used to reshape the DataFrame into a wide format with dramas as rows and users as columns. The missing observations are filled with 0.

In [2]:
import pandas as pd

df = pd.read_csv('drama_reviews_cf.csv')
df.head()

Unnamed: 0,drama_title,user_name,overall_rating
0,Dear My Friends (2016),iamgeralddd,10.0
1,Dear My Friends (2016),Dounie,10.0
2,Dear My Friends (2016),Pelin,10.0
3,Dear My Friends (2016),silent_whispers,9.0
4,Dear My Friends (2016),Dana,9.0


In [3]:
# create a new DataFrame that has the drama title and its corresponding number of ratings 
rating_count = (df.
     groupby(by = ['drama_title'])['overall_rating'].
     count().
     reset_index().
     rename(columns = {'overall_rating': 'rating_count'})
     [['drama_title', 'rating_count']]
    )
rating_count.head()

Unnamed: 0,drama_title,rating_count
0,'Cause You're My Boy (2018),24
1,0 Goshitsu no Kyaku (2009),1
2,1 Litre no Namida (2005),90
3,1 Pound no Fukuin (2008),6
4,100 Days My Prince (2018),24


In [4]:
# merge new DataFrame with original DataFrame 
df_rating_count = df.merge(rating_count, left_on = 'drama_title', right_on = 'drama_title', how = 'left')
df_rating_count.head()

Unnamed: 0,drama_title,user_name,overall_rating,rating_count
0,Dear My Friends (2016),iamgeralddd,10.0,12
1,Dear My Friends (2016),Dounie,10.0,12
2,Dear My Friends (2016),Pelin,10.0,12
3,Dear My Friends (2016),silent_whispers,9.0,12
4,Dear My Friends (2016),Dana,9.0,12


In [5]:
# get the key metrices of number of rating count
pd.set_option('display.float_format', lambda x: '%.3f' % x) 
print(df_rating_count['rating_count'].describe())

count   20075.000
mean       47.641
std        55.052
min         1.000
25%         8.000
50%        23.000
75%        85.000
max       231.000
Name: rating_count, dtype: float64


In [6]:
popularity_threshold = 23 # top 50% in number of reviews
popular_drama = df_rating_count.query('rating_count >= @popularity_threshold') # dramas in top 50% based on no. of ratings 
popular_drama.head()

Unnamed: 0,drama_title,user_name,overall_rating,rating_count
1247,The Untamed (2019),Ai-chan,10.0,108
1248,The Untamed (2019),angelVSdevil,10.0,108
1249,The Untamed (2019),philll,10.0,108
1250,The Untamed (2019),ChineseDramaFan,10.0,108
1251,The Untamed (2019),AmyLovesDramas,9.5,108


In [9]:
# create a sparse matrix of drama title and user name, with overall rating as the values
from scipy.sparse import csr_matrix

popular_drama_pivot = popular_drama.pivot(index = 'drama_title', columns = 'user_name', values = 'overall_rating').fillna(0)
popular_drama_matrix = csr_matrix(popular_drama_pivot.values)

from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
knn.fit(popular_drama_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [31]:
query_index = 179
distances, indices = knn.kneighbors(pd.DataFrame(popular_drama_pivot.iloc[query_index, :]).values.reshape(1, -1), n_neighbors = 6)

# get the top 5 dramas that has the highest similarity
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(popular_drama_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}'.format(i, popular_drama_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Story of Yanxi Palace (2018):

1: The Story of Ming Lan (2018), with distance of 0.7376371435874067
2: Search: WWW (2019), with distance of 0.8642142802384017
3: Bloody Romance (2018), with distance of 0.8646716088864472
4: 100 Days My Prince (2018), with distance of 0.868704193619679
5: Ashes of Love (2018), with distance of 0.8721470193781997


In [36]:
query_index = 111
distances, indices = knn.kneighbors(pd.DataFrame(popular_drama_pivot.iloc[query_index, :]).values.reshape(1, -1), n_neighbors = 6)

# get the top 5 dramas that has the highest similarity
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(popular_drama_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}'.format(i, popular_drama_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Love O2O (2016):

1: Put Your Head on My Shoulder (2019), with distance of 0.8803582766261174
2: A Love So Beautiful (2017), with distance of 0.8948562492439815
3: W (2016), with distance of 0.9091192461559305
4: Mad Dog (2017), with distance of 0.9095781438625755
5: Another Miss Oh (2016), with distance of 0.9109215103470132


I do not quite agree with these recommendations. There are other dramas that are more similar in my opinion. One reason can be attributed to the shortage of data. I could only retrieve the user data given the reviews. However, there are many users that gave ratings without the reviews. This data is not accessible through web scraping.

To increase the data pool slightly, I decided to use all dramas data instead of dramas that attain the 50% rating count threshold.

#### All dramas inclusive

In [39]:
# create a sparse matrix of drama title and user name, with overall rating as the values
from scipy.sparse import csr_matrix

df_pivot = df.pivot(index = 'drama_title', columns = 'user_name', values = 'overall_rating').fillna(0)
df_matrix = csr_matrix(df_pivot.values)

# calculate consine similarity score between each drama
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
knn.fit(df_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [47]:
query_index = 1853
distances, indices = knn.kneighbors(pd.DataFrame(df_pivot.iloc[query_index, :]).values.reshape(1, -1), n_neighbors = 6)

# get the top 5 dramas that has the highest similarity
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(df_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, df_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Story of Yanxi Palace (2018):

1: The Legend of Hao Lan (2019), with distance of 0.6804742115494322:
2: Our Glamorous Time (2018), with distance of 0.7359146481821208:
3: The Story of Ming Lan (2018), with distance of 0.7376371435874067:
4: Scarlet Heart 2 (2014), with distance of 0.7408084469827373:
5: Nirvana in Fire 2: The Wind Blows in Chang Lin (2017), with distance of 0.7560108381816948:


In [55]:
query_index = 1117
distances, indices = knn.kneighbors(pd.DataFrame(df_pivot.iloc[query_index, :]).values.reshape(1, -1), n_neighbors = 6)

# get the top 5 dramas that has the highest similarity
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(df_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, df_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Love O2O (2016):

1: Put Your Head on My Shoulder (2019), with distance of 0.8803582766261174:
2: A Love So Beautiful (2017), with distance of 0.8948562492439815:
3: The Demon Master (2018), with distance of 0.8963664762836093:
4: U-Prince The Series: The Absolute Economist (2016), with distance of 0.907740008516381:
5: Looking For Dorothy (2006), with distance of 0.907740008516381:


The results have been improved for Story of Yanxi Palace (2018) recommendations.  The Legend of Hao Lan (2019) is closely similar to Yanxi Palace due to the setting (historical + romance + scheming palace drama). Moreover, the lead actor and actress is the same in both dramas. The popularity of Yanxi Palace could have sparked high reviews for Hao Lan too.

### Matrix Factorization

Matrix Factorization is a technique that decomposes the user-item interaction matrix into the product of two lower dimensionality rectangular matrices (Wikipedia). The Pearson’s R correlation coefficient for every drama pair is calculated and input into a matrix. We then pick the dramas that have high correlation coefficient values (between 0.9 and 1.0).

The drama DataFrame is first converted into a pivot table with users as rows and dramas as columns. Empty observations are filled with 0.

In [73]:
df_pivot_2 = df.pivot(index = 'user_name', columns = 'drama_title', values = 'overall_rating').fillna(0)
df_pivot_2.shape

(6918, 2382)

In [74]:
X = df_pivot_2.values.T

We then reduced the dimensionality of matrix using Singular Value Decomposition. 

In [61]:
import sklearn
from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components=12, random_state=17)
matrix = SVD.fit_transform(X)
matrix.shape

(2382, 12)

As we can see, the TruncatedSVD has reduced the dimensions significantly from 6918x2382 to 2382x12.

In [62]:
import warnings
warnings.filterwarnings("ignore",category =RuntimeWarning)
corr = np.corrcoef(matrix)
corr.shape

(2382, 2382)

In [70]:
drama_title = df_pivot_2.columns
drama_list = list(drama_title)

In [69]:
# get list of dramas that have high correlation coefficient scores
untamed = drama_list.index("Story of Yanxi Palace (2018)")
corr_untamed  = corr[untamed]
list(drama_title[(corr_untamed >= 0.9)])

['Doctor Prisoner (2019)',
 'Secrets of Three Kingdoms (2018)',
 'Story of Yanxi Palace (2018)']

In [75]:
# get list of dramas that have high correlation coefficient scores
loveo2o = drama_list.index("Love O2O (2016)")
corr_loveo2o  = corr[loveo2o]
list(drama_title[(corr_loveo2o >= 0.9)])

["Angel 'N' Devil (2014)",
 'Late Night Hospital (2011)',
 'Looking For Dorothy (2006)',
 'Love O2O (2016)',
 'Proud of Love (2016)',
 'U-Prince The Series: The Absolute Economist (2016)',
 'Yasuko to Kenji (2008)',
 'Youth (2018)']

Though other users have rated their recommendations as similar to the searched drama, I am not too sure about it having not watched these recommendations before. 

Meanwhile, besides the explicit interaction such as ratings that I have used, other important information could be included to improve the results. These implicit user interaction includes:

- Browsed dramas (whether users read the drama reviews/ratings)
- Favourite actors (which actors/actresses do the users like)
- Watching status (whether users have finished/stopped watching the drama they had started)

These factors can be assigned different weights, depending on its importance. For example, if the user has browsed a related drama and subsequently rated this drama highly, it reinforces the view that this new drama is closely related to the initial drama.

## Conculsion

It is interesting to find out that the results given by collaborative filtering proves to be different from those of content-based recommender systems. While it may look good on paper (e.g. drama synopsis), the quality of dramas is still dependent on other factors such as actors, music and editing. This is especially the case with remakes or serial dramas. While the first series may be overwhelmingly popular, its remakes/serial may prove otherwise. No wonder collaborative filtering is such as important tool today, and the organizations with such personal data is able to triumph in today's competitive markets.

Source: https://datascienceplus.com/building-a-book-recommender-system-the-basics-knn-and-matrix-factorization/