# Drama Recommender Systems

Here, I tried several recommender systems:

1. Simple recommenders
2. Content-based recommender
   - Plot-based recommender
   - Main actors, genres and keywords recommender
3. Collaborative filtering
   - k-Nearest Neighbors
   - Matrix factorization
   
These systems produce different recommendations. Since they each have their disadvantages, perhaps we can combine them at different stages to give better recommendations to our viewers. For example, collaborative filtering does not work well on new users when there is little or no information on them. We would not be able to find similar users or similar content that they like based on their drama watching history. As a start, the simple recommender can be used to recommend dramas that are popular amongst a large number of users.

Note: The collaborative filtering recommender system is in another jupyter notebook.

## 1. Simple Recommenders

Simple recommenders provide recommendations based on the highest weighted rating score. This new score is calculated based on IMDB's formula, which takes into account the mean rating score and the number of users that rated it.

In [1]:
import pandas as pd

In [17]:
df = pd.read_csv('drama_reviews_cf.csv')
df.head()

Unnamed: 0,drama_title,user_name,overall_rating
0,Dear My Friends (2016),iamgeralddd,10.0
1,Dear My Friends (2016),Dounie,10.0
2,Dear My Friends (2016),Pelin,10.0
3,Dear My Friends (2016),silent_whispers,9.0
4,Dear My Friends (2016),Dana,9.0


In [18]:
from pandas import DataFrame

In [19]:
# create new DataFrame that gives the mean rating of each drama and the number of ratings given to each drama
drama_df = DataFrame({'no_ratings': df.groupby('drama_title')['overall_rating'].count(),
                      'mean_ratings': df.groupby('drama_title')['overall_rating'].mean()}).reset_index()

In [24]:
drama_df = drama_df.sort_values('mean_ratings', ascending=False)
drama_df["mean_ratings_rank"] = drama_df["mean_ratings"].rank(method='min',ascending=False) # give rank based on mean rating
drama_df.head(15)

Unnamed: 0,drama_title,no_ratings,mean_ratings,mean_ratings_rank
1726,Secret (2000),1,10.0,1.0
1291,More Beautiful Than a Flower (2004),1,10.0,1.0
2162,Toshiie and Matsu (2002),1,10.0,1.0
113,Ando Natsu (2008),1,10.0,1.0
1132,Love Story in Harvard (2004),3,10.0,1.0
1200,Manâ€§Boy (2012),1,10.0,1.0
916,Kaeru no Oujo-sama (2012),1,10.0,1.0
1934,Tears of Heaven (2014),1,10.0,1.0
815,I'm Pregnant With Your Baby (2013),1,10.0,1.0
1067,Life Special Investigation Team (2008),1,10.0,1.0


In [25]:
overall_mean = drama_df['mean_ratings'].mean()
print(overall_mean)

7.90283473904441


In [26]:
m = drama_df['no_ratings'].quantile(0.90)
print(m) # top 10% most highly reviewed dramas

21.0


Top 10% of most highly reviewed dramas have 21 reviews. Ostensibly, the dramas with the highest rating (rating of 10.0) is not representative of the viewers' consensus. 14 of the top 15 dramas based on mean rating have only 1 user rating. This cannot be compared to dramas with a mean rating of >9.0 yet have >100 users rating it.

In [27]:
q_dramas = drama_df.copy().loc[drama_df['no_ratings'] >= m]
q_dramas.shape # no. of dramas that are in the top 10% most highly reviewed dramas

(245, 4)

There are 245 dramas in the top 10% of most highly reviewed dramas.

In [28]:
def weighted_rating(x, m=m, C=overall_mean): # function that computes the weighted rating of each drama
    v = x['no_ratings']
    R = x['mean_ratings']
    return (v/(v+m) * R) + (m/(m+v) * C) # calculation based on the IMDB formula

In [29]:
drama_df['score'] = drama_df.apply(weighted_rating, axis=1)

In [34]:
drama_df = drama_df.sort_values('score', ascending=False) # sort dramas based on score calculated above
drama_df["score_rank"] = drama_df["score"].rank(method='min',ascending=False) # give rank based on new score
# drama_df[['drama_title', 'no_ratings', 'mean_ratings', 'score']].head(15) # top 15 dramas
drama_df.head(15)

Unnamed: 0,drama_title,no_ratings,mean_ratings,mean_ratings_rank,score,score_rank
1794,Signal (2016),94,9.712766,138.0,9.382257,1.0
2099,The Untamed (2019),108,9.486111,220.0,9.228368,2.0
1407,Nirvana in Fire (2015),75,9.54,162.0,9.18187,3.0
395,Descendants of the Sun (2016),149,9.35906,238.0,9.179174,4.0
599,Goblin (2016),204,9.272059,265.0,9.144265,5.0
467,Eternal Love (2017),149,9.241611,287.0,9.076233,6.0
1349,My Mister (2018),81,9.358025,239.0,9.058427,7.0
1617,Reply 1997 (2012),112,9.196429,293.0,8.992177,8.0
2273,Weightlifting Fairy Kim Bok Joo (2016),132,9.132576,319.0,8.963788,9.0
1139,Love by Chance (2018),78,9.237179,288.0,8.954137,10.0


At first glance it seems that the new score is just a lower variation of the mean rating. On closer inspection, we see that it takes into account the number of viewers that rated it.

Now, lets see what happened to 15 of the dramas that have a mean rating of 1.0. Despite having the top rank based on mean rating, they do poorly in the rank based on the new score. The drama with a higher number of rating scores higher in rank (SPEC: Birth (2010) with 6 ratings ranks 140) compared to those with the same mean rating but lower number of ratings (You're So Beautiful (2011) with 2 ratings ranks a meagre 413).

In [36]:
drama_df[drama_df['mean_ratings_rank'] == 1.0].head(15)

Unnamed: 0,drama_title,no_ratings,mean_ratings,mean_ratings_rank,score,score_rank
1683,SPEC: Birth (2010),6,10.0,1.0,8.368871,140.0
538,Friend Zone (2018),4,10.0,1.0,8.238381,230.0
2379,dele (2018),4,10.0,1.0,8.238381,230.0
2018,The Judgement (2018),3,10.0,1.0,8.16498,298.0
1132,Love Story in Harvard (2004),3,10.0,1.0,8.16498,298.0
2128,Tiger & Dragon (2005),3,10.0,1.0,8.16498,298.0
203,Beautiful Rain (2012),3,10.0,1.0,8.16498,298.0
101,All In (2003),3,10.0,1.0,8.16498,298.0
2357,You're So Beautiful (2011),2,10.0,1.0,8.085197,413.0
218,Best Mistake (2019),2,10.0,1.0,8.085197,413.0


## 2. Content-Based Recommender

### Plot Description Based Recommender

Here, we give recommendations based on how similar the plots are. The similarity metrics used here is the cosine similarity. As we are not able to directly use the synopsis text, we will first convert them into numerical representation using TF-IDF.

In [2]:
df = pd.read_csv('drama_list.csv')
df.head()

Unnamed: 0,drama_title,year,main_actors,genres,tags,synopsis
0,The Untamed (2019),2019,"Sean Xiao, Wang Yi Bo","Action, Adventure, Friendship, Historical, ...","Censored Romance, Tragic Past, Adapted From A ...",Wei Wuxian and Lan Wangji are two completely d...
1,My Mister (2018),2018,"Lee Sun Kyun, IU","Business, Psychological, Life, Drama, Fami...","Nice Male Lead, Strong Female Lead, Healing, I...",Park Dong Hoon is a middle-aged engineer who i...
2,Signal (2016),2016,"Lee Je Hoon, Kim Hye Soo, Jo Jin Woong","Suspense, Mystery, Crime, Drama, Supernatu...","Different Timelines, Criminal Profiler, Serial...","Fifteen years ago, a young girl was kidnapped ..."
3,Nirvana in Fire (2015),2015,"Hu Ge, Tamia Liu, Wang Kai, Chen Long, Leo Wu,...","Friendship, Historical, Wuxia, Drama, Fant...","Bromance, Smart Male Lead, Political Intrigue,...",During the Datong era of the Southern Liang Dy...
4,Prison Playbook (2017),2017,"Park Hae Soo, Jung Kyung Ho","Friendship, Comedy, Life, Drama","Prison, Bromance, Slight Romance, Black Comedy...","Kim Je Hyuk, a famous baseball player, is conv..."


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['synopsis'] = df['synopsis'].fillna('') # some dramas might not have synopsis

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['synopsis'])

tfidf_matrix.shape

(4999, 23363)

Over 23,000 words are used to describe 4,999 dramas.

### Calculate consine similarity

A consine similarity score is calculated between each pair of dramas. The dramas with the highest similarity score will be returned.

In [4]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [5]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['drama_title']).drop_duplicates()

In [6]:
# Function that takes in drama title as input and outputs most similar dramas
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the drama that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all dramas with that drama
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the dramas based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar drama
    sim_scores = sim_scores[1:11]

    # Get the drama indices
    drama_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['drama_title'].iloc[drama_indices]

In [7]:
get_recommendations('Story of Yanxi Palace (2018)')

4306                                 The Switch II (2002)
3197                           The Emperor's Harem (2011)
1782                               Emperors and Me (2019)
1068                 Granting You a Dreamlike Life (2018)
1308                          Suddenly This Summer (2018)
4939                                     Tiger Mom (2015)
3128                          The Rippling Blossom (2011)
2886                           Empress Myeongseong (2001)
4178    The Voyage of Emperor Qian Long to Jiang Nan (...
1192                         Legend of the Phoenix (2019)
Name: drama_title, dtype: object

In [8]:
get_recommendations('Love O2O (2016)')

2135                  The Magicians of Love (2006)
2696    Rent a Girlfriend Home for New Year (2010)
59                        The King’s Avatar (2019)
2654                   Happy & Love Forever (2010)
3424                           Painted Skin (2011)
708                              I Hear You (2019)
1456                           Stay With Me (2016)
808              Oh! My Emperor: Season Two (2018)
4435                       Waiting to Bloom (2013)
1033                   Murphy's Law of Love (2015)
Name: drama_title, dtype: object

There are some good recommendations here. Love O2O (2016) and The King’s Avatar (2019) are both based on gaming. The main leads excel at gaming and they continue to succeed on in their field.

The recommendations for Story of Yanxi Palace (2018) are mostly historical palace dramas, with leads scheming against each other.

### Main Actors, Genres and Keywords Based Recommender

A similar concept is used. However this time, instead of the synopsis, we focus on the main actors, tags and genres.

In [9]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [10]:
# Apply clean_data function to your features.
features = ['main_actors','tags','genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data)

In [12]:
# Create function that combines the text of main actors, tags and genres
def create_soup(x):
    return x['main_actors']+' '+x['tags']+' '+x['genres']

In [13]:
# Create a new soup feature
df['soup'] = df.apply(create_soup, axis=1)

In [14]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

In [15]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [16]:
# Reset index of your main DataFrame and construct reverse mapping as before
df = df.reset_index()
indices = pd.Series(df.index, index=df['drama_title'])

In [17]:
get_recommendations('Story of Yanxi Palace (2018)', cosine_sim2)

4836                    Jao Sao Ban Rai (2006)
774                The Flame's Daughter (2018)
4812                           Prissana (2000)
962                          Ngao Asoke (2016)
4941                     Kehas See Dang (2011)
3992                       Maiden's Vow (2006)
2477                      Rosy Business (2009)
1585                       Plerng Naree (2016)
4598                       I'm So Smart (2013)
555     Ruyi's Royal Love in the Palace (2018)
Name: drama_title, dtype: object

In [18]:
get_recommendations('Love O2O (2016)', cosine_sim2)

2098     Ugly Duckling Series: Pity Girl (2015)
591         U-Prince: The Badly Politics (2017)
834       U-Prince: The Lovely Geologist (2016)
1597    U-Prince: The Absolute Economist (2016)
4044                    Dream of Colours (2004)
647                    The Sand Princess (2019)
4598                        I'm So Smart (2013)
1496     U-Prince: The Playful Comm-Arts (2017)
96          Put Your Head on My Shoulder (2019)
1153                         My Dear Boy (2017)
Name: drama_title, dtype: object

Good recommendations are given here as well. Strong female leads with a historical stance are present in Story of Yanxi Palace (2018) and The Flame's Daughter (2018). Yanxi Palace was set in the Qing dynasty, which is the same as Ruyi's Royal Love in the Palace (2018).

Both Love O2O (2016) and Put Your Head on My Shoulder (2019) have a youth/university setting, with romance and an adorable female lead as the commonality.

Source: https://www.datacamp.com/community/tutorials/recommender-systems-python