In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import math

sb.set()

pd.options.mode.chained_assignment = None

In [3]:
data = pd.read_csv("csvs/clean.csv").drop(columns='Unnamed: 0')
for c in ['keywords','directors','cast','genres','networks','production_companies']:
    data[c] = data[c].str.split(';')
data['good_or_not'] = (data.average_rating>=7.5)

# Use of Natural Language Processing (NLP) on Dramas

NLP is a fascinating branch in Machine Learning. It is able to train computers to recognise texts and understand meaning from them.

In our notebook, we shall explore the use of NLP with classification in predicting the goodness of a K-Drama based on its plot.

In addition, we shall apply the use of NLP to build a simple recommender system based on plot, cast names, tagwords if there are, and genres.

## Classification Using NLP Metrics

With NLP, texts can be valued and hence used for classification. We shall explore the use of NLP in the prediction of goodness.

## Preparation

To prepare the information for analysis, we shall clean up the description. However, we notice below that there are some NA values in synopsis. We shall remove those without a synopsis.

In [4]:
data.isna().sum()

tmdb_id                   0
name                      0
original_name             0
keywords                261
airing_date               0
directors               800
cast                      0
genres                    0
number_of_seasons         0
number_of_episodes        0
episode_run_time          0
synopsis                  8
popularity                0
average_rating            0
rating_count              0
networks                  0
production_companies    300
cast_popularity           0
good_or_not               0
dtype: int64

In [5]:
data.dropna(subset='synopsis',inplace=True)
data.isna().sum()

tmdb_id                   0
name                      0
original_name             0
keywords                256
airing_date               0
directors               792
cast                      0
genres                    0
number_of_seasons         0
number_of_episodes        0
episode_run_time          0
synopsis                  0
popularity                0
average_rating            0
rating_count              0
networks                  0
production_companies    296
cast_popularity           0
good_or_not               0
dtype: int64

### Cleaning Description Field

To clean the description, we will have to break down the sentence into a list of words. Then, we will remove "stop words", which are common English connector lexicon that have little weight to meaning.

Thankfully, there is a function for this in Python. We shall use the Rake function from rake_nltk to "rake" out the keywords from the synopses.

However, before that, we will have to remove all hyphens. This is because of the presence of Korean names in the synopsis, which will help reduce mistaken raking.

In [6]:
from rake_nltk import Rake

def gen_plot_keywords(data, desc):
    plot_keywords = []
    r = Rake()
    for i, row in data.iterrows():
        text = row[desc].replace('-','').replace(' - ','')
        r.extract_keywords_from_text(text)
        key_words_dict_scores = r.get_word_degrees()
        plot_keywords.append(' '.join(list(key_words_dict_scores.keys())))
    return plot_keywords

In [7]:
clean_nlp = data.copy()
clean_nlp.synopsis = gen_plot_keywords(data, 'synopsis')
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not
0,99966,All of Us Are Dead,지금 우리 학교는,"[high school, bullying, based on comic, horror...",2022-01-28,[JQ Lee],"[Park Ji-hu, Yoon Chan-young, Cho Yi-hyun, Lom...","[Action & Adventure, Drama, Sci-Fi & Fantasy]",2.0,12.0,65.0,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True
1,93405,Squid Game,오징어 게임,"[secret organization, challenge, survival, fic...",2021-09-17,[Hwang Dong-hyuk],"[Lee Jung-jae, Park Hae-soo, Jung Ho-yeon, Wi ...","[Action & Adventure, Mystery, Drama]",2.0,9.0,54.0,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True
2,136699,Glitch,글리치,"[friendship, investigation, ufo, miniseries, s...",2022-10-07,[Roh Deok],"[Jeon Yeo-been, Nana, Lee Dong-hwi, Ryu Kyung-...","[Drama, Comedy, Mystery, Sci-Fi & Fantasy]",1.0,10.0,54.0,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,"[court case, court, autism, lawyer, courtroom,...",2022-06-29,[Yoo In-sik],"[Park Eun-bin, Kang Tae-oh, Kang Ki-young, Jeo...","[Drama, Comedy]",1.0,16.0,70.0,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True
4,129473,Young Lady and Gentleman,신사와 아가씨,"[tutor, family, single father, healing, rich m...",2021-09-25,,"[Ji Hyun-woo, Lee Se-hee, Park Ha-na, Oh Hyun-...","[Comedy, Drama, Family]",1.0,52.0,70.0,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True


For experimentation, we shall also conduct lemmatisation on the synopses. This involves reducing words to their base form. For example, "studies", "studying", "studied" etc. will be reduced to "study" after lemmatisation.

In order to automate the process of POS tagging, we shall use spaCy for lemmatisation.

In [8]:
import spacy

def gen_lemmatised_keywords(data, desc):
    lemmatised = []
    parse = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    for i, row in data.iterrows():
        text = parse(row[desc])
        lemmatised_tokens = []
        new_text = ' '.join([token.lemma_ for token in text])
        lemmatised.append(str(new_text))
    return lemmatised

In [9]:
clean_nlp['lemmatised_synopsis'] = gen_lemmatised_keywords(data, 'synopsis')
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not,lemmatised_synopsis
0,99966,All of Us Are Dead,지금 우리 학교는,"[high school, bullying, based on comic, horror...",2022-01-28,[JQ Lee],"[Park Ji-hu, Yoon Chan-young, Cho Yi-hyun, Lom...","[Action & Adventure, Drama, Sci-Fi & Fantasy]",2.0,12.0,65.0,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True,a high school become ground zero for a zombie ...
1,93405,Squid Game,오징어 게임,"[secret organization, challenge, survival, fic...",2021-09-17,[Hwang Dong-hyuk],"[Lee Jung-jae, Park Hae-soo, Jung Ho-yeon, Wi ...","[Action & Adventure, Mystery, Drama]",2.0,9.0,54.0,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True,hundred of cash - strap player accept a strang...
2,136699,Glitch,글리치,"[friendship, investigation, ufo, miniseries, s...",2022-10-07,[Roh Deok],"[Jeon Yeo-been, Nana, Lee Dong-hwi, Ryu Kyung-...","[Drama, Comedy, Mystery, Sci-Fi & Fantasy]",1.0,10.0,54.0,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True,a young woman join force with a UFO enthusiast...
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,"[court case, court, autism, lawyer, courtroom,...",2022-06-29,[Yoo In-sik],"[Park Eun-bin, Kang Tae-oh, Kang Ki-young, Jeo...","[Drama, Comedy]",1.0,16.0,70.0,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True,brilliant attorney Woo Young - woo tackle chal...
4,129473,Young Lady and Gentleman,신사와 아가씨,"[tutor, family, single father, healing, rich m...",2021-09-25,,"[Ji Hyun-woo, Lee Se-hee, Park Ha-na, Oh Hyun-...","[Comedy, Drama, Family]",1.0,52.0,70.0,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True,Lee Young Kook be a widower with three child ....


With this lemmatisation, it seems that the sentences are still unclear. Hence, let us rake the synopsis again.

In [10]:
clean_nlp['lemmatised_synopsis'] = gen_plot_keywords(clean_nlp, 'lemmatised_synopsis')
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not,lemmatised_synopsis
0,99966,All of Us Are Dead,지금 우리 학교는,"[high school, bullying, based on comic, horror...",2022-01-28,[JQ Lee],"[Park Ji-hu, Yoon Chan-young, Cho Yi-hyun, Lom...","[Action & Adventure, Drama, Sci-Fi & Fantasy]",2.0,12.0,65.0,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True,high school become ground zero zombie virus ou...
1,93405,Squid Game,오징어 게임,"[secret organization, challenge, survival, fic...",2021-09-17,[Hwang Dong-hyuk],"[Lee Jung-jae, Park Hae-soo, Jung Ho-yeon, Wi ...","[Action & Adventure, Mystery, Drama]",2.0,9.0,54.0,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True,hundred cash strap player accept strange invit...
2,136699,Glitch,글리치,"[friendship, investigation, ufo, miniseries, s...",2022-10-07,[Roh Deok],"[Jeon Yeo-been, Nana, Lee Dong-hwi, Ryu Kyung-...","[Drama, Comedy, Mystery, Sci-Fi & Fantasy]",1.0,10.0,54.0,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True,young woman join force ufo enthusiast investig...
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,"[court case, court, autism, lawyer, courtroom,...",2022-06-29,[Yoo In-sik],"[Park Eun-bin, Kang Tae-oh, Kang Ki-young, Jeo...","[Drama, Comedy]",1.0,16.0,70.0,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True,brilliant attorney woo young tackle challenge ...
4,129473,Young Lady and Gentleman,신사와 아가씨,"[tutor, family, single father, healing, rich m...",2021-09-25,,"[Ji Hyun-woo, Lee Se-hee, Park Ha-na, Oh Hyun-...","[Comedy, Drama, Family]",1.0,52.0,70.0,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True,lee young kook widower three child still get d...


Much better :D

## Classification using Term Frequency - Inverse Document Frequency (TF-IDF)

Just using words is not sufficient for the model to understand. Word vectorisation will have to come in to replace words with a certain value that can be stored in a readable matrix.

One such vectorisation is by using TF-IDF. TF-IDF calculates the prevalence of a word in a series or corpus/dataset is to a text. Essentially, TF-IDF treats terms in a text with less weight if it is more frequent in the corpus, but still increasing based on the weight of the term in the text itself.

By using TF-IDF on plot analysis, we are able to tell how uniquely the plot is written.

With the vectorisation of the plot, we can then perform classification using a Random Forest Classifier.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

'''
The following functions follow the following flow:
gen_tfidf_forest_results(args) takes in the data and splits the predictor and response sets for training and testing.
It initialises a TF-IDF vectoriser as well as a Random Forest classifier.
The vectoriser will set its max features according to the argument as well as the n-gram range.
The classifier will balance the response samples automatically.
Then, it uses a pipeline to allow text to be transformed using the vectoriser before being used for classification in the classifier.
The results of the prediction will be printed according to the max features set in the vectoriser using the pipeline.
'''

def accuracy_summary(pipeline, X_train, y_train, X_test, y_test):
    pipe_fit = pipeline.fit(X_train, y_train)              # Vectorises the data then trains the classifier in the pipeline.
    y_pred = pipe_fit.predict(X_test)                      # Generates the prediction.
    accuracy = accuracy_score(y_test, y_pred)              # Saves the prediction accuracy.
    print("accuracy score: {0:.2f}%".format(accuracy*100))
    return accuracy

def gen_tfidf_forest_results(data=clean_nlp, X_name = 'synopsis', y_name = 'good_or_not', n_features=[5000], stop_words=None, ngram_range=(1, 3)):
    X = clean_nlp[X_name]
    y = clean_nlp[y_name]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
    tfidf = TfidfVectorizer()                            # Initialise the TF-IDF Vectoriser.
    rf = RandomForestClassifier(class_weight="balanced") # Initialise the Random Forest Classifier.
    result = []                                          # Used to store the results of each max features provided as an iterable in the argument.
    for n in n_features:
        tfidf.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)         # Sets the stop words (if provided), max features and n-gram limit.
        checker_pipeline = Pipeline([('vectorizer', tfidf), ('classifier', rf)])                 # Sets up the pipeline.
        print("Test result for {} features".format(n))
        nfeature_accuracy = accuracy_summary(checker_pipeline, X_train, y_train, X_test, y_test) # Check above function.
        result.append((n,nfeature_accuracy))                                                     # Stores results as a tuple and appends to the list.
    return checker_pipeline, result, X_train, X_test, y_train, y_test

In [12]:
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

'''
The classification report is used here.
Pipeline is assumed to be passed in. If not it won't work.
'''

def gen_classification_report(X_train, X_test, y_train, y_test, pipeline):
    pipe_fit = pipeline.fit(X_train, y_train)
    y_pred = pipe_fit.predict(X_test)
    print(classification_report(y_test, y_pred, target_names=['negative','positive']))

In [13]:
pipeline, tfidf_forest_res, X_train, X_test, y_train, y_test = gen_tfidf_forest_results(n_features=np.arange(2000,6001,2000))

Test result for 2000 features
accuracy score: 54.87%
Test result for 4000 features
accuracy score: 56.64%
Test result for 6000 features
accuracy score: 56.19%


In [14]:
gen_classification_report(X_train, X_test, y_train, y_test, pipeline)

              precision    recall  f1-score   support

    negative       0.45      0.23      0.31        95
    positive       0.59      0.79      0.68       131

    accuracy                           0.56       226
   macro avg       0.52      0.51      0.49       226
weighted avg       0.53      0.56      0.52       226



### Results of Prediction 1

Using this method on the unlemmatised plot, we realise that the accuracy is rather mediocre. The negative precision and recall is low, meaning that the the pipeline likely overly predicted negative values as compared to positive ones.

In [15]:
pipeline2, tfidf_forest_res_2, X_train, X_test, y_train, y_test = gen_tfidf_forest_results(X_name='lemmatised_synopsis',n_features=np.arange(2000,6001,2000))
gen_classification_report(X_train, X_test, y_train, y_test, pipeline2)

Test result for 2000 features
accuracy score: 59.29%
Test result for 4000 features
accuracy score: 60.62%
Test result for 6000 features
accuracy score: 61.06%
              precision    recall  f1-score   support

    negative       0.30      0.22      0.25        73
    positive       0.67      0.76      0.71       153

    accuracy                           0.58       226
   macro avg       0.49      0.49      0.48       226
weighted avg       0.55      0.58      0.56       226



### Results of Prediction 2 and Conclusion

Predicting using the lemmatised plot keywords seems to be better. Although the improvement is only a little bit, it shows that lemmatisation can allow words to be compared more directly, hence perhaps improving the uniqueness score of each drama.

However, this does not necessarily mean that using NLP with classification to determine if the drama will be rated well is a fruitless method. Likely causes for the poor performance of our model could be the following:

1. Too little data. If fed with large amounts of data, the model will be able to work much better.
2. Regression on popularity should be used. A more unique plot does not necessarily mean that the drama will be good. As such, popularity might be a better response.
3. Other vectorisation methods should be used. Perhaps a better, more informed vectorisation methods or classes, even those relying on Large Language Models (LLM), will perform much better.

Despite the small scale and limitations, this method could get about 54% prediction accuracy, which shows promise if it were to use much more informed and well-trained models.

# Simple K-Drama Recommender

Other than data analysis, we can also use classification and simple clustering to create a simple recommender system.

## Preparation

In order to prepare the data for NLP, we will have to first identify the text variables. The following are strings in lists:

1. Keywords
2. Cast
3. Genres
4. Networks
5. Production Companies

Among these 5, keywords, cast and genres are the most relevant for the classification and recommender system.

However, there are many dramas that do not have keywords. As such, we shall use the previously built keywords from the synopsis as the keywords instead.

As for description, it is a sentence that has to be broken down for NLP use.

Finally, all the fields will be combined using a method called "Bag of Words". It will be stored as an entire string in a separate column called "bag_of_words".

In [16]:
data.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not
0,99966,All of Us Are Dead,지금 우리 학교는,"[high school, bullying, based on comic, horror...",2022-01-28,[JQ Lee],"[Park Ji-hu, Yoon Chan-young, Cho Yi-hyun, Lom...","[Action & Adventure, Drama, Sci-Fi & Fantasy]",2.0,12.0,65.0,A high school becomes ground zero for a zombie...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True
1,93405,Squid Game,오징어 게임,"[secret organization, challenge, survival, fic...",2021-09-17,[Hwang Dong-hyuk],"[Lee Jung-jae, Park Hae-soo, Jung Ho-yeon, Wi ...","[Action & Adventure, Mystery, Drama]",2.0,9.0,54.0,Hundreds of cash-strapped players accept a str...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True
2,136699,Glitch,글리치,"[friendship, investigation, ufo, miniseries, s...",2022-10-07,[Roh Deok],"[Jeon Yeo-been, Nana, Lee Dong-hwi, Ryu Kyung-...","[Drama, Comedy, Mystery, Sci-Fi & Fantasy]",1.0,10.0,54.0,A young woman joins forces with a UFO enthusia...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,"[court case, court, autism, lawyer, courtroom,...",2022-06-29,[Yoo In-sik],"[Park Eun-bin, Kang Tae-oh, Kang Ki-young, Jeo...","[Drama, Comedy]",1.0,16.0,70.0,Brilliant attorney Woo Young-woo tackles chall...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True
4,129473,Young Lady and Gentleman,신사와 아가씨,"[tutor, family, single father, healing, rich m...",2021-09-25,,"[Ji Hyun-woo, Lee Se-hee, Park Ha-na, Oh Hyun-...","[Comedy, Drama, Family]",1.0,52.0,70.0,Lee Young Kook is a widower with three childre...,128.825,8.125,8.0,[KBS2],,71.909,True


### Cleaning Lists of Strings

Firstly, in the lists of strings, they have already been split into individual words or phrases. In the 3 fields, we shall clean as follows:

1. Cast
> Lowercase all words.
> 
> Join names into one word.
> 
> Replace '-' with ''.
> 
> Join list with space.

2. Genres
> Lowercase all words.
> 
> Replace '&' with ''.
> 
> Replace '-' with ''.
> 
> Join list with space.

In [17]:
clean_nlp.keywords = data.keywords.str.join(' ')
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not,lemmatised_synopsis
0,99966,All of Us Are Dead,지금 우리 학교는,high school bullying based on comic horror com...,2022-01-28,[JQ Lee],"[Park Ji-hu, Yoon Chan-young, Cho Yi-hyun, Lom...","[Action & Adventure, Drama, Sci-Fi & Fantasy]",2.0,12.0,65.0,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True,high school become ground zero zombie virus ou...
1,93405,Squid Game,오징어 게임,secret organization challenge survival fiction...,2021-09-17,[Hwang Dong-hyuk],"[Lee Jung-jae, Park Hae-soo, Jung Ho-yeon, Wi ...","[Action & Adventure, Mystery, Drama]",2.0,9.0,54.0,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True,hundred cash strap player accept strange invit...
2,136699,Glitch,글리치,friendship investigation ufo miniseries suspense,2022-10-07,[Roh Deok],"[Jeon Yeo-been, Nana, Lee Dong-hwi, Ryu Kyung-...","[Drama, Comedy, Mystery, Sci-Fi & Fantasy]",1.0,10.0,54.0,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True,young woman join force ufo enthusiast investig...
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,court case court autism lawyer courtroom law f...,2022-06-29,[Yoo In-sik],"[Park Eun-bin, Kang Tae-oh, Kang Ki-young, Jeo...","[Drama, Comedy]",1.0,16.0,70.0,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True,brilliant attorney woo young tackle challenge ...
4,129473,Young Lady and Gentleman,신사와 아가씨,tutor family single father healing rich man po...,2021-09-25,,"[Ji Hyun-woo, Lee Se-hee, Park Ha-na, Oh Hyun-...","[Comedy, Drama, Family]",1.0,52.0,70.0,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True,lee young kook widower three child still get d...


In [18]:
new_cast = []
for i, row in data.iterrows():
    temp = []
    for n in row.cast:
        temp.append(str(n.lower().replace('-','').replace(' ','')))
    new_cast.append(temp)
clean_nlp.cast = new_cast
clean_nlp.cast = clean_nlp.cast.str.join(' ')
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not,lemmatised_synopsis
0,99966,All of Us Are Dead,지금 우리 학교는,high school bullying based on comic horror com...,2022-01-28,[JQ Lee],parkjihu yoonchanyoung choyihyun lomon yooinso...,"[Action & Adventure, Drama, Sci-Fi & Fantasy]",2.0,12.0,65.0,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True,high school become ground zero zombie virus ou...
1,93405,Squid Game,오징어 게임,secret organization challenge survival fiction...,2021-09-17,[Hwang Dong-hyuk],leejungjae parkhaesoo junghoyeon wihajun ohyou...,"[Action & Adventure, Mystery, Drama]",2.0,9.0,54.0,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True,hundred cash strap player accept strange invit...
2,136699,Glitch,글리치,friendship investigation ufo miniseries suspense,2022-10-07,[Roh Deok],jeonyeobeen nana leedonghwi ryukyungsoo kochan...,"[Drama, Comedy, Mystery, Sci-Fi & Fantasy]",1.0,10.0,54.0,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True,young woman join force ufo enthusiast investig...
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,court case court autism lawyer courtroom law f...,2022-06-29,[Yoo In-sik],parkeunbin kangtaeoh kangkiyoung jeonbaesoo ba...,"[Drama, Comedy]",1.0,16.0,70.0,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True,brilliant attorney woo young tackle challenge ...
4,129473,Young Lady and Gentleman,신사와 아가씨,tutor family single father healing rich man po...,2021-09-25,,jihyunwoo leesehee parkhana ohhyunkyung chahwa...,"[Comedy, Drama, Family]",1.0,52.0,70.0,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True,lee young kook widower three child still get d...


In [19]:
new_genres = []
for i, row in data.iterrows():
    temp = []
    for n in row.genres:
        temp.append(str(n.lower().replace('-','').replace(' & ',' ')))
    new_genres.append(temp)
clean_nlp.genres = new_genres
clean_nlp.genres = clean_nlp.genres.str.join(' ')
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,episode_run_time,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not,lemmatised_synopsis
0,99966,All of Us Are Dead,지금 우리 학교는,high school bullying based on comic horror com...,2022-01-28,[JQ Lee],parkjihu yoonchanyoung choyihyun lomon yooinso...,action adventure drama scifi fantasy,2.0,12.0,65.0,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True,high school become ground zero zombie virus ou...
1,93405,Squid Game,오징어 게임,secret organization challenge survival fiction...,2021-09-17,[Hwang Dong-hyuk],leejungjae parkhaesoo junghoyeon wihajun ohyou...,action adventure mystery drama,2.0,9.0,54.0,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True,hundred cash strap player accept strange invit...
2,136699,Glitch,글리치,friendship investigation ufo miniseries suspense,2022-10-07,[Roh Deok],jeonyeobeen nana leedonghwi ryukyungsoo kochan...,drama comedy mystery scifi fantasy,1.0,10.0,54.0,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True,young woman join force ufo enthusiast investig...
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,court case court autism lawyer courtroom law f...,2022-06-29,[Yoo In-sik],parkeunbin kangtaeoh kangkiyoung jeonbaesoo ba...,drama comedy,1.0,16.0,70.0,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True,brilliant attorney woo young tackle challenge ...
4,129473,Young Lady and Gentleman,신사와 아가씨,tutor family single father healing rich man po...,2021-09-25,,jihyunwoo leesehee parkhana ohhyunkyung chahwa...,comedy drama family,1.0,52.0,70.0,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True,lee young kook widower three child still get d...


In [20]:
clean_nlp['bag_of_words'] = clean_nlp.genres + ' ' + clean_nlp.cast + ' ' + clean_nlp.lemmatised_synopsis
clean_nlp.head()

Unnamed: 0,tmdb_id,name,original_name,keywords,airing_date,directors,cast,genres,number_of_seasons,number_of_episodes,...,synopsis,popularity,average_rating,rating_count,networks,production_companies,cast_popularity,good_or_not,lemmatised_synopsis,bag_of_words
0,99966,All of Us Are Dead,지금 우리 학교는,high school bullying based on comic horror com...,2022-01-28,[JQ Lee],parkjihu yoonchanyoung choyihyun lomon yooinso...,action adventure drama scifi fantasy,2.0,12.0,...,high school becomes ground zero zombie virus o...,398.111,8.421,2750.0,[Netflix],"[Kim Jong-hak Production, SLL, Film Monster]",82.507,True,high school become ground zero zombie virus ou...,action adventure drama scifi fantasy parkjihu ...
1,93405,Squid Game,오징어 게임,secret organization challenge survival fiction...,2021-09-17,[Hwang Dong-hyuk],leejungjae parkhaesoo junghoyeon wihajun ohyou...,action adventure mystery drama,2.0,9.0,...,hundreds cashstrapped players accept strange i...,323.945,7.835,11835.0,[Netflix],[Siren Pictures],68.428,True,hundred cash strap player accept strange invit...,action adventure mystery drama leejungjae park...
2,136699,Glitch,글리치,friendship investigation ufo miniseries suspense,2022-10-07,[Roh Deok],jeonyeobeen nana leedonghwi ryukyungsoo kochan...,drama comedy mystery scifi fantasy,1.0,10.0,...,young woman joins forces ufo enthusiast invest...,223.442,7.674,43.0,[Netflix],[Studio 329],81.095,True,young woman join force ufo enthusiast investig...,drama comedy mystery scifi fantasy jeonyeobeen...
3,197067,Extraordinary Attorney Woo,이상한 변호사 우영우,court case court autism lawyer courtroom law f...,2022-06-29,[Yoo In-sik],parkeunbin kangtaeoh kangkiyoung jeonbaesoo ba...,drama comedy,1.0,16.0,...,brilliant attorney woo youngwoo tackles challe...,147.054,8.31,381.0,"[Netflix, ENA]","[AStory, KT Studio Genie]",131.091,True,brilliant attorney woo young tackle challenge ...,drama comedy parkeunbin kangtaeoh kangkiyoung ...
4,129473,Young Lady and Gentleman,신사와 아가씨,tutor family single father healing rich man po...,2021-09-25,,jihyunwoo leesehee parkhana ohhyunkyung chahwa...,comedy drama family,1.0,52.0,...,lee young kook widower three children still ’ ...,128.825,8.125,8.0,[KBS2],,71.909,True,lee young kook widower three child still get d...,comedy drama family jihyunwoo leesehee parkhan...


## Recommender Function

After generating the bag of words, we shall use the bag of words to generate a matrix of vocabulary frequency using a Count Vectoriser.

Then, we shall use cosine similarity to find the similarity between entries of the matrix. This compares each drama with the other to find how similar they are based on their vocabulary list.

The recommender finally generates 10 dramas that have the highest similarity index with the title in question.

In [23]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

def recommend(title, data=clean_nlp, bag_of_words='bag_of_words', name='name'):
    count = CountVectorizer()
    count_matrix = count.fit_transform(pd.Series(data[bag_of_words]))        # Vectorises all the bag of words fields in the data.
    cosine_sim = cosine_similarity(count_matrix, count_matrix)               # Generates a cosine similarity matrix that can be read by machine.
    recommended_dramas = []
    indices = pd.Series(data[name])
    idx = indices[indices == title].index[0]                                 # Gets a 1-D iterable of the similarity index of other dramas with the queried drama.
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False) # Sorts the above iterable in descending order.
    top_10_indices = list(score_series.iloc[1:11].index)                     # Gets the list of top 10 dramas most similar to the queried drama.
    
    for i in top_10_indices:
        recommended_dramas.append(list(data['name'])[i])
        
    return recommended_dramas

In [24]:
recommend('All of Us Are Dead')

['Ultimate Weapon Alice',
 'Mimicus',
 'Invincible Parachute Agent',
 'Hero',
 'One More Time',
 'Dokgo Rewind',
 'Another Parting',
 'Joseon Exorcist',
 'Just Dance',
 'Sweet Home']

While not perfect, the function is able to return recommended dramas based on similarity with the queried drama.

While we personally do not know most of the dramas recommended, when queried 'All of Us Are Dead', a drama on a zombie apocalypse, we did get the title of 'Sweet Home' recommended, which we know for a fact is also related as it deals with monsters and infection.

Hence, the recommender gets the job done. Go ahead and try if you are bored of current dramas! (Data is limited so do note that not all dramas are inside :/ )