# Korean Drama Plot Text Data Exploration

In my earlier [notebook](google.com), I explored Korea drama's metadata, which I scraped from Wikipedia. While it is rich in other area, Wikipedia did not provide much information on the plot of dramas. I found another website to get the plot data I wanted. Luckily, the plot data was in English (as the site was a drama review / curating site for foreign fans for Korean dramas).

## Now let's see what we are working with and what it looks like

In [6]:
import pickle
import numpy as np
from gensim.models import LdaModel
from gensim import corpora, models

In [2]:
# Load plot data
file_path = 'data/english_texts.pkl'
pkl = open(file_path, 'rb')
eng_texts = pickle.load(pkl)

In [3]:
# Load titles
file_path = 'data/titles.pkl'
pkl = open(file_path, 'rb')
titles = pickle.load(pkl)

In [4]:
print("Total {} drama plot data loaded".format(len(eng_texts)))

Total 139 drama plot data loaded


![drama](data/drama_collage.png)

## The plots!

In [16]:
random_draw = np.random.randint(0,139,5)
for r in random_draw:
    print("Drama title:{}".format(titles[r]))
    print("-----------------------------")
    print(eng_texts[r][:500])
    print("=============================")
    print(" ")

Drama title:records-of-a-night-watchman
-----------------------------
 MBCs new supernatural fusion sageuk Records of a Night Watchman premiered today and I dont know if its because I set my expectations really low or because the really terrible trailers made me but the show is off to a decent start Granted its a world filled with ghosts magic and dragonsyou have to be prepared for a certain amount of cheese when whats written on the page and the CG outcome isnt always as seamless as youd wish But theres a hefty mythology in play that blends a familiar elementJose
 
Drama title:warm-and-cozy
-----------------------------
 After wading through the arduous portion of denial with a side of noble idiocy we finally get to the good stuff fluttery anticipation of first dates how to seduce your boyfriend without seeming too eager Answer Theres no such thing as too eager Too eagers for people in Episode  Get on with it and what it takes to hang onto your happiness once youve found it Does it in

## Topic modeling on the plot data (*Do not hold your breath yet*)

Okay, I am going to come clean first. In my first attempt to do topic modeling using gensim library in python, the result was not very good (actually it was bad). I have few ideas how to improve it, but let me show you the current result first. This LDA model uses TFIDF corpus.

In [18]:
file_path = 'data/ldamodel_2'
ldamodel_2 = LdaModel.load(file_path)

In [19]:
print(*ldamodel_2.print_topics(num_topics = 8, num_words = 10), sep = '\n')

(0, '0.002*"yeonjae" + 0.002*"dowoo" + 0.001*"jiwook" + 0.001*"eunseok" + 0.001*"kyungah" + 0.001*"dowoos" + 0.000*"kyungtae" + 0.000*"eunsoo" + 0.000*"yeonjaes" + 0.000*"jaemyung"')
(17, '0.002*"roo" + 0.002*"chashik" + 0.002*"yooseul" + 0.001*"mi" + 0.001*"jinmok" + 0.001*"sofia" + 0.000*"tan" + 0.000*"yooseuls" + 0.000*"michelle" + 0.000*"piano"')
(83, '0.002*"yeojin" + 0.002*"taehyun" + 0.001*"dojoon" + 0.000*"chaeyoung" + 0.000*"yeojins" + 0.000*"hanshin" + 0.000*"taehyuns" + 0.000*"dojoons" + 0.000*"yongpal" + 0.000*"scarface"')
(61, '0.002*"gyun" + 0.001*"hyemyeong" + 0.000*"woo" + 0.000*"dayeon" + 0.000*"hyemyeongs" + 0.000*"wolmyung" + 0.000*"youngshin" + 0.000*"chuseong" + 0.000*"poong" + 0.000*"seho"')
(39, '0.002*"muryong" + 0.002*"wanseung" + 0.002*"seolok" + 0.001*"yoo" + 0.001*"joon" + 0.001*"johnny" + 0.000*"seung" + 0.000*"joonoh" + 0.000*"inspector" + 0.000*"mi"')
(34, '0.000*"symbolswhy" + 0.000*"hyuns" + 0.000*"dongyoons" + 0.000*"pursesnatcher" + 0.000*"aboutsometh

What does it all mean? it means for each topic (the first number indicates topic ID), the following words have the highest probability appearing. 
## The problem is that Korean names are clouding my result!
Almost every word are names in Korean: yeonjae, dowoo, jiwook, eunseok, kyungah,wolmyung, poong, seho.... 

### Getting rid of the names will be the first thing on my list in my next iteration

pyLDAvis is a great tool to visualize the topic modeling. Please check out and play with the /data/vis/vis.html.

![vis_topic](data/vis/vis_topic.png)