#### Start of the Idea
Korean dramas or K-dramas have seen an incredible evolution, particularly in recent years. The Korean entertainment sector has attracted a larger worldwide audience especially in the last ten years. More platforms and options to watch K-dramas have expanded their appeal and viewership, particularly on social media. Korean dramas are renowned for their intriguing storylines and compelling plot turns. There are varieties which is exactly what keeps viewers going back for more.

My interest in K-dramas increased over the past few years and made be interested in analysing and visualizing the evolution of K-dramas and Korean actors over time.

#### The codes in this notebook is to scrape the Wiki page for the list of Korean Dramas released from 1995 to 2022. Then IMDB data is scraped for the dramas to get other related data such as ratings, runtime etc. The reason why it was not simply scraped from IMDB is because the data from IMDB contains data other than series from South Korea. The wiki data helps to clean out the rogue data from the imdb site. Both the data are merged and the merged data is then stored in a csv file which will be used for analysis through visualization in Power BI.

#### 'requests' is the HTTP library used for accessing the web pages

In [1]:
pip install requests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
import requests

#### 'Beautiful Soup' is the Python library used for extracting data out of html and xml files.

In [2]:
from bs4 import BeautifulSoup

#### Import Necessary Libraries

In [3]:
import numpy as np
import pandas as pd
from time import sleep
from random import randint
import re

##### As the main wiki page only gives the drama name and year, both are scraped through the following code.

In [4]:
wiki_page = requests.get("https://en.wikipedia.org/wiki/List_of_Korean_dramas")
print(wiki_page.status_code)    # status code should be 200 for success

200


In [5]:
soup = BeautifulSoup(wiki_page.content,'html.parser')
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of Korean dramas - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.c

In [6]:
allLinks = soup.find(id='bodyContent').find_all('li')[28:] # find all the links present in the page
allLinks

[<li><i><a href="/wiki/100_Days_My_Prince" title="100 Days My Prince">100 Days My Prince</a></i> (2018)</li>,
 <li><i><a href="/wiki/12_Signs_of_Love" title="12 Signs of Love">12 Signs of Love</a></i> (2012)</li>,
 <li><i><a href="/wiki/12_Years_Promise" title="12 Years Promise">12 Years Promise</a></i> (2014)</li>,
 <li><i><a href="/wiki/18_Again" title="18 Again">18 Again</a></i> (2020)</li>,
 <li><i><a href="/wiki/365:_Repeat_the_Year" title="365: Repeat the Year">365: Repeat the Year</a></i> (2020)</li>,
 <li><i><a href="/wiki/4_Legendary_Witches" title="4 Legendary Witches">4 Legendary Witches</a></i> (2014–15)</li>,
 <li><i><a href="/wiki/49_Days" title="49 Days">49 Days</a></i> (2011)</li>,
 <li><i><a href="/wiki/5th_Republic_(TV_series)" title="5th Republic (TV series)">5th Republic</a></i> (2005)</li>,
 <li><i><a href="/wiki/7_Escape" title="7 Escape">7 Escape</a></i> (2023)</li>,
 <li><i><a href="/wiki/7_First_Kisses" title="7 First Kisses">7 First Kisses</a></i> (2016–17)</l

#### After scraping all the links corresponding to the dramas, the name and year of the drama is then extracted from the links.

In [62]:
# splitting the links gives the drama name and the year
s=allLinks[0].get_text().split(" ")[-1]
print(s)
print(allLinks[0].find_next('a').get('title'))
print(allLinks[0].get_text()[0:-1])

(2018)
100 Days My Prince
100 Days My Prince (2018


In [64]:
# data scraped from wiki is converted to dataframe
kdrama_wiki=pd.DataFrame([[link.get_text().split(" ")[-1],
                 link.find_next('a').get('title')]
                 for link in allLinks if link.find_next('i')],
               columns=("year","drama"))
kdrama_wiki

Unnamed: 0,year,drama
0,(2018),100 Days My Prince
1,(2012),12 Signs of Love
2,(2014),12 Years Promise
3,(2020),18 Again
4,(2020),365: Repeat the Year
...,...,...
1567,(2007),Your Scene
1568,(2021),Youth (TV series)
1569,(2021),Youth of May
1570,(2021–22),Yumi's Cells


In [14]:
kdrama_wiki['year'] = kdrama_wiki['year'].str.split('–').str[0] # removed the characters after '–'
kdrama_wiki['year'] = kdrama_wiki['year'].str.replace(r'[()]',"")
kdrama_wiki['drama'] = [str(x).split('(')[0].strip() for x in kdrama_wiki['drama']]
kdrama_wiki

  kdrama_wiki['year'] = kdrama_wiki['year'].str.replace(r'[()]',"")


Unnamed: 0,year,drama
0,2018,100 Days My Prince
1,2012,12 Signs of Love
2,2014,12 Years Promise
3,2020,18 Again
4,2020,365: Repeat the Year
...,...,...
1567,2007,Your Scene
1568,2021,Youth
1569,2021,Youth of May
1570,2021,Yumi's Cells


### Scraping IMDB Data

In [26]:
# The pages in the imdb link needs to be dynamically set which has around 48 pages of data with 2366 titles of dramas
pages = np.arange(1,2350,50)

# initailizing variables to store the values scraped
drama_titles = []
#years = []
runtimes = []
imdb_ratings = []
votes = []
images = []
genres = []
casts = []


for page in pages:

    # get response for k-dramas
    imdb_page = requests.get('https://www.imdb.com/search/title/?title_type=tv_series&countries=kr&start='+str(page)+'&ref_=adv_nxt')

    #sleep(randint(8,15))

    imdb_soup = BeautifulSoup(imdb_page.content,'html.parser')

    # the 50 dramas listed in the page are stored in 'frames'
    frames = imdb_soup.find_all('div',class_='lister-item mode-advanced')

    # extracting the data of the 50 drams listed in the page
    for frame in frames:

        cast = []

        # title
        d_title = frame.h3.a.text
        drama_titles.append(d_title)

        if frame.find('img',class_='loadlate') is not None:

            # images
            img = frame.find('img',class_='loadlate').get('loadlate')
            images.append(img)
        else:
            images.append(None)

        if frame.p.find('span',class_='runtime') is not None:

            # runtime of the drama
            runtime = int(frame.p.find('span',class_='runtime').text.replace("min",""))
            runtimes.append(runtime)
        else:
            runtimes.append(None)

        if frame.p.find('span',class_='genre') is not None:

            # genre
            genre = frame.p.find('span',class_='genre').text.replace("\n","").rstrip().split(',')
            genres.append(genre)
        else:
            genres.append(None)

        if frame.find('p',class_="") is not None:

          # cast
          cast_links = frame.find('p',class_="").find_all('a')

          if cast_links is not None:
            for link in cast_links[:2]:
              cast.append(link.get_text().lstrip())

          if len(cast)>0:
            casts.append(cast)
          else:
            casts.append(None)

        if frame.strong is not None:

            # imdb ratings
            imdb_rating = float(frame.strong.text)
            imdb_ratings.append(imdb_rating)
        else:
            imdb_ratings.append("")

        if frame.find('span',attrs={'name':'nv'}) is not None:
            if frame.find('span',attrs={'name':'nv'})['data-value'] is not None:

                # no of votes
                vote = int(frame.find('span',attrs={'name':'nv'})['data-value'])
                votes.append(vote)
        else:
            votes.append(None)

In [44]:
# data scraped is converted to dataframe
kdrama_imdb = pd.DataFrame({'drama':drama_titles,
                            'genre':genres,
                            'cast':casts,
                            'runtime':runtimes,
                            'imdb_rating':imdb_ratings,
                            'vote':votes,
                            'image':images})

In [45]:
kdrama_imdb.head(5)

Unnamed: 0,drama,genre,cast,runtime,imdb_rating,vote,image
0,Bloodhounds,"[Action, Crime, Drama]","[Woo Do-Hwan, Sang-yi Lee]",60.0,8.1,2731.0,https://m.media-amazon.com/images/M/MV5BZTlhZG...
1,Miraculous: Tales of Ladybug & Cat Noir,"[Animation, Action, Adventure]","[Cristina Valenzuela, Bryce Papenbrook]",20.0,7.6,13096.0,https://m.media-amazon.com/images/M/MV5BZWExMT...
2,Squid Game,"[Action, Drama, Mystery]","[Lee Jung-jae, Park Hae-soo]",55.0,8.0,481468.0,https://m.media-amazon.com/images/M/MV5BYWE3MD...
3,The Big Door Prize,"[Comedy, Drama, Sci-Fi]","[Chris O'Dowd, Gabrielle Dennis]",98.0,6.3,4260.0,https://m.media-amazon.com/images/M/MV5BNjFjNz...
4,The Good Bad Mother,"[Comedy, Crime, Drama]","[Ra Mi-ran, Lee Do-Hyun]",70.0,8.4,1381.0,https://m.media-amazon.com/images/M/MV5BMjI1Nj...


#### Unique genres

In [52]:
listed_genre = []
for genre in kdrama_imdb['genre']:
  if genre is not None:
    for g in genre:
      listed_genre.append(g.lstrip())

genre_unique = set(listed_genre)
genre_unique

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western'}

#### Unique Actors

In [53]:
listed_cast = []
for cast in kdrama_imdb['cast']:
  if cast is not None:
    for c in cast:
      listed_cast.append(c)

cast_unique = set(listed_cast)
cast_unique

{'Jung Yu-mi',
 'Dong-Hyuk Cho',
 'Jeon Do-yeon',
 'Tae-gon Lee',
 'Se-na Lee',
 'No-shik Park',
 'Hong Yoon Hwa',
 'Yu-jin Ahn',
 'Lee Bong-Won',
 'Kim Won-Hae',
 'Na Young-hee',
 'Min Ah Kang',
 'Min-Jung Kim',
 'Kim Ji-Hyeon',
 'So-hye Kim',
 'Miles Meili',
 'Bo-yeon Kim',
 'Luci Christian',
 'Mi-kyeong Yang',
 'Mi-yeon Lee',
 'Seo Do Young',
 'Seong-jin Kang',
 'Madison Brunoehler',
 'Oh Man-seok',
 'Super Junior',
 'Ye-ryeon Cha',
 'Soo-Hyun Hong',
 'Tran Vu',
 'Ahn Jung-Hoon',
 'Ji-yoon Park',
 'Kim Min-Suk',
 'Hong Bum-ki',
 'Christina Sherman',
 'Hee-jin Lee',
 'Jun-Yeong Seo',
 'Wayv',
 'Garrison Michael Farquharson-Keener',
 'Hudson Loverro',
 'Eun-jeong Han',
 'Rie Kugimiya',
 'Soo Hyun Seo',
 'Jin-Ah Im',
 'Hong-Chul Ro',
 'Hee-soon Park',
 'Hee-chul Kim',
 'Park Sung-woong',
 'Roger Carel',
 'Nam Seung Woo',
 'Jin-Yeop Kim',
 'Ho-jin Chun',
 'Do-hee Min',
 'Kendell Byrd',
 'Lee Moo-saeng',
 'Geon Yu',
 'Lee Won-jung',
 'Seo Hyo-Rim',
 'Kazumi Evans',
 'Yoo-jin Kim',
 'Geu-

In [54]:
len(cast_unique)

1872

#### The two dataframes are merged on 'drama' column based on the name of the drama.

In [55]:
final_df = pd.merge(kdrama_wiki,kdrama_imdb,how="inner",on="drama")
final_df

Unnamed: 0,year,drama,genre,cast,runtime,imdb_rating,vote,image
0,2018,100 Days My Prince,"[Action, Comedy, History]","[Kyung-soo Do, Nam Ji-hyun]",75.0,7.7,2775.0,https://m.media-amazon.com/images/M/MV5BYjljZG...
1,2014,12 Years Promise,"[Comedy, Drama, Romance]","[So-yeon Lee, Min Namkoong]",70.0,7.2,186.0,https://m.media-amazon.com/images/M/MV5BMGE5YT...
2,2020,18 Again,"[Comedy, Drama, Fantasy]","[Ha-neul Kim, Yoon Sang-Hyun]",70.0,8.2,2427.0,https://m.media-amazon.com/images/M/MV5BN2JhYj...
3,2020,365: Repeat the Year,"[Crime, Drama, Fantasy]","[Lee Jun-hyuk, Nam Ji-hyun]",30.0,7.9,769.0,https://m.media-amazon.com/images/M/MV5BNTNlZW...
4,2013,7th Grade Civil Servant,"[Action, Romance]","[Joo Won, Hye Eun Lee]",65.0,6.2,149.0,https://m.media-amazon.com/images/M/MV5BOWIxMD...
...,...,...,...,...,...,...,...,...
903,2013,Your Lady,"[Drama, Romance]","[Yu-ri Lee, Ho Lim]",35.0,8.6,7.0,https://m.media-amazon.com/images/M/MV5BNTQxYz...
904,2013,Your Neighbor's Wife,[Romance],"[Yum Jung-ah, Yu-seok Kim]",59.0,6.9,8.0,https://m.media-amazon.com/images/M/MV5BYTUzY2...
905,2021,Youth of May,"[Drama, History, Romance]","[Lee Do-Hyun, Go Min-Si]",65.0,8.4,1743.0,https://m.media-amazon.com/images/M/MV5BYTYzMT...
906,2021,Yumi's Cells,"[Comedy, Drama, Romance]","[Kim Go-eun, Park Jin-young]",70.0,8.2,2426.0,https://m.media-amazon.com/images/M/MV5BYjFmNT...


#### Unique Genres and Actors in the final dataframe

In [56]:
listed_cast = []
for cast in final_df['cast']:
  if cast is not None:
    for c in cast:
      listed_cast.append(c)

cast_unique = set(listed_cast)
cast_unique

{'AOA',
 'Adam McArthur',
 'Ae Yun Jung',
 'Ae-ra Shin',
 'Ah Jung Yoon',
 'Ahn Bo-Hyun',
 'Ahn Hyo-Seop',
 'Ahn Jae-Hyun',
 'Ahn Jong Sun',
 'Ahn Moon Sook',
 'Ahn Nae-sang',
 'Amber Liu',
 'Amy Aleha',
 'Anastasia Kim',
 'Anzu Lawson',
 'Bae Doona',
 'Bae Suzy',
 'Baek Ji-won',
 'Baro',
 'Beom-su Lee',
 'Bit-na Wang',
 'Bo-ra Kim',
 'Bo-yeon Kim',
 'Boom',
 'Bruno Bruni Jr.',
 'Byeol Kang',
 'Byeong-eun Park',
 'Byeong-gyu Jo',
 'Byun Hee-Bong',
 'Byung-ho Son',
 'Cha Eun-Woo',
 'Chae Soo-bin',
 'Chae-Ah Han',
 'Chae-Young Han',
 'Chae-yeong Lee',
 'Chan-Young Yoon',
 'Chang-Suk Oh',
 'Chang-min Shim',
 'Chang-ui Song',
 'Changjo',
 'Chase Kim',
 'Cheol-ho Choi',
 'Cheol-min Lee',
 'Cheol-min Park',
 'Cho Seung-woo',
 'Cho Yeo-jeong',
 'Choi Chul Ho',
 'Choi Ji-woo',
 'Choi Jin-Hyuk',
 'Choi Min-ho',
 'Choi Min-sik',
 'Choi Tae-Joon',
 'Choi Wonyoung',
 'Choi Woo-sik',
 'Chong-ok Bae',
 'Chu Ja-hyeon',
 'Chunji',
 'Claudia Kim',
 'Da-In Lee',
 'Da-bin Jung',
 'Dae-Myung Kim',
 'Dae-c

In [57]:
len(cast_unique)

747

#### The final dataframe is saved in the form of csv file.

In [58]:
final_df.to_csv('KDramaList.csv')