# ***K-Drama Dataset Transformation***

## **Project Goals**
For Korean Dramas viewed across streaming platforms in 2023, this project explores:
1. the correlation between viewership and audience ratings for Korean dramas
2. the relationship of both variables with factors such as number of episodes, release year, and genre
3. whether certain directors, creators, or actors were associated with significantly higher viewership/ratings

## **Findings**
The following insights were found at the end of the project
1. There is not much correlation between viewership and audience ratings, as seen in the scatterplot of ratings and viewership.
2. Despite many dramas having 16 episodes, those with less episodes (namely 10 and 12) tended to have higher viewership.
3. The genres attracting the highest viewership are Family, Military, Action and Adventure.
4. More gritty dramas such as Military, Crime and Psychological Dramas had higher average audience ratings.
5. Director An Gil-Ho stood out in terms of viewership, with an average of >300 million views across two shows directed.

## **Project Steps**
1. Scrape Data from Multiple Sources
2. Clean and Transform Data
3. Analyse and Visualise Data

## **Step 1: Scrape Data from Multiple Sources**


To answer the research questions, I scraped data from two sources, with viewership data from the former and audience ratings from the latter:
1. FlixPatrol, a site that gathers streaming data for movies and TV shows.
2. MyDramaList, an extensive database of Korean dramas.

The code used to scrape these sites can be found **here**

## **Step 2: Clean and Transform Data**
### 2.1: Setup

In [642]:
# importing data transformation libraries
import pandas as pd
import numpy as np
import re
from collections import Counter

import seaborn as sns
import matplotlib.pyplot as plt

# reading in first dataset
ratings_df = pd.read_csv("mydramalist.csv")
views_df = pd.read_csv("flixpatrol.csv")


### 2.2: Exploratory Data Analysis
Next, I use quick summary tools to explore the dataset briefly and understand what kind of cleaning is required.

In [643]:
ratings_df.head()

Unnamed: 0,title,rating,release_info,genre,actors
0,Twinkling Watermelon,9.2,"Korean Drama - 2023, 16 episodes","Genres: Romance, Youth, Drama, Fantasy","['Ryeoun', 'Choi Hyun Wook', 'Seol In Ah', 'Sh..."
1,Move to Heaven,9.1,"Korean Drama - 2021, 10 episodes","Genres: Life, Drama","['Lee Je Hoon', 'Tang Jun Sang', 'Hong Seung H..."
2,Weak Hero Class 1,9.1,"Korean Drama - 2022, 8 episodes","Genres: Action, Youth, Drama","['Park Ji Hoon', 'Choi Hyun Wook', 'Hong Kyung..."
3,Hospital Playlist Season 2,9.1,"Korean Drama - 2021, 12 episodes","Genres: Romance, Life, Drama, Medical","['Jo Jung Suk', 'Yoo Yeon Seok', 'Jung Kyung H..."
4,Lovely Runner,9.1,"Korean Drama - 2024, 16 episodes","Genres: Music, Comedy, Romance, Fantasy","['Byeon Woo Seok', 'Kim Hye Yoon', 'Song Geon ..."


In [644]:
pd.set_option('display.max_columns', 1000)
views_df.head()

Unnamed: 0,title,season,hours,runtime,views,cast,director,producers,creators
0,Doona!,season 1,88700000,7:00,12.7M,"Bae Suzy,Yang Se-jong,Lee Yu-bi,Ha Yeong,Park ...",Lee Jeong-hyo,Min Song-a,"Min Song-a,Jang Yu-ha,Lee Jeong-hyo"
1,Daily Dose of Sunshine,season 1,115900000,12:57,8.9M,"Park Bo-young,Yeon Woo-jin,Jang Dong-yoon,Lee ...",JQ Lee,Lee Ra-ha,Park Cheol-su
2,Sweet Home,season 2,93400000,9:31,9.8M,"Song Kang,Go Min-si,Lee Jin-wook,Lee Si-young,...",Lee Eung-bok,,
3,Black Knight,season 1,110300000,4:51,22.7M,"Kim Woo-bin,Kang Yoo-seok,Esom,Song Seung-heon...",Cho Ui-seok,Cho Ui-seok,
4,Young Lady and Gentleman,season 1,92900000,58:30,1.6M,"Ji Hyun-woo,Lee Se-hee,Park Ha-na,Oh Hyun-kyun...",,,


### 2.3 Data Cleaning and Transformation

Firstly, I reformat the 'title' and 'season' columns of both datasets as they will later be joined along these indexes.

In [645]:
# transforming ratings dataset
# define fn to create 'season' column
def extract_season(title):
    match = re.search(r'Season (\d+)', title)
    if match:
        return int(match.group(1))
    else:
        return 1
# create the new 'season' column
ratings_df['season'] = ratings_df['title'].apply(extract_season)

# define fn to remove season from 'title' column
def adjust_title(title):
    match = re.search(r'(.+) Season \d+', title)
    if match:
        return match.group(1)
    else:
        return title

# adjust 'title column
ratings_df['title'] = ratings_df['title'].apply(adjust_title)

In [646]:
# transforming views dataset
# reformatting 'season' column
views_df['season'] = views_df['season'].str.replace(r'season (\d+)', r'\1', regex=True)
# converting 'season' column to numeric datatype
views_df['season'] = pd.to_numeric(views_df['season'])

Next, both datasets are merged before further cleaning.

In [647]:
# merging both datasets
df = ratings_df.merge(views_df, on=['title', 'season'], how='inner')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         174 non-null    object 
 1   rating        174 non-null    float64
 2   release_info  174 non-null    object 
 3   genre         174 non-null    object 
 4   actors        174 non-null    object 
 5   season        174 non-null    int64  
 6   hours         174 non-null    object 
 7   runtime       174 non-null    object 
 8   views         162 non-null    object 
 9   cast          173 non-null    object 
 10  director      150 non-null    object 
 11  producers     97 non-null     object 
 12  creators      24 non-null     object 
dtypes: float64(1), int64(1), object(11)
memory usage: 17.8+ KB


After combining the data, I proceed with further data cleaning.

In [648]:
# split 'release_info' column into 'release_year' and 'episode_count'
def extract_year(column):
    pattern =  re.match(r'Korean Drama - (\d{4}),.*', column)
    if pattern:
        return pattern.group(1)
    return None

df['release_year'] = df['release_info'].apply(extract_year) 

def extract_episodes(column):
    pattern = re.match(r'.*\d{4}, (\d+).*', column)
    if pattern:
        return pattern.group(1)
    return None

df['episode_count'] = df['release_info'].apply(extract_episodes) 

# drop 'release_info'
df = df.drop(axis = 1, columns = 'release_info')

To simplify analysis, I create new tables for columns with multiple values for each drama, which can later be joined to the main table.

In [649]:
# create separate table for genre, cast, etc.
# create genre table
genre_df = df[['title', 'season', 'genre']]
genre_df.loc[:, 'genre'] = genre_df['genre'].str.replace('Genres: ', '')
genre_df.loc[:, 'genre'] = genre_df['genre'].str.split(', ')
genre_df = genre_df.explode('genre')
genre_df.head()

# define fn to create new table from original dataset
def create_table(df, colname):
    col_df = df[['title', 'season', colname]]
    col_df.loc[:, colname] = col_df[colname].str.split(',')
    return col_df.explode(colname)

# create cast table
cast_df = create_table(df, 'cast')
cast_df.head()

# create producers table
producers_df = create_table(df, 'producers')
producers_df = producers_df.dropna()
producers_df.head()

# with the creators table having many null values, I decided to drop it altogether

# drop columns from original df
df = df.drop(axis = 1, columns = ['genre', 'actors', 'producers', 'creators', 'cast'])
df.head()

Unnamed: 0,title,rating,season,hours,runtime,views,director,release_year,episode_count
0,Move to Heaven,9.1,1,33500000,8:40,3.7M,Kim Sung-ho,2021,10
1,Alchemy of Souls,9.1,1,217300000,24:52,8.7M,Park Joon-hwa,2022,20
2,Flower of Evil,9.1,1,42200000,18:56,2.3M,Kim Cheol-kyu,2020,16
3,Hospital Playlist,9.1,1,57200000,17:09,3.3M,Shin Won-ho,2020,12
4,Reply 1988,9.1,1,46800000,31:37,1.5M,Shin Won-ho,2015,20


In [650]:
# reformat 'hours' and 'views' columns
df['hours'] = df['hours'].str.replace(',', '')
df['hours'] = pd.to_numeric(df['hours'])
df['hours'] = df['hours'].apply(lambda x:(x/1000000))
df['views'] = df['views'].str.replace('M', '')
df = df.rename(columns = {'hours':'hours_millions', 'views':'views_millions'})
df.head()

Unnamed: 0,title,rating,season,hours_millions,runtime,views_millions,director,release_year,episode_count
0,Move to Heaven,9.1,1,33.5,8:40,3.7,Kim Sung-ho,2021,10
1,Alchemy of Souls,9.1,1,217.3,24:52,8.7,Park Joon-hwa,2022,20
2,Flower of Evil,9.1,1,42.2,18:56,2.3,Kim Cheol-kyu,2020,16
3,Hospital Playlist,9.1,1,57.2,17:09,3.3,Shin Won-ho,2020,12
4,Reply 1988,9.1,1,46.8,31:37,1.5,Shin Won-ho,2015,20


In [651]:
# change runtime from hours:minutes to hours
df['runtime'] = df['runtime'].str.strip().replace('', np.nan)
def convert_runtime(df):
    df['runtime'] = df['runtime'].apply(lambda x: int(x.split(':')[0]) + int(x.split(':')[1])/60 if pd.notnull(x) else np.nan).round(1)
    return df

df = convert_runtime(df)

Finally, I export the cleaned data to csv files, which will be analysed in Tableau.

In [652]:
genre_df.to_csv('genre.csv', index = False)
cast_df.to_csv('cast.csv', index = False)
producers_df.to_csv('producers.csv', index = False)
df.to_csv('top_dramas.csv', index = False)

## **Step 3: Analyse and Visualise Data**
Click here to view Tableau dashboard