# Thai X Japanese Drama: Can the challenge the global dominance of Korean Drama 

Midterm Project: Comprehensive Data Analysis and Visualization

Duration: 2 weeks
Submission Date: Sep. 28, 2023

### Requirements:
1. Select a Dataset: 
> - Choose a dataset that contains at least 500 entries and at least five different
variables.
2. Data Exploration:
> - Perform summary statistics to understand the basic metrics of each variable (mean, median, mode,
variance, standard deviation).
> - Identify any outliers and clean the dataset if necessary.
3. Statistical Analysis:
> - Conduct hypothesis tests or other statistical methods to answer at least two questions you have
about the dataset.
> - Use measures of similarity, probability, and distributions to draw inferences from the data.
4. Data Visualization:
> - Create at least four different types of visualizations using the dataset.
> - These can be bar charts, line charts, scater plots, pie charts, etc., as appropriate for your data.
5. Interpretation:
> - Prepare a presentation that walks through your exploratory data analysis, statistical findings, and
visualizations.
> - The presentation should be both factually accurate and easily understandable, targeted at an
audience unfamiliar with your dataset.
6. Documentation:
> - Alongside the presentation, prepare a report documenting your methodology, the statistical tests
performed, the visualizations created, and your interpretations.


### Deliverables:

1. Cleaned and processed dataset in CSV format.
2. A presentation (PPT or equivalent) summarizing your findings for 7 minutes presentation.
3. A detailed report (Word, PDF, or equivalent).

Evaluation Criteria:
- Quality of data exploration and statistical analysis.
- Effectiveness and appropriateness of data visualizations.
- Coherence and clarity in the presentation and report.
- Ability to interpret the results and draw meaningful conclusions.

<br>
Submission:
Submit all the project files via Google Classroom.

In [1]:
import pandas as pd
import numpy as np
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_tha_actors.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_drama.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_user_reviews.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_kor_actors.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_user_reviews.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_user_reviews.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_jap_actors.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_drama.csv


# Drama

In [2]:
# Load the data: ignoring drama_id, rank, and pop(ularity) because we won't be using the website ranking system
dtha_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_drama.csv').iloc[:,1:-2]
dkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
djap_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_drama.csv').iloc[:,1:-2]

In [3]:
dtha_df.head()

Unnamed: 0,drama_name,native_name,year,synopsis,genres,tags,director,sc_writer,country,type,tot_eps,ep_duration,start_dt,end_dt,aired_on,org_net,tot_user_score,tot_num_user,tot_watched,content_rt
0,6ixtynin9,เรื่องตลก 69 เดอะซีรีส์,2023,"Because of the economic crisis, Toom has sudde...","Thriller, Comedy, Crime","Prolonged Nudity, MDL Remake, Miniseries, Nudi...",,,Thailand,Drama,6,2820.0,2023-09-06,2023-09-06,Wednesday,Netflix,7.7,22.0,208.0,Not Yet Rated
1,Her,HER วิชานี้จบที่เธอ,2023,After making a bet with her brother to find a ...,Romance,"LGBTQ+, Bisexual Female Lead, Short Length Ser...",['Earth Wachirawit Kanthayom'],,Thailand,Drama,4,480.0,2023-08-10,2023-08-10,,,7.6,15.0,98.0,Not Yet Rated
2,Put Our Heart on the Paper,Put Our Heart on the Paper,2023,"Due to a severe case of writer's block, succes...",Romance,"Fan Female Lead, Roommates' Relationship, Writ...",,,Thailand,Drama,4,480.0,2023-08-05,2023-08-26,Saturday,,7.2,20.0,96.0,Not Yet Rated
3,Love in a Cage,กรงดอกสร้อย,2023,"Soiinthanin is a beautiful, smart, well-educat...","Historical, Romance, Drama","Illegitimate Female Lead, Adapted From A Novel...",['Adul Boonboot'],,Thailand,Drama,16,,2023-07-21,2023-08-25,"Friday, Saturday, Sunday",Channel 3,7.1,17.0,414.0,Not Yet Rated
4,Wedding Plan,แผนการ(รัก)ร้ายของนายเจ้าบ่าว,2023,Namnuea and Sailom are wedding planner and gro...,"Comedy, Romance","LGBTQ+, Wedding Planner Male Lead, Co-workers'...",['Ne Neti Suwanjinda'],,Thailand,Drama,7,2880.0,2023-07-19,2023-08-30,Wednesday,GMM 25 iQiyi,7.6,4168.0,10705.0,18+ Restricted (violence & profanity)


# Reviews

In [4]:
rtha_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_user_reviews.csv').iloc[:,1:]
rkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_user_reviews.csv').iloc[:,1:]
rjap_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_user_reviews.csv').iloc[:,1:]

In [5]:
rtha_df.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,ep_watched,n_helpful
0,Her,6.0,8.0,9.0,5.0,7.0,"Contrived plot aside, the actors are terribly ...",4 of 4 episodes seen,3
1,Love in a Cage,10.0,10.0,7.0,10.0,10.0,Captivating from beginning to the end - the sy...,8 of 16 episodes seen,1
2,The Cupid Coach,1.0,6.0,6.0,1.0,3.0,Content warning: youll want to claw your eyes ...,12 of 12 episodes seen,4
3,La Pluie,9.5,8.5,9.0,8.0,9.0,There are many Fantasy dramas out there but wh...,12 of 12 episodes seen,13
4,Love by Chance,4.0,8.5,9.0,1.0,5.5,This drama was a bit of a mixed bag for me. Th...,14 of 14 episodes seen,0


# TODO: Data Visualization and Analysis

In [6]:
def review_score_pipeline(
    df: pd.DataFrame,
    features: list = [],
) -> pd.DataFrame:
    
    df['ep_watched'] = df['ep_watched'].fillna('0 of 0 episodes seen')
    df['tot_watched'] = df['ep_watched'].apply(lambda x: int(str(x).split(' ')[0]))
    df['tot_ep'] = df['ep_watched'].apply(lambda x: int(str(x).split(' ')[2]))
    df['watched_ratio'] = df['tot_watched'] / df['tot_ep']
    df['watched_ratio'] = df['watched_ratio'].fillna(0.0).apply(lambda x: round(x,3))
    
    return df if len(features) == 0 else df[features]

def process_review_text(text: str) -> list:
    pass

In [7]:
features = ['title','story','acting_cast','music','rewatch_value','overall','text','n_helpful','tot_watched','tot_ep','watched_ratio']

test = review_score_pipeline(rtha_df, features)
test.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,n_helpful,tot_watched,tot_ep,watched_ratio
0,Her,6.0,8.0,9.0,5.0,7.0,"Contrived plot aside, the actors are terribly ...",3,4,4,1.0
1,Love in a Cage,10.0,10.0,7.0,10.0,10.0,Captivating from beginning to the end - the sy...,1,8,16,0.5
2,The Cupid Coach,1.0,6.0,6.0,1.0,3.0,Content warning: youll want to claw your eyes ...,4,12,12,1.0
3,La Pluie,9.5,8.5,9.0,8.0,9.0,There are many Fantasy dramas out there but wh...,13,12,12,1.0
4,Love by Chance,4.0,8.5,9.0,1.0,5.5,This drama was a bit of a mixed bag for me. Th...,0,14,14,1.0


In [8]:
# TODO: Write a Function to Detect Outlier - using Boxplot or Normal Distribution Bell-shaped curve to display the result

rtha_df['ep_watched'] = rtha_df['ep_watched'].fillna('0 of 0 episodes seen')
rtha_df['tot_watched'] = rtha_df['ep_watched'].apply(lambda x: int(str(x).split(' ')[0]))
rtha_df['tot_ep'] = rtha_df['ep_watched'].apply(lambda x: int(str(x).split(' ')[2]))
rtha_df['watched_ratio'] = rtha_df['tot_watched'] / rtha_df['tot_ep']
rtha_df['watched_ratio'] = rtha_df['watched_ratio'].fillna(0.0).apply(lambda x: round(x,3))

rtha_df.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,ep_watched,n_helpful,tot_watched,tot_ep,watched_ratio
0,Her,6.0,8.0,9.0,5.0,7.0,"Contrived plot aside, the actors are terribly ...",4 of 4 episodes seen,3,4,4,1.0
1,Love in a Cage,10.0,10.0,7.0,10.0,10.0,Captivating from beginning to the end - the sy...,8 of 16 episodes seen,1,8,16,0.5
2,The Cupid Coach,1.0,6.0,6.0,1.0,3.0,Content warning: youll want to claw your eyes ...,12 of 12 episodes seen,4,12,12,1.0
3,La Pluie,9.5,8.5,9.0,8.0,9.0,There are many Fantasy dramas out there but wh...,12 of 12 episodes seen,13,12,12,1.0
4,Love by Chance,4.0,8.5,9.0,1.0,5.5,This drama was a bit of a mixed bag for me. Th...,14 of 14 episodes seen,0,14,14,1.0


In [9]:
rtha_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6755 entries, 0 to 6754
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          6755 non-null   object 
 1   story          6755 non-null   float64
 2   acting_cast    6755 non-null   float64
 3   music          6755 non-null   float64
 4   rewatch_value  6755 non-null   float64
 5   overall        6755 non-null   float64
 6   text           6734 non-null   object 
 7   ep_watched     6755 non-null   object 
 8   n_helpful      6755 non-null   int64  
 9   tot_watched    6755 non-null   int64  
 10  tot_ep         6755 non-null   int64  
 11  watched_ratio  6755 non-null   float64
dtypes: float64(6), int64(3), object(3)
memory usage: 633.4+ KB


In [10]:
# There are some wired cases where the tot_watched is greater than tot_ep
# It is possible that the reviewers rewatched the drama multiple times
# set(rtha_df['watched_ratio'].tolist())
rtha_df[rtha_df['watched_ratio'] > 1.0]

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,ep_watched,n_helpful,tot_watched,tot_ep,watched_ratio
379,Midnight Museum,10.0,10.0,10.0,10.0,10.0,This is my first time writing a review for a s...,15 of 10 episodes seen,14,15,10,1.5
380,Midnight Museum,10.0,10.0,9.0,10.0,10.0,this series has honestly blown me away and we ...,15 of 10 episodes seen,10,15,10,1.5
384,Midnight Museum,10.0,10.0,10.0,10.0,10.0,I Love this series. Love the mystery and the c...,15 of 10 episodes seen,2,15,10,1.5
481,Tin Tem Jai,5.5,5.0,6.5,4.5,5.0,I swear if Tin doesnt realise how good the sen...,12 of 10 episodes seen,4,12,10,1.2
2270,Are We Alright?,10.0,9.5,9.5,10.0,10.0,Overall 10 Story 10 Acting/Cast 9.5 Music 9...,20 of 15 episodes seen,5,20,15,1.333
3153,Why R U?,9.5,9.5,8.0,9.0,10.0,Overall 10 Story 9.5 Acting/Cast 9.5 Music ...,14 of 13 episodes seen,39,14,13,1.077
3347,Praomook,10.0,10.0,8.5,10.0,10.0,"I love this series, I love the actors, there i...",17 of 15 episodes seen,25,17,15,1.133
4013,Club Friday to Be Continued: She Changed,4.5,5.0,5.0,1.0,6.0,Overall 6.0 Story 4.5 Acting/Cast 5.0 Music...,22 of 13 episodes seen,6,22,13,1.692
4401,My Name Is Busaba,9.0,9.0,6.0,9.5,8.5,I started watching this show for the food. It ...,24 of 16 episodes seen,7,24,16,1.5
4419,Unlucky Ploy,1.0,1.0,2.5,1.0,1.0,worst quality and story. It is time to think ...,20 of 16 episodes seen,1,20,16,1.25
