# Thai X Japanese Drama: Can the challenge the global dominance of Korean Drama 

Midterm Project: Comprehensive Data Analysis and Visualization

Duration: 2 weeks
Submission Date: Sep. 28, 2023

### Requirements:
1. Select a Dataset: 
> - Choose a dataset that contains at least 500 entries and at least five different
variables.
2. Data Exploration:
> - Perform summary statistics to understand the basic metrics of each variable (mean, median, mode,
variance, standard deviation).
> - Identify any outliers and clean the dataset if necessary.
3. Statistical Analysis:
> - Conduct hypothesis tests or other statistical methods to answer at least two questions you have
about the dataset.
> - Use measures of similarity, probability, and distributions to draw inferences from the data.
4. Data Visualization:
> - Create at least four different types of visualizations using the dataset.
> - These can be bar charts, line charts, scater plots, pie charts, etc., as appropriate for your data.
5. Interpretation:
> - Prepare a presentation that walks through your exploratory data analysis, statistical findings, and
visualizations.
> - The presentation should be both factually accurate and easily understandable, targeted at an
audience unfamiliar with your dataset.
6. Documentation:
> - Alongside the presentation, prepare a report documenting your methodology, the statistical tests
performed, the visualizations created, and your interpretations.


### Deliverables:

1. Cleaned and processed dataset in CSV format.
2. A presentation (PPT or equivalent) summarizing your findings for 7 minutes presentation.
3. A detailed report (Word, PDF, or equivalent).

Evaluation Criteria:
- Quality of data exploration and statistical analysis.
- Effectiveness and appropriateness of data visualizations.
- Coherence and clarity in the presentation and report.
- Ability to interpret the results and draw meaningful conclusions.

<br>
Submission:
Submit all the project files via Google Classroom.

# Statistical Analysis

### TODO:
> * Statistical Analysis:
>> * Conduct hypothesis tests or other statistical methods to answer at least two questions you have about the dataset.
>> * Use measures of similarity, probability, and distributions to draw inferences from the data.

### Questions
1. Is there satistically significicant evidence that the average overall score given the Korean Drama by contriburos on MyDramaList website is higher than that of Thai and Japanese Drama?
* Dive into more specific geners like Romantic, Horror, and others (focusing on that different region is know for)
> * For example, Thailand is known for Horror movie, we might investigate whether people find Thai Horror Drama higher quality (via rating) than Korea (Japan?)

In [1]:
import pandas as pd
import numpy as np
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_tha_actors.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_drama.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_user_reviews.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_kor_actors.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_user_reviews.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_user_reviews.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_jap_actors.csv
/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_drama.csv


# Drama

In [2]:
# Load the data: ignoring drama_id, rank, and pop(ularity) because we won't be using the website ranking system
dtha_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_drama.csv').iloc[:,1:-2]
dkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
djap_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_drama.csv').iloc[:,1:-2]

# Reviews

In [3]:
rtha_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_user_reviews.csv').iloc[:,1:]
rkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_user_reviews.csv').iloc[:,1:]
rjap_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_user_reviews.csv').iloc[:,1:]

In [4]:
rtha_df.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,ep_watched,n_helpful
0,Her,6.0,8.0,9.0,5.0,7.0,"Contrived plot aside, the actors are terribly ...",4 of 4 episodes seen,3
1,Love in a Cage,10.0,10.0,7.0,10.0,10.0,Captivating from beginning to the end - the sy...,8 of 16 episodes seen,1
2,The Cupid Coach,1.0,6.0,6.0,1.0,3.0,Content warning: youll want to claw your eyes ...,12 of 12 episodes seen,4
3,La Pluie,9.5,8.5,9.0,8.0,9.0,There are many Fantasy dramas out there but wh...,12 of 12 episodes seen,13
4,Love by Chance,4.0,8.5,9.0,1.0,5.5,This drama was a bit of a mixed bag for me. Th...,14 of 14 episodes seen,0


# Utility Functions

2. Data Exploration:
> - Perform summary statistics to understand the basic metrics of each variable (mean, median, mode,
variance, standard deviation).
> - Identify any outliers and clean the dataset if necessary.

In [5]:
from collections import Counter
import pandas as pd
import numpy as np

from scipy import stats
from typing import Tuple
import json


class DataAnalysisUtility:
    
    def calculate_statistics(
        self, 
        df: pd.DataFrame, 
        column_name: str
    ) -> dict:
        
        mean = df[column_name].mean()
        median = df[column_name].median()
        
        # Mode
        value_counter = Counter(df[column_name])
        nmax_occurance = value_counter.most_common(n=1)[0][1]
        mode = [key for key,val in value_counter.items() if val == nmax_occurance]
        
        variance = df[column_name].var()
        std_deviation = df[column_name].std()
        
        minimum = df[column_name].min()
        maximum = df[column_name].max()
        data_range = abs(maximum - minimum)
        
        statistics_dict = {
            'mean': mean,
            'median': median,
            'mode': mode,
            'variance': variance,
            'standard_deviation': std_deviation,
            'min': minimum,
            'max': maximum,
            'range': data_range
        }
        
        return statistics_dict
    
    def iqr_outlier_detector(
        self,
        df: pd.DataFrame, 
        column_name: str
    ) -> Tuple[pd.Series, pd.DataFrame]:
        q1 = df[column_name].quantile(0.25)
        q3 = df[column_name].quantile(0.75)
        iqr = q3 - q1

        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        # print(f'Boundaries: {lower_bound=} and {upper_bound=}')

        outlier_mask = ((df[column_name] < lower_bound) | (df[column_name] > upper_bound))
        df_no_outliers = df[~outlier_mask]

        return outlier_mask, df_no_outliers

    def zscore_outlier_detector(
        self,
        df: pd.DataFrame, 
        column_name: str, 
        threshold=3
    ) -> Tuple[pd.Series, pd.DataFrame]:
        z_scores = np.abs(stats.zscore(df[column_name]))
        outlier_mask = z_scores > threshold
        df_no_outliers = df[~outlier_mask]

        return outlier_mask, df_no_outliers

In [6]:
# Testing the utility functions for processing the data
util = DataAnalysisUtility()  

result_a = util.calculate_statistics(rtha_df,'overall')
print(json.dumps(result_a,indent=3),end='\n\n')

# Test Outlier Detection Function
outlier_mask, no_outlier_df = util.zscore_outlier_detector(rtha_df, column_name='overall')
print(no_outlier_df.shape, rtha_df.shape)

a,b = util.iqr_outlier_detector(rtha_df, 'overall')
print(b.shape, rtha_df.shape)

{
   "mean": 7.999333826794967,
   "median": 9.0,
   "mode": [
      10.0
   ],
   "variance": 5.754848534530734,
   "standard_deviation": 2.3989265379604134,
   "min": 1.0,
   "max": 10.0,
   "range": 9.0
}

(6755, 9) (6755, 9)
(6448, 9) (6755, 9)


# TODO: Drama Observation

In [7]:
# If we're putting the emphasis on Drama only, then we have to do filtering
dkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
dkor_df = dkor_df[dkor_df['type'] == 'Drama']
# dkor_df['genres'][0].split(',  ')

def split_geners_tags(df: pd.DataFrame) -> pd.DataFrame:
    df['genres'] = df['genres'].apply(lambda x: [char.replace(',','').strip() for char in str(x).split()])
    df['tags'] = df['tags'].apply(lambda x: [char.replace(',','').strip()  for char in str(x).split()])
    return df

def custom_rating_generator(df: pd.DataFrame) -> pd.DataFrame:
    df['watched_ratio'] = df['tot_num_user'] / df['tot_watched']
    df['watched_ratio'] = df['watched_ratio'].apply(lambda x: round(x,3))
    df['scored_signif'] = df['tot_user_score'] * df['watched_ratio']
    return df

dkor_df = split_geners_tags(dkor_df)
dkor_df = custom_rating_generator(dkor_df)
dkor_df.head()


Unnamed: 0,drama_name,native_name,year,synopsis,genres,tags,director,sc_writer,country,type,...,start_dt,end_dt,aired_on,org_net,tot_user_score,tot_num_user,tot_watched,content_rt,watched_ratio,scored_signif
0,Mask Girl,마스크걸,2023,Kim Mo Mi is an ordinary office woman with a s...,"[Thriller, Mystery, Comedy, Drama]","[Noir, Revenge, Suspense, Inferiority, Complex...",,,South Korea,Drama,...,2023-08-18,2023-08-18,Friday,Netflix,7.9,5814.0,13098,18+ Restricted (violence & profanity),0.444,3.5076
1,Better Things,Better Things,2023,"Comprised of 3 episodes, the sitcom follows th...",[Sitcom],"[Short, Length, Series, Miniseries, Web, Series]",,,South Korea,Drama,...,2023-08-07,2023-08-11,"Monday, Wednesday, Friday",,8.0,16.0,44,Not Yet Rated,0.364,2.912
2,My Dearest,연인,2023,A love-story between a noblewoman and a myster...,"[Historical, Romance, Drama, Melodrama]","[Joseon, Dynasty, Qing, Invasion, Of, Joseon, ...","['Kim Sung Yong', 'Lee Han Joon']",['Hwang Jin Young'],South Korea,Drama,...,2023-08-04,2023-09-02,"Friday, Saturday",MBC,8.7,1680.0,8523,15+ - Teens 15 or older,0.197,1.7139
3,Sing My Crush,따라바람,2023,After a disastrous first encounter between Ba ...,"[Music, Comedy, Romance, Youth]","[LGBTQ+, Band, High, School, To, Working, Life...",['So Joon Moon'],,South Korea,Drama,...,2023-08-02,2023-08-02,Wednesday,,8.1,3613.0,8224,Not Yet Rated,0.439,3.5559
4,The Uncanny Counter Season 2: Counter Punch,카운터 펀치,2023,Evil spirits from the afterlife arrive on Eart...,"[Action, Mystery, Comedy, Supernatural]","[Dark, Fantasy, Fantasy, Evil, Spirit, Drama, ...",['Yoo Seon Dong'],"['Kim Sae Bom', 'Yeo Ji Na']",South Korea,Drama,...,2023-07-29,2023-09-03,"Saturday, Sunday",Netflix OCN tvN,8.5,5542.0,27408,15+ - Teens 15 or older,0.202,1.717


# TODO: Data Visualization and Analysis

In [8]:
# TODO: Add Weighted Rating based on the number of people that found the review useful
def review_score_pipeline(
    df: pd.DataFrame,
    features: list = [],
) -> pd.DataFrame:
    
    df['ep_watched'] = df['ep_watched'].fillna('0 of 0 episodes seen')
    df['tot_watched'] = df['ep_watched'].apply(lambda x: int(str(x).split(' ')[0]))
    df['tot_ep'] = df['ep_watched'].apply(lambda x: int(str(x).split(' ')[2]))
    df['watched_ratio'] = df['tot_watched'] / df['tot_ep']
    df['watched_ratio'] = df['watched_ratio'].fillna(0.0).apply(lambda x: round(x,3))
    
    # TODO: Weighted Score Rating --> df['weigthed_signif'] 
    
    return df if len(features) == 0 else df[features]

def process_review_text(text: str) -> list:
    # TODO: Tokenization, StopwordRemovers, etc.
    pass

def outlier_detection_marker():
    pass

In [9]:
features = ['title','story','acting_cast','music','rewatch_value','overall','text','n_helpful','tot_watched','tot_ep','watched_ratio']

rtha_df = review_score_pipeline(rtha_df, features)
rkor_df = review_score_pipeline(rkor_df, features)
rjap_df = review_score_pipeline(rjap_df, features)

In [10]:
rtha_df.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,n_helpful,tot_watched,tot_ep,watched_ratio
0,Her,6.0,8.0,9.0,5.0,7.0,"Contrived plot aside, the actors are terribly ...",3,4,4,1.0
1,Love in a Cage,10.0,10.0,7.0,10.0,10.0,Captivating from beginning to the end - the sy...,1,8,16,0.5
2,The Cupid Coach,1.0,6.0,6.0,1.0,3.0,Content warning: youll want to claw your eyes ...,4,12,12,1.0
3,La Pluie,9.5,8.5,9.0,8.0,9.0,There are many Fantasy dramas out there but wh...,13,12,12,1.0
4,Love by Chance,4.0,8.5,9.0,1.0,5.5,This drama was a bit of a mixed bag for me. Th...,0,14,14,1.0


In [11]:
rkor_df.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,n_helpful,tot_watched,tot_ep,watched_ratio
0,The Good Bad Mother,10.0,10.0,8.0,8.0,10.0,Watching this drama was an incredibly fulfilli...,27,14,14,1.0
1,Mask Girl,9.0,10.0,8.5,8.0,9.0,Warning ahead: This is no show for softies. An...,27,7,7,1.0
2,Lies Hidden in My Garden,9.5,10.0,9.5,9.0,9.5,Absolutely fantastic slow burn dark psychologi...,10,8,8,1.0
3,Mask Girl,10.0,10.0,10.0,9.0,10.0,The retro style of the beginning and the calmi...,10,7,7,1.0
4,W,8.5,8.0,5.0,8.0,8.5,Overall 8.5 Story 8.5 Acting/Cast 8.0 Music...,1,16,16,1.0


In [12]:
rjap_df.head()

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,n_helpful,tot_watched,tot_ep,watched_ratio
0,Nihon Boro Yado Kiko,5.0,9.0,9.0,1.0,6.5,"It was cute, it was charming, it had its spunk...",3,12,12,1.0
1,One Piece,10.0,10.0,8.5,9.5,10.0,Wow! This live-action series really justifies ...,14,8,8,1.0
2,Kieta Hatsukoi,10.0,10.0,10.0,10.0,10.0,The Kieta Hatsukoi live action changed the tra...,0,10,10,1.0
3,One Piece,10.0,10.0,9.0,10.0,10.0,I am a One Piece fan and also have also grown ...,7,8,8,1.0
4,One Piece,9.0,8.0,8.0,8.5,8.5,Finished this and still not sure if it was the...,3,8,8,1.0


In [13]:
# There are some wired cases where the tot_watched is greater than tot_ep
# It is possible that the reviewers rewatched the drama multiple times
# set(rtha_df['watched_ratio'].tolist())
rtha_df[rtha_df['watched_ratio'] > 1.0]

Unnamed: 0,title,story,acting_cast,music,rewatch_value,overall,text,n_helpful,tot_watched,tot_ep,watched_ratio
379,Midnight Museum,10.0,10.0,10.0,10.0,10.0,This is my first time writing a review for a s...,14,15,10,1.5
380,Midnight Museum,10.0,10.0,9.0,10.0,10.0,this series has honestly blown me away and we ...,10,15,10,1.5
384,Midnight Museum,10.0,10.0,10.0,10.0,10.0,I Love this series. Love the mystery and the c...,2,15,10,1.5
481,Tin Tem Jai,5.5,5.0,6.5,4.5,5.0,I swear if Tin doesnt realise how good the sen...,4,12,10,1.2
2270,Are We Alright?,10.0,9.5,9.5,10.0,10.0,Overall 10 Story 10 Acting/Cast 9.5 Music 9...,5,20,15,1.333
3153,Why R U?,9.5,9.5,8.0,9.0,10.0,Overall 10 Story 9.5 Acting/Cast 9.5 Music ...,39,14,13,1.077
3347,Praomook,10.0,10.0,8.5,10.0,10.0,"I love this series, I love the actors, there i...",25,17,15,1.133
4013,Club Friday to Be Continued: She Changed,4.5,5.0,5.0,1.0,6.0,Overall 6.0 Story 4.5 Acting/Cast 5.0 Music...,6,22,13,1.692
4401,My Name Is Busaba,9.0,9.0,6.0,9.5,8.5,I started watching this show for the food. It ...,7,24,16,1.5
4419,Unlucky Ploy,1.0,1.0,2.5,1.0,1.0,worst quality and story. It is time to think ...,1,20,16,1.25


# Statistical Analysis

3. Statistical Analysis:
> - Conduct hypothesis tests or other statistical methods to answer at least two questions you have
about the dataset.
> - Use measures of similarity, probability, and distributions to draw inferences from the data.

<br>

----

### Hypothesis:
#### 1. Is there a significant difference in the average viewer ratings of drama TV shows among these three countries?
* Statistical Method: Analysis of Variance (ANOVA)**

* Hypothesis:

> * Null Hypothesis (H0): 
>> The average viewer ratings of drama TV shows are equal across all three countries (Thailand, Japan, and Korea).
> * Alternative Hypothesis (Ha): 
>> The average viewer ratings of drama TV shows are not equal across at least one pair of countries.

*  Procedure:
Collect viewer rating data for drama TV shows from the three countries. erform an ANOVA test to compare the means of viewer ratings among the three groups (countries). f the ANOVA test results in a significant p-value, conduct post-hoc tests (e.g., Tukey's HSD) to identify which specific pairs of countries have significantly different viewer ratings.


#### 2. What is the most popular geners of drama from each countires (Thailand, Japan, and Korea)? How do they compare with other countires 
* Ex. Thailand is popular with Horror, so does Thai Horror Drama receives better rathing than Japanese and Korean Drama under Horror gener?
* How do we detemined which genere is the most popular for each country
> * Metrics or Measurements for Genres popularity??:

In [14]:
kor_genres_dict = {}
dkor_df['tot_user_score'] = dkor_df['tot_user_score'].fillna(dkor_df['tot_user_score'].mean())

for index, row in dkor_df.iterrows():
    genres = row['genres']
    for genre in genres:
        if genre not in kor_genres_dict.keys():
            kor_genres_dict[genre] = {
                'num': 1,
                'score': [row['tot_user_score']]
            }
        else:
            kor_genres_dict[genre]['num'] += 1
            kor_genres_dict[genre]['score'].append(row['tot_user_score'])

for key, val in kor_genres_dict.items():
    genre_average = sum(val['score']) / len(val['score'])
    kor_genres_dict[key]['score'] = genre_average
        
kor_genres_dict

{'Thriller': {'num': 231, 'score': 7.970019958883202},
 'Mystery': {'num': 312, 'score': 7.903444264429552},
 'Comedy': {'num': 677, 'score': 7.5985182924906445},
 'Drama': {'num': 875, 'score': 7.637054026873312},
 'Sitcom': {'num': 26, 'score': 7.544200807847665},
 'Historical': {'num': 79, 'score': 7.975949367088609},
 'Romance': {'num': 1135, 'score': 7.511207678874213},
 'Melodrama': {'num': 234, 'score': 7.622113720094097},
 'Music': {'num': 57, 'score': 7.496491228070172},
 'Youth': {'num': 351, 'score': 7.461746588347859},
 'Action': {'num': 119, 'score': 8.058610172285876},
 'Supernatural': {'num': 113, 'score': 7.548672566371682},
 'Military': {'num': 8, 'score': 8.4},
 'Fantasy': {'num': 162, 'score': 7.653703703703702},
 'Horror': {'num': 42, 'score': 7.752380952380954},
 'Psychological': {'num': 53, 'score': 7.967924528301884},
 'Life': {'num': 304, 'score': 7.72006066450026},
 'Crime': {'num': 55, 'score': 8.024992918218539},
 'Food': {'num': 52, 'score': 7.34423076923077

In [15]:
# dkor_df is preprocess data
kor_genres_dict = Counter(genre for genres in dkor_df['genres'] for genre in genres)
skor_genres_dict = sorted(kor_genres_dict.items(), key=lambda x: x[1], reverse=True)
top_five_genres = dict(skor_genres_dict[:5])

skor_genres_dict = dict(skor_genres_dict)

print(f'{top_five_genres=}',end='\n\n')

print(f'Complete Genres list: {json.dumps(skor_genres_dict,indent=4)}')

top_five_genres={'Romance': 1135, 'Drama': 875, 'Comedy': 677, 'Youth': 351, 'Mystery': 312}

Complete Genres list: {
    "Romance": 1135,
    "Drama": 875,
    "Comedy": 677,
    "Youth": 351,
    "Mystery": 312,
    "Life": 304,
    "Melodrama": 234,
    "Thriller": 231,
    "Fantasy": 162,
    "Action": 119,
    "Supernatural": 113,
    "Business": 97,
    "Family": 84,
    "Historical": 79,
    "Law": 69,
    "Music": 57,
    "Sci-Fi": 56,
    "Crime": 55,
    "Psychological": 53,
    "Food": 52,
    "Political": 51,
    "Medical": 44,
    "Horror": 42,
    "Sitcom": 26,
    "Sports": 24,
    "Adventure": 18,
    "Military": 8,
    "nan": 5,
    "Documentary": 5,
    "Tokusatsu": 5,
    "Mature": 2,
    "Martial": 2,
    "Arts": 2,
    "War": 1
}


In [16]:
# Explore with different group style?
dkor_df_test = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
Counter(dkor_df_test['genres'])

Counter({'Romance': 81,
         'Comedy,  Romance': 52,
         'Romance,  Youth': 52,
         'Comedy,  Romance,  Drama': 51,
         'Romance,  Drama': 50,
         'Romance,  Drama,  Melodrama': 38,
         'Romance,  Youth,  Drama': 36,
         'Comedy,  Romance,  Life,  Drama': 35,
         'Comedy,  Romance,  Youth': 28,
         'Drama': 27,
         'Comedy,  Romance,  Youth,  Drama': 24,
         'Romance,  Drama,  Family,  Melodrama': 22,
         'Comedy': 20,
         'Romance,  Life,  Youth,  Drama': 19,
         'Comedy,  Romance,  Drama,  Fantasy': 18,
         'Comedy,  Romance,  Life': 17,
         'Life,  Youth': 17,
         'Comedy,  Romance,  Fantasy': 16,
         'Romance,  Life,  Drama': 16,
         'Business,  Comedy,  Romance,  Drama': 16,
         'Thriller,  Mystery,  Drama': 15,
         'Romance,  Life,  Drama,  Melodrama': 15,
         'Action,  Thriller,  Mystery,  Drama': 14,
         'Youth,  Drama': 14,
         'Comedy,  Romance,  Life,  Youth