# Thai X Japanese Drama: Can the challenge the global dominance of Korean Drama 

Midterm Project: Comprehensive Data Analysis and Visualization

Duration: 2 weeks
Submission Date: Sep. 28, 2023

### Requirements:
1. Select a Dataset: 
> - Choose a dataset that contains at least 500 entries and at least five different
variables.
2. Data Exploration:
> - Perform summary statistics to understand the basic metrics of each variable (mean, median, mode,
variance, standard deviation).
> - Identify any outliers and clean the dataset if necessary.
3. Statistical Analysis:
> - Conduct hypothesis tests or other statistical methods to answer at least two questions you have
about the dataset.
> - Use measures of similarity, probability, and distributions to draw inferences from the data.
4. Data Visualization:
> - Create at least four different types of visualizations using the dataset.
> - These can be bar charts, line charts, scater plots, pie charts, etc., as appropriate for your data.
5. Interpretation:
> - Prepare a presentation that walks through your exploratory data analysis, statistical findings, and
visualizations.
> - The presentation should be both factually accurate and easily understandable, targeted at an
audience unfamiliar with your dataset.
6. Documentation:
> - Alongside the presentation, prepare a report documenting your methodology, the statistical tests
performed, the visualizations created, and your interpretations.


### Deliverables:

1. Cleaned and processed dataset in CSV format.
2. A presentation (PPT or equivalent) summarizing your findings for 7 minutes presentation.
3. A detailed report (Word, PDF, or equivalent).

Evaluation Criteria:
- Quality of data exploration and statistical analysis.
- Effectiveness and appropriateness of data visualizations.
- Coherence and clarity in the presentation and report.
- Ability to interpret the results and draw meaningful conclusions.

<br>
Submission:
Submit all the project files via Google Classroom.

# Statistical Analysis

### TODO:
> * Statistical Analysis:
>> * Conduct hypothesis tests or other statistical methods to answer at least two questions you have about the dataset.
>> * Use measures of similarity, probability, and distributions to draw inferences from the data.

### Questions
1. Is there satistically significicant evidence that the average overall score given the Korean Drama by contriburos on MyDramaList website is higher than that of Thai and Japanese Drama?
* Dive into more specific geners like Romantic, Horror, and others (focusing on that different region is know for)
> * For example, Thailand is known for Horror movie, we might investigate whether people find Thai Horror Drama higher quality (via rating) than Korea (Japan?)

In [None]:
import pandas as pd
import numpy as np
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Drama

In [None]:
# Load the data: ignoring drama_id, rank, and pop(ularity) because we won't be using the website ranking system
dtha_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_drama.csv').iloc[:,1:-2]
dkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
djap_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_drama.csv').iloc[:,1:-2]

# Reviews

In [None]:
rtha_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/tha_user_reviews.csv').iloc[:,1:]
rkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_user_reviews.csv').iloc[:,1:]
rjap_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/jap_user_reviews.csv').iloc[:,1:]

In [None]:
rtha_df.head()

# Utility Functions

2. Data Exploration:
> - Perform summary statistics to understand the basic metrics of each variable (mean, median, mode,
variance, standard deviation).
> - Identify any outliers and clean the dataset if necessary.

In [None]:
from collections import Counter
import pandas as pd
import numpy as np

from scipy import stats
from typing import Tuple
import json


class DataAnalysisUtility:
    
    def calculate_statistics(
        self, 
        df: pd.DataFrame, 
        column_name: str
    ) -> dict:
        
        mean = df[column_name].mean()
        median = df[column_name].median()
        
        # Mode
        value_counter = Counter(df[column_name])
        nmax_occurance = value_counter.most_common(n=1)[0][1]
        mode = [key for key,val in value_counter.items() if val == nmax_occurance]
        
        variance = df[column_name].var()
        std_deviation = df[column_name].std()
        
        minimum = df[column_name].min()
        maximum = df[column_name].max()
        data_range = abs(maximum - minimum)
        
        statistics_dict = {
            'mean': mean,
            'median': median,
            'mode': mode,
            'variance': variance,
            'standard_deviation': std_deviation,
            'min': minimum,
            'max': maximum,
            'range': data_range
        }
        
        return statistics_dict
    
    def iqr_outlier_detector(
        self,
        df: pd.DataFrame, 
        column_name: str
    ) -> Tuple[pd.Series, pd.DataFrame]:
        q1 = df[column_name].quantile(0.25)
        q3 = df[column_name].quantile(0.75)
        iqr = q3 - q1

        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        # print(f'Boundaries: {lower_bound=} and {upper_bound=}')

        outlier_mask = ((df[column_name] < lower_bound) | (df[column_name] > upper_bound))
        df_no_outliers = df[~outlier_mask]

        return outlier_mask, df_no_outliers

    def zscore_outlier_detector(
        self,
        df: pd.DataFrame, 
        column_name: str, 
        threshold=3
    ) -> Tuple[pd.Series, pd.DataFrame]:
        z_scores = np.abs(stats.zscore(df[column_name]))
        outlier_mask = z_scores > threshold
        df_no_outliers = df[~outlier_mask]

        return outlier_mask, df_no_outliers

In [None]:
# Testing the utility functions for processing the data
util = DataAnalysisUtility()  

result_a = util.calculate_statistics(rtha_df,'overall')
print(json.dumps(result_a,indent=3),end='\n\n')

# Test Outlier Detection Function
outlier_mask, no_outlier_df = util.zscore_outlier_detector(rtha_df, column_name='overall')
print(no_outlier_df.shape, rtha_df.shape)

a,b = util.iqr_outlier_detector(rtha_df, 'overall')
print(b.shape, rtha_df.shape)

# TODO: Drama Observation


Shows that the distribution of Drama vs. Movie in our dataset is really unbalance. Therefore, we will be focusing on Drama instead. --> We have to plot a graph/visualization to show this point

In [None]:
# If we're putting the emphasis on Drama only, then we have to do filtering
dkor_df = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
dkor_df = dkor_df[dkor_df['type'] == 'Drama']
# dkor_df['genres'][0].split(',  ')

def split_geners_tags(df: pd.DataFrame) -> pd.DataFrame:
    df['genres'] = df['genres'].apply(lambda x: [char.replace(',','').strip() for char in str(x).split()])
    df['tags'] = df['tags'].apply(lambda x: [char.replace(',','').strip()  for char in str(x).split()])
    return df

def custom_rating_generator(df: pd.DataFrame) -> pd.DataFrame:
    df['watched_ratio'] = df['tot_num_user'] / df['tot_watched']
    df['watched_ratio'] = df['watched_ratio'].apply(lambda x: round(x,3))
    df['scored_signif'] = df['tot_user_score'] * df['watched_ratio']
    return df

dkor_df = split_geners_tags(dkor_df)
dkor_df = custom_rating_generator(dkor_df)
dkor_df.head()


# TODO: Data Visualization and Analysis

In [None]:
# TODO: Add Weighted Rating based on the number of people that found the review useful
def review_score_pipeline(
    df: pd.DataFrame,
    features: list = [],
) -> pd.DataFrame:
    
    df['ep_watched'] = df['ep_watched'].fillna('0 of 0 episodes seen')
    df['tot_watched'] = df['ep_watched'].apply(lambda x: int(str(x).split(' ')[0]))
    df['tot_ep'] = df['ep_watched'].apply(lambda x: int(str(x).split(' ')[2]))
    df['watched_ratio'] = df['tot_watched'] / df['tot_ep']
    df['watched_ratio'] = df['watched_ratio'].fillna(0.0).apply(lambda x: round(x,3))
    
    # TODO: Weighted Score Rating --> df['weigthed_signif'] 
    
    return df if len(features) == 0 else df[features]

def process_review_text(text: str) -> list:
    # TODO: Tokenization, StopwordRemovers, etc.
    pass

def outlier_detection_marker():
    pass

In [None]:
features = ['title','story','acting_cast','music','rewatch_value','overall','text','n_helpful','tot_watched','tot_ep','watched_ratio']

rtha_df = review_score_pipeline(rtha_df, features)
rkor_df = review_score_pipeline(rkor_df, features)
rjap_df = review_score_pipeline(rjap_df, features)

In [None]:
rtha_df.head()

In [None]:
rkor_df.head()

In [None]:
rjap_df.head()

In [None]:
# There are some wired cases where the tot_watched is greater than tot_ep
# It is possible that the reviewers rewatched the drama multiple times
# set(rtha_df['watched_ratio'].tolist())
rtha_df[rtha_df['watched_ratio'] > 1.0]

# Statistical Analysis

3. Statistical Analysis:
> - Conduct hypothesis tests or other statistical methods to answer at least two questions you have
about the dataset.
> - Use measures of similarity, probability, and distributions to draw inferences from the data.

<br>

----

### Hypothesis:
#### 1. Is there a significant difference in the average viewer ratings of drama TV shows among these three countries?
* Statistical Method: Analysis of Variance (ANOVA)**

* Hypothesis:

> * Null Hypothesis (H0): 
>> The average viewer ratings of drama TV shows are equal across all three countries (Thailand, Japan, and Korea).
> * Alternative Hypothesis (Ha): 
>> The average viewer ratings of drama TV shows are not equal across at least one pair of countries.

*  Procedure:
Collect viewer rating data for drama TV shows from the three countries. erform an ANOVA test to compare the means of viewer ratings among the three groups (countries). f the ANOVA test results in a significant p-value, conduct post-hoc tests (e.g., Tukey's HSD) to identify which specific pairs of countries have significantly different viewer ratings.


#### 2. What is the most popular geners of drama from each countires (Thailand, Japan, and Korea)? How do they compare with other countires 
* Ex. Thailand is popular with Horror, so does Thai Horror Drama receives better rathing than Japanese and Korean Drama under Horror gener?
* How do we detemined which genere is the most popular for each country
> * Metrics or Measurements for Genres popularity??:



#### 3. Rewatched Value from the User review?


#### 4. Top 5 Actors from three regions (weighted rating?) (Main Role and Supporting Role)


#### 5. Tags give better intepretation than the Genre?


#### 6. Yearly Performance (2021 - 2023)


In [None]:
kor_genres_dict = {}
dkor_df['tot_user_score'] = dkor_df['tot_user_score'].fillna(dkor_df['tot_user_score'].mean())

for index, row in dkor_df.iterrows():
    genres = row['genres']
    for genre in genres:
        if genre not in kor_genres_dict.keys():
            kor_genres_dict[genre] = {
                'num': 1,
                'score': [row['tot_user_score']]
            }
        else:
            kor_genres_dict[genre]['num'] += 1
            kor_genres_dict[genre]['score'].append(row['tot_user_score'])

for key, val in kor_genres_dict.items():
    genre_average = sum(val['score']) / len(val['score'])
    kor_genres_dict[key]['score'] = genre_average
        
kor_genres_dict

In [None]:
# dkor_df is preprocess data
kor_genres_dict = Counter(genre for genres in dkor_df['genres'] for genre in genres)
skor_genres_dict = sorted(kor_genres_dict.items(), key=lambda x: x[1], reverse=True)
top_five_genres = dict(skor_genres_dict[:5])

skor_genres_dict = dict(skor_genres_dict)

print(f'{top_five_genres=}',end='\n\n')

print(f'Complete Genres list: {json.dumps(skor_genres_dict,indent=4)}')

In [None]:
# Explore with different group style?
dkor_df_test = pd.read_csv('/kaggle/input/thai-and-japanese-drama-vs-korean-drama-dominance/kor_drama.csv').iloc[:,1:-2]
Counter(dkor_df_test['genres'])