많은 선행연구에서 향후 OTT 산업에서 경쟁력으로 '다양한 콘텐츠 확보'를 언급하고 있다.

- 국내외 OTT(Over the Top) 서비스 현황 및 콘텐츠 확보 전략 분석,정보통신산업진흥원


----------



이에 어떻게 다양한 콘텐츠를 구성하면 좋을지 알아보기 위해 각 플랫폼 별 정보가 있는 데이터셋을 살펴보고자 함. 
- data set : Movies on Netflix, Prime Video, Hulu and Disney+
https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney/code?datasetId=669193
- IMDB점수와 뷰어수의 관계, 장르별 뷰어수
- 플랫폼과 상관없이 인기 있는 장르는 같을 것이다.



## 1. 데이터 불러오기 

In [None]:
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import iplot
import cufflinks as cf
cf.go_offline()

# for data overview
from pandas_profiling import ProfileReport
import plotly.graph_objects as go
fig = go.Figure()
import re

In [None]:
movie_data = pd.read_csv('MoviesOnStreamingPlatforms.csv', encoding = 'unicode_escape')
use_df = pd.read_csv('use_df.csv', encoding = 'unicode_escape')
top_ww = pd.read_csv('Values_01worldwide.csv', encoding = 'unicode_escape')
top_ko = pd.read_csv('Values_02SKorea.csv', encoding = 'unicode_escape')
top_hk = pd.read_csv('Values_03Hongkong.csv', encoding = 'unicode_escape')
top_jp = pd.read_csv('Values_04Japan.csv', encoding = 'unicode_escape')
top_th = pd.read_csv('Values_05Taiwan.csv', encoding = 'unicode_escape')
top_us = pd.read_csv('Values_06USA.csv', encoding = 'unicode_escape')

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

## 2. 데이터 확인

#### 1. movie_data 확인 (https://www.kaggle.com/nikhileshkos/recommended-ott-movies-shows-analysis) 

In [None]:
movie_data.head(2)

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,RottenTomatoes,Netflix,Hulu,PrimeVideo,Disney,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,0.87,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,0.87,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0


**활용하지 않을 데이터 드랍 후 해당 데이터를 global_data로 정의**

In [None]:
global_data = movie_data.drop(['Unnamed: 0','ID','Year','Age','RottenTomatoes','Type','Language','Directors'], axis = 1)
global_data

**넷플릭스에 있는 영화만 추출 ( Netflix = 1 인 데이터 보여주기 )** - **총 3559개**

In [None]:
netflix_data = global_data.drop(['Hulu','PrimeVideo','Disney'], axis = 1)

In [None]:
netflix_data = netflix_data.loc[(netflix_data.Netflix >= 1)]
netflix_data

**넷플릭스에 있는 콘텐츠와 다른 OTT플랫폼에 있는 콘텐츠 교집합** 총 371개

In [None]:
#global_data.loc[(global_data.Hulu >= 1) | (global_data.PrimeVideo >= 1) | (global_data.Disney >= 1)]

**넷플릭스 데이터 안에서 MISSING DATA 확인**


In [None]:
netflix_data.isnull().sum() 

- 비교하려고 하는 IMDb 데이터와 Genres 에 있는 Null값 확인. 

- 아래 공유 데이터와 비교하여 Null 값을 채울 예정(필요할 경우)


#### 2. use_df 확인 (Kaggle data + FLIXPATROL)


In [None]:
use_df.info()

In [None]:
use_df.head(2)

**해당 데이터에서 쓰고자 하는 컬럼 외에 다른 것들 드랍**

In [None]:
IMDB_base = use_df.drop(['Unnamed: 0','COUNTRY','Genre','View Rating','Tags','Country Availability','Languages','Hidden Gem Score', 'Runtime', 'Director', 'Writer', 'Actors','Metacritic Score','Awards Received','Awards Nominated For','Boxoffice','Release Date','Production House','Runtime','Summary','Rotten Tomatoes Score'], axis = 1)
IMDB_base.head(2)

Unnamed: 0,Title,VALUE,Type,IMDb Score,Netflix Release Date,IMDb Votes
0,365 Days,42149,Movie,3.2,2020-04-02,50125.0
1,Emily in Paris,27138,Series,7.1,2020-10-02,45000.0


**Type 중에서 Movie 장르만 남기기**

In [None]:
movie_base = IMDB_base.loc[(base.Type == 'Movie')]
movie_base

In [None]:
movie_base.isnull().sum() 

**netflix data와 movie base(use_df) 합치기**

In [None]:
df = pd.merge(movie_base, netflix_data, on='Title')
df

In [None]:
df.info()

#### 3. Top Ranking 데이터 확인

In [None]:
top_ww.head(5)

Unnamed: 0,ï»¿Ranking_WW,Title,Points_Worldwide
0,1,365 Days,42329
1,2,Enola HolmesÂ,16432
2,3,The Christmas Chronicles: Part TwoÂ,15916
3,4,HolidateÂ,14849
4,5,The Old GuardÂ,13918


In [None]:
top_ko.head(5)

Unnamed: 0,ï»¿Ranking_KO,Title,Points_Ko
0,1,365 Days,1215
1,2,Honest Candidate,455
2,3,Howls Moving Castle,406
3,4,ExtractionÂ,330
4,5,The CallÂ,310


In [None]:
top_hk.head(5)

Unnamed: 0,ï»¿Ranking_HK,Title,Points_HK
0,1,365 Days,476
1,2,ExtractionÂ,360
2,3,HolidateÂ,268
3,4,HolidateÂ,268
4,5,The CallÂ,258


In [None]:
top_jp.head(5)

Unnamed: 0,ï»¿Ranking_Jp,Title,Points_JP
0,1,Contagion,350
1,2,Joker,281
2,3,Nihontouitsu Series,271
3,4,A Whisker AwayÂ,255
4,5,Akira,225


In [None]:
top_th.head(5)

Unnamed: 0,ï»¿Ranking_TH,Title,Points_TH
0,1,365 Days,603
1,2,ExtractionÂ,456
2,3,6 UndergroundÂ,357
3,4,HolidateÂ,320
4,5,The CallÂ,291


In [None]:
top_us.head(5)

Unnamed: 0,ï»¿Ranking_US,Title,Points_US
0,1,The Grinch,567
1,2,365 Days,359
2,3,The Christmas Chronicles: Part TwoÂ,260
3,4,The Christmas Chronicles: Part TwoÂ,260
4,5,ExtractionÂ,223


#### 4. Top Ranking 합치기

**Merge more multiple dataframes**

In [None]:
import pandas as pd
from functools import reduce

top_merged = [top_ww, top_ko, top_hk, top_jp, top_th, top_us]
df_top = reduce(lambda left, right: pd.merge(left,right,on=['Title'], how='outer'), top_merged)

**Top 리스트는 101위까지만 있기 때문에 인덱스를 102열까지만 저장**

In [None]:
df_top100 = df_top.iloc[:102]
df_top100

In [None]:
df_top100.info()

## 3. Profiling 

In [None]:
import pandas as pd
from pandas_profiling import ProfileReport

## 4. 시각화