## 데이터
- YouTube Trending Video Dataset (updated daily) 2022.04.14 기준
- https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset

## 분석 목적

- 한국 지역 한정
- 유튜브 인기 동영상으로 선정되는 요소 및 기준 분석
- 동영상의 제목, 좋아요/싫어요 수, 카테고리, 업로드 날짜 등의 항목과 조회수의 관계를 분석 

In [1]:
# 기본
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from datetime import timedelta

# 경고 뜨지 않게...
import warnings
warnings.filterwarnings('ignore')

# 그래프 설정
plt.rcParams['font.family'] = 'Malgun Gothic'
# plt.rcParams['font.family'] = 'AppleGothic'
plt.rcParams['font.size'] = 16
plt.rcParams['figure.figsize'] = 20, 10
plt.rcParams['axes.unicode_minus'] = False

# 데이터 전처리 알고리즘
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# 상관관계
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

## 목차

### 1. 데이터 전처리 
    (1) videoId
    (2) categoryId & category
    (3) publishedAt & trending_date
    (4) trendTime-publiTime
    (5) Ratio 
    (6) title_length

## 1. 데이터 전처리

In [2]:
df = pd.read_csv("data/KR_youtube_trending_data.csv")
df

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,uq5LClQN3cE,안녕하세요 보겸입니다,2020-08-09T09:32:48Z,UCu9BCtGIEr73LXZsKmoujKw,보겸 BK,24,2020-08-12T00:00:00Z,보겸|bokyem,5947503,53326,105756,139946,https://i.ytimg.com/vi/uq5LClQN3cE/default.jpg,False,False,
1,I-ZbZCHsHD0,부락토스의 계획 [총몇명 프리퀄],2020-08-12T09:00:08Z,UCRuSxVu4iqTK5kCh90ntAgA,총몇명,1,2020-08-12T00:00:00Z,총몇명|재밌는 만화|부락토스|루시퍼|총몇명 프리퀄|총몇명 스토리,963384,28244,494,3339,https://i.ytimg.com/vi/I-ZbZCHsHD0/default.jpg,False,False,"오늘도 정말 감사드립니다!!총몇명 스튜디오 - 총몇명, 십제곱, 5G민, MOVE혁..."
2,9d7jNUjBoss,평생 반성하면서 살겠습니다.,2020-08-10T09:54:13Z,UCMVC92EOs9yDJG5JS-CMesQ,양팡 YangPang,22,2020-08-12T00:00:00Z,양팡|양팡유튜브|팡튜브|가족시트콤|양팡가족|양팡가족시트콤|양팡언니|현실남매|현실자매...,2950885,17974,68898,50688,https://i.ytimg.com/vi/9d7jNUjBoss/default.jpg,False,False,
3,3pI_L3-sMVg,안녕하세요 꽈뚜룹입니다.,2020-08-11T15:00:58Z,UCkQCwnkQfgSuPTTnw_Y7v7w,꽈뚜룹 Quaddurup,24,2020-08-12T00:00:00Z,꽈뚜룹|한국여행기|quaddurup|뚜룹이|korea|southkorea|vlog|...,1743374,36893,1798,8751,https://i.ytimg.com/vi/3pI_L3-sMVg/default.jpg,False,False,앞으로 좀 더 깔끔한 영상제작 약속 드리겠습니다.늘 감사드립니다
4,zrsBjYukE8s,박진영 (J.Y. Park) When We Disco (Duet with 선미) M/V,2020-08-11T09:00:13Z,UCaO6TYtlC8U5ttz62hTrZgg,JYP Entertainment,10,2020-08-12T00:00:00Z,JYP Entertainment|JYP|J.Y.Park|JYPark|박진영|선미|S...,3433885,353337,9763,23405,https://i.ytimg.com/vi/zrsBjYukE8s/default.jpg,False,False,MelOn http://kko.to/TWyXd7zYjSpotify https://s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119949,33hhNb4mBIc,"🚙 우리 지금 어디가?｜모닝루틴, 스킨케어, 비원츠 아이세럼스틱, 해방촌, OEAT",2022-04-01T09:00:01Z,UC2ukAHT9BdMuyD3iOyDrGDA,문세훈 (Moon Sehoon),22,2022-04-14T00:00:00Z,문세훈|sehoon|세훈|신지연|솔로지옥|지연|데이트|해방촌|이태원|스킨케어|남자스...,1371881,48319,0,3552,https://i.ytimg.com/vi/33hhNb4mBIc/default.jpg,False,False,"안녕하세요 문세훈입니다오늘은 비원츠와 함께하는 저의 스킨케어 루틴,그리고 커피를 한..."
119950,AeeBm4ulKj0,53만원🤑입생로랑🆚️46만원💸아르마니 어드벤트 캘린더 비교 언빡싱,2022-04-03T10:00:09Z,UCnekLiljel-Px4ClMC7b3mg,회사원A,26,2022-04-14T00:00:00Z,[None],337342,5622,0,211,https://i.ytimg.com/vi/AeeBm4ulKj0/default.jpg,False,False,"이 영상은 광고 계약사항 없는 오리지널 컨텐츠입니다.아르마니 뷰티 어드벤트캘린더, ..."
119951,6kaHgbQLbDo,구멍난 벽에 미로 만들기,2022-04-02T03:24:06Z,UCd4FmcWIVdWAy0-Q8OJBloQ,사나고 Sanago,26,2022-04-14T00:00:00Z,사나고|3D펜|3Dpen|만들기|making|3d프린터|3Dprinting,472724,13307,0,792,https://i.ytimg.com/vi/6kaHgbQLbDo/default.jpg,False,False,● 3D펜 사나고 샵https://smartstore.naver.com/sanago...
119952,VCeMOtwdDps,[4K] NMIXX(엔믹스) - “Kill This Love(by 블랙핑크)” Ba...,2022-04-01T08:00:10Z,UCB9e3pof1o83aa0kkaoeJGA,it's Live,24,2022-04-14T00:00:00Z,NMIXX|엔믹스|블랙핑크|BLACKPINK|Kill This Love|Kill T...,2602749,230222,0,12793,https://i.ytimg.com/vi/VCeMOtwdDps/default.jpg,False,False,JYP 신인 걸그룹 NMIXX(엔믹스)가 잇츠라이브에 떴다!!😋카리스마 넘치는 랩과...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119954 entries, 0 to 119953
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   video_id           119954 non-null  object
 1   title              119954 non-null  object
 2   publishedAt        119954 non-null  object
 3   channelId          119954 non-null  object
 4   channelTitle       119954 non-null  object
 5   categoryId         119954 non-null  int64 
 6   trending_date      119954 non-null  object
 7   tags               119954 non-null  object
 8   view_count         119954 non-null  int64 
 9   likes              119954 non-null  int64 
 10  dislikes           119954 non-null  int64 
 11  comment_count      119954 non-null  int64 
 12  thumbnail_link     119954 non-null  object
 13  comments_disabled  119954 non-null  bool  
 14  ratings_disabled   119954 non-null  bool  
 15  description        116558 non-null  object
dtypes: bool(2), int64(5)

- description 결측치 존재
- description은 분석에 사용하지 않을 항목 중 하나로 삭제

In [4]:
# 분석에 사용하지 않을 항목 삭제
df.drop('description', axis=1, inplace=True)
df.drop('thumbnail_link', axis=1, inplace=True)
df.drop('comments_disabled', axis=1, inplace=True)
df.drop('ratings_disabled', axis=1, inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119954 entries, 0 to 119953
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   video_id       119954 non-null  object
 1   title          119954 non-null  object
 2   publishedAt    119954 non-null  object
 3   channelId      119954 non-null  object
 4   channelTitle   119954 non-null  object
 5   categoryId     119954 non-null  int64 
 6   trending_date  119954 non-null  object
 7   tags           119954 non-null  object
 8   view_count     119954 non-null  int64 
 9   likes          119954 non-null  int64 
 10  dislikes       119954 non-null  int64 
 11  comment_count  119954 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 11.0+ MB


### (1) vidoe_id



In [6]:
df["video_id"].duplicated().value_counts()

True     104809
False     15145
Name: video_id, dtype: int64

- 중복된 video_id 값 다수 존재
- 동일한 영상이 연속적으로 인기동영상으로 선정된 경우
- 일자별로 trending_data가 생겼기 때문

In [7]:
df = df.drop_duplicates(["video_id"])[:]

In [8]:
df["video_id"].duplicated().value_counts()

False    15145
Name: video_id, dtype: int64

- 중복되지 않은 video_id만 추출
- 인기동영상 최초 지정 일자만 남김

### (2) categoryId & category

- categoryId 값은 숫자로 구성
- json파일을 통해 텍스트로 구성된 category 항목을 추가 
- 출처 : https://www.kaggle.com/code/yontodd/us-youtube-eda-when-how-often-which-category/data

In [9]:
id_to_category = {}

with open("data/category_id.json","r") as f:
    id_data = json.load(f)
    for category in id_data["items"]:
        id_to_category[category["id"]] = category["snippet"]["title"]

id_to_category

{'1': 'Film & Animation',
 '2': 'Autos & Vehicles',
 '10': 'Music',
 '15': 'Pets & Animals',
 '17': 'Sports',
 '18': 'Short Movies',
 '19': 'Travel & Events',
 '20': 'Gaming',
 '21': 'Videoblogging',
 '22': 'People & Blogs',
 '23': 'Comedy',
 '24': 'Entertainment',
 '25': 'News & Politics',
 '26': 'Howto & Style',
 '27': 'Education',
 '28': 'Science & Technology',
 '30': 'Movies',
 '31': 'Anime/Animation',
 '32': 'Action/Adventure',
 '33': 'Classics',
 '34': 'Comedy',
 '35': 'Documentary',
 '36': 'Drama',
 '37': 'Family',
 '38': 'Foreign',
 '39': 'Horror',
 '40': 'Sci-Fi/Fantasy',
 '41': 'Thriller',
 '42': 'Shorts',
 '43': 'Shows',
 '44': 'Trailers'}

In [10]:
type(df["categoryId"][0])

numpy.int64

In [11]:
df["categoryId"] = df["categoryId"].astype(str)

In [12]:
df.insert(4, "category",df["categoryId"].map(id_to_category))

In [13]:
df.isnull().sum()

video_id          0
title             0
publishedAt       0
channelId         0
category         27
channelTitle      0
categoryId        0
trending_date     0
tags              0
view_count        0
likes             0
dislikes          0
comment_count     0
dtype: int64

In [14]:
df["categoryId"].loc[df["category"].isnull() == True]

17        29
1792      29
2066      29
5359      29
8363      29
15226     29
16769     29
21422     29
23171     29
23368     29
24557     29
26607     29
27193     29
28205     29
31585     29
52586     29
79258     29
79364     29
80156     29
80358     29
80775     29
87762     29
96580     29
98961     29
102786    29
103383    29
104165    29
Name: categoryId, dtype: object

- categoryId의 29에 해당하는 내용이 json파일에 없어서 생긴 결측치

In [15]:
# categoryId가 29인 인기동영상의 채널 확인...
df["channelTitle"].loc[df["categoryId"] == "29"]

17                    행정안전부
1792            서울시 · Seoul
2066                   조선일보
5359          법륜스님의 희망세상만들기
8363                    충주시
15226                 국가보훈처
16769              대한민국 병무청
21422                   환경부
23171     어슬렁 어슬렁 아프리카 벌써5년
23368     어슬렁 어슬렁 아프리카 벌써5년
24557           서울시 · Seoul
26607           서울시 · Seoul
27193            법륜스님의 즉문즉설
28205     CalBap-캘리포니아 건강밥상
31585                   안깨남
52586            법륜스님의 즉문즉설
79258        United Nations
79364        United Nations
80156        United Nations
80358        Global Citizen
80775        Global Citizen
87762               농림축산식품부
96580     CalBap-캘리포니아 건강밥상
98961                  부산일보
102786                  충주시
103383                  안깨남
104165                  충주시
Name: channelTitle, dtype: object

- categoryId가 29인 채널은 비영리 단체 및 공공기관 채널인 것을 확인

In [16]:
# category의 결측치를 "Nonprofits"로 채우기
df["category"].loc[df["category"].isnull() == True] = "Nonprofits"

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15145 entries, 0 to 119807
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   video_id       15145 non-null  object
 1   title          15145 non-null  object
 2   publishedAt    15145 non-null  object
 3   channelId      15145 non-null  object
 4   category       15145 non-null  object
 5   channelTitle   15145 non-null  object
 6   categoryId     15145 non-null  object
 7   trending_date  15145 non-null  object
 8   tags           15145 non-null  object
 9   view_count     15145 non-null  int64 
 10  likes          15145 non-null  int64 
 11  dislikes       15145 non-null  int64 
 12  comment_count  15145 non-null  int64 
dtypes: int64(4), object(9)
memory usage: 2.1+ MB


### (3) publishedAt & trending_date

- 업로드 날짜(publishedAt)와 인기동영상 선정 날짜(trending_date)를 이용해 
- 데이터를 분석하기 위해서는 datetime형식으로 변환 필요

In [18]:
type(df["trending_date"][0])

str

In [19]:
type(df["publishedAt"][0])

str

In [20]:
df["trending_date"]

0         2020-08-12T00:00:00Z
1         2020-08-12T00:00:00Z
2         2020-08-12T00:00:00Z
3         2020-08-12T00:00:00Z
4         2020-08-12T00:00:00Z
                  ...         
119783    2022-04-14T00:00:00Z
119784    2022-04-14T00:00:00Z
119786    2022-04-14T00:00:00Z
119787    2022-04-14T00:00:00Z
119807    2022-04-14T00:00:00Z
Name: trending_date, Length: 15145, dtype: object

In [21]:
df["publishedAt"]

0         2020-08-09T09:32:48Z
1         2020-08-12T09:00:08Z
2         2020-08-10T09:54:13Z
3         2020-08-11T15:00:58Z
4         2020-08-11T09:00:13Z
                  ...         
119783    2022-04-13T03:00:14Z
119784    2022-04-13T09:00:14Z
119786    2022-04-13T09:00:09Z
119787    2022-04-13T08:00:11Z
119807    2022-04-10T14:33:50Z
Name: publishedAt, Length: 15145, dtype: object

In [22]:
df["trending_date"] = pd.to_datetime(df["trending_date"],format="%Y-%m-%dT%H:%M:%SZ")

In [23]:
df["publishedAt"] = pd.to_datetime(df["publishedAt"],format="%Y-%m-%dT%H:%M:%SZ")

### (4) trendTime-publiTime

- 인기동영상에 선정되기까지 걸린 시간

In [24]:
df["trendTime-publiTime"] = df["trending_date"] - df["publishedAt"] + timedelta(days = 1)

In [25]:
# pd.set_option('display.max_rows', None) 모든 행 확인할 때
df["trendTime-publiTime"].value_counts()

1 days 14:59:48    93
1 days 14:59:50    85
1 days 14:59:59    81
1 days 14:59:49    72
1 days 14:59:52    68
                   ..
2 days 12:57:07     1
1 days 20:30:55     1
2 days 08:11:54     1
1 days 12:08:59     1
4 days 09:26:10     1
Name: trendTime-publiTime, Length: 8982, dtype: int64

### (5) Ratio 

- likes/view_count - 조회수 대비 좋아요 비율
- dislikes/view_count - 조회수 대비 싫어요 비율
- comment_count/view_count - 조회수 대비 댓글수 비율
- dislikes/likes 좋아요 - 대비 싫어요 비율

In [26]:
df["likes/view_count"] = df["likes"] / df["view_count"]
df["dislikes/view_count"] = df["dislikes"] / df["view_count"]
df["comment_count/view_count"] = df["comment_count"] / df["view_count"]
df["dislikes/likes"] = df["dislikes"] / df["likes"]
df["dislikes/likes"].loc[df["dislikes/likes"] == np.inf] = 0

In [27]:
df.isnull().sum()

video_id                      0
title                         0
publishedAt                   0
channelId                     0
category                      0
channelTitle                  0
categoryId                    0
trending_date                 0
tags                          0
view_count                    0
likes                         0
dislikes                      0
comment_count                 0
trendTime-publiTime           0
likes/view_count              1
dislikes/view_count           1
comment_count/view_count      1
dislikes/likes              234
dtype: int64

In [28]:
df["dislikes/likes"].loc[df["dislikes/likes"].isnull() == True] = 0
df["likes/view_count"].loc[df["likes/view_count"].isnull() == True] = 0
df["dislikes/view_count"].loc[df["dislikes/view_count"].isnull() == True] = 0
df["comment_count/view_count"].loc[df["comment_count/view_count"].isnull() == True] = 0

In [29]:
df.isnull().sum()

video_id                    0
title                       0
publishedAt                 0
channelId                   0
category                    0
channelTitle                0
categoryId                  0
trending_date               0
tags                        0
view_count                  0
likes                       0
dislikes                    0
comment_count               0
trendTime-publiTime         0
likes/view_count            0
dislikes/view_count         0
comment_count/view_count    0
dislikes/likes              0
dtype: int64

### (6) title_length

- 제목 길이 분석

In [30]:
df["title_length"] = df["title"].apply(lambda x : len(str(x)) if pd.isnull(x) == False  else 0 )

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15145 entries, 0 to 119807
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype          
---  ------                    --------------  -----          
 0   video_id                  15145 non-null  object         
 1   title                     15145 non-null  object         
 2   publishedAt               15145 non-null  datetime64[ns] 
 3   channelId                 15145 non-null  object         
 4   category                  15145 non-null  object         
 5   channelTitle              15145 non-null  object         
 6   categoryId                15145 non-null  object         
 7   trending_date             15145 non-null  datetime64[ns] 
 8   tags                      15145 non-null  object         
 9   view_count                15145 non-null  int64          
 10  likes                     15145 non-null  int64          
 11  dislikes                  15145 non-null  int64          
 12  com

In [32]:
df.describe()

Unnamed: 0,view_count,likes,dislikes,comment_count,trendTime-publiTime,likes/view_count,dislikes/view_count,comment_count/view_count,dislikes/likes,title_length
count,15145.0,15145.0,15145.0,15145.0,15145,15145.0,15145.0,15145.0,15145.0,15145.0
mean,779572.3,48991.22,564.087091,5484.379,1 days 20:39:58.162826015,inf,inf,inf,0.029541,44.403103
std,2214909.0,237060.9,5419.912591,54408.0,0 days 22:27:51.001315065,,,,0.169203,22.556269
min,0.0,0.0,0.0,0.0,0 days 11:47:11,0.0,0.0,0.0,0.0,1.0
25%,176926.0,3710.0,37.0,400.0,1 days 12:00:18,0.01385256,0.0001907997,0.001454364,0.004536,27.0
50%,350025.0,7586.0,106.0,944.0,1 days 14:59:43,0.02288198,0.0003446598,0.002848722,0.015363,41.0
75%,723371.0,17905.0,247.0,2299.0,1 days 20:29:58,0.04078821,0.0005973853,0.00525006,0.029262,58.0
max,76805030.0,7110450.0,405428.0,3400571.0,22 days 13:55:07,inf,inf,inf,7.535216,100.0
