# BTS MV Trends

BTS is a K-pop group that has gained international recognition over the years. Since their debut in 2013, they have released multiple music videos with various genres. This project aims to extract data from the group's official music videos in youtube using the youtube API and look at the trends of these data.

## Data Cleaning

Given the JSON File extracted from the youtube API, we will be getting a dataframe of BTS official music videos and its relevant video details. 

In [1]:
import json
import pandas as pd
import re

file = open('bts_official_mv_data.json')

bts_official_mv_data = json.load(file)

raw_df = pd.json_normalize(bts_official_mv_data['items'])

raw_df.columns

Index(['kind', 'etag', 'id', 'snippet.publishedAt', 'snippet.channelId',
       'snippet.title', 'snippet.description',
       'snippet.thumbnails.default.url', 'snippet.thumbnails.default.width',
       'snippet.thumbnails.default.height', 'snippet.thumbnails.medium.url',
       'snippet.thumbnails.medium.width', 'snippet.thumbnails.medium.height',
       'snippet.thumbnails.high.url', 'snippet.thumbnails.high.width',
       'snippet.thumbnails.high.height', 'snippet.thumbnails.standard.url',
       'snippet.thumbnails.standard.width',
       'snippet.thumbnails.standard.height', 'snippet.thumbnails.maxres.url',
       'snippet.thumbnails.maxres.width', 'snippet.thumbnails.maxres.height',
       'snippet.channelTitle', 'snippet.tags', 'snippet.categoryId',
       'snippet.liveBroadcastContent', 'snippet.defaultLanguage',
       'snippet.localized.title', 'snippet.localized.description',
       'snippet.defaultAudioLanguage', 'contentDetails.duration',
       'contentDetails.dimension'

In [2]:
# filter dataframe to reflect relevant columns only
relevant_details= ['id', 'snippet.publishedAt', 'snippet.title', 'snippet.description', 'contentDetails.duration', 'statistics.viewCount', 'statistics.likeCount', 'statistics.commentCount']
df = raw_df.loc[:, relevant_details]

# rename columns
column_names = ['Video ID', 'Date', 'Title', 'Description', 'Duration', 'Number of Views', 'Number of Likes', 'Number of Comments']
df.columns = column_names

# check for null values and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Video ID            55 non-null     object
 1   Date                55 non-null     object
 2   Title               55 non-null     object
 3   Description         55 non-null     object
 4   Duration            55 non-null     object
 5   Number of Views     55 non-null     object
 6   Number of Likes     51 non-null     object
 7   Number of Comments  55 non-null     object
dtypes: object(8)
memory usage: 3.6+ KB


In [3]:
# check rows containing null values
df[df.isna().any(axis = 1)]

Unnamed: 0,Video ID,Date,Title,Description,Duration,Number of Views,Number of Likes,Number of Comments
38,a16gTN7kOWU,2016-03-11T09:00:01Z,BTS (防弾少年団) 'RUN -Japanese Ver.-' Official MV,防弾少年団、3月15日発売の日本6thシングル RUN -Japanese Ver.- のM...,PT3M56S,29981629,,27571
39,LYAcYSmaLoc,2015-12-02T06:00:01Z,BTS (防弾少年団) 'I NEED U (Japanese Ver.)' Officia...,防弾少年団、12月8日発売の日本5thシングル I NEED U (Japanese Ver...,PT3M40S,48225529,,24660
42,ULStzgQYrqk,2015-06-17T03:00:01Z,BTS (防弾少年団) 'FOR YOU' Official MV (Dance Ver.),防弾少年団、6月17日発売の日本4thシングル「FOR YOU」のミュージックビデオのダンス...,PT4M46S,28138590,,16225
43,TTG6nxwdhyA,2015-06-05T03:00:00Z,BTS (防弾少年団) 'FOR YOU' Official MV,防弾少年団、6月17日発売の日本4thシングル「FOR YOU」のミュージックビデオ。\nオ...,PT5M7S,84566047,,70264


There are null values for some entries because the number of likes for some videos was not made publicly available. Since the 'Number of Likes' column is independent of other factors, we cannot fill in the missing values. Because of this, we will be dropping the rows containing null values. 

In [4]:
df = df.dropna()

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 54
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Video ID            51 non-null     object
 1   Date                51 non-null     object
 2   Title               51 non-null     object
 3   Description         51 non-null     object
 4   Duration            51 non-null     object
 5   Number of Views     51 non-null     object
 6   Number of Likes     51 non-null     object
 7   Number of Comments  51 non-null     object
dtypes: object(8)
memory usage: 3.6+ KB


Currently, all the columns have object data types. We must convert the 'Date' column into a datetime data type, the 'Duration' column into a timedelta datatype, and the 'Number of Views', 'Number of Likes', and 'Number of Comments' columns into integer datatypes.

In [5]:
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Date']).reset_index(drop = True)

df['Duration'] = pd.to_timedelta(df['Duration'])

for column in ['Number of Views', 'Number of Likes','Number of Comments']:
    df[column] = pd.to_numeric(df[column], errors = 'coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   Video ID            51 non-null     object             
 1   Date                51 non-null     datetime64[ns, UTC]
 2   Title               51 non-null     object             
 3   Description         51 non-null     object             
 4   Duration            51 non-null     timedelta64[ns]    
 5   Number of Views     51 non-null     int64              
 6   Number of Likes     51 non-null     int64              
 7   Number of Comments  51 non-null     int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(3), timedelta64[ns](1)
memory usage: 3.3+ KB


Lastly, we must remove the artist name and the word 'Official MV' from the video titles since it is already known that we will be looking at official music videos from the artist BTS.

In [6]:
df['Title'].values

array(["BTS (방탄소년단) 'No More Dream' Official MV",
       "BTS (방탄소년단) 'No More Dream' Official MV (Choreography Version)",
       "BTS (방탄소년단) 'We Are Bulletproof Pt.2' Official MV",
       "BTS (방탄소년단) 'N.O' Official MV",
       "BTS (방탄소년단) '상남자 (Boy In Luv)' Official MV",
       "BTS (방탄소년단) '상남자 (Boy In Luv)' Official MV (Choreography Version)",
       "BTS (방탄소년단) '하루만 (Just one day)' Official MV",
       "BTS (방탄소년단) 'Danger' Official MV",
       "BTS (방탄소년단) '호르몬전쟁' Official MV",
       "BTS (방탄소년단) 'I NEED U' Official MV",
       "BTS (방탄소년단) 'I NEED U' Official MV (Original ver.)",
       "BTS (방탄소년단) '쩔어' Official MV", "BTS (방탄소년단) 'RUN' Official MV",
       "BTS (방탄소년단) 'EPILOGUE : Young Forever' Official MV",
       "BTS (방탄소년단) '불타오르네 (FIRE)' Official MV",
       "BTS (방탄소년단) '불타오르네 (FIRE)' Official MV (Choreography Version)",
       "BTS (방탄소년단) 'Save ME' Official MV",
       "BTS (방탄소년단) '피 땀 눈물 (Blood Sweat & Tears)' Official MV",
       "BTS (방탄소년단) '봄날 (Spring Day)' O

In [7]:
for index, raw_title in enumerate(df['Title']):
    
    split_title = re.split('Official MV', raw_title)
    title = re.search(r'(?<= \')(.*)(?=\' )|(?<= ‘)(.*)(?=’ )',split_title [0]).group()
    clean_title = f'{title}{split_title[1]}'
    df.iloc[index, 2] = clean_title

df['Title'].values

array(['No More Dream', 'No More Dream (Choreography Version)',
       'We Are Bulletproof Pt.2', 'N.O', '상남자 (Boy In Luv)',
       '상남자 (Boy In Luv) (Choreography Version)', '하루만 (Just one day)',
       'Danger', '호르몬전쟁', 'I NEED U', 'I NEED U (Original ver.)', '쩔어',
       'RUN', 'EPILOGUE : Young Forever', '불타오르네 (FIRE)',
       '불타오르네 (FIRE) (Choreography Version)', 'Save ME',
       '피 땀 눈물 (Blood Sweat & Tears)', '봄날 (Spring Day)', 'Not Today',
       'Not Today (Choreography Version)', '血、汗、涙 -Japanese Ver.-', 'DNA',
       'MIC Drop -Japanese ver.-', 'MIC Drop (Steve Aoki Remix)',
       'MIC Drop -Japanese ver.-', 'FAKE LOVE',
       'FAKE LOVE (Extended ver.)', 'IDOL', 'IDOL (Feat. Nicki Minaj)',
       'Airplane pt.2 -Japanese ver.-',
       '작은 것들을 위한 시 (Boy With Luv) (feat. Halsey)',
       "작은 것들을 위한 시 (Boy With Luv) (feat. Halsey) ('ARMY With Luv' ver.)",
       'Lights', 'Make It Right (feat. Lauv)',
       'Make It Right (Vertical ver.)', 'ON', 'Black Swan', 'Stay Gold