# 1. Introduction


#### The purpose of this section is to develop the necessary data processing steps to modify the dataset for subsequent data analysis.

#### The data processing steps are developed with the following goals:
- The columns are in the correct data type
- The dataset should contain newly created features
- The dataset cleaned and indexed by unique id
- To get dataframe details through API


# 2. Data  Collection

#### CSV and JSON file of 9 countries downloaded and imported to dataframe from following link:
#### https://www.kaggle.com/datasnaek/youtube-new


#### The CSV files contain the video records information
#### JSON files contain the video category information. 

In [1]:
import pandas as pd
import numpy as np
import json

#### 2.1 Importing the 9 countries csv file into dataframe 

In [2]:
ca_yet_df = pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/CAvideos.csv')
de_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/DEvideos.csv')
fr_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/FRvideos.csv')
gb_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/GBvideos.csv')
in_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/INvideos.csv')
us_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/USvideos.csv')
jp_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/JPvideos.csv', encoding= "ISO-8859-1")
kr_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/KRvideos.csv', encoding= "ISO-8859-1")
mx_yet_df= pd.read_csv('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/MXvideos.csv', encoding= "ISO-8859-1")


# 3. Initial Dataset Exploration

#### 3.1 First, explore the data set:


In [3]:
ca_yet_df.info()
#description has 1296 null value 

ca_yet_df.shape
#(40881, 16)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40881 entries, 0 to 40880
Data columns (total 16 columns):
video_id                  40881 non-null object
trending_date             40881 non-null object
title                     40881 non-null object
channel_title             40881 non-null object
category_id               40881 non-null int64
publish_time              40881 non-null object
tags                      40881 non-null object
views                     40881 non-null int64
likes                     40881 non-null int64
dislikes                  40881 non-null int64
comment_count             40881 non-null int64
thumbnail_link            40881 non-null object
comments_disabled         40881 non-null bool
ratings_disabled          40881 non-null bool
video_error_or_removed    40881 non-null bool
description               39585 non-null object
dtypes: bool(3), int64(5), object(8)
memory usage: 4.2+ MB


(40881, 16)

#### There are 40881 rows and and 16 features in Canada dataframe

In [4]:
ca_yet_df.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


#### 3.2 Inserting country column in each dataframe
#### The “country” column is added and filled with corresponding country to help identify the origin of each video record.

In [5]:
ca_yet_df['country']='CA'
de_yet_df['country']='DE'
fr_yet_df['country']='FR'
gb_yet_df['country']='GB'
in_yet_df['country']='IN'
us_yet_df['country']='US'
jp_yet_df['country']='JP'
kr_yet_df['country']='KR'
mx_yet_df['country']='MX'

#### 3.3 To identify the category through mapping JSON file:

#### The category id of the dataframe is mapped to the json file to get the category name. There are json files contain the category information

#### Getting category of CANADA datafarme

In [6]:
ca_yet_df['category_id']= ca_yet_df['category_id'].astype(str)
ca_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/CA_category_id.json', 'r') as f:
    data = json.load(f)
    for ca_category in data['items']:
        ca_category_id[ca_category['id']] = ca_category['snippet']['title']

ca_yet_df.insert(4, 'category', ca_yet_df['category_id'].map(ca_category_id))

In [7]:
ca_yet_df.head(4)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,Music,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...,CA
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,Comedy,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...,CA
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,Comedy,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,CA
3,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,Entertainment,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,CA


#### Getting category of US datafarme

In [8]:
us_yet_df['category_id']= us_yet_df['category_id'].astype(str)
us_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/US_category_id.json', 'r') as f:
    data = json.load(f)
    for us_category in data['items']:
        us_category_id[us_category['id']] = us_category['snippet']['title']

us_yet_df.insert(4, 'category', us_yet_df['category_id'].map(us_category_id))

In [9]:
us_yet_df.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,People & Blogs,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,US
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,Entertainment,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",US
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,Comedy,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,US


#### Getting category of GB datafarme

In [10]:
gb_yet_df['category_id']= gb_yet_df['category_id'].astype(str)
gb_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/GB_category_id.json', 'r') as f:
    data = json.load(f)
    for gb_category in data['items']:
        gb_category_id[gb_category['id']] = gb_category['snippet']['title']

gb_yet_df.insert(4, 'category', gb_yet_df['category_id'].map(gb_category_id))


In [11]:
gb_yet_df.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,Jw1Y-zhQURU,17.14.11,John Lewis Christmas Ad 2017 - #MozTheMonster,John Lewis,Howto & Style,26,2017-11-10T07:38:29.000Z,"christmas|""john lewis christmas""|""john lewis""|...",7224515,55681,10247,9479,https://i.ytimg.com/vi/Jw1Y-zhQURU/default.jpg,False,False,False,Click here to continue the story and make your...,GB
1,3s1rvMFUweQ,17.14.11,Taylor Swift: …Ready for It? (Live) - SNL,Saturday Night Live,Entertainment,24,2017-11-12T06:24:44.000Z,"SNL|""Saturday Night Live""|""SNL Season 43""|""Epi...",1053632,25561,2294,2757,https://i.ytimg.com/vi/3s1rvMFUweQ/default.jpg,False,False,False,Musical guest Taylor Swift performs …Ready for...,GB
2,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,Music,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787420,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...,GB


#### Getting category of DENMARK datafarme

In [12]:
de_yet_df['category_id']= de_yet_df['category_id'].astype(str)
de_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/DE_category_id.json', 'r') as f:
    data = json.load(f)
    for de_category in data['items']:
        de_category_id[de_category['id']] = de_category['snippet']['title']

de_yet_df.insert(4, 'category', de_yet_df['category_id'].map(de_category_id))


In [13]:
de_yet_df.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,LgVi6y5QIjM,17.14.11,Sing zu Ende! | Gesangseinlagen vom Feinsten |...,inscope21,Entertainment,24,2017-11-13T17:08:49.000Z,"inscope21|""sing zu ende""|""gesangseinlagen""|""ge...",252786,35885,230,1539,https://i.ytimg.com/vi/LgVi6y5QIjM/default.jpg,False,False,False,Heute gibt es mal wieder ein neues Format... w...,DE
1,Bayt7uQith4,17.14.11,Kinder ferngesteuert im Kiosk! Erwachsene abzo...,LUKE! Die Woche und ich,Comedy,23,2017-11-12T22:30:01.000Z,"Kinder|""ferngesteuert""|""Kinder ferngesteuert""|...",797196,53576,302,1278,https://i.ytimg.com/vi/Bayt7uQith4/default.jpg,False,False,False,Kinder ferngesteuert! Kinder lassen sich sooo ...,DE
2,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,Entertainment,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97190,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",DE


#### Getting category of FRANCE datafarme

In [14]:
fr_yet_df['category_id']= fr_yet_df['category_id'].astype(str)
fr_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/FR_category_id.json', 'r') as f:
    data = json.load(f)
    for fr_category in data['items']:
        fr_category_id[fr_category['id']] = fr_category['snippet']['title']

fr_yet_df.insert(4, 'category', fr_yet_df['category_id'].map(fr_category_id))


In [15]:
fr_yet_df.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,Ro6eob0LrCY,17.14.11,Malika LePen : Femme de Gauche - Trailer,Le Raptor Dissident,Entertainment,24,2017-11-13T17:32:55.000Z,"Raptor""|""Dissident""|""Expliquez""|""moi""|""cette""|...",212702,29282,1108,3817,https://i.ytimg.com/vi/Ro6eob0LrCY/default.jpg,False,False,False,Dimanche.\n18h30.\nSoyez présents pour la vidé...,FR
1,Yo84eqYwP98,17.14.11,"LA PIRE PARTIE ft Le Rire Jaune, Pierre Croce,...",Le Labo,Entertainment,24,2017-11-12T15:00:02.000Z,[none],432721,14053,576,1161,https://i.ytimg.com/vi/Yo84eqYwP98/default.jpg,False,False,False,Le jeu de société: https://goo.gl/hhG1Ta\n\nGa...,FR
2,ceqntSXE-10,17.14.11,DESSINS ANIMÉS FRANÇAIS VS RUSSES 2 - Daniil...,Daniil le Russe,Comedy,23,2017-11-13T17:00:38.000Z,"cartoon""|""pokémon""|""école""|""ours""|""мультфильм",482153,76203,477,9580,https://i.ytimg.com/vi/ceqntSXE-10/default.jpg,False,False,False,Une nouvelle dose de dessins animés français e...,FR


#### Getting category of INDIA datafarme

In [16]:
in_yet_df['category_id']= in_yet_df['category_id'].astype(str)
in_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/IN_category_id.json', 'r') as f:
    data = json.load(f)
    for in_category in data['items']:
        in_category_id[in_category['id']] = in_category['snippet']['title']

in_yet_df.insert(4, 'category', in_yet_df['category_id'].map(in_category_id))


In [17]:
in_yet_df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,kzwfHumJyYc,17.14.11,Sharry Mann: Cute Munda ( Song Teaser) | Parmi...,Lokdhun Punjabi,Film & Animation,1,2017-11-12T12:20:39.000Z,"sharry mann|""sharry mann new song""|""sharry man...",1096327,33966,798,882,https://i.ytimg.com/vi/kzwfHumJyYc/default.jpg,False,False,False,Presenting Sharry Mann latest Punjabi Song Cu...,IN
1,zUZ1z7FwLc8,17.14.11,"पीरियड्स के समय, पेट पर पति करता ऐसा, देखकर दं...",HJ NEWS,News & Politics,25,2017-11-13T05:43:56.000Z,"पीरियड्स के समय|""पेट पर पति करता ऐसा""|""देखकर द...",590101,735,904,0,https://i.ytimg.com/vi/zUZ1z7FwLc8/default.jpg,True,False,False,"पीरियड्स के समय, पेट पर पति करता ऐसा, देखकर दं...",IN
2,10L1hZ9qa58,17.14.11,Stylish Star Allu Arjun @ ChaySam Wedding Rece...,TFPC,Entertainment,24,2017-11-12T15:48:08.000Z,Stylish Star Allu Arjun @ ChaySam Wedding Rece...,473988,2011,243,149,https://i.ytimg.com/vi/10L1hZ9qa58/default.jpg,False,False,False,Watch Stylish Star Allu Arjun @ ChaySam Weddin...,IN
3,N1vE8iiEg64,17.14.11,Eruma Saani | Tamil vs English,Eruma Saani,Comedy,23,2017-11-12T07:08:48.000Z,"Eruma Saani|""Tamil Comedy Videos""|""Films""|""Mov...",1242680,70353,1624,2684,https://i.ytimg.com/vi/N1vE8iiEg64/default.jpg,False,False,False,This video showcases the difference between pe...,IN
4,kJzGH0PVQHQ,17.14.11,why Samantha became EMOTIONAL @ Samantha naga ...,Filmylooks,Entertainment,24,2017-11-13T01:14:16.000Z,"Filmylooks|""latest news""|""telugu movies""|""telu...",464015,492,293,66,https://i.ytimg.com/vi/kJzGH0PVQHQ/default.jpg,False,False,False,why Samantha became EMOTIONAL @ Samantha naga ...,IN


#### Getting category of JAPAN datafarme

In [18]:
jp_yet_df['category_id']= jp_yet_df['category_id'].astype(str)
jp_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/JP_category_id.json', 'r') as f:
    data = json.load(f)
    for jp_category in data['items']:
        jp_category_id[jp_category['id']] = jp_category['snippet']['title']

jp_yet_df.insert(4, 'category', jp_yet_df['category_id'].map(jp_category_id))
jp_yet_df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,5ugKfHgsmYw,18.07.02,é¸èªããªãåç´ã«è½ä¸ï¼è·¯ä¸ã®è»ã...,æäºéä¿¡æ åã»ã³ã¿ã¼,News & Politics,25,2018-02-06T03:04:37.000Z,"äºæ|""ä½è³""|""ä½è³ç""|""ããªã³ãã¿ã...",188085,591,189,0,https://i.ytimg.com/vi/5ugKfHgsmYw/default.jpg,True,False,False,ä½è³çç¥å¼å¸ã®æ°å®¶ã«å¢è½ããé¸ä...,JP
1,ohObafdd34Y,18.07.02,ã¤ããQ ãç¥­ãç·å®®å·Ãæè¶ å·¨å¤§ã...,ç¥è°·ãããª Kamiya Erina 2,Film & Animation,1,2018-02-06T04:01:56.000Z,[none],90929,442,88,174,https://i.ytimg.com/vi/ohObafdd34Y/default.jpg,False,False,False,,JP
2,aBr2kKAHN6M,18.07.02,Live Views of Starman,SpaceX,Science & Technology,28,2018-02-06T21:38:22.000Z,[none],6408303,165892,2331,3006,https://i.ytimg.com/vi/aBr2kKAHN6M/default.jpg,False,False,False,,JP
3,5wNnwChvmsQ,18.07.02,æ±äº¬ãã£ãºãã¼ãªã¾ã¼ãã®åã­ã£ã...,ã¢ã·ã¿ãã¯ãã¤,News & Politics,25,2018-02-06T06:08:49.000Z,ã¢ã·ã¿ãã¯ãã¤,96255,1165,277,545,https://i.ytimg.com/vi/5wNnwChvmsQ/default.jpg,False,False,False,æ±äº¬ãã£ãºãã¼ãªã¾ã¼ãã®åã­ã£ã...,JP
4,B7J47qFvdsk,18.07.02,æ¦®åå¥ããè¡æã®æ­»ãã ãµãï¼æ ç...,ã·ãããã¥ãã¤,Film & Animation,1,2018-02-06T02:30:00.000Z,[none],108408,1336,74,201,https://i.ytimg.com/vi/B7J47qFvdsk/default.jpg,False,False,False,å®¶ã«å¸°ã£ã¦ãããµã©ãªã¼ãã³ã®ãã...,JP


#### Getting category of South Korea datafarme

In [19]:
kr_yet_df['category_id']= kr_yet_df['category_id'].astype(str)
kr_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/KR_category_id.json', 'r') as f:
    data = json.load(f)
    for kr_category in data['items']:
        kr_category_id[kr_category['id']] = kr_category['snippet']['title']

kr_yet_df.insert(4, 'category', kr_yet_df['category_id'].map(kr_category_id))
kr_yet_df.head(4)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,RxGQe4EeEpA,17.14.11,ì¢ì by ë¯¼ì_ì¤ì¢ì _ì¢ë ëµê°,ë¼í¸ë§ì½ë¦¬ì,People & Blogs,22,2017-11-13T07:07:36.000Z,"ë¼í¸ë§|""ì¤ì¢ì ""|""ì¢ë""|""ì¢ì""|""ì¬ë ...",156130,1422,40,272,https://i.ytimg.com/vi/RxGQe4EeEpA/default.jpg,False,False,False,ì¤ì¢ì 'ì¢ë'ì ëµê° 'ì¢ì' ìµì´ ê...,KR
1,hH7wVE8OlQ0,17.14.11,JSA ê·ì ë¶íêµ° ì´ê²© ë¶ì,Edward,News & Politics,25,2017-11-13T10:59:16.000Z,"JSA|""ê·ì""|""ë¶íêµ°""|""ì´ê²©""|""ë¶ì""|""JS...",76533,211,28,113,https://i.ytimg.com/vi/hH7wVE8OlQ0/default.jpg,False,False,False,[ì±ëAë¨ë]å ë³ì¬ íì¬ 'ììë¶ëª...,KR
2,9V8bnWUmE9U,17.14.11,ëëª°ë¼í¨ë°ë¦¬ ì´ëí ìì 2í (ë¹¼ë...,ëëª°ë¼í¨ë°ë¦¬ í«ì¼,People & Blogs,22,2017-11-11T07:16:08.000Z,"ìëë¤ì¤|""ë¹¼ë¹¼ë¡""|""í«ì¼""|""ëëª°ë¼í...",421409,5112,166,459,https://i.ytimg.com/vi/9V8bnWUmE9U/default.jpg,False,False,False,í¼ê°ì¤ë ê¼­ ì¶ì² ë¶íëë ¤ì,KR
3,0_8py-t5R80,17.14.11,"ááµáá§á¼áá¡á¨ ì¶êµ­ íì¥, ëì¹...",ë¯¸ëì´ëª½êµ¬,News & Politics,25,2017-11-12T11:19:52.000Z,"ì´ëªë°|""ì´ëªë° ì¶êµ­ê¸ì§""|""ì´ëªë° ...",222850,2093,173,1219,https://i.ytimg.com/vi/0_8py-t5R80/default.jpg,False,False,False,ë¤ì¤ë ëêµ¬ê²ëê¹ ë£ê³ ë í íì ,KR


#### Getting category of Mexico datafarme

In [20]:
mx_yet_df['category_id']= mx_yet_df['category_id'].astype(str)
mx_category_id ={}
with open('D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/MX_category_id.json', 'r') as f:
    data = json.load(f)
    for mx_category in data['items']:
        mx_category_id[mx_category['id']] = mx_category['snippet']['title']

mx_yet_df.insert(4, 'category', mx_yet_df['category_id'].map(mx_category_id))

mx_yet_df.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,SbOwzAl9ZfQ,17.14.11,CapÃ­tulo 12 | MasterChef 2017,MasterChef 2017,Entertainment,24,2017-11-13T06:06:22.000Z,"MasterChef Junior 2017|""TV Azteca""|""recetas""|""...",310130,4182,361,1836,https://i.ytimg.com/vi/SbOwzAl9ZfQ/default.jpg,False,False,False,Disfruta la presencia del Chef Torreblanca en ...,MX
1,klOV6Xh-DnI,17.14.11,ALEXA EX-INTEGRANTE DEL GRUPO TIMBIRICHE RENUN...,Micky Contreras Martinez,People & Blogs,22,2017-11-13T05:11:58.000Z,La Voz Mexico 7,104972,271,174,369,https://i.ytimg.com/vi/klOV6Xh-DnI/default.jpg,False,False,False,ALEXA EX-INTEGRANTE DEL GRUPO TIMBIRICHE RENUN...,MX
2,6L2ZF7Qzsbk,17.14.11,LOUIS CKAGÃ - EL PULSO DE LA REPÃBLICA,El Pulso De La RepÃºblica,News & Politics,25,2017-11-13T17:00:02.000Z,"Chumel Torres|""El Pulso de la Republica""|""noti...",136064,10105,266,607,https://i.ytimg.com/vi/6L2ZF7Qzsbk/default.jpg,False,False,False,La canciÃ³n del principio se llama âEste esp...,MX


# 4. Merge dataframes

#### The datarframes merged to the df_video dataframe and then the it is checked how many columns have missing value. The "Description" and "category" columns have missing value.

In [21]:
df_videos = pd.concat([ca_yet_df, de_yet_df, fr_yet_df, gb_yet_df, in_yet_df, us_yet_df, mx_yet_df, jp_yet_df, kr_yet_df])
df_videos_value = df_videos  
df_videos_Ids = df_videos

#### Unique value of the category of the dataframe

In [22]:
df_videos['category'].unique()

array(['Music', 'Comedy', 'Entertainment', 'News & Politics',
       'People & Blogs', 'Howto & Style', 'Film & Animation',
       'Science & Technology', 'Gaming', 'Sports', nan, 'Pets & Animals',
       'Travel & Events', 'Autos & Vehicles', 'Education', 'Shows',
       'Movies', 'Trailers', 'Nonprofits & Activism'], dtype=object)

In [23]:
df_videos.count()

video_id                  335203
trending_date             335203
title                     335203
channel_title             335203
category                  334006
category_id               335203
publish_time              335203
tags                      335203
views                     335203
likes                     335203
dislikes                  335203
comment_count             335203
thumbnail_link            335203
comments_disabled         335203
ratings_disabled          335203
video_error_or_removed    335203
description               318189
country                   335203
dtype: int64

# 5. Data Type Correction

#### The purpose of this section is to ensure the data types of each columns are correct.

#### The "trending_date" and " published_time" columns that are str type converted to the Date type and reformatted for data analysis

In [24]:
df_videos['trending_date'] = pd.to_datetime(df_videos['trending_date'],errors='coerce', format='%y.%d.%m')

In [25]:
df_videos['publish_time'] = pd.to_datetime(df_videos['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')

In [26]:
df_videos.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
0,n1WpP7iowLc,2017-11-14,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,Music,10,2017-11-10 17:00:03,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...,CA
1,0dBIkQ4Mz1M,2017-11-14,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,Comedy,23,2017-11-13 17:00:00,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...,CA
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,Comedy,23,2017-11-12 19:05:24,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,CA


### Set dataframe index :

In [27]:
df_videos = df_videos.reset_index().set_index('video_id')

#### Dataframe contains more data from America rather than Asia 

In [28]:
df_videos_grp_cntry = df_videos.groupby(['country']).count()['title'].sort_values(ascending=False)
df_videos_grp_cntry


country
US    40949
CA    40881
DE    40840
FR    40724
MX    40451
GB    38916
IN    37352
KR    34567
JP    20523
Name: title, dtype: int64

In [29]:
video_id_value_str = df_videos_value['video_id'].str

In [30]:
video_id_value_str


<pandas.core.strings.StringMethods at 0x233b5778748>

#### Description and Category columns that contain missing values, have been filled with the "Unavailable" value

In [31]:
df_videos_value = df_videos_value.fillna("Unavailable")

#### The Below cell display in this dataframe, there are multiple records for each video_Id. There are multiple records with the differnt "trending-date".



In [32]:
df_videos_value[df_videos_value['video_id'] =="1h7KV2sjUWY"].head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
35312,1h7KV2sjUWY,2018-05-18,True Facts : Ant Mutualism,zefrank1,People & Blogs,22,2018-05-18 01:00:06,[none],233136,25781,104,2195,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,Unavailable,CA
35509,1h7KV2sjUWY,2018-05-19,True Facts : Ant Mutualism,zefrank1,People & Blogs,22,2018-05-18 01:00:06,[none],554139,44071,228,3018,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,Unavailable,CA
35715,1h7KV2sjUWY,2018-05-20,True Facts : Ant Mutualism,zefrank1,People & Blogs,22,2018-05-18 01:00:06,[none],656399,48457,259,3239,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,Unavailable,CA
36065,1h7KV2sjUWY,2018-05-21,True Facts : Ant Mutualism,zefrank1,People & Blogs,22,2018-05-18 01:00:06,[none],739895,51050,297,3326,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,Unavailable,CA
35628,1h7KV2sjUWY,2018-05-19,True Facts : Ant Mutualism,zefrank1,People & Blogs,22,2018-05-18 01:00:06,[none],554139,44071,228,3018,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,Unavailable,DE


#### To make video_id as the index, we need to validate this column. 
#### In order to set the “video_id” column as the index column of the dataframe, the column needs to be cleaned and validated. 


In [33]:
df_videos_Ids_grid = df_videos_Ids.groupby("video_id").size()

In [34]:
df_videos_Ids.groupby("video_id").size().head()

video_id
#NAME?         1752
#VALUE!           7
--1skHapGUc       1
--2K8l6BWfw       1
--45ws7CEN0       2
dtype: int64

#### A group by query over the video_ids is executed to identify any invalid ids. Indices are created with those ids to drop any invalid record.

#### Invalid value in Video_id column are "#NAME?" and "#VALUE!"

In [35]:
df_videos_Ids.shape

(335203, 18)

#### Dropping the "video_id" = "#NAME?" :

In [36]:
df_videos_Ids[df_videos_Ids["video_id"]=="#NAME?"].head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
134,#NAME?,2017-11-14,స‌మంత కంట‌త‌డి | Samantha became EMOTIONAL @ S...,Friday Poster,Entertainment,24,2017-11-13 08:59:27,"స‌మంత కంట‌త‌డి|""Samantha became EMOTIONAL @ Sa...",31052,36,11,2,https://i.ytimg.com/vi/-b0ww7L2MGU/default.jpg,False,False,False,స‌మంత కంట‌త‌డి | Samantha became EMOTIONAL @ S...,IN
173,#NAME?,2017-11-14,कुंभ राशि वालों के लिए 12 नवंबर - 18 नवंबर का ...,Jansatta,News & Politics,25,2017-11-11 09:09:06,"कुंभ राशि|""Astro""|""rashi""|""कुंभ""|""jansatta""",30659,180,36,3,https://i.ytimg.com/vi/-BcG_jN6DgE/default.jpg,False,False,False,,IN
189,#NAME?,2017-11-14,"घर में चुपचाप यहाँ रख दे एक लौंग , इतना बरसेगा...",Health Tips for You,Howto & Style,26,2017-11-08 12:27:17,"tona totka|""tone""|""laal kitaab""|""lal kitaab""|""...",743321,2570,1154,294,https://i.ytimg.com/vi/-kj6W27Jj-8/default.jpg,False,False,False,"घर में चुपचाप यहाँ रख दे एक लौंग , इतना बरसेगा...",IN
298,#NAME?,2017-11-15,18 नवम्बर 2017शनि अमावस्या को जरा से काले तिल ...,AstroMitram,People & Blogs,22,2017-11-14 05:41:47,"Tiger Zinda Hai Trailer|""Tiger Zinda Hai Offic...",28816,376,31,29,https://i.ytimg.com/vi/-X33hZ1oTXI/default.jpg,False,False,False,शनि अमावस्या 18 नवम्बर 2017 को जरा से काले तिल...,IN
360,#NAME?,2017-11-15,BEST MOM EVER- Things you would love to hear f...,Old Delhi Films,Entertainment,24,2017-11-14 06:52:06,"Mother|""mom""|""best mom""|""best dad ever""|""best ...",14529,1018,131,83,https://i.ytimg.com/vi/-x9Bp5lFyM0/default.jpg,False,False,False,"Things your MOTHER will never say, still you r...",IN


In [37]:
badValueNameIndex=df_videos_Ids[df_videos_Ids['video_id']=="#NAME?"].index

In [38]:
df_video_good_name = df_videos_Ids.drop(badValueNameIndex)

In [39]:
df_video_good_name[df_video_good_name["video_id"]=="#NAME?"]

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country


In [40]:
df_videos_Ids= df_video_good_name

In [41]:
df_videos_Ids[df_videos_Ids["video_id"]=="#NAME?"].shape[0]

0

In [42]:
df_videos_Ids.groupby("video_id").size().head()

video_id
#VALUE!        7
--1skHapGUc    1
--2K8l6BWfw    1
--45ws7CEN0    2
--6vcer7XYQ    3
dtype: int64

#### The dataframe still contains invalid video_id value "#VALUE!" that should be dropped from dataframe. 

In [43]:
df_videos_Ids[df_videos_Ids["video_id"]=="#VALUE!"].head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
20698,#VALUE!,2018-03-09,ऐसे झूठ जिनको देख कर आप तुरंत विश्वास कर लेते ...,Rahasya,Education,27,2018-03-08 07:03:27,"myth|""facts""|""top interesting facts""|""most int...",68027,4963,169,477,https://i.ytimg.com/vi/-L2-A0Saw30/default.jpg,False,False,False,FOLLOW US ON FACEBOOK : https://goo.gl/UhDWrn\...,IN
20894,#VALUE!,2018-03-10,ऐसे झूठ जिनको देख कर आप तुरंत विश्वास कर लेते ...,Rahasya,Education,27,2018-03-08 07:03:27,"myth|""facts""|""top interesting facts""|""most int...",143872,6914,297,540,https://i.ytimg.com/vi/-L2-A0Saw30/default.jpg,False,False,False,FOLLOW US ON FACEBOOK : https://goo.gl/UhDWrn\...,IN
21112,#VALUE!,2018-03-11,ऐसे झूठ जिनको देख कर आप तुरंत विश्वास कर लेते ...,Rahasya,Education,27,2018-03-08 07:03:27,"myth|""facts""|""top interesting facts""|""most int...",233118,8381,400,624,https://i.ytimg.com/vi/-L2-A0Saw30/default.jpg,False,False,False,FOLLOW US ON FACEBOOK : https://goo.gl/UhDWrn\...,IN


In [44]:
bad_video_index_list=df_videos_Ids[df_videos_Ids["video_id"]=="#VALUE!"].index

In [45]:
df_videos_Ids.shape

(320588, 18)

In [46]:
df_videos_Ids=df_videos_Ids.drop(bad_video_index_list)

In [47]:
df_videos_Ids.shape

(320532, 18)

In [48]:
df_videos_Ids[df_videos_Ids["video_id"]=="#VALUE!"]

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country


In [49]:
df_videos_Ids.groupby("video_id").size().head()

video_id
--1skHapGUc    1
--2K8l6BWfw    1
--45ws7CEN0    2
--6vcer7XYQ    3
--728h8mnDY    2
dtype: int64

#### Since the video_id is not unique in this dataframe, we have to retrieve the last record of this video based on "trending_date".

#### The dataframe should be cleaned to have unique "video_id". The dataframe sorted in descending order based on “trending_date” and then grouped by “video_id”. The first record of each group of video_ids that has the latest "trending_date" is chosen. 
#### The purpose of this process is to ensure the uniqueness of each “video_id” as there are multiple records for each video_id with a different value of "trending_date".

In [50]:
df_videos_Ids[df_videos_Ids["video_id"]=="-06RYo6s6qQ"]

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
21882,-06RYo6s6qQ,2018-03-05,Wie du eine KAPUTTE Beziehung REPARIERST,Christian Bischoff,People & Blogs,22,2018-03-04 09:00:00,"Beziehung|""Kaputte Beziehung""|""WIe du eine kap...",11839,935,21,125,https://i.ytimg.com/vi/-06RYo6s6qQ/default.jpg,False,False,False,In diesem Video zeige ich Dir wie du eine kapu...,DE


In [51]:
df_unique_video_id = df_videos_Ids.sort_values("trending_date",ascending=False).groupby("video_id").head(1)

In [52]:
df_unique_video_id.groupby("video_id").size().head()

video_id
--1skHapGUc    1
--2K8l6BWfw    1
--45ws7CEN0    1
--6vcer7XYQ    1
--728h8mnDY    1
dtype: int64

In [53]:
df_unique_video_id.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148716 entries, 34566 to 179
Data columns (total 18 columns):
video_id                  148716 non-null object
trending_date             148716 non-null datetime64[ns]
title                     148716 non-null object
channel_title             148716 non-null object
category                  147994 non-null object
category_id               148716 non-null object
publish_time              148716 non-null datetime64[ns]
tags                      148716 non-null object
views                     148716 non-null int64
likes                     148716 non-null int64
dislikes                  148716 non-null int64
comment_count             148716 non-null int64
thumbnail_link            148716 non-null object
comments_disabled         148716 non-null bool
ratings_disabled          148716 non-null bool
video_error_or_removed    148716 non-null bool
description               137919 non-null object
country                   148716 non-null object


### YouTube API

#### YouTube API has used in order to get more details about the YouTube vidoes. 
#### The API key has been generated through google deveopler console and requests over the YouTube API submitted by the key.
#### The request format is list of dictionary

In [311]:
# API KEY for first set of videos
#api_key = "AIzaSyBg16Q1dHoeE-oCVCVEExAJXXAx_UX8Onk"

# API Key for second set of videos
#api_key = "AIzaSyCTzXCwVmfWhtys74tSgHSgtbZAO9G1hxQ"

# API Key for third set of videos
#api_key = "AIzaSyA-r9VD3OkakG01mPjHRxt3endz4tqI184"

# API Key for fourth set of videos
api_key = "AIzaSyAy6s7RZI5KeueYnC4WPl4T15y6LRNWqHI"



In [312]:

from apiclient.discovery import build
youtube = build('youtube', 'v3', developerKey = api_key)

In [313]:
type(youtube)

googleapiclient.discovery.Resource

#### The created dataframe in previous step that contains unique video_id is being used to extarct the list of video id list.
#### Then it has submitted the request for each video_id in list through YouTube API 

In [77]:
df_videos_Ids_unique_api = df_unique_video_id

In [78]:
df_videos_Ids_unique_api.shape

(148716, 18)

#### Since there is limitation on request number on the YouTube API, the dataframe split up into array of dataframes. 

In [79]:
df_videos_Ids_split = np.array_split(df_videos_Ids_unique_api, 30)


In [80]:
df_videos_Ids_split[0].shape

(4958, 18)

In [314]:
#videoIds1 = df_videos_Ids_split[0]['video_id']

videoIds1 = df_videos_Ids_split[3]['video_id']

In [315]:
len(videoIds1)

4958

In [316]:
counter_video_dict = 0

In [317]:
video_dict1 = []


for vid in videoIds1:
    req = youtube.videos().list(id = vid,  part='snippet').execute()
    counter_video_dict += 1 
    if (req != ''):
        video_dict1.append(req['items'])

HttpError: <HttpError 403 when requesting https://www.googleapis.com/youtube/v3/videos?id=ZOPT6E-UzwA&part=snippet&key=AIzaSyAy6s7RZI5KeueYnC4WPl4T15y6LRNWqHI&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.">

In [318]:
counter_video_dict

3275

In [319]:
with open("D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/api/videos_4.json", "w") as data:
    data.write(json.dumps(video_dict1))

In [320]:
len(video_dict1)

3275

In [321]:
channelId = []
counter= 0

while counter < len(video_dict1)-1:
    if (len(video_dict1[counter]) > 0 ):
        channelId.append(video_dict1[counter][0]['snippet']['channelId']) 
    counter += 1 

In [322]:
len(channelId)

2613

In [323]:
channelId[0:5]

['UCANAGvMwJW8cpYH-OXE2Ryg',
 'UCfkM3u-0uSKADDitZLpXcfA',
 'UClFSU9_bUb4Rc6OYfTt5SPw',
 'UC1Lp8HbCqOPBMT1W_mjb21A',
 'UCaminwG9MTO4sLYeC3s6udA']

In [324]:
# First API Key
#channel_api_key = "AIzaSyBSonKRF-MnkQ8oSeUQWyrmNVc13WOfIpU"


# Second API Key 
# channel_api_key = "AIzaSyBQE6d4QrTFMqFcoxo3xgoag1FbHwxk81A"


# Third API Key 
#channel_api_key = "AIzaSyBOBeYmHBJbP9zkzI1an7nDdj_fzE--tuU"

# Fourth API Key 
channel_api_key = "AIzaSyAmJMe9z73pLxptG-kToZaSQlx0F2Gse8A"

In [325]:
ytRq = build('youtube', 'v3', developerKey = channel_api_key)
type(ytRq)

googleapiclient.discovery.Resource

In [307]:
req = ytRq.channels().list( part='snippet, statistics', id ='UCSse-lNI1DQ4w-8lh7vfPUw', maxResults= 50)


In [326]:
video_chan_dict1 = []

for chid in channelId:
    req = ytRq.channels().list(id = chid,  part='snippet').execute()
    if (req != ''):
        video_chan_dict1.append(req['items'])

In [327]:
len(video_chan_dict1)

2613

#### Since the video dataframe and channel dataframe have been split up to arrays of dataframe. 
#### We have write the result into files. 

In [328]:
with open("D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/api/channel_4.json", "a") as data:
    data.write(json.dumps(video_chan_dict1))

In [363]:
datastore1 = []
datastore2 = []
datastore3 = []
datastore4 = []

In [364]:
with open("D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/api/channel_1.json", 'r') as f:
    datastore1 = json.load(f)
    
with open("D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/api/channel_2.json", 'r') as f:
    datastore2 = json.load(f)
    
with open("D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/api/channel_3.json", 'r') as f:
    datastore3 = json.load(f)
    
with open("D:/DataScienceFoundation/SpringBoard/YouTube Project/youtube-new/api/channel_4.json", 'r') as f:
    datastore4 = json.load(f)

In [368]:
channel_dict_all = []

In [421]:
channel_dict_all= datastore1 + datastore2 + datastore3 + datastore4

In [424]:
len(channel_dict_all)

10473

In [431]:
channel_dict_all[2697][0]

{'etag': '"0UM_wBUsFuT6ekiIlwaHvyqc80M/vlpfYKjwo0sBsHvLrgcLSUfS7ic"',
 'id': 'UCGFjTDFI3t5Puy2jovEZsvw',
 'kind': 'youtube#channel',
 'snippet': {'country': 'FR',
  'description': "Salut à tous et bienvenue sur ma chaîne youtube!\nJe viens du sud de la France, \n\nN'hésitez pas à vous abonner, partager et liker !  Ça me ferai extrêmement plaisir.\n\nContact Pro ► lidealpro@gmail.com",
  'localized': {'description': "Salut à tous et bienvenue sur ma chaîne youtube!\nJe viens du sud de la France, \n\nN'hésitez pas à vous abonner, partager et liker !  Ça me ferai extrêmement plaisir.\n\nContact Pro ► lidealpro@gmail.com",
   'title': 'LIDEAL'},
  'publishedAt': '2017-04-03T18:24:53.000Z',
  'thumbnails': {'default': {'height': 88,
    'url': 'https://yt3.ggpht.com/a/AGF-l79X_XyNbpXdj4FiMf2bBgx-CQhMz_ovc8bNhA=s88-c-k-c0xffffffff-no-rj-mo',
    'width': 88},
   'high': {'height': 800,
    'url': 'https://yt3.ggpht.com/a/AGF-l79X_XyNbpXdj4FiMf2bBgx-CQhMz_ovc8bNhA=s800-c-k-c0xffffffff-no-rj-m

In [440]:
video_chan_dict1_cleaned = []
counter = 0

while counter < len(datastore1)-1:
    #extracting channel title in channel dictionary
    #video_chan_dict1[counter][0]['channel_publishedAt']= datastore1[counter][0]['snippet']['publishedAt']
    tmp = {}
    tmp['channel_id']= datastore1[counter][0]['id']
    tmp['channel_title']= datastore1[counter][0]['snippet']['title']
    #tmp['channel_publishedAt']= datastore1[counter][0]['snippet']['publishedAt']
    tmp['channel_etag']= datastore1[counter][0]['etag']
    tmp['channel_description']= datastore1[counter][0]['snippet']['description']
    video_chan_dict1_cleaned.append(tmp)
    counter= counter+1

In [434]:
len(video_chan_dict1_cleaned)

2695

In [435]:
video_chan_dict1_cleaned[0]

{'channel_description': '법륜스님의 즉문즉설',
 'channel_etag': '"0UM_wBUsFuT6ekiIlwaHvyqc80M/zxS0H6GV9QtqVTLKqyivMgeQxF4"',
 'channel_id': 'UCSsWdUwr4UonSCAVH-k0_Lg',
 'channel_title': '법륜스님의 즉문즉설'}

In [161]:
len(video_chan_dict1_cleaned)

2694

In [436]:
chnnel_df_1 = pd.DataFrame(video_chan_dict1_cleaned)

In [437]:
chnnel_df_1.head(3)

Unnamed: 0,channel_description,channel_etag,channel_id,channel_title
0,법륜스님의 즉문즉설,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/zxS0H6GV9QtqVTLKq...",UCSsWdUwr4UonSCAVH-k0_Lg,법륜스님의 즉문즉설
1,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO...",UCqwUrj10mAEsqezcItqvwEw,BB Ki Vines
2,"It's not on TV, it's on TVF. Subscribe to The ...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/wLK8JkSfE_7TdCka8...",UCNJcSUSzUeFm8W9P7UUlSeQ,The Viral Fever


chnnel_df_1['channel_publishedAt'] = pd.to_datetime(chnnel_df_1['channel_publishedAt'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')

In [165]:
chnnel_df_1.head()

Unnamed: 0,channel_description,channel_etag,channel_id,channel_publishedAt,channel_title,etag,id,kind,snippet
0,법륜스님의 즉문즉설,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/zxS0H6GV9QtqVTLKq...",UCSsWdUwr4UonSCAVH-k0_Lg,2011-08-11 06:12:03,법륜스님의 즉문즉설,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/zxS0H6GV9QtqVTLKq...",UCSsWdUwr4UonSCAVH-k0_Lg,youtube#channel,"{'title': '법륜스님의 즉문즉설', 'description': '법륜스님의 ..."
1,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO...",UCqwUrj10mAEsqezcItqvwEw,2015-06-20 08:40:00,BB Ki Vines,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO...",UCqwUrj10mAEsqezcItqvwEw,youtube#channel,"{'title': 'BB Ki Vines', 'description': 'BB Ki..."
2,"It's not on TV, it's on TVF. Subscribe to The ...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/wLK8JkSfE_7TdCka8...",UCNJcSUSzUeFm8W9P7UUlSeQ,2011-03-14 19:57:13,The Viral Fever,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/wLK8JkSfE_7TdCka8...",UCNJcSUSzUeFm8W9P7UUlSeQ,youtube#channel,"{'title': 'The Viral Fever', 'description': 'I..."
3,"ZEE24Taas, a 24x7 Marathi news channel activel...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/tf051B1A5UwlIza-1...",UCVbsFo8aCgvIRIO9RYwsQMA,2010-04-06 14:09:58,ZEE 24 TAAS,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/tf051B1A5UwlIza-1...",UCVbsFo8aCgvIRIO9RYwsQMA,youtube#channel,"{'title': 'ZEE 24 TAAS', 'description': 'ZEE24..."
4,Press the bell icon and Subscribe for daily do...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/gw-dt0kgn3oRnFgp5...",UCgM1AQcoM5TRlzsm1e66vrQ,2015-09-24 21:08:31,Hasley India,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/gw-dt0kgn3oRnFgp5...",UCgM1AQcoM5TRlzsm1e66vrQ,youtube#channel,"{'title': 'Hasley India', 'description': 'Pres..."


In [168]:
chnnel_df_1.shape

(2694, 9)

In [441]:
chnnel_df_1.groupby("channel_id").size().head(7)

channel_id
UC-0gtGds9hpUVkumxj6hN0A    2
UC-2EkisRV8h9KsHpslQ1gXA    1
UC-2Y8dQb0S6DtpxNgAKoJKA    4
UC-47_UfWjes2kH69QF-ASJg    1
UC-4M8AN08hw39nn2v91VuMQ    1
UC-4Tzd5lIRU7lcv7_5OO01Q    1
UC-5OCj6jVpzwE2XxXwFeMsw    2
dtype: int64

chnnel_df_unique__1 = chnnel_df_1.sort_values("channel_publishedAt",ascending=False).groupby("channel_id").head(1)
chnnel_df_unique__1.groupby("channel_id").size().head()

In [444]:
chnnel_df_unique1 = chnnel_df_1.groupby("channel_id").head(1) 
chnnel_df_unique1.head()

Unnamed: 0,channel_description,channel_etag,channel_id,channel_title
0,법륜스님의 즉문즉설,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/zxS0H6GV9QtqVTLKq...",UCSsWdUwr4UonSCAVH-k0_Lg,법륜스님의 즉문즉설
1,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO...",UCqwUrj10mAEsqezcItqvwEw,BB Ki Vines
2,"It's not on TV, it's on TVF. Subscribe to The ...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/wLK8JkSfE_7TdCka8...",UCNJcSUSzUeFm8W9P7UUlSeQ,The Viral Fever
3,"ZEE24Taas, a 24x7 Marathi news channel activel...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/tf051B1A5UwlIza-1...",UCVbsFo8aCgvIRIO9RYwsQMA,ZEE 24 TAAS
4,Press the bell icon and Subscribe for daily do...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/gw-dt0kgn3oRnFgp5...",UCgM1AQcoM5TRlzsm1e66vrQ,Hasley India


In [446]:
chnnel_df_unique__1= chnnel_df_unique1.set_index("channel_id")

In [447]:
chnnel_df_unique__1.head()

Unnamed: 0_level_0,channel_description,channel_etag,channel_title
channel_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
UCSsWdUwr4UonSCAVH-k0_Lg,법륜스님의 즉문즉설,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/zxS0H6GV9QtqVTLKq...",법륜스님의 즉문즉설
UCqwUrj10mAEsqezcItqvwEw,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO...",BB Ki Vines
UCNJcSUSzUeFm8W9P7UUlSeQ,"It's not on TV, it's on TVF. Subscribe to The ...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/wLK8JkSfE_7TdCka8...",The Viral Fever
UCVbsFo8aCgvIRIO9RYwsQMA,"ZEE24Taas, a 24x7 Marathi news channel activel...","""0UM_wBUsFuT6ekiIlwaHvyqc80M/tf051B1A5UwlIza-1...",ZEE 24 TAAS
UCgM1AQcoM5TRlzsm1e66vrQ,Press the bell icon and Subscribe for daily do...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/gw-dt0kgn3oRnFgp5...",Hasley India


In [448]:
df_unique_video_id[df_unique_video_id['channel_title']=="nigahiga"].head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country
39945,UfKmSfgFxi8,2018-06-09,FORTNITE The Movie (Official Fake Trailer),nigahiga,Entertainment,24,2018-05-11 21:11:16,"ryan|""higa""|""higatv""|""nigahiga""|""fortnite""|""th...",15346180,551373,16115,27094,https://i.ytimg.com/vi/UfKmSfgFxi8/default.jpg,False,False,False,Play Fortnite for FREE here: https://pixly.go2...,US
39354,2s4GMLkTNv0,2018-06-07,Dancing Without Moving!?,nigahiga,Entertainment,24,2018-06-02 22:28:48,"ryan|""higa""|""higatv""|""nigahiga""|""dancing witho...",5708085,487038,3979,40614,https://i.ytimg.com/vi/2s4GMLkTNv0/default.jpg,False,False,False,Nearly 1 whole week of standing still and over...,CA
34663,CbWP9nvnRhw,2018-05-14,The Pun Challenge!?,nigahiga,Entertainment,24,2018-04-27 18:55:53,"ryan|""higa""|""higatv""|""nigahiga""|""pun challenge...",3673170,212212,4165,18529,https://i.ytimg.com/vi/CbWP9nvnRhw/default.jpg,False,False,False,Go and bother the youtubers that you think wou...,US
27343,ZCPwpcurYns,2018-04-01,How To Make Mumble Rap,nigahiga,Entertainment,24,2018-03-17 20:58:36,"ryan|""higa""|""higatv""|""nigahiga""|""david choi""|""...",6187534,355336,6260,30425,https://i.ytimg.com/vi/ZCPwpcurYns/default.jpg,False,False,False,"When David and I are bored, we sometimes make ...",US
24147,tugFFhML7VY,2018-03-16,The Nintendoe Paper!,nigahiga,Entertainment,24,2018-03-02 20:08:20,"ryan|""higa""|""higatv""|""nigahiga""|""nintendo""|""la...",4159065,337616,4811,32303,https://i.ytimg.com/vi/tugFFhML7VY/default.jpg,False,False,False,Sorry it's been so long since we posted a vide...,US


In [449]:
video_channel_join = pd.merge(df_unique_video_id, chnnel_df_unique__1, on = 'channel_title', how='inner')

In [452]:
video_channel_join.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country,channel_description,channel_etag
0,Zqv5CBWt9yA,2018-06-14,Bhuvan Bam- Safar | Official Music Video |,BB Ki Vines,Entertainment,24,2018-06-13 07:13:43,"safar|""travel""|""bhuvan bam""|""music""|""journey""|...",3854712,524135,14650,55735,https://i.ytimg.com/vi/Zqv5CBWt9yA/default.jpg,False,False,False,Bhuvan Bam releases his 3rd single 'Safar' in ...,IN,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO..."
1,g2orJgNOpnU,2018-06-13,BB Ki Vines- | Alvida Dost |,BB Ki Vines,Entertainment,24,2018-06-06 12:02:47,"friends|""fight""|""humour""|""funny""|""comedy""|""bhu...",10433510,674784,17155,68078,https://i.ytimg.com/vi/g2orJgNOpnU/default.jpg,False,False,False,Is this the last time BB and Bancho talk? A he...,IN,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO..."
2,T9WN2_ikz6Q,2018-04-15,BB Ki Vines- | The Sacrifice |,BB Ki Vines,Entertainment,24,2018-04-09 10:21:36,"sacrifice|""parents""|""laptop""|""office""|""bonus""|...",7769007,1084173,9373,125740,https://i.ytimg.com/vi/T9WN2_ikz6Q/default.jpg,False,False,False,Dad's got a bonus from office. How will he spe...,IN,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO..."
3,t-My3vpDBvA,2018-04-06,BB Ki Vines- | Likhe Jo Khat Tujhe |,BB Ki Vines,Entertainment,24,2018-03-31 13:46:47,"bhuvan bam|""bb""|""funny""|""humour""|""letter""|""hil...",10294388,660730,17079,56644,https://i.ytimg.com/vi/t-My3vpDBvA/default.jpg,False,False,False,A random letter for Babloo ji turns his world ...,IN,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO..."
4,FPm7xM849-E,2018-03-14,BB Ki Vines- | Maun Vrat |,BB Ki Vines,Entertainment,24,2018-03-05 13:09:25,"BB|""Bhuvan Bam""|""humour""|""comedy""|""mute""|""dumb...",10415360,612718,16544,40230,https://i.ytimg.com/vi/FPm7xM849-E/default.jpg,False,False,False,BB has to be mute for a day in order to achiev...,IN,BB Ki Vines is about BB and some funny instanc...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/5yXczLj4QG1sbTSVO..."


In [453]:
video_channel_join[video_channel_join['channel_title']== "Elhiwar Ettounsi"].head()

Unnamed: 0,video_id,trending_date,title,channel_title,category,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,country,channel_description,channel_etag
8499,8T22866ElAc,2018-06-14,Eli Lik Lik Episode 13 Partie 02,Elhiwar Ettounsi,Entertainment,24,2018-06-13 19:12:53,"hkayet tounsia|""elhiwar ettounsi""|""denya okhra...",83317,301,23,25,https://i.ytimg.com/vi/8T22866ElAc/default.jpg,False,False,False,► Retrouvez vos programmes préférés : https://...,DE,Retrouvez tous les replays de vos emissions pr...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/m8Z8Gh_douetpBEJG..."
8500,8HNuRNi8t70,2018-06-14,Eli Lik Lik Episode 13 Partie 01,Elhiwar Ettounsi,Entertainment,24,2018-06-13 19:01:18,"hkayet tounsia|""elhiwar ettounsi""|""denya okhra...",103339,460,66,51,https://i.ytimg.com/vi/8HNuRNi8t70/default.jpg,False,False,False,► Retrouvez vos programmes préférés : https://...,CA,Retrouvez tous les replays de vos emissions pr...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/m8Z8Gh_douetpBEJG..."
8501,HI7Xbt25vC4,2018-06-13,Eli Lik Lik Episode 12 Partie 03,Elhiwar Ettounsi,Entertainment,24,2018-06-12 19:21:34,"hkayet tounsia|""elhiwar ettounsi""|""denya okhra...",83394,294,35,46,https://i.ytimg.com/vi/HI7Xbt25vC4/default.jpg,False,False,False,► Retrouvez vos programmes préférés : https://...,FR,Retrouvez tous les replays de vos emissions pr...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/m8Z8Gh_douetpBEJG..."
8502,XQtDDJBWzfs,2018-06-13,Eli Lik Lik Episode 12 Partie 01,Elhiwar Ettounsi,Entertainment,24,2018-06-12 19:02:29,"hkayet tounsia|""elhiwar ettounsi""|""denya okhra...",108008,501,73,89,https://i.ytimg.com/vi/XQtDDJBWzfs/default.jpg,False,False,False,► Retrouvez vos programmes préférés : https://...,CA,Retrouvez tous les replays de vos emissions pr...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/m8Z8Gh_douetpBEJG..."
8503,8jvfcBriDk0,2018-06-12,Eli Lik Lik Episode 11 Partie 02,Elhiwar Ettounsi,Entertainment,24,2018-06-11 19:15:06,"hkayet tounsia|""elhiwar ettounsi""|""denya okhra...",102929,347,41,80,https://i.ytimg.com/vi/8jvfcBriDk0/default.jpg,False,False,False,► Retrouvez vos programmes préférés : https://...,DE,Retrouvez tous les replays de vos emissions pr...,"""0UM_wBUsFuT6ekiIlwaHvyqc80M/m8Z8Gh_douetpBEJG..."
