# DSCI 521: Data Analysis and Interpretation <br> Term Project Phase 2: Youtube trending page analysis

## Group members 
- Group member 
    - Name: Amira Bendjama
    - Email: ab4745@drexel.edu
- Group member 
    - Name: Thuy Hong Doan
    - Email: td688@drexel.edu
- Group member 
    - Name: Alsulami Meznah
    - Email: mha54@drexel.edu

## Cleaning Youtube trending dataset 

Before starting the anaylsis, it is important to clean the dataset that the project relies on. The project cleaning consisted of: 
- Finding and deleting rows with a missing values, which were mainly in the Description column. 
- Replacing NaN in description with space.
- Deleting rows with comments_disabled=True or ratings_disabled=True.
- Fixing the Tags column,by Replacing "[None]" with space, and spliting tags with '|' and convert list to one string.
- Dropping duplicates rows in video id column.
- Converting date columns 'publishedAt', 'trending_date' to datetime type.
- Reseting indexing of the dataframe


In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set(style='darkgrid')

US_category_id = pd.read_json('data/US_category_id.json')
trending_youtube = pd.read_csv('data/US_youtube_trending_data.csv')


In [7]:
# getting information about the dataset
trending_youtube.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180390 entries, 0 to 180389
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   video_id           180390 non-null  object
 1   title              180390 non-null  object
 2   publishedAt        180390 non-null  object
 3   channelId          180390 non-null  object
 4   channelTitle       180390 non-null  object
 5   categoryId         180390 non-null  int64 
 6   trending_date      180390 non-null  object
 7   tags               180390 non-null  object
 8   view_count         180390 non-null  int64 
 9   likes              180390 non-null  int64 
 10  dislikes           180390 non-null  int64 
 11  comment_count      180390 non-null  int64 
 12  thumbnail_link     180390 non-null  object
 13  comments_disabled  180390 non-null  bool  
 14  ratings_disabled   180390 non-null  bool  
 15  description        180390 non-null  object
dtypes: bool(2), int64(5)

In [9]:
# cleaning dataset 
# number of null values in the dataset
trending_youtube.isnull().sum()

video_id             0
title                0
publishedAt          0
channelId            0
channelTitle         0
categoryId           0
trending_date        0
tags                 0
view_count           0
likes                0
dislikes             0
comment_count        0
thumbnail_link       0
comments_disabled    0
ratings_disabled     0
description          0
dtype: int64

The description column has all null values. These are some of the rows whose description values are null.

In [11]:
trending_youtube[trending_youtube["description"].apply(lambda x: pd.isna(x))].head(3)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description


In [13]:
# Replace NaN in description with space
trending_youtube["description"].fillna(" ", inplace=True)
# Delete all rows with a missing values if any
trending_youtube.dropna(inplace=True)

In [15]:
# checking for the NAN values
trending_youtube.isnull().sum().sum()

0

In [17]:
trending_youtube.shape

(177001, 16)

Since our main focus is to find factors that effect the trending of a video, any videos with disbaled comments of ratings will be exculded from our project.

In [19]:
# delete rows with comments_disabled=True or ratings_disabled=True
trending_youtube = trending_youtube[(trending_youtube['comments_disabled'] == False) &
                    (trending_youtube['ratings_disabled'] == False)]
trending_youtube.shape

(177001, 16)

In [21]:
trending_youtube["tags"].head(10)

0    brawadis prank basketball skits ghost funny vi...
1    Apex Legends Apex Legends characters new Apex ...
2    jacksepticeye funny funny meme memes jacksepti...
3    xxl freshman xxl freshmen 2020 xxl freshman 20...
4    The LaBrant Family DIY Interior Design Makeove...
5    Professor injury professor achilles professor ...
6                                                     
7                     cgpgrey education hello internet
8    surprising dad father papa with dream car truc...
9    Vengo De Nada Aleman Ovi Big Soto Trap Ovi Nat...
Name: tags, dtype: object

Tags columns is not comphrensive, that's why we need to convert "none" values into empty string and remove "|" to form one string for each row.

In [23]:
# Replace [None] in tags with space 
trending_youtube.loc[trending_youtube['tags'] == '[None]', 'tags'] = ' '
# split tags with '|' and convert list to one string
trending_youtube['tags'] = [' '.join(tag) \
                       for tag in trending_youtube['tags'].str.split('|')]

trending_youtube["tags"].head(10)

0    brawadis prank basketball skits ghost funny vi...
1    Apex Legends Apex Legends characters new Apex ...
2    jacksepticeye funny funny meme memes jacksepti...
3    xxl freshman xxl freshmen 2020 xxl freshman 20...
4    The LaBrant Family DIY Interior Design Makeove...
5    Professor injury professor achilles professor ...
6                                                     
7                     cgpgrey education hello internet
8    surprising dad father papa with dream car truc...
9    Vengo De Nada Aleman Ovi Big Soto Trap Ovi Nat...
Name: tags, dtype: object

There is a lot of duplicates rows in the dataset, where we can verify using video id column and unique() to get the real size of nonduplicates rows. That's any row with duplicate title and video id will be dropped from the dataset.

In [25]:
len(trending_youtube['video_id'].unique())

32449

In [27]:
len(trending_youtube['video_id'])

177001

In [29]:
# .drop_duplicates()
# these are the duplicate values with same video id
duplicates = trending_youtube[trending_youtube['video_id'].duplicated() == True]
duplicates

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
751,fITRL6oXWao,Dame Drops 61 As Portland's Playoff Quest Cont...,2020-08-12T04:13:09Z,UCU7iRrk3xfpUk0R6VdyC1Ow,NBA on TNT,17,2020-08-15T00:00:00Z,NBA on TNT NBA Inside the NBA Charles Barkley ...,466256,5636,182,1130,https://i.ytimg.com/vi/fITRL6oXWao/default.jpg,False,False,The Tuesday crew discusses Damian Lillard's cl...
981,JJk1ji23Ey8,Veggie & Fish Tank,2020-08-11T11:08:04Z,UCRxAgfYexGLlu1WHGIMUDqw,JunsKitchen,26,2020-08-16T00:00:00Z,taking cats walk jun junskitchen juns kitchen ...,616413,82601,236,4272,https://i.ytimg.com/vi/JJk1ji23Ey8/default.jpg,False,False,I've been wanting to do aquaponics (growing ve...
1161,W7VK4DUHvKU,"Lil Yachty - Pardon Me ft. Future, Mike WiLL M...",2020-08-11T19:00:10Z,UC1X3TRsCt36QPjF1p5f3HTg,LilYachtyVEVO,10,2020-08-17T00:00:00Z,Lil Yachty Lil Boat 3 Future Lil Yachty Pardon...,1390669,70722,1051,3228,https://i.ytimg.com/vi/W7VK4DUHvKU/default.jpg,False,False,Watch the official video for Lil Yachty & Futu...
1379,rYErvhVhlzU,Did Late Night TV Change in 2020?,2020-08-13T16:00:13Z,UCuo9VyowIT-ljA5G2ZuC6Yw,Eddy Burback,23,2020-08-18T00:00:00Z,eddy burback,378916,44584,439,2686,https://i.ytimg.com/vi/rYErvhVhlzU/default.jpg,False,False,Go to https://buyraycon.com/eddy for 15% off y...
1593,dANJlolAYyA,Machine Gun Kelly - concert for aliens,2020-08-13T17:00:11Z,UC2a9zmrdjvLsrRgi4G4blWw,MGKVEVO,10,2020-08-19T00:00:00Z,Machine Gun Kelly concert for aliens Bad Boy/I...,1834901,127127,4211,8808,https://i.ytimg.com/vi/dANJlolAYyA/default.jpg,False,False,Machine Gun Kelly - concert for aliens is avai...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180245,CwaKooHmJOM,Dallas Cowboys vs. San Francisco 49ers | 2022 ...,2023-01-23T02:58:23Z,UCDVYQ4Zhbm3S2dlz7P1GBDg,NFL,17,2023-01-26T00:00:00Z,,4016354,41846,0,9508,https://i.ytimg.com/vi/CwaKooHmJOM/default.jpg,False,False,Check out our other channels:NFL Mundo https:/...
180257,A1Gy5nJ6GGE,The US Soldiers Leaking Nuclear Secrets | Inve...,2023-01-23T17:00:06Z,UCZaT_X_mc0BI-djXOlfhqWQ,VICE News,25,2023-01-26T00:00:00Z,VICE News VICE News Tonight VICE on HBO news v...,985371,25717,0,3115,https://i.ytimg.com/vi/A1Gy5nJ6GGE/default.jpg,False,False,US nuclear weapons are stored across Europe bu...
180283,UkeGQotnsDU,"Why Fuel Injectors are AWESOME (28,000 fps Slo...",2023-01-22T14:59:32Z,UC6107grRI4m0o2-emgoDnAA,SmarterEveryDay,28,2023-01-26T00:00:00Z,Smarter Every Day Science Physics Destin Sandl...,1264730,57293,0,3638,https://i.ytimg.com/vi/UkeGQotnsDU/default.jpg,False,False,Smarter Every Day on Patreon: http://www.patre...
180291,PKvKOJZcT-s,Realities of Winter In A Ghost Town,2023-01-22T21:00:07Z,UCEjBDKfrqQI4TgzT9YLNT8g,Ghost Town Living,22,2023-01-26T00:00:00Z,,464446,28176,0,2455,https://i.ytimg.com/vi/PKvKOJZcT-s/default.jpg,False,False,To get 5 free travel packs of AG1 and a year s...


There is almost 7 duplicates for a single video! since while collecting dataset, the video can remain in the trending page for a while so when collecting the dataset it will get the video multiple times but with different viewer count and comments and other variables. so when we drop our duplicates we make sure to leave the last time the video was trending. 

In [31]:
trending_youtube[trending_youtube['video_id'] == "J78aPJ3VyNs"]

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
1399,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-18T00:00:00Z,jacksepticeye funny funny meme memes jacksepti...,3490530,457130,4269,47291,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...


In [33]:
len(duplicates['video_id'])

1012

In [35]:
trending_youtube.drop_duplicates(subset=['title','video_id'], keep='last' , inplace= True)

In [37]:
# last updated video before leaving the trending page
trending_youtube[trending_youtube['video_id'] == "J78aPJ3VyNs"]

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
1399,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-18T00:00:00Z,jacksepticeye funny funny meme memes jacksepti...,3490530,457130,4269,47291,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...


In [39]:
len(trending_youtube)

33461

Converting date columns from object to datetime type, to easily manipulate them in the analysis.

In [41]:
trending_youtube.dtypes

video_id                     object
title                        object
publishedAt          datetime64[ns]
channelId                    object
channelTitle                 object
categoryId                    int64
trending_date        datetime64[ns]
tags                         object
view_count                    int64
likes                         int64
dislikes                      int64
comment_count                 int64
thumbnail_link               object
comments_disabled              bool
ratings_disabled               bool
description                  object
dtype: object

In [43]:
trending_youtube[['publishedAt', 'trending_date']].head()

Unnamed: 0,publishedAt,trending_date
13,2020-08-11 19:00:10,2020-08-12
58,2020-08-11 11:08:04,2020-08-12
172,2020-08-07 18:30:06,2020-08-12
173,2020-08-07 09:30:04,2020-08-12
174,2020-08-06 19:47:12,2020-08-12


In [45]:
#Trending date column has object data type which needs to changed as datetime
trending_youtube['trending_date'] = pd.to_datetime(trending_youtube['trending_date'], format = "%Y-%m-%dT%H:%M:%SZ")
#The publishedAt column converted with the astype function
trending_youtube['publishedAt'] = trending_youtube['publishedAt'].astype('datetime64[ns]')
trending_youtube[['publishedAt', 'trending_date']].head()

Unnamed: 0,publishedAt,trending_date
0,2020-08-11 19:00:10,2020-08-12
1,2020-08-11 11:08:04,2020-08-12
2,2020-08-07 18:30:06,2020-08-12
3,2020-08-07 09:30:04,2020-08-12
4,2020-08-06 19:47:12,2020-08-12


In [47]:
trending_youtube[['trending_date','publishedAt']].dtypes

trending_date    datetime64[ns]
publishedAt      datetime64[ns]
dtype: object

Since we dropped rows and change few parts in the dataset, the indexing of the dataframe won't be coherent, we can simply reset it.

In [48]:
trending_youtube.index

RangeIndex(start=0, stop=33461, step=1)

In [49]:
trending_youtube.reset_index(drop=True, inplace=True)

In [50]:
trending_youtube.index

RangeIndex(start=0, stop=33461, step=1)