In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("videosUS.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41415 entries, 0 to 41414
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   video_id                41415 non-null  object 
 1   trending_date           40949 non-null  object 
 2   title                   40949 non-null  object 
 3   channel_title           40949 non-null  object 
 4   category_id             40949 non-null  float64
 5   publish_time            40949 non-null  object 
 6   tags                    40949 non-null  object 
 7   views                   40949 non-null  float64
 8   likes                   40949 non-null  float64
 9   dislikes                40949 non-null  float64
 10  comment_count           40949 non-null  float64
 11  thumbnail_link          40949 non-null  object 
 12  comments_disabled       40949 non-null  object 
 13  ratings_disabled        40949 non-null  object 
 14  video_error_or_removed  40949 non-null

In [4]:
print("The missing values and there columns are:") 
df.isna().sum()

The missing values and there columns are:


video_id                     0
trending_date              466
title                      466
channel_title              466
category_id                466
publish_time               466
tags                       466
views                      466
likes                      466
dislikes                   466
comment_count              466
thumbnail_link             466
comments_disabled          466
ratings_disabled           466
video_error_or_removed     466
description               1036
dtype: int64

  I believe these pieces of info went missing because probably the videos are hidden (private). Or Maybe there is a channel that contained 466 videos that was used in the survey, but is deleted afterwards so the data was deleted from the dataset too for privacy/legal issues, but the videos' Ids were kept.
This was concluded from https://stackoverflow.com/questions/25716081/can-youtube-ids-be-reissued-after-a-video-is-deleted, 
where someone explained that videos ID's are unique (even after the deletion of the video) and can't be used to other videos.
"If YouTube were to ever reuse IDs, it would cause problems such as old links now pointing to new (possibly unlisted) videos. 
There is no advantage in reusing IDs, only problems including privacy problems. It would be an ugly bug."
 
  For the missing description, i guess that it's normal to have it missing because some videos ( like entertainment ones or so) are enoughly described by their title, or when the channel is not an offical/professional publisher (amateurs' videos).

In [5]:
print("Extracting the mean , median , quartiles of each numerical variable within this dataset.")
df.describe()

Extracting the mean , median , quartiles of each numerical variable within this dataset.


Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,40949.0,40949.0,40949.0,40949.0,40949.0
mean,19.972429,2360785.0,74266.7,3711.401,8446.804
std,7.568327,7394114.0,228885.3,29029.71,37430.49
min,1.0,549.0,0.0,0.0,0.0
25%,17.0,242329.0,5424.0,202.0,614.0
50%,24.0,681861.0,18091.0,631.0,1856.0
75%,25.0,1823157.0,55417.0,1938.0,5755.0
max,43.0,225211900.0,5613827.0,1674420.0,1361580.0


In [6]:
df["views_zscore"] = (df.views - df.views.mean()) / df.views.std()
df = df[abs(df["views_zscore"]) < 3]
# Removing views outliers from the dataset

In [7]:
df["likes_zscore"] = (df.likes - df.likes.mean()) / df.likes.std()
df = df[abs(df["likes_zscore"]) < 3]
# Removing views outliers from the dataset

In [8]:
df["dislikes_zscore"] = (df.dislikes - df.dislikes.mean()) / df.dislikes.std()
df = df[abs(df["dislikes_zscore"]) < 3]
# Removing views outliers from the dataset

In [9]:
df["comments_zscore"] = (df.comment_count - df.comment_count.mean()) / df.comment_count.std()
df = df[abs(df["comments_zscore"]) < 3]
# Removing views outliers from the dataset

In [10]:
df.info()
# to compare the count of values before and after removing the outliers (Check cell 3 for the 'before' count).

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38782 entries, 0 to 41413
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   video_id                38782 non-null  object 
 1   trending_date           38782 non-null  object 
 2   title                   38782 non-null  object 
 3   channel_title           38782 non-null  object 
 4   category_id             38782 non-null  float64
 5   publish_time            38782 non-null  object 
 6   tags                    38782 non-null  object 
 7   views                   38782 non-null  float64
 8   likes                   38782 non-null  float64
 9   dislikes                38782 non-null  float64
 10  comment_count           38782 non-null  float64
 11  thumbnail_link          38782 non-null  object 
 12  comments_disabled       38782 non-null  object 
 13  ratings_disabled        38782 non-null  object 
 14  video_error_or_removed  38782 non-null

In [11]:
print("Unique values present in category column are: ", df['category_id'].unique())
print("So There are",df['category_id'].nunique(), "unique categories.")

Unique values present in category column are:  [22. 24. 23. 28.  1. 25. 17. 10. 15. 27. 26.  2. 19. 20. 29. 43.]
So There are 16 unique categories.


I think it will be more handy if we change the type of: comments_disabled, ratings_disabled, video_error_or_removed from object
to int, i.e 0 for 'False' and 1 for 'True'.

In [12]:
print("There are", df['tags'].nunique() , "tags.")
print("The 5 most commun tags are: ")
df['tags'].value_counts()[0:5]  # top 5 tags
# The method value_counts() returns the count of all unique values in the given index in descending order.

There are 5935 tags.
The 5 most commun tags are: 


[none]                                                                                                                                                                                                                                                                                                                                                                                                                                                       1500
ABC|"americanidol"|"idol"|"american idol"|"ryan"|"seacrest"|"ryan seacrest"|"katy"|"perry"|"katy perry"|"luke"|"bryan"|"luke bryan"|"lionel"|"richie"|"lionel richie"|"season 16"|"american idol XVI"|"television"|"ad"|"spring"|"2018"|"music"|"reality"|"competition"|"song"|"sing"|"audition"|"auditions"|"performance"|"live"|"fox"|"AI"|"hollywood"|"contestant"|"official"|"american"|"official american idol"|"hollywood week"|"hometown audition"      87
James Corden|"The Late Late Show"|"Colbert"|"late night"|"late night show"|"Stephen Colbert"|"Comedy