### Importing necessary libraries

In [1]:
!pip install opendatasets





In [2]:
pip install --upgrade google-api-python-client

Note: you may need to restart the kernel to use updated packages.




In [3]:
pip install --upgrade google-auth-oauthlib google-auth-httplib2

Note: you may need to restart the kernel to use updated packages.




In [4]:
import opendatasets as od
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import warnings
import numpy as np
import statistics as st
import plotly.express as px
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
from IPython.display import JSON
#import isodate


warnings.filterwarnings('ignore')

### Loading the dataset from the source

In [5]:
dataset = 'https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset?select=US_youtube_trending_data.csv'

final_dir = '\\'.join(os.getcwd().split('\\')[:-1])

In [6]:
od.download(dataset, data_dir=final_dir)

final_dir += '\\youtube-trending-video-dataset\\US_youtube_trending_data.csv'

Skipping, found downloaded files in "C:\Users\pooja\OneDrive\Desktop\Semester 3\Data 606 Capstone in DataScience\youtube-trending-video-dataset" (use force=True to force download)


In [7]:
data = pd.read_csv(final_dir)

data.head(3)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...
1,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11T17:00:10Z,UC0ZV6M2THA81QT9hrVWJG3A,Apex Legends,20,2020-08-12T00:00:00Z,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg,False,False,"While running her own modding shop, Ramya Pare..."
2,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12T00:00:00Z,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...


### Understanding the data

In [8]:
# Checking the shape of the DataFrame
data.shape

(187790, 16)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187790 entries, 0 to 187789
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   video_id           187790 non-null  object
 1   title              187790 non-null  object
 2   publishedAt        187790 non-null  object
 3   channelId          187790 non-null  object
 4   channelTitle       187790 non-null  object
 5   categoryId         187790 non-null  int64 
 6   trending_date      187790 non-null  object
 7   tags               187790 non-null  object
 8   view_count         187790 non-null  int64 
 9   likes              187790 non-null  int64 
 10  dislikes           187790 non-null  int64 
 11  comment_count      187790 non-null  int64 
 12  thumbnail_link     187790 non-null  object
 13  comments_disabled  187790 non-null  bool  
 14  ratings_disabled   187790 non-null  bool  
 15  description        183729 non-null  object
dtypes: bool(2), int64(5)

In [10]:
# Checking null values

def null_values(df):
    temp = df.isna().sum()
    temp_1 = round(temp * 100 / df.shape[0], 2)
    
    return pd.DataFrame((temp, temp_1), index = ['Count', 'Percentage']).T.sort_values('Count', ascending = False)


null_df = null_values(data).reset_index().rename({'index':'Column_name'}, axis =1)
null_df

Unnamed: 0,Column_name,Count,Percentage
0,description,4061.0,2.16
1,video_id,0.0,0.0
2,title,0.0,0.0
3,publishedAt,0.0,0.0
4,channelId,0.0,0.0
5,channelTitle,0.0,0.0
6,categoryId,0.0,0.0
7,trending_date,0.0,0.0
8,tags,0.0,0.0
9,view_count,0.0,0.0


We can see that there are around 10% missing values in `description` column
- We are keeping the `description` column as it is while analysis and perform the null value treatment during model building.

- We also observed that the data has certain important columns missing in it. Such as `VideoDuration` and `Comments` of video which are very useful for further analysis.
- We used `video_id` to extract the data from `YOUTUBE API`.
- The link for the Youtube API is: https://developers.google.com/youtube/v3/quickstart/python

- While trying to extract the data we found that the video IDs are repeating which are causing duplicate rows in the dataset.
- There are only 34066 unique videos out of 185990 video ids.
#### Reason:
- The videos are repeating as they are trending for multiple number of days.
- The combination of `video_id` and `trending_date` gives us the information about how many days a specific video have been trending.
- Let's consider an example below


In [11]:
# Let's check the unique video_ids in the `video_id` column
data['video_id'].nunique()

34391

In [15]:
# This code shows an example of how a randomly chosen video_id has continuous trending_dates
example_id = data['video_id'].value_counts().sort_values(ascending = False).index[np.random.randint(10,20)]

df=data[data['video_id'] == example_id]
df

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
7401,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-18T00:00:00Z,Justin|Bieber|Holy|RBMG/Def|Jam|Pop,6217404,936304,13331,91366,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
7602,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-19T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,16239716,1555576,29634,134954,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
7806,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-20T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,22349477,1851026,39955,149395,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
8273,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-22T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,28381347,2077535,48096,162536,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
8722,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-24T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,32772710,2229874,53151,170185,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
8949,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-25T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,34763995,2288738,55087,173256,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
9132,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-26T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,36350151,2331823,56544,175811,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
9335,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-27T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,37731225,2364416,57780,177132,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
9528,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-28T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,38942701,2391684,58762,179139,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...
9745,pvPsJFRGleA,Justin Bieber - Holy ft. Chance The Rapper,2020-09-18T04:00:10Z,UCHkj014U2CQ2Nv0UZeYpE_A,JustinBieberVEVO,10,2020-09-29T00:00:00Z,justin bieber|bieber|justin|hailey|biebs|biebe...,40157565,2419281,60017,180869,https://i.ytimg.com/vi/pvPsJFRGleA/default.jpg,False,False,Holy out now: https://JustinBieber.lnk.to/Holy...


- In the above example we can see that a specific video has been trending for days.
- That means we have to modify the data such that it only has unique video_id in `video_id` column.

### Extracting `Duration` and `Comments` from Youtube API

In [None]:
# First let's get unique video ids from the video_id column in dataframe

videos_list = data['video_id'].unique()

In [None]:
# The API_KEY is not displayed for security purposes

api_key = "*********************************"

In [None]:
# This fuction is used to extract 'Duration' and 'Comments' of a provided video_id

def get_video_info(vid_id, api_key):
    api_service_name = "youtube"
    api_version = "v3"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=api_key)

    request = youtube.videos().list(
        part="snippet,contentDetails,statistics,status",
        id=vid_id
    )
    
    response = request.execute()
    
    info_temp = []
    info_temp.append(vid_id)
    info_temp.append(response['items'][0]['contentDetails']['duration'])
    info_temp.append(response['items'][0]['status']['madeForKids'])
    
    youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey = api_key)

    request = youtube.commentThreads().list(
        part="snippet,replies",
        videoId=vid_id,
        textFormat = "plainText",
        order = "relevance"
    )
    response_comments = request.execute()
    
    info_temp.append([response_comments['items'][i]['snippet']['topLevelComment']['snippet']['textOriginal'] for i in range(20)])
    
    
    return info_temp



# This function is used to create a dataframe from the list of video_ids in video_list and save it in the repository.
# Start and end number is the range of video_ids you want the data for.

def data_to_csv(videos_list, start_number, end_number):
    print("Code Running")
    video_info = []
    deleted_videos = []
    
    for vid_id in videos_list[start_number : end_number]:
        try:
            video_info.append(get_video_info(vid_id, api_key))
        except:
            deleted_videos.append(vid_id)
    
    print("Number of videos extracted =", len(video_info))
    print("Number of video deleted =", len(deleted_videos))
    
    df = pd.DataFrame(video_info, columns = ['video_id', 'Duration', 'madeForKids', 'Comments'])
    df.to_csv('./YoutubeDataFiles/Data_' + str(start_number) + '_' + str(end_number) + '.csv')
    
    
# Saving the data to a .csv file by calling the funciton

#data_to_csv(videos_list, 0, 5000)

- Although it was possible to extract all the video_ids at once. It would cost money if you were to extract more than 5000.
- Hence, we used different API_KEYS and various start & end_numbers to save the files in the repository.
- The files are stored in `./YoutubeDataFiles` directory.
- The .csv files follow the convention `Data_startnumber_endnumber.csv`.

In [None]:
# Let's read the 'extracted files' in and merge them to merge with original Dataframe

# extracted_df = 

### Modifying the Original Dataframe

In [16]:
# Checking column name and values in the dataframe

data.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...


In [None]:
# Checking the datatypes of each column

data.dtypes

Let's make the following changes to the Original Dataframe in 4 major steps:

1. Drop `channelId` column as it is not necessary for the analysis and prediction.
<br>

2. Convert `publishedAt` and `trending_date` to Pandas_Datetime object. Map each category_id to respective category in `categroty_id` column. This is done using `US_category_id.json` file from the source file.
<br>

3. Groupby `video_id` column and apply the following functions to the rest of the columns:
- `title` - <b>Function : MODE</b> - because title is same for each video_id.
- `publishedAT` - <b>Function : MODE</b> - although published date is same for each video_id. Let's consider minimum value.
- `channelTitle` - <b>Function : MODE</b> - because channel title is same for each video_id.
- `categroryId` - <b>Function : MODE</b> - because categoryId is same for each video_id.
- `trending_date` - <b>Function : [min(trending_date), max(trending_date)]</b> - this is because we can extract features like `How many days video took to get into trending list` and `How many days the video havee been trending`.
- `tags` - <b>Function : MODE</b> - because tags are same for each video_id.
- `likes` - <b>Function : [min(likes), max(likes)]</b> - this is because we can extract features like `How many likes have been increased during the period`.
- `comments_disabled` - <b>Function : MODE</b> - because comments_disabled_flag is same for each video_id.
- `ratings_disabled` - <b>Function : MODE</b> - because ratings_disabled_flag is same for each video_id.
<br>
<br>
       
4. Drop `dislikes` column as the dislikes has been discontinued from November 2021.
Reference: https://www.google.com/search?q=when+did+youtube+remove+dislikes&rlz=1C1UEAD_enUS1037US1037&oq=when+did+youtube+remove+&aqs=chrome.0.0i512j69i57j0i512l6j0i22i30l2.6954j1j4&sourceid=chrome&ie=UTF-8



#### Step - 1

In [None]:
# Dropping channelId

data.drop(['channelId'], axis = 1, inplace = True)

# Validating the above code

data.head(3)

#### Step - 2

- `publishedAt`, `trending_date` are object type. Let's convert it to Datetime format.

In [None]:
# Converting 'publishedAt' and 'trending_date' to datetime objects

data[['publishedAt', 'trending_date']] = data[['publishedAt', 'trending_date']].apply(lambda x: pd.to_datetime(x))

# Validating the above changes

data.dtypes[['publishedAt', 'trending_date']]

We can see that `categoryId` column has id's of repective categories. We can access the description of categoryId from the `US_category_id.json` file.

In [None]:
# Let's import US_category_id.json file and map the category id's respectively

category_path = '\\'.join(final_dir.split('\\')[:-1]) + '\\US_category_id.json'


# Creating a dictionary object which stores the category id and its respective category
category_dict = {}

with open(category_path, 'r') as file:
    json_data = json.load(file)
    for item in json_data['items']:
        category_dict[int(item['id'])] = item['snippet']['title']
    
data['categoryId'] = data['categoryId'].apply(lambda x: category_dict[x])

# Validating the above code
data['categoryId'].head()

#### Step - 3 and 4

In [None]:
# This function converts list to min_max_list as per conditions specified in the above cell

def column_start_end(x: list) -> list:
    return([min(x), max(x)])


# Grouping by 'video_id' and aggregating using functions as specified in the above cell

modified_df = data.groupby('video_id').agg({'title':st.mode, 'publishedAt':np.min, 'channelTitle':st.mode, 'categoryId':st.mode,
              'trending_date': column_start_end, 'tags': st.mode, 'likes': column_start_end,
                'comments_disabled': st.mode, 'ratings_disabled': st.mode}).reset_index()

modified_df.head(2)

In [None]:
modified_df.shape

The resulting dataframe has 34066 rows and 10 columns.

Let's make the following changes to the resulting dataframe in 7 steps:
1. Create new columns `trending_date_start` & `trending_date_end` from `trending_date` column.
2. Create new columns `likes_start` & `likes_end` from `likes` column.
3. `tags` have `[None]` values in it. Converting them to Null values.
4. Create new column `tagsCount` from `tags` column. Which indicates the number of tags used in particular video.
5. Create new column `hoursTakenToTrend` from `trending_date_start` & `publishedAt` columns.
6. Create new column `trendingDaysDuration` from `trending_date_end` & `trending_date_start` columns.
7. Drop columns `trending_date`, `likes`, `publishedAt` as they are no longer required.

#### Step - 1

In [None]:
# Create new columns trending_date_start & trending_date_end from trending_date column.

modified_df['trending_date_start'] = modified_df['trending_date'].apply(lambda x: min(x))
modified_df['trending_date_end'] = modified_df['trending_date'].apply(lambda x: max(x))

#### Step - 2

In [None]:
# Create new columns likes_start & likes_end from likes column.

modified_df['likes_start'] = modified_df['likes'].apply(lambda x: min(x))
modified_df['likes_end'] = modified_df['likes'].apply(lambda x: max(x))
# df.drop('likes', axis = 1, inplace = True)

#### Step - 3

In [None]:
# tags have [None] values in it. Converting them to Null values.

modified_df['tags'] = modified_df['tags'].apply(lambda x: x if x!= '[None]' else np.nan)

#### Step - 4

In [None]:
# Create new column tagsCount from tags column. Which indicates the number of tags used in particular video.

modified_df['tagCount'] = modified_df['tags'].apply(lambda x: 0 if type(x) == float else len(list(x.split('|'))))

#### Step - 5

In [None]:
# Create new column hoursTakenToTrend from trending_date_start & publishedAt columns.

modified_df['hoursTakenToTrend'] = round((modified_df['trending_date_start'] - modified_df['publishedAt']).dt.seconds/(60*60), 1)

#### Step - 6

In [None]:
# Create new column trendingDaysDuration from trending_date_end & trending_date_start columns.

modified_df['trendingDaysDuration'] = (modified_df['trending_date_end'] - modified_df['trending_date_start']).dt.days

#### Step - 7

In [None]:
# Drop columns trending_date, likes, publishedAt as they are no longer required.

modified_df.drop(['trending_date', 'likes', 'publishedAt'], axis = 1, inplace = True)

In [None]:
# Displaying the dataframe after all the changes

modified_df.head(3)

In [None]:
# Displaying shape of Dataframe after all the changes

modified_df.shape

In [None]:
# Displaying the datatypes of the resultant Dataframe

modified_df.dtypes

### Let's merge `modified_df` and `extracted_df`

In [None]:
#final_df = pd.merge((modified_df, extracted_df), on = 'video_id', how = 'inner')

In [None]:
# Displaying shape of final dataframe

#final_df.shape

In [None]:
# Displaying final dataframe

#final_df.head()

`Duration` column is in ISO Date format. Let's convert it into seconds

In [None]:
# Converting time in 'Duration' column to Seconds

#final_df['Duration'] = final_df['Duration'].apply(lambda x: isodate.parse_duration(x).total_seconds())

# Validating the above changes

#final_df.head(2)

## Exploratory Data Analysis

In [None]:
# Let's check the correlation between the numerical columns

plt.figure(figsize = (10, 8))
sns.heatmap(data[['view_count', 'likes', 'dislikes', 'comment_count', 'daysTakenToTrend', 'tagCount']].corr(), linewidths=.5, annot=True, cmap='coolwarm')
plt.show()

- `view_count` and `likes` are highly correlated. It is more likely that the video with more views has more likes.
- `comment_count` and `likes` are relatively highly correlated when compared to `comment_count` and `views`.
- `daysTakenToTrend` is not correlated to any feature. Which is interesting as it is impossible to correlate how many days the video will take to trend based on comment_count or dislikes or likes or view_count.

In [None]:
######################### likes per view

In [None]:
# Checking number of videos based on each Category

plt.figure(figsize = (10, 4))
sns.countplot(data['categoryId'], order = data['categoryId'].value_counts().sort_values(ascending = False).index)
plt.xticks(rotation = 90)
plt.show()

There are more `Entertainment` videos and least type is `Nonprofits & Activism`

In [None]:
plt.figure(figsize = (10, 6))
sns.barplot(data = data, x = 'categoryId', y = 'likes',
            order = data.groupby('categoryId')['likes'].mean().sort_values(ascending = False).index, ci = 0)

plt.xticks(rotation = 90)
plt.show()

- `Pets & Animals` videos has most average likes and `New & Policts` videos has least average likes.

In [None]:
plt.figure(figsize = (10, 6))
sns.barplot(data = data, x = 'categoryId', y = 'comment_count',
            order = data.groupby('categoryId')['comment_count'].mean().sort_values(ascending = False).index, ci = 0)

plt.xticks(rotation = 90)
plt.show()

`Music` videos has most average comment count and `Nonprofits & Activism` has least average comment count.

In [None]:
plt.figure(figsize = (10, 6))
sns.barplot(data = data, x = 'categoryId', y = 'daysTakenToTrend',
            order = data.groupby('categoryId')['daysTakenToTrend'].mean().sort_values(ascending = False).index, ci = 0)

plt.xticks(rotation = 90)
plt.show()

It is interesting to note that `News & Politics` videos take less time to trend and `Music`, `Comedy` and `Pets & Animals` videos take more time to trend.

In [None]:
sns.countplot(data['comments_disabled'])
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.countplot(data[data['ratings_disabled'] == True]['categoryId'], 
              order = data[data['ratings_disabled'] == True].groupby('categoryId')['ratings_disabled'].count().sort_values(ascending = False).index)
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.countplot(data[data['tags'] == '[None]']['categoryId'], 
              order = data[data['tags'] == '[None]'].groupby('categoryId')['ratings_disabled'].count().sort_values(ascending = False).index)
plt.xticks(rotation = 90)
plt.show()

In [None]:
# Let's extract weekday from the trendingdate
data['day'] = data['trending_date'].dt.day_name()

In [None]:
# Let's plot number of trending videos for each day of the week
sns.countplot(data['day'])
plt.show()

In [None]:
# Daily several videos trend but which video trends for the longest number of days will be the question.

In [None]:
# Modify
plt.figure(figsize = (15,4))
sns.countplot(data = data, x = 'categoryId', hue = 'day')

In [None]:
# categories that are taking minimum or moderate or maximum number of days to trend
(data.groupby('categoryId')['daysTakenToTrend'].var().sort_values()).plot.bar()

In [None]:

data['like/dislike ratio'] = round(data['likes']/data['dislikes'], 2)

In [None]:
# Modify
plt.figure(figsize = (8, 10))
sns.histplot(data = data, x = 'like/dislike ratio', y = 'categoryId')