# PSTAT 100 Project Report 
---

# YouTube Top Trending Video Analysis

Harleen Kaur, Amelia Meyer

#### Author contributions

Harleen Kaur contributed to tidying, PCA analysis, correlation matrix analysis, Abstract, Methods 

Amelia Meyer contributed to tidying, exploratory analysis (scatterplots, barplots, wordcloud), Background, Abstract, Data Description


<center><img src='https://cdn.searchenginejournal.com/wp-content/uploads/2020/12/2ff5499e-0bf8-4ffc-b205-e496aca01204-5fe397b4e68ca-1520x800.jpeg' style= 'width:100:px, height:100:px'></center>

## Report Outline
[Introduction](#intro)   
[Data Description](#data_description)   
[Methods](#methods)  
[Results](#results)  
[Discussion](#discussion)  
[Initial exporations](#init_explore)  
[Planned work](#planned_work)  

---
<a name="intro"/>

# Introduction

## Background: Trending Videos on Youtube (US) 

According to Cloudfare, YouTube was the 8th most visited site, and the 3rd most visited social media site in 2021 ([See this cbs article for more rankings](https://www.cbsnews.com/news/tiktok-google-facebook-social-media-internet/)). With millions of users in the US and around the world, YouTube has developed its own gravitational force, pulling people in to watch content varying from makeup tutorials to [scuba-divers solving cold cases](https://www.washingtonpost.com/nation/2021/12/10/youtube-scuba-diver-cold-case/). 

The most popular content on YouTube is referred to as 'trending' - a term that gained popularity due to Twitter's hashtags. Luckily for us, YouTube tracks their [top trending videos](https://www.youtube.com/feed/trending) which, according to Variety magazine, are measured by user interactions "number of views, shares, comments and likes". 

## Abstract

The purpose of our project will be to analyze the top trending videos on YouTube in the US. Our dataset only contains videos in the US that appeared on the Top Trending list in the last three years (2021 through 2022) so our analysis will not extend to YouTube videos in general. Our goals are to get a better understanding of the types of videos that appear on the Top Trending list in the US. In particular, we will be focusing on identifying patterns amongst the top trending videos. We will explore whether certain genres receive more views and likes, or dislikes;  whether there are any consistent top-trenders (are there particular music artists that appear on the top trending list often?); and whether a high number of dislikes prompts top-trenders to disable their views from commenting. Through our graphical and statistical analysis of the dataset we found... **FINDINGS**

[More on how YouTube influences the topics of today](https://www.forbes.com/sites/under30network/2017/06/20/why-youtube-stars-influence-millennials-more-than-traditional-celebrities/?sh=7c13084148c6)

---
<a name="data_description"/>

## Dataset Description

The dataset we will be using for our analysis is the [YouTube Trending Video Dataset](https://www.kaggle.com/rsrishav/youtube-trending-video-dataset) which documents the daily record of the top trending videos on YouTube (updated daily). Our dataset was collected by Rishav Sharma and made available through Kaggle with for the purpose of data analysis. Rishav lists possible projects as sentiment analysis, categorization based on video comments and statistcs, machine learning to generate comments, predicting popularity of new videos, and general analysis over time. The data was collected using the [YouTube API](https://developers.google.com/youtube/v3) as provided by Google which uses web-scraping. We'll be focusing specifically on the dataset for top trending YouTube videos in the US which includes up to 200 of the top trending videos per day, spanning from August 12th, 2020 to February 28th, 2022.  

Our dataset consists of 16 columns and 113391 rows. Each row is an observation representing one video that appeared on the top trending YouTube videos list in the US. Each column represents one of the following variable: title, date_published channel_name, category, trending_date, tags, view_count, likes, dislikes, comment_count, comments_disabled, ratings_disabled,year_published, month_published, year_trending, month_trending. There are no missing values in our dataset. 
 
As we observed our dataset, we found some cool things. The most viewed and liked video is "Butter" by BTS ( a iconic KPOP songs). In fact, we see KPOP music in the top charts quite often. Curious as to which genres are most popular, we found that music, entertainment, and gaming tend to have a highest amount of view counts and likes. 

##### Below is are tabled representations of the variables in our dataset that we used for analysis and interpretation.

The two boolean variables in our dataset have the following allocations:  

variable name | number observations with True | number of observations with False   
---|---|--- 
comments_disabled | 1739 | 112252  
ratings_disabled | 779 | 113212  

Our cateogry variable can be represented as one of the following 16 categories:

Name | Variable description | Type | Units of measurement
---|---|---|---
title | video title | object or string | ...
date_published | date video was published | object or string | ...
channel_name | name of channel video was published on | object or string | ...
category | category video falls under; genre of video | object or string | ...
trending_date | date video was trending | object or string | ...
tags | tags attached to video; video identifiers added by video creator | object or string | ...
view_count | number of views video received | int64 or integers | ...
likes | number of likes video received | int64 or integers | ...
dislikes | number of dislikes video received | int64 or integers | ...
comment_count | number of comments on video | int64 or integers | ...
comments_disabled | whether the comments were disabled | bool or true/false | ...
ratings_disabled | whether the ratings were disabled | bool or true/false | ...
year_published | what year the vide was published | int64 or integers | 4-digit year values
month_published | what month the year was published | int64 or integers | 1 or 2-digit month value
year_trending | what year the video was trending | int64 or integers | 4-digit year value
month_trending | what month the video was trending | int64 or integers | 1 or 2-digit month value

#### Here are the first four rows and twelve columns of our dataset after performing initial cleaning:

In [6]:
import pandas as pd

# load new csv
trending = pd.read_csv(r'C:\Users\candy\Documents\PSTAT100FinalProject\trending.csv')

# print a few example rows of dataset in tidy format
trending.drop(columns=['Unnamed: 0'], inplace=True)

trending.iloc[:4, :12]

Unnamed: 0,title,date_published,channel_name,category,trending_date,tags,view_count,likes,dislikes,comment_count,comments_disabled,ratings_disabled
0,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11,Brawadis,People & Blogs,2020-08-12,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,False,False
1,Apex Legends | Stories from the Outlands – “Th...,2020-08-11,Apex Legends,Gaming,2020-08-12,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,False,False
2,I left youtube for a month and THIS is what ha...,2020-08-11,jacksepticeye,Entertainment,2020-08-12,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,False,False
3,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11,XXL,Music,2020-08-12,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,False,False


### Methods

Write a single paragraph summarizing your approach to analyzing the data. Assume your audience is familiar with the methods you've used (so you don't need to explain them from first principles).

The goal of providing this summary is to facilitate smooth reading of the results section -- if your approach is clearly explained here, you can simply present the results one after the next in the next section without explaining how they were obtained. So, explain just enough that your reader will be able to follow your results.

If your analysis was more exploratory in nature or consisted primarily of visualizations, you can instead describe what relationships and features you were exploring or visualizing and the techniques or plot types that you used.

To better understand the relationship between different types of video interactions, we analyzed the correlation between the variables in the dataset. We hypothesized that there may be a positive relationship between the amount of dislikes a video gets to whether its video interactions, such as comments, are disabled. In addition, we also analyzed the covariation in the trending dataset through Principal Component Analysis (PCA) in order to visualize correlation and preserve linear transformations in the data. 

---
## Results

### Category Popularity Exploration

![Category Histograms](category_hists.png)

The figures above are histograms of different observation count, views, and likes against category respectively. Plot one (left) shows the number of videos allocated to each category. Plot two (middle) shows the sum of the view counts for all videos within each category. Plot three (right) shows the sum of the likes for each category. As we can see, music has the most view and likes in total. This is in agreement with our analysis of the videos with the most likes (the first 10 of which belong to the music category), and the most views (the first 4 of which belong to the music category). We can conclude that music videos are the most popular of the top trending videos in the US during our dataset's timeframe, followed by Entertainment. 

### Word Cloud

![Wordcloud of Tags](wordcloud_tags.png)

YouTube videos allow creators to add tags that identify the types of content features in their videos. Therefore, we created a word cloud (above) to showcase the most popular tags in our dataset. Popularity of the tags is determines solely by the usage number within our dataset, and is displayed by the font size within the word cloud i.e. words with bigger fonts are more popular. We can see that 'among us', 'music video', and 'hip hop' are the three most popular tags. This reflects our analysis of the category histograms. 

### Correlation Matrix

![Correlation Matrix](corrmatrix.png)

The correlation matrix tells us that there is a positive relationship between `likes`, `dislikes`, `views`, and `comment_counts`. This means that if one type of video interaction increases, so do others. For example, trending videos with more likes tend to also have a higher amount dislikes despite being the opposite sentiment. Furthermore, `likes` are slightly negatively correlated with `ratings_disabled` and `comments_disabled`. Therefore, the more positive interactions (likes) a video has, the less likely that the videos comments or ratings will be disabled by the channel. Another notable relationship is the positive correlation with `view_count` with `ratings_disabled` and `comments_disabled`. 

## Principal Component Analysis

![PCA](loading_plot.png)

Interpretation of the PC's

PC1:  `comment_count`, `view_count`, `dislikes` and `likes`  have the largest positive loading values. Therefore, increasing these values increase the level of PC1. This is consistent with the correlation matrix we constructed earlier. We can conclude that PC1 represents counts of video interactions. 

PC2: High values are coming from videos with comments or ratings disabled. This may describe the positive relationship between videos with interactions disabled.

PC3: Negative Disabled interaction. High values of `comments_disabled` lower the values of this PC while high values of `ratings_disabled` raises it. In the correlation matrix we observed that these two variables are positively correlated.

PC4: `comment_count` has a high positive loading. `Dislikes` and `view_count` have a low value.. This PC may also represent a different relationship between video interactions.

---
## Discussion

This section should conclude your report in 1-2 paragraphs that reiterate the findings and offer any commentary. 'Commentary' could include:
* speculation about the cause of certain findings;
* caveats about interpretation;
* refining of questions or aims;
* further topics you would have liked to explore.

* Discussion
    + Highlight your main findings and takeaways.
    + Offer further commentary: caveats, further steps, etc.

**FURTHER: Q1 - here is tricky because you *only* have information on the trending videos, so you can't exactly say that relative to non-trending videos, trending videos have such-and-such properties. but you *can* look for patterns *among* the trending videos, as you've started to do already with the exploratory plots. do they exhibit some relationship between two or more variables? that kind of thing. the limitation is simply that any pattern you *do* find you can't know for sure that it's *distinctive of* trending videos  Q2 - i like this; could you translate the question into more qualitative terms? like i can see you thinking about whether negative feedback might prompt someone to turn off comments and whether that's reflected in the data. so i think the question would be more interesting if it's phrased according to the motivating thought rather than in a stats-y way. could you say something more specific than 'look at'? any particular visualizations you want to make or 9
summaries you want to calculate?**  

#### Format and appearance 

* No codes should appear in your report.
* All figures and tables should have captions.
* Figures should be appropriately sized and labeled.
* No text from the template should appear in your report other than headers.
* The total length should not exceed 8 pages.

### Evaluation

Your report will be evaluated based on:
* (format) adherence to formatting and appearance guidelines;
* (clarity) clarity and thoughtfulness in written voice;
* (accuracy) apparent accuracy of quantitative results and technical information;
* (applied a PSTAT100 technique) successful use of one or more techniques in the course.

Notice that no credit is tied to the nature of the results; you can earn credit equally well with an analysis that says little as with one that says a lot. **Negative, neutral, or ambiguous results -- analyses that do not produce any particular insight -- are more than acceptable.** If your analyses turn in one of these directions, present them as clearly as possible, and consider speculating in your discussion section about the absence of signficicant/interpretable findings.
