In [14]:
# libraries
import numpy as np
import pandas as pd
# !pip install altair;
import altair as alt
import datetime
# !pip install sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import add_dummy_feature

alt.data_transformers.disable_max_rows();

# PSTAT 100 Project Plan Report

0. Background
1. Data description
2. Initial exporations
3. Planned work

## Group Information

**Group members**: Harleen Kaur (Member 1), Amelia Meyer (Member 2)

**Contributions**:
1. Member 1 ...
2. Member 2 ...

## Background: Trending Videos on Youtube (US)

<center><img src='https://fs-prod-cdn.nintendo-europe.com/media/images/10_share_images/games_15/nintendo_switch_download_software_1/H2x1_NSwitchDS_YouTube.jpg' style= 'width:100:px, height:100:px'></center>


According to Cloudfare, YouTube was the 8th most visited site, and the 3rd most visited social media site in 2021. [See this cbs article for more rankings.](https://www.cbsnews.com/news/tiktok-google-facebook-social-media-internet/). With millions of users in the US and around the world, YouTube has developed its own gravitational force, pulling people in to watch content varying from makeup tutorials to [scuba-divers solving cold cases](https://www.washingtonpost.com/nation/2021/12/10/youtube-scuba-diver-cold-case/). 

The most popular content on YouTube is referred to as 'trending' - a term that gained popularity due to Twitter's hashtags. Luckily for us, YouTube tracks their [top trending videos](https://www.macmillandictionaryblog.com/trending) which, according to Variety magazine, are measured by user interactions "(number of views, shares, comments and likes)". 

The purpose of our project will be to analyze the top trending videos on YouTube in the US using a dataset of the daily record of the top trending videos. Our goals are to get a better understanding of the types of videos users are interested in  **...come back and add other things we want to do with the data**. 

[View Youtube's Trending Feed](https://www.youtube.com/feed/trending)  
[Check out the influence of YouTube on today's topics](https://www.forbes.com/sites/under30network/2017/06/20/why-youtube-stars-influence-millennials-more-than-traditional-celebrities/?sh=7c13084148c6)


# CHECK
*Introduce the topic of your project.*
* What area or areas of study are you in dialogue with for your project?
* What is your data about, broadly? 
* What is the motivation for collecting the kind of data you're working with, and what sorts of things could you potentially learn?

You may find it useful to write up the data description first, think about what the reader should know before they peek at your dataset, and then come back to the background section. I often write the background sections of your assignments last, once I have a sense of what kind of information would be most useful going into the assignment.**

# Data Description

## Basic Information

#### General Description

The dataset we will be using for our analysis is the [YouTube Trending Video Dataset](https://www.kaggle.com/rsrishav/youtube-trending-video-dataset) which documents the daily record of the top trending videos on YouTube(updated daily). 

#### Source
Our dataset was collected by Rishav Sharma and made available through Kaggle with for the purpose of user analysis. Rishav lists possible projects as sentiment analysis, categorization based on video comments and statistcs, machine learning to generate comments, predicting popularity of new videos, and general analysis over time. 

#### Collection Methods:
The data was collected using the [YouTube API](https://developers.google.com/youtube/v3) as provided by Google which uses web-scraping. 

#### Sampling Design and Scope of Inference
We'll be focusing specifically on the dataset for top trending YouTube videos in the US. The dataset includes several months of data, with up to 200 listed trending videos per day. The sample population is all YouTube videos in the US. The sample frame is the top trending YouTube videos in the US. Our scope of inference will include videos that are available to users in the US. 



# CHECK 
Indicate the relevant population. If identifiable from data documentation, state the sampling frame and sampling mechanism and indicate the scope of inference. If no information is available about the sampling design, indicate this instead, and discuss the extent to which having no scope of inference is a limitation for the particular topic you're investigating.

## Data Semantics and Structure

#### Units and Observations
The observational units are the trending videos. One unit is one video that appeared on the top trending YouTube videos list in the US. 

#### Variable descriptions

If your dataset is large and you'll only work with a subset of the total available variables, limit your attention to the variables that you'll work with. Here's a template you can work with:

Name | Variable description | Type | Units of measurement
---|---|---|---
title | video title | object or string | ...
date_published | date video was published | object or string | ...
channel_name | name of channel video was published on | object or string | ...
category | category video falls under; genre of video | object or string | ...
trending_date | date video was trending | object or string | ...
tags | tags attached to video; video identifiers added by video creator | object or string | ...
view_count | number of views video received | int64 or integers | ...
likes | number of likes video received | int64 or integers | ...
dislikes | number of dislikes video received | int64 or integers | ...
comment_count | number of comments on video | int64 or integers | ...
comments_disabled | whether the comments were disabled | bool or true/false | ...
ratings_disabled | whether the ratings were disabled | bool or true/false | ...
year_published | what year the vide was published | int64 or integers | 4-digit year values
month_published | what month the year was published | int64 or integers | 1 or 2-digit month value
year_trending | what year the video was trending | int64 or integers | 4-digit year value
month_trending | what month the video was trending | int64 or integers | 1 or 2-digit month value


#### Example rows

In [10]:
# load new csv
trending = pd.read_csv(r'C:\Users\candy\Documents\PSTAT100FinalProject\trending.csv')

# print a few example rows of dataset in tidy format
trending.drop(columns=['Unnamed: 0'], inplace=True)

trending.head(4)

Unnamed: 0,title,date_published,channel_name,category,trending_date,tags,view_count,likes,dislikes,comment_count,comments_disabled,ratings_disabled,year_published,month_published,year_trending,month_trending
0,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11,Brawadis,People & Blogs,2020-08-12,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,False,False,2020,8,2020,8
1,Apex Legends | Stories from the Outlands – “Th...,2020-08-11,Apex Legends,Gaming,2020-08-12,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,False,False,2020,8,2020,8
2,I left youtube for a month and THIS is what ha...,2020-08-11,jacksepticeye,Entertainment,2020-08-12,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,False,False,2020,8,2020,8
3,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11,XXL,Music,2020-08-12,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,False,False,2020,8,2020,8


---
## 2. Initial explorations

At this stage, you may spend most of your effort on the computing side tidying up the data. You're not expected to complete a thorough exploratory analysis, and if your dataset was especially messy to start with, you may not even begin your exploratory analysis by the time you prepare this report. You have the option to leave exploration for the next stage of work and simply report basic properties of the dataset, but you should at minimum address the items in the 'basic properties' section below.

### Basic properties of the dataset

Help the reader get acquainted with your dataset on a simple level by identifying characteristics of the dataset and variable summaries. Some amount of code is fine here, but try to use code cells sparingly.

**Dimensions**: state the dimensions of the data (in tidy format, of course).

**Missing values**: Are there missing values? If so, why are they missing?

**Variable summaries**: Provide simple variable summaries for the most important variables in your dataset. Preferably, you'll do this for all variables, but if you have a large number, you might need to prioritize and focus on the ones most of interest. What exactly you do is a little case-specific, but think of things like means and variances, min/max, number of levels and observation counts for categorical variables, etc.

*Start your draft here.*

In [16]:
# Most viewed video
trending.title.iloc[trending.view_count.idxmax()]

"BTS (방탄소년단) 'Butter' Official MV"

In [17]:
# Most liked video
trending.title.iloc[trending.likes.idxmax()]

"BTS (방탄소년단) 'Butter' Official MV"

In [18]:
trending.title.iloc[trending.dislikes.idxmax()]

"BLACKPINK - 'Ice Cream (with Selena Gomez)' M/V"

### Exploratory analysis

If you were lucky and your dataset was neat, you should aim to include a few exploratory plots or tables here -- they don't need to be polished at this stage, but you should select plots that are informative (rather than including all plots you may have looked at). 

If you do include exploratory graphics or tables, please explain in a sentence or two what each one shows. Try to include a minimum of code. Consider [saving your plots as images](https://altair-viz.github.io/user_guide/saving_charts.html#png-svg-and-pdf-format) and inputting images into markdown cells instead of generating them anew via code cells.

---
## 3. Planned work

Here you should indicate your tentative ideas for your analysis. Don't worry, these aren't final -- you can always change your mind later or shift gears if they don't pan out. The objective is to have you start thinking ahead about what you'll do.

### Questions

Please propose two focused questions that you plan to explore.

1. Is there a pattern to which videos make it on the top trending video list? 
2. Can we predict whether a video will make it on the top trending video list?

### Proposed approaches

For each question, please describe an idea or two about how you might approach the question.

1. Look at the variables associated with each video such as their tag, category, words used in the title. 
2. *Approach 2 here*

---
## Submission Checklist
1. Save file to confirm all changes are on disk
2. Run *Kernel > Restart & Run All* to execute all code from top to bottom
3. Save file again to write any new output to disk
4. Select *File > Download as > HTML*.
5. Open in Google Chrome and print to PDF on A3 paper in portrait orientation.
6. Submit to Gradescope