## Lab 2: Data Cleaning and Exploratory Data Analysis


The data we will be working with today contains statistics about viral Youtube videos in the US. The full dataset can be found on [Kaggle](https://www.kaggle.com/datasnaek/youtube-new). Since the original dataset is too large, we will only be working with a smaller subset of the original data in this lab. 

If you would like to scape your own data in trending videos for other projects/purposes, [here](https://github.com/mitchelljy/Trending-YouTube-Scraper) are some resourses/directions to do just that. 

Our goal in this lab is to identify whether there are patterns between the video's metadata (channel title, category, release time, tags, etc...) and whether it trends in the US.

In [3]:
import matplotlib.pyplot as plt
#import seaborn as sns
import pandas as pd
import numpy as np


In [4]:
trending_vids = pd.read_csv("USvideos_small.csv")
trending_vids.head()

## Data Cleaning

Before we get into data analysis, we need to inspect the quality of our dataset and identify whether the columns are consistent and in the right format.

Specifically, we will check our dataset to see:

1. Whether there are missing values in our dataset.
2. If the type of the columns in our dataset makes sense.
3. Whether there are any rows/columns that we want to drop.

**Question 1.1:** Return a series containing the number of missing entries for each column in our dataset, sorted from most to least missing values. 

***Hint***: You may find the `df.isna()` method useful.

<!--
BEGIN QUESTION
name: q1_1
-->

In [8]:
missing = trending_vids.isna().....
missing
"""

In [None]:
grader.check("q1_1")

** Question 1.2:** For simplicity, we will drop all of the rows from our dataframe that contain missing values, though it is possible to impute missing values by filling them in with the column's mean, median, mode, and more complicated methods. 

Remove all of the rows from `trending_vids` that contain missing values in any column. Reassign `trending_vids`.

Check that the number of rows in your new dataset is what you expect based on the results of Question 1.1. 

<!--
BEGIN QUESTION
name: q1_2
-->

In [11]:
trending_vids = trending_vids....
trending_vids
"""


In [None]:
grader.check("q1_2")

Next, we will take a look at the types of our columns to check whether all of our column types make sense. 

We can see all of the the column types in our dataframe by calling `df.dtypes`

In [17]:
trending_vids.dtypes

Let's take a closer look at the `object` data types for the 'trending_date' and 'publish_time' columns.

In [18]:
trending_vids.loc[:, ['trending_date', 'publish_time']]

**Question 1.3:** Curiously, the `trending_date` and `publish_time` columns are represented as `object` series instead of `datetime` series, which is what they appear to be at first glance. 

Reassign the `trending_date` and `publish_time` columns to be datetime series in our original dataframe. 
Make sure that both columns are in the format `YYYY-MM-DD`.

You will have to add `20` to the beginning of each date in the `trending_date` column to make the year correct.

***Hint***: You may find the `dayfirst` and `yearfirst` parameters of `pd.to_datetime` useful

<!--
BEGIN QUESTION
name: q1_3
-->

In [21]:
trending_vids['trending_date'] = pd.to_datetime(..., dayfirst=..., yearfirst=...)
trending_vids['publish_time'] = pd.to_datetime(...)
trending_vids
"""

In [None]:
grader.check("q1_3")

**Question 1.4:** Lastly, remove the `thumbnail_link` column from our dataframe. We will not be analyzing images in this lab.

<!--
BEGIN QUESTION
name: q1_4
-->

In [27]:
trending_vids = trending_vids.drop(labels=..., axis=...)
"""

In [None]:
grader.check("q1_4")

## Exploratory Data Analysis (EDA)

In this section we are going to provide you with a set of excercises that will let you get to know the dataset a bit better.
When looking at the viral videos data set, many questions natually comes to mind:
- Do trending topics change day to day?
- What channels are thriving off of the viral videos?
- Is there a specific time associated with the highest number of viral videos? 
Let's find out.

As an aspiring influencer, your friend Alan would like to find probable patterns associated with the viral videos. 
This would greatly help him optimize the release time and content of his videos to gain more subscribers. 
He knows that you are looking into a dataset that looks into the viral videos in the US and decides to ask you for help.

**Question 2.1:** What topics are trending each day? 
Assign `topics` to a table with columns `trending_date`, `category_id`, `count`

**_Hint:_** you can group a table by multiple columns.

<!--
BEGIN QUESTION
name: q2_1
-->

In [31]:
topics = trending_vids.groupby([...]).agg({"title": 'count'})
topics
# END PROMPT """


**Question 2.2:** What are some of the most controversial videos? 
Here we will measure the level of controversy assiciated with each video by 
`number_of_dislikes` / (`number_of_dislikes` + `number_of_likes`)
Assign `controversial_vids` to a dataframe containing the 
`trending_date`, `title`, `channel_title`, `category_id`, and `publish_time`
of the 5 most controversial videos by this metric. 

To help you get started, we already set up another dataframe, `controversy_helper`, 
that includes all of the relevant columns from `trending_vids`.

**_Hint:_** Since the metric we are using isn't a part of the original dataframe, 
you can add the metric as a column first to your table to simplify the process.


<!--
BEGIN QUESTION
name: q2_2
-->

In [29]:
relavent_cols = ["trending_date", "title", "channel_title", "category_id", "publish_time", "likes", "dislikes"]
controversy_helper = trending_vids[relavent_cols]

controversy_helper["metric"] = .....
controversial_vids = controversy_helper.sort_values(...).drop(...)[...]
controversial_vids
# END PROMPT """


In [None]:
grader.check("q2_2")

**Question 2.3:** What are the top 3 channels with the most number of trending videos from this dataset? 
Maybe Alan can learn a thing or two from their videos in the future.
Assign `viral_channels` to a list containing the names of the top 3 channels with the most number of trending videos.


<!--
BEGIN QUESTION
name: q2_3
-->


In [34]:
channels_grouped = trending_vids.groupby(....).agg(....).sort_values(..., ascending = False)[:3]
viral_channels = ....
viral_channels
# END PROMPT """


In [None]:
grader.check("q2_3")

**Question 2.4:** What is the distibution of publish times of trending videos? Is there a certain time in which viral videos are posted at or is the 
distribution relatively uniform?

Assign `viral_times` to a series which contains the hour of every video's publishing time in the dataset. 
You can create a visualization for this, but the autograder tests will only check for the series. 

**Hint:** Try using a lambda function to isolate the hour values.
<!--
BEGIN QUESTION
name: q2_4
-->

In [38]:
viral_times = trending_vids["publish_time"].apply(.....)
viral_times
# END PROMPT """


In [None]:
grader.check("q2_4")


We previously looked at the "controversy" rate for a certain video. Does the like/dislike ratio stay consistent between channels? For every
youtuber, go ahead and calculate the _overall_ `number of likes / (number of likes + number of dislikes)` for _every_ video. 


In [43]:
grouped_channels = trending_vids.groupby("channel_title").agg(sum)[["likes", "dislikes"]]
grouped_channels.head()

**Question 2.5:** Add a column 'controversy metric' to the provided dataframe `grouped_channels`

We used a very simplistic formula for this question, but feel free to dig deeper in case that interests you! Things that you can explore are the average ratios as well
as whether or not some videos drag down the controversy for a youtuber.

<!--
BEGIN QUESTION
name: q2_5
-->

In [57]:
# add the controversy_metric channel to the dataframe
grouped_channels["controversy_metric"] = .....
grouped_channels.sort_values("controversy_metric", ascending = False)
# END PROMPT """


In [None]:
grader.check("q2_5")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)