# Youtube Analysis

### Aaditya Bhat

## Introduction

The goal of this analysis is to check whether there exists any corellation between controversy and view count for a youtube video.

Controversy is quantified by a parameter called Controversy Index. 
For a given video, Controvery Index is calculated as following,



$$
controversyIndex = \left\{ \begin{array}{rl}
 \frac{dislikeCount}{likeCount} &\mbox{ if $likeCount>dislikeCount$} \\[1em]
  \frac{likeCount}{dislikeCount} &\mbox{ otherwise}
       \end{array} \right.
$$

Hence, 

$$0 \leq controversy Index \leq 1$$

$$controversy Index = 0 \implies video \   isn't \   controversial$$
$$controversy Index = 1 \implies video \   is \  highly \  controversial$$

## Data

The data for this analysis is parsed using YouTube Data API. It contains Category, Title, View Count, Like Count, Dislike Count and Comment Count for 426 videos.

The script to parse data from YouTube Data API can be found at

https://github.com/aadityaubhat/youtubeAnalysis

In [13]:
#Importing required libraries
import pandas as pd
from bokeh.plotting import figure, output_file, show, ColumnDataSource, gridplot
from bokeh.models import HoverTool
from bokeh.io import output_notebook
from scipy.stats.stats import pearsonr

#Reading data
dataFrame = pd.read_csv("data.csv")

### Data Cleaning

The raw data is tidy and does not require extensive data cleansing. But there are few rows which have zero likeCount and zero dislikeCount. Controversy Index for these vidoes can't be calculated. Hence these rows are removed from the DataFrame and controversyIndex is calculated for the rest.

In [14]:
#Removing row's with 0 likeCount and 0 dislikeCount
dataFrame = dataFrame.query('not (likeCount == 0 and dislikeCount == 0)')

#Calculating controversyIndex for each row.
controversyIndex = []
for index, row in dataFrame.iterrows():
    if row[3]>=row[4]:
        controversyIndex.append(round(float(row[4])/float(row[3]),2))
    else:
        controversyIndex.append(round(float(row[3])/float(row[4]),2))
        
#Appending controversyIndex column to the original DataFrame
dataFrame['controversyIndex'] = controversyIndex

#Looking at the top 10 rows
dataFrame[:10]

Unnamed: 0,category,title,viewCount,likeCount,dislikeCount,commentCount,controversyIndex
0,Entertainment,WE MADE IT!!,1454174,78572,572,10871,0.01
1,Entertainment,THE SUBSTITUTE TEACHER EXPERIMENT!,2712496,133693,16506,9627,0.12
2,Entertainment,Teens React to Fuller House,2851335,53896,251527,21648,0.21
3,Entertainment,HUGE SURPRISE!!,463538,36952,488,4795,0.01
4,Entertainment,What happened in January?,393709,20856,1273,1461,0.06
5,Entertainment,WINTER IS COMING (Bonus),252811,10806,74,571,0.01
6,Entertainment,Bad Nanny,600307,18078,1051,645,0.06
7,Entertainment,PRANK ON MY MUM!,1692506,110594,728,7774,0.01
8,Entertainment,@FINE BROS SUB COUNT RIP LIVE !!! ROAD TO 13 M...,1364050,37008,1255,1212,0.03
9,Entertainment,Rob Kardashian Swipes at Sisters!,89032,1225,118,119,0.1


## Analysis

### Overall Correlation

Overall correlation between controversyIndex and viewCount is examined by plotting a scatter plot for all the videos. 

The correlation is quantified by calculating Pearson's correlation coefficient and two tailed p-value.

In [15]:
# Initializing the bokeh plot

output_notebook(hide_banner=True) 
source = ColumnDataSource(dataFrame)
hover = HoverTool()
hover.tooltips = [('Controversy Index', '@controversyIndex'),('View Count', '@viewCount'), ('Index', '$index'), 
                  ('Title', '@title'), ('Category', '@category')]
p = figure(plot_width = 700, plot_height = 400, title = "controversyIndex vs viewCount", tools = [hover,])


# adding a circle renderer with a size, color, and alpha
p.circle('controversyIndex', 'viewCount', size=5, color="blue", alpha=0.5, source = source )

# adding annotations
p.xaxis.axis_label = "controversyIndex"
p.yaxis.axis_label = "viewCount"
p.left[0].formatter.use_scientific = False

# show the results
show(p)


In [16]:
correlationTuple = pearsonr(dataFrame["viewCount"].tolist(), dataFrame["controversyIndex"].tolist())
print 'Correlation Coefficient = ', round(correlationTuple[0],2)
print 'P- value for testing non correlation = ', round(correlationTuple[1],2)

Correlation Coefficient =  0.02
P- value for testing non correlation =  0.62


Correlation coefficient has a value very close to zero, indicating there is no correlation between controversyIndex and viewCount. This result may be influenced by the outlier. Eliminating the outlier and calculating pearson correlation coefficient and p -value.

In [17]:
#Outlier index can be found out by hovering over it.
correlationTuple = pearsonr(dataFrame.loc[dataFrame.index.delete(123),]["viewCount"].tolist(), 
                            dataFrame.loc[dataFrame.index.delete(123),]["controversyIndex"].tolist())
print 'After eliminating the outlier,'
print 'Correlation Coefficient = ', round(correlationTuple[0],2)
print 'P- value for testing non correlation = ', round(correlationTuple[1],2)

After eliminating the outlier,
Correlation Coefficient =  0.06
P- value for testing non correlation =  0.25


Even after eliminating the outlier, correlation is extremely weak and statistically insignificant. 

### Correlation within various categories

Correlation between controversyIndex and viewCount within various categories is examined by plotting the scatter plots and calculating Pearson's correation coeffcient and P-value for each category.

In [18]:
#Creating an empty list to store plots
plotList = []

#Iterating through all the categories to create individual scatter plots
for i in set(dataFrame['category'].tolist()):
    #Setting up the plot properties
    hover = HoverTool()
    hover.tooltips = [('Controversy Index', '@controversyIndex'),('View Count', '@viewCount'), ('Index', '$index'), 
                  ('Title', '@title')]
    source = ColumnDataSource(dataFrame[dataFrame['category'] == i])
    fig = figure(plot_width = 300, plot_height = 300, tools = [hover,'reset'], title = i)
    #Creating the scatter plot
    fig.circle('controversyIndex', 'viewCount', size=5, color="blue", alpha=0.5, source = source )
    fig.xaxis.axis_label = "controversyIndex"
    fig.yaxis.axis_label = "viewCount"
    fig.left[0].formatter.use_scientific = False
    #Adding the plot to plotList
    plotList.append(fig)

#Aranging the plots to for a grid
gridList = []

for i in range(0,len(plotList), 3):
    if i+1 >= len(plotList):
        gridList.append([plotList[i]])
    elif i+2 >= len(plotList):
        gridList.append([plotList[i],plotList[i+1]])
    else:
        gridList.append([plotList[i],plotList[i+1],plotList[i+2]])
        
#Calculating Correlation coeffcient and p-value for each category
corellationDict = {}
for i in set(dataFrame['category'].tolist()):
    corellationDict[i] = {'correlationCoefficient' :pearsonr(dataFrame[dataFrame['category'] == i]["viewCount"].tolist(), 
                                  dataFrame[dataFrame['category'] == i]["controversyIndex"].tolist())[0],
                          'pValue' :pearsonr(dataFrame[dataFrame['category'] == i]["viewCount"].tolist(), 
                                  dataFrame[dataFrame['category'] == i]["controversyIndex"].tolist())[1]}
categoryCorrDataFrame = pd.DataFrame(corellationDict).transpose()

#Showing the plots
show(gridplot(children = gridList))

#Viewing correlation and p-value for each category
categoryCorrDataFrame


Unnamed: 0,correlationCoefficient,pValue
Autos & Vehicles,0.442053,0.018509
Education,0.188563,0.346229
Entertainment,0.436196,0.015965
Film & Animation,0.046535,0.814099
Gaming,0.152073,0.42242
Howto & Style,0.087152,0.646994
Movies,-0.056342,0.803332
Music,0.101032,0.595264
News & Politics,0.340241,0.088992
Nonprofits & Activism,-0.209134,0.295143


Filtering categories with statistically significant correlation between controversyIndex and viewCount.

In [19]:
categoryCorrDataFrame.query('pValue <0.05')

Unnamed: 0,correlationCoefficient,pValue
Autos & Vehicles,0.442053,0.018509
Entertainment,0.436196,0.015965
Pets & Animals,0.404894,0.026453
Sports,0.498895,0.006883
Travel & Events,0.500906,0.005644


The above categories have a positive correlation between controversyIndex and viewCount. They also have P-value less than 5%, indicating the correlation is statistically significant.

## Conclusion

This analysis has found out the following - 
* There is no overall correlation between controversyIndex and viewCount. 

* 'Autos & Vehicles', 'Entertainment', 'Pets & Animals', 'Sports' and 'Travel & Events' have a weak positive correlation between controversyIndex and viewCount which is statistically significant.

However, this analysis has been limited by size and bias of the underlying data. For more reliable results, a large randomized dataset should be used.