# 3. Analyze and Visualize Data

Now that all our data is ready to go and formatted properly, we can explore it. 

## Table of Contents:
* [Quick Look](#ql)
* [Significance of Difference Between the Datasets](#sd)
* [View Counts Across Time](#vc)
* [Conclusion](#conc)




Matplotlib is a popular plotting library in python that allows us to easily visualize data. Scipy allows us to run statistical tests on our data. 

In [112]:
%matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from IPython.core.display import display

Let's begin by reading our data into dataframes. We'll also add a few columns with more data that might be interesting. The Like/Dislike ratio might give us an idea of whether a video is controversial, and the Like/View ratio and Comment/View ratio will give us an idea of engagement on the video. 

In [113]:
mainstream_rows = pd.read_csv("../data/video_metadata_mainstream_media.csv")
other_rows = pd.read_csv("../data/video_metadata_not_mainstream_media.csv")

def add_ratio_column(name_of_column, numerator, denominator):
    mainstream_rows[name_of_column] = mainstream_rows[numerator]/mainstream_rows[denominator]
    other_rows[name_of_column] = other_rows[numerator]/other_rows[denominator]

add_ratio_column('like/dislike ratio', 'video_like_count', 'video_dislike_count')
add_ratio_column('like/view ratio', 'video_like_count', 'video_view_count')
add_ratio_column('dislike/view ratio', 'video_dislike_count', 'video_view_count')
add_ratio_column('comment/view ratio', 'video_comment_count', 'video_view_count')



# Quick Look <a class="anchor" id="ql"></a>

The describe() function in pandas (and R)  will give us a quick look at the statistics for our data. 

In [114]:
print "Data for Videos Mentioning the Mainstream Media"
display(mainstream_rows.describe(include=[np.number]))
print "Data for all Other Videos"
display(other_rows.describe(include=[np.number]))

Data for Videos Mentioning the Mainstream Media


Unnamed: 0,video_view_count,video_like_count,video_dislike_count,video_comment_count,like/dislike ratio,like/view ratio,dislike/view ratio,comment/view ratio
count,810.0,810.0,810.0,810.0,810.0,810.0,810.0,810.0
mean,42655.538272,1276.022222,57.267901,355.633333,30.783841,0.035468,0.001612,0.010004
std,65326.972988,1546.81196,108.488576,440.762939,19.977975,0.013904,0.001396,0.005512
min,2093.0,83.0,3.0,8.0,0.909502,0.002267,0.000239,0.001118
25%,13321.5,434.0,18.0,116.0,17.269156,0.025367,0.000849,0.006252
50%,23793.0,803.5,32.0,228.5,26.47551,0.034997,0.001267,0.008769
75%,45632.5,1447.5,57.0,405.0,39.352679,0.044527,0.001882,0.01226
max,910656.0,15967.0,1486.0,4601.0,157.315789,0.092573,0.016096,0.051813


Data for all Other Videos


Unnamed: 0,video_view_count,video_like_count,video_dislike_count,video_comment_count,like/dislike ratio,like/view ratio,dislike/view ratio,comment/view ratio
count,18232.0,18231.0,18231.0,18231.0,18231.0,18231.0,18231.0,18231.0
mean,55183.49,1127.060776,98.282925,444.482365,22.32155,0.02746,0.001939,0.009934
std,156024.4,2069.857101,606.143874,1186.709077,16.925344,0.013487,0.002655,0.006004
min,590.0,24.0,1.0,0.0,0.31913,0.000554,6e-05,0.0
25%,13450.5,329.0,19.0,114.0,11.0,0.01682,0.000895,0.005852
50%,24725.5,645.0,35.0,230.0,18.269231,0.025002,0.001358,0.00857
75%,49508.5,1243.0,73.0,462.0,29.109275,0.036446,0.002109,0.012446
max,9827591.0,122814.0,57228.0,92465.0,262.0,0.096477,0.071981,0.077675


On first glance, we can see that

1. The maximum and mean view counts are quite different for both categories. It would appear that videos mentioning the mainstream media actually garner less attention than his other videos. 

2. The like/dislike ratio is also very different across the two datasets. This might point towards some measure of "controversy" as well as perhaps pointing towards an echo chamber in which videos are not viewed by people who disagree with the premise in the first place. 

# Significance of Difference for Each Statistic <a class="anchor" id="sd"></a>

Let's investigate each statistic and see whether or not there is a statistically significant difference. 

In [115]:
headers = ["Variable", "P-Value", "Mainstream Media Videos Mean", "Other Videos Mean"]
t_test_table = []

for col in other_rows.select_dtypes(include=[np.number]):
    stat, pvalue = scipy.stats.ttest_ind(mainstream_rows.dropna()[col], other_rows.dropna()[col])
    t_test_table.append([col, '{0:f}'.format(pvalue), round(mainstream_rows[col].mean(), 4), round(other_rows[col].mean(), 4)])

display(pd.DataFrame(t_test_table, columns=headers))

Unnamed: 0,Variable,P-Value,Mainstream Media Videos Mean,Other Videos Mean
0,video_view_count,0.022985,42655.5383,55183.4871
1,video_like_count,0.043062,1276.0222,1127.0608
2,video_dislike_count,0.054322,57.2679,98.2829
3,video_comment_count,0.033658,355.6333,444.4824
4,like/dislike ratio,0.0,30.7838,22.3215
5,like/view ratio,0.0,0.0355,0.0275
6,dislike/view ratio,0.000511,0.0016,0.0019
7,comment/view ratio,0.742869,0.01,0.0099


Overall, we can confidently say that there is a difference between the two sets of videos on all statistics except for the comments (the p-values are less than 0.05). 

We can see that videos mentioning the mainstream media get less views, but more engagement (lower view_count, higher like/view ratio). Jones' other videos are more controversial, with a lower like to dislike ratio. This may indicate that videos about the mainstream media are seen by people who agree with his viewpoints, and are more likely to "like" the video, and less likely to "dislike" the video.

This points to the possibility that a stronger echo chamber exists in his videos discrediting the MSM. That is, there are fewer people who watch these videos that would disagree with his point of view. 

# View Counts Across Time for Videos Mentioning MSM <a class="anchor" id="vc"></a>

Finally, we can do a time series analysis to see how view counts have changed over time. 

In [116]:
mainstream_rows.video_publish_date = pd.to_datetime(mainstream_rows['video_publish_date'],
                                                    format='%Y-%m-%d %H:%M:%S.%f')

print "Video View Count Across Time for Videos Mentioning MSM"
mainstream_rows.plot(x='video_publish_date', y='video_view_count',
                    figsize=(8,3), title="Videos Mentioning Mainstream Media",
                     fontsize=10, ylim=(0, 1000000))
plt.show()


Video View Count Across Time for Videos Mentioning MSM


<IPython.core.display.Javascript object>

Videos mentioning the mainstream media started appearing and being viewed mainly during and after the US Presidential Election in November of 2016. Indeed, it would seem that attacking the MSM was not nearly as prelavant before the elections.

# Conclusion <a class="anchor" id="conc"></a>

In conclusion, a very cursory overview of these two datasets (videos mentioning MSM vs. videos not mentioning MSM) reveal an interesting pattern of engagement. 

While discrediting the mainstream media does not lead to an increased number of views on the videos, it does attract viewers who are more likely to "like" the video, and less likely to "dislike" it. In that sense, the audience of these videos may be smaller and more targeted, pointing towards a higher possibility of an echo chamber within this smaller  community. As people who agree with Alex Jones watch these videos, the messages passed across in them may be reinforced in the viewer. This fragmentation of viewers, where people's realities are informed by silos of information, may be an interesting phenonemon to investigate further when exploring media manipulation. 

Furthermore, the 2016 US Presidential Election seems to have sparked interest in the mistrust of mainstream media. There are more videos being made concerning this as well as more viewers seeking these videos out. 

Some questions that could be investigated with more time and a thorough study include: 
1) Do videos mentioning the mainstream media get shared at higher rates, therefore corroborating the idea that while there are fewer viewers, there are tighter networks of people who view these videos?
2) Is there a way of analyzing the caption track on the video to quantify or qualify what kind of attack on the MSM is being made and how strong it is?
3) Would a longer time-series analysis show us a pattern between an uptick of these kinds of videos and US election cycles? 