In this notebook, we study the features of the nela dataset, using the subset of the dataset (4 newsources)

In [1]:
# Some features can be calculated using: https://github.com/RyleeThompson/unbiasMe/blob/master/helpers.py

In [2]:
import pandas as pd
from sklearn.cluster import KMeans
import numpy as np
import seaborn as sns
df = pd.read_csv("subset.csv") # Or just use the full csv file
sources = ["CNN","Fox News","BBC", "Xinhua"]
readibility = [" TTR","SMOG", "FKE","wordlen"]
import matplotlib.pyplot as plt

FileNotFoundError: [Errno 2] No such file or directory: 'subset.csv'

Below is the list of features within the dataset, we will choose a few to study

In [None]:
for c in df.columns: 
  print(c)

## Study of Readibility
First we will look at several measures of difficulty of a text:

1) TTR

2) SMOG

3) FKE

In [None]:
sns.displot(df, x= " TTR", kind="kde",hue = " source",multiple="stack")

Here is a plot of the TTR feature distribution for each news source. TTR of a text is defined as $$\frac{ \# unique\_words}{\# words}$$. TTR measures the lexical diversity, which can help measure the difficulty of a text. The higher the TTR, the harder it can be for non-native speakers/new speakers to read atext.

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source][" TTR"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of TTR")

In [None]:
sns.displot(df, x= "SMOG", kind="kde",hue = " source",multiple="stack")
f.set_titles("SMOG Distribution")

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source]["SMOG"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of SMOG")


Description of SMOG:

In [None]:
f = sns.displot(df, x= "FKE", kind="kde",hue = " source",multiple="stack")
f.set_titles("FKE Distribution")

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source]["FKE"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of FKE")


In [None]:
f= sns.displot(df, x= "wordlen", kind="kde",hue = " source",multiple="stack")
f.set_titles("Avg WordLen Distribution")

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source]["wordlen"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of Avg WordLen")


Here is the plot for the average word length of the articles per each newsource

**Some Takeaways**


1.   Xinhua is the hardest to read: longest avg word len per article, highest FKE,SMOG, and TTR Scores
2.   Fox is harder to read than CNN on all metrics.



In [None]:
r_frame = df[readibility]

In [None]:
sns.jointplot(data=r_frame, x="wordlen", y="FKE", kind="hist")

In [None]:
sns.jointplot(data=r_frame, x="wordlen", y="SMOG", kind="hist")

In [None]:
sns.jointplot(data=r_frame, x="FKE", y="SMOG", kind="hist")

In [None]:
sns.jointplot(data=r_frame, x="SMOG", y=" TTR", kind="hist")

In [None]:
sns.jointplot(data=r_frame, x="wordlen", y=" TTR", kind="hist")

**Notable correlation for readibility features**


1.   Longer avg wordlen --> Higher SMOG value
2.   Longer avg wordlen --> Higher SKE value
3.   Higher FKE --> Higher SMOG


Interestingly, SMOG and wordlen are quite uncorelalted with TTR score




We will look at the sentiment distribution next



In [None]:
f= sns.displot(df, x= " wneg_count", kind="kde",hue = " source",multiple="stack")

wneg_count can be calculated using:  wneg_count = float(sum([tokens.count(n) for n in wneg]))/len(tokens)

In other words, go through each token and count the total number of tokens that belong to the negative class and divide that by total number of tokens. The idea is the same for wpos_count and wneu_count

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source][" wneg_count"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of Average Negative Words per Article")


In [None]:
f= sns.displot(df, x= " wpos_count", kind="kde",hue = " source",multiple="stack")

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source][" wpos_count"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of Average Positive Words per Article")


In [None]:
f= sns.displot(df, x= "vad_neg", kind="kde",hue = " source",multiple="stack")

In [None]:
res = {}
for source in sources:
    res[source] = df[df[" source"] == source]["vad_neg"].mean()
res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}
sns.barplot(x=list(res.keys()), y=list(res.values())).set_title("Mean of sentiment calculated using Vader")


**Takeaways**

Firstly: using basic negative word counts

1.   CNN uses, on average, more positive words per article than Fox News.
2.   Fox uses, on average, more negative words per article than CNN.

Note # of positive + negative != # of tokens, there are neutral tokens/words as well


Secondly using Vader:
By using vader to calculate the sentiment on the articles, we observe similar things: Fox has more negative articles than CNN