# NELA Indicator Analysis

Created: 2019.9.22  
Notebook sequence: 1

For exploring what indicators could be used on what prediction tasks

---

Note that more info about these sources can be found in various places,

* Media Bias / Fact Check: https://mediabiasfactcheck.com/
* Pew Research Center: https://www.journalism.org/2014/10/21/political-polarization-media-habits/



A quick note on the limitations of what they've done with ML prediction on these labels in http://homepages.rpi.edu/~horneb/WWW18_Horne_Demo.pdf

They only used one source (opensources) and they only reported on a Random Forest classifier

They also are relying on computing a bunch of additional features like twitter counts, wikipedia stuff, etc. not just raw content

Prediction tasks refined:
* Fake news (~2)
    * Open Sources, fake
    * Wikipedia, is_fake
    * PROBLEM: How do I balance out with non-fake news? I don't know what indicators have "considered" what sources. Maybe for non-fake use media bias / fact Check least-biased?
* Reliable news (3)
    * Open Sources, unreliable | Open Sources, reliable
    * Media Bias / Fact Check, factual_reporting 1-3 | 4-5 (PROBLEM: justify split)
    * NewsGuard, overall_class 1.0 | 0.0
* Biased (extreme, slight) (~3)
    * Media Bias / Fact Check (left-biased, left-center-biased, right-center-biased, right-biased) | (least-biased)
    * Allsides, bias_rating (left, lean-left, lean-right, right) | (center)
    * Media Bias / Fact Check, (extreme_left, extreme_right) | (least-biased)
* Which direction biased? (extreme, slight) (~3+)
    * Media Bias / Fact Check (left-biased, left-center-biased) | (least-biased) | (right-center-biased, right-biased)
    * Media Bias / Fact Check (left-biased) | (least-biased) | (right-biased)
    * Media Bias / Fact Check, extreme_left | (least-biased) | extreme_right
    * Allsides, bias_rating (left, lean-left) | (center) | (lean-right, right)
    * Allsides, bias_rating (left) | (center) | (right)
    * BuzzFeed (left) | | (right)
* How much biased? (2)
    * Media Bias / Fact Check (left-biased) | (left-center-biased) | (least-biased) | (right-center-biased) | (right-biased)
    * Allsides, bias_rating (left) | (lean-left) | (center) | (lean-right) | (right)


NOTE: may want to leave fake and reliable merged. The concern is for Open Sources I'm not sure what the difference between "fake" and "unreliable" is. Maybe I just straight up leave this out?

I don't think politifact has any useful labels for what I'm doing

Pew Research Center probably isn't super useful either, these values are reader's trust/distrust of sources

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import util
import pandas as pd

In [4]:
labels_df = util.nela_load_labels()

In [5]:
labels_df

Unnamed: 0.1,Unnamed: 0,"NewsGuard, Does not repeatedly publish false content","NewsGuard, Gathers and presents information responsibly","NewsGuard, Regularly corrects or clarifies errors","NewsGuard, Handles the difference between news and opinion responsibly","NewsGuard, Avoids deceptive headlines","NewsGuard, Website discloses ownership and financing","NewsGuard, Clearly labels advertising","NewsGuard, Reveals who's in charge, including any possible conflicts of interest","NewsGuard, Provides information about content creators",...,"Allsides, community_agree","Allsides, community_disagree","Allsides, community_label","BuzzFeed, leaning","PolitiFact, Pants on Fire!","PolitiFact, False","PolitiFact, Mostly False","PolitiFact, Half-True","PolitiFact, Mostly True","PolitiFact, True"
0,21stCenturyWire,,,,,,,,,,...,,,,left,,,,,,
1,ABC News,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,8964.0,6949.0,somewhat agree,,,,,,,
2,AMERICAblog News,,,,,,,,,,...,,,,left,,,,,,
3,Activist Post,,,,,,,,,,...,,,,left,,,,,,
4,Addicting Info,,,,,,,,,,...,,,,left,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,iPolitics,,,,,,,,,,...,,,,,,,,,,
190,oann,,,,,,,,,,...,,,,right,,,,,,
191,rferl,,,,,,,,,,...,,,,,,,,,,
192,sott.net,,,,,,,,,,...,,,,,,,,,,


There are in total 194 sources, with indicators from 8 different indicator sources, with a total of 57 different indicator labels

# Level 1 - Reliable or not?

Relevant indicators:

* `NewsGuard, overall_class` (1 good, 0 bad)
* `Pew Research Center, total` (1 trusted, 0 undecided, -1 not trusted)
* `Wikipedia, is_fake` (1 marked)
* `Open Sources, reliable` (# tag)
* `Open Sources, fake` (# tag)
* `Open Sources, unreliable` (# tag)
* `Media Bias / Fact Check, factual_reporting` (bad 1 - 5 good)

In [23]:
reliable_df = labels_df[[
    "Unnamed: 0",
    "NewsGuard, overall_class", 
    "Pew Research Center, total", 
    "Wikipedia, is_fake", 
    "Open Sources, reliable", 
    "Open Sources, fake",
    "Open Sources, unreliable",
    "Media Bias / Fact Check, factual_reporting"
]]
#with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    #display(reliable_df)

In [24]:
reliable_df.shape

(194, 8)

In [27]:
print(reliable_df[reliable_df.isnull().sum(axis=1) < 6].shape)
reliable_df[reliable_df.isnull().sum(axis=1) < 6] # require at least two indicators for every source

(82, 8)


Unnamed: 0.1,Unnamed: 0,"NewsGuard, overall_class","Pew Research Center, total","Wikipedia, is_fake","Open Sources, reliable","Open Sources, fake","Open Sources, unreliable","Media Bias / Fact Check, factual_reporting"
1,ABC News,1.0,1.0,,,,,4.0
4,Addicting Info,,,,,,2.0,3.0
5,Al Jazeera,0.0,-1.0,,,,,4.0
6,Alternet,1.0,,,2.0,,,3.0
8,BBC,1.0,1.0,,,,,4.0
...,...,...,...,...,...,...,...,...
179,Vox,1.0,,,,,,4.0
182,Washington Monthly,1.0,,,,,,4.0
183,Washington Post,1.0,1.0,,,,,4.0
184,Western Journal,0.0,,,,,,3.0


When looking at Newsguard, pew research, and factual reporting scores, and requiring that none of them be null, we have about 24 sources

In [26]:
for column in reliable_df.columns[1:]:
    print("------",column,"-------")
    print(reliable_df[column].value_counts())

------ NewsGuard, overall_class -------
1.0    60
0.0    25
Name: NewsGuard, overall_class, dtype: int64
------ Pew Research Center, total -------
 1.0    16
 0.0     6
-1.0     3
Name: Pew Research Center, total, dtype: int64
------ Wikipedia, is_fake -------
1.0    5
Name: Wikipedia, is_fake, dtype: int64
------ Open Sources, reliable -------
2.0    3
Name: Open Sources, reliable, dtype: int64
------ Open Sources, fake -------
1.0    5
3.0    1
Name: Open Sources, fake, dtype: int64
------ Open Sources, unreliable -------
2.0    9
1.0    4
3.0    2
Name: Open Sources, unreliable, dtype: int64
------ Media Bias / Fact Check, factual_reporting -------
4.0    54
3.0    48
2.0     7
5.0     2
1.0     1
Name: Media Bias / Fact Check, factual_reporting, dtype: int64


# Level 2 - Biased or not?

Relevant indicators:

* `Allsides, bias_rating` rating label (left, right, center, lean left, lean right)
* `Media Bias / Fact Check, extreme_left`(1 true, 0 false)
* `Media Bias / Fact Check, extreme_right`(1 true, 0 false)
* `Media Bias / Fact Check, label` label (left_center_bias, left_bias, conspiracy_pseudoscience, right_bias, questionable_source, right_center_bias, satire, least_biased)
* `Open Sources, bias` # tag
* `BuzzFeed, leaning` (left, right)
* **pew research center??**

In [9]:
bias_df = labels_df[[
    "Unnamed: 0",
    "Allsides, bias_rating",
    "Media Bias / Fact Check, extreme_left",
    "Media Bias / Fact Check, extreme_right",
    "Media Bias / Fact Check, label",
    "Open Sources, bias",
    "BuzzFeed, leaning"
]]
bias_df

Unnamed: 0.1,Unnamed: 0,"Allsides, bias_rating","Media Bias / Fact Check, extreme_left","Media Bias / Fact Check, extreme_right","Media Bias / Fact Check, label","Open Sources, bias","BuzzFeed, leaning"
0,21stCenturyWire,,0.0,0.0,conspiracy_pseudoscience,,left
1,ABC News,Lean Left,0.0,0.0,left_center_bias,,
2,AMERICAblog News,,,,,1.0,left
3,Activist Post,,0.0,0.0,conspiracy_pseudoscience,,left
4,Addicting Info,,0.0,0.0,left_bias,,left
...,...,...,...,...,...,...,...
189,iPolitics,,0.0,0.0,left_center_bias,,
190,oann,,,,,,right
191,rferl,,,,,,
192,sott.net,,0.0,0.0,conspiracy_pseudoscience,,


In [10]:
bias_df[bias_df.isnull().sum(axis=1) < 2]

Unnamed: 0.1,Unnamed: 0,"Allsides, bias_rating","Media Bias / Fact Check, extreme_left","Media Bias / Fact Check, extreme_right","Media Bias / Fact Check, label","Open Sources, bias","BuzzFeed, leaning"
6,Alternet,Left,0.0,0.0,left_bias,,left
11,Bipartisan Report,,1.0,0.0,questionable_source,2.0,left
13,Breitbart,Right,0.0,1.0,questionable_source,3.0,right
19,CNS News,Right,0.0,1.0,questionable_source,2.0,right
30,Daily Kos,Left,0.0,0.0,left_bias,,left
32,Daily Signal,Right,0.0,0.0,right_bias,1.0,right
37,Drudge Report,Lean Right,0.0,0.0,right_bias,2.0,right
47,Fox News,Lean Right,0.0,0.0,right_bias,,right
52,FrontPage Magazine,Right,0.0,1.0,questionable_source,1.0,right
57,Hot Air,Lean Right,0.0,0.0,right_bias,,right


An interesting thing to note is that there isn't really much disagreement, and moreover, there's almost none that are actually "center"

A potential approach here is that you would train a model on all of the sources for a particular indicator (even if there's no overlap with another) and then test it on the sources of the other indicator.

In [21]:
for column in bias_df.columns[1:]:
    print("------",column,"-------")
    print(bias_df[column].value_counts())

------ Allsides, bias_rating -------
Left          17
Right         14
Lean Left     13
Center        11
Lean Right    10
Name: Allsides, bias_rating, dtype: int64
------ Media Bias / Fact Check, extreme_left -------
0.0    133
1.0      2
Name: Media Bias / Fact Check, extreme_left, dtype: int64
------ Media Bias / Fact Check, extreme_right -------
0.0    123
1.0     12
Name: Media Bias / Fact Check, extreme_right, dtype: int64
------ Media Bias / Fact Check, label -------
left_center_bias            36
left_bias                   30
conspiracy_pseudoscience    17
right_bias                  16
questionable_source         15
right_center_bias           10
satire                       8
least_biased                 3
Name: Media Bias / Fact Check, label, dtype: int64
------ Open Sources, bias -------
1.0    14
2.0     7
3.0     4
Name: Open Sources, bias, dtype: int64
------ BuzzFeed, leaning -------
right    32
left     24
Name: BuzzFeed, leaning, dtype: int64


Allsides labels a total of 65 sources, 11 of which are "center"

Media Bias / extremes only shows for 14 sources total

Media Bias label might be okay, but only 3 has the label "least_biased". A more interesting approach might be if we only look for "extreme bias" (and save just direction period for level 3)

Buzzfeed is not a good source for this because it does not have a 'center' option, so save it for level 3


# Level 3 - Biased in what direction?

Relevant indicators:

* `Allsides, bias_rating` rating label (left, right, center, lean left, lean right)
* `Media Bias / Fact Check, extreme_left`(1 true, 0 false)
* `Media Bias / Fact Check, extreme_right`(1 true, 0 false)
* `Media Bias / Fact Check, label` label (left_center_bias, left_bias, conspiracy_pseudoscience, right_bias, questionable_source, right_center_bias, satire, least_biased)
* `Open Sources, bias` # tag
* `BuzzFeed, leaning` (left, right)
* **pew research center??**

# Level 4 - Biased to what extent?

Relevant indicators:

* `Allsides, bias_rating` rating label (left, right, center, lean left, lean right)
* `Media Bias / Fact Check, extreme_left`(1 true, 0 false)
* `Media Bias / Fact Check, extreme_right`(1 true, 0 false)
* `Media Bias / Fact Check, label` label (left_center_bias, left_bias, conspiracy_pseudoscience, right_bias, questionable_source, right_center_bias, satire, least_biased)
* **pew research center??**

# Level 5 - Misinformation classification

# NELA-GT-2018 (dataset description from README.md)

## Dataset Articles

The articles gathered in this dataset is found in an sqlite database. The database has one table name `articles`. This table has 4 textual columns:

1. `date`: Date of article in `yyyy-mm-dd` format.
1. `source`: Source of article.
1. `name`: Title of article.
1. `content`: Clean text content of article.

The rows of the article are sorted first with respect to `date` and then with respect to `source`.

The dataset's articles are also provided in plain-text files, with file-structure and file naming convension:
```
date/
	source/
		<source>--<date>--<title>.txt
```

## Dataset Labels

The labels of sources are stored in a comma-seperated file *labels.csv* and in a human-readable format in *labels.txt*. Each row in the files contain information about a source. 
The column names use the naming convention `<site_name>,<label_name>`, where `<site_name>` is the name of the site providing the label and `<label_name>` is the name of the particular label. The following lists all columns in the labels files. The columns use different value, which is described below. Note that all columns can also have missing value (no data for that particular source).

1.  Source names
  	**NewsGuard labels:**
1. NewsGuard, Does not repeatedly publish false content `1 true, 0 false`
2. NewsGuard, Gathers and presents information responsibly `1 true, 0 false`
3. NewsGuard, Regularly corrects or clarifies errors `1 true, 0 false`
4. NewsGuard, Handles the difference between news and opinion responsibly `1 true, 0 false`
5. NewsGuard, Avoids deceptive headlines `1 true, 0 false`
6. NewsGuard, Website discloses ownership and financing `1 true, 0 false`
7. NewsGuard, Clearly labels advertising `1 true, 0 false`
8. NewsGuard, Reveals who's in charge, including any possible conflicts of interest `1 true, 0 false`
9. NewsGuard, Provides information about content creators `1 true, 0 false`
10. NewsGuard, score `0-100`
11. NewsGuard, overall_class `1 good, 0 bad`
	**Pew Research Center**
12. Pew Research Center, known_by_40% `1 true, 0 false`
13. Pew Research Center, total `1 trusted, 0 undecided, -1 not trusted`
14. Pew Research Center, consistently_liberal `1 trusted, 0 undecided, -1 not trusted`
15. Pew Research Center, mostly_liberal `1 trusted, 0 undecided, -1 not trusted`
16. Pew Research Center, mixed `1 trusted, 0 undecided, -1 not trusted`
17. Pew Research Center, mostly conservative `1 trusted, 0 undecided, -1 not trusted`
18. Pew Research Center, consistently conservative `1 trusted, 0 undecided, -1 not trusted`
	**Wikipedia**
19. Wikipedia, is_fake `1 marked`
	**Open Sources**
20. Open Sources, reliable `# tag`
21. Open Sources, fake `# tag`
22. Open Sources, unreliable `# tag`
23. Open Sources, bias `# tag`
24. Open Sources, conspiracy `# tag`
25. Open Sources, hate `# tag`
26. Open Sources, junksci `# tag`
27. Open Sources, rumor `# tag`
28. Open Sources, blog `# tag`
29. Open Sources, clickbait `# tag`
30. Open Sources, political `# tag`
31. Open Sources, satire `# tag`
32. Open Sources, state `# tag`
	**Media Bias / Fact Check**
33. Media Bias / Fact Check, label `label`
34. Media Bias / Fact Check, factual_reporting `bad 1 - 5 good`
35. Media Bias / Fact Check, extreme_left `1 true, 0 false`
36. Media Bias / Fact Check, right `1 true, 0 false`
37. Media Bias / Fact Check, extreme_right `1 true, 0 false`
38. Media Bias / Fact Check, propaganda `1 true, 0 false`
39. Media Bias / Fact Check, fake_news `1 true, 0 false`
40. Media Bias / Fact Check, some_fake_news `1 true, 0 false`
41. Media Bias / Fact Check, failed_fact_checks `1 true, 0 false`
42. Media Bias / Fact Check, conspiracy `1 true, 0 false`
43. Media Bias / Fact Check, pseudoscience `1 true, 0 false`
44. Media Bias / Fact Check, hate_group `1 true, 0 false`
45. Media Bias / Fact Check, anti_islam `1 true, 0 false`
46. Media Bias / Fact Check, nationalism `1 true, 0 false`
	**Allsides**
47. Allsides, bias_rating `rating label`
48. Allsides, community_agree `# votes agreeing`
49. Allsides, community_disagree `# votes disagreeing`
50. Allsides, community_label `agreement label`
	**BuzzFeed**
51. BuzzFeed, leaning `left, right`
	**Politifact**
52. PolitiFact, Pants on Fire! `# counts`
53. PolitiFact, False `# counts`
54. PolitiFact, Mostly False `# counts`
55. PolitiFact, Half-True `# counts`
56. PolitiFact, Mostly True `# counts`
57. PolitiFact, True `# counts`

