# Analysis of articles on Wikipedia and Infogalactic
In this notebook we try to analyze the differences between the coverage of alt-right media on Wikipedia and Infogalactic. 

### The following code scrapes the list of alt-right websites from the following Wikipedia page: https://en.wikipedia.org/wiki/Alternative_media_(U.S._political_right)#Television


In [None]:
from bs4 import BeautifulSoup
import requests

link_list = "https://en.wikipedia.org/wiki/Alternative_media_(U.S._political_right)#Television"
user_agent = "Chrome/86.0.4240.22"
headers = {'User-Agent': user_agent}
html = requests.get(link_list,headers=headers).text

soup = BeautifulSoup(html, "lxml")
body = soup('body')[0]
headers_h2 = body.find_all('h2')
websites = []
for header in headers_h2:
  header_text = header.get_text()
  if 'Alternative media outlets' in header_text:
    for sibling in header.next_siblings:
      break_loop = False
      if sibling.name == 'div':
        table = sibling.table
        # get the first ul element
        tbody = table.tbody
        tr = tbody.tr
        td_array = tr.find_all('td')
        for td in td_array:
          ul_tags = td.find_all('ul')
          for ul_tag in ul_tags:
            
            for li_tag in ul_tag.find_all('li'):
              href = li_tag.find('a', href=True)
              websites.append((href['href']).replace('/wiki/', ''))
        break_loop = True
            
      if break_loop:
        break
      

### The next step is to generate Wikipedia and Infogalactic URLs for the above websites
Only those websites are selected which have more than one edits in their history.

In [None]:
wiki_urls = []
infog_urls = []
for website in websites:
  # replace spaces with _
  name = website.replace(" ", "_")
  infog_name = name
  if 'CNSNews' in name:
    infog_name = name.replace('CNSNews', 'Cybercast_News_Service')
  infog_name = infog_name.replace(" ", "_")
  wiki_url = 'https://en.wikipedia.org/wiki/' + name
  infog_url = 'https://infogalactic.com/info/' + infog_name
  # evaluate page history
  # Many of the Infogalactic pages have a single edit history. We won't select them for comparison
  infog_page_history = 'https://infogalactic.com/w/index.php?title=' + infog_name + '&action=history'
  infog_history_html = requests.get(infog_page_history,headers=headers).text
  soup = BeautifulSoup(infog_history_html, 'lxml')
  page_history_tag = soup.find(id='pagehistory')
  if page_history_tag != None:
    li_tags = page_history_tag.find_all('li')
    if (len(li_tags)) > 1:
      infog_urls.append('https://infogalactic.com/info/' + infog_name)
      wiki_urls.append(wiki_url)
print(wiki_urls)
print(infog_urls)
  

['https://en.wikipedia.org/wiki/Fox_News', 'https://en.wikipedia.org/wiki/National_Review', 'https://en.wikipedia.org/wiki/The_American_Conservative', 'https://en.wikipedia.org/wiki/The_Weekly_Standard', 'https://en.wikipedia.org/wiki/The_Daily_Caller', 'https://en.wikipedia.org/wiki/Drudge_Report', 'https://en.wikipedia.org/wiki/InfoWars', 'https://en.wikipedia.org/wiki/The_Rebel_Media', 'https://en.wikipedia.org/wiki/Instapundit', 'https://en.wikipedia.org/wiki/Michelle_Malkin']
['https://infogalactic.com/info/Fox_News', 'https://infogalactic.com/info/National_Review', 'https://infogalactic.com/info/The_American_Conservative', 'https://infogalactic.com/info/The_Weekly_Standard', 'https://infogalactic.com/info/The_Daily_Caller', 'https://infogalactic.com/info/Drudge_Report', 'https://infogalactic.com/info/InfoWars', 'https://infogalactic.com/info/The_Rebel_Media', 'https://infogalactic.com/info/Instapundit', 'https://infogalactic.com/info/Michelle_Malkin']


### Adding a utility function to remove references. 
eg; if a text is written like this - 'Fox News is a popular news outlet[8]', we need to remove [8] because that is not relevant to our comparison.
In the code instances of '\n' have been removed from the text as well.

In [None]:
# Removes references from the text
def remove_references(text):
  ret = ''
  skip1c = 0
  skip2c = 0
  for i in text:
      if i == '[':
          skip1c += 1
      elif i == ']' and skip1c > 0:
          skip1c -= 1
      elif skip1c == 0 and skip2c == 0:
          ret += i
  return ret

### Next we feed these URLs into a loop, extract the introductions of each of the TV channels from Infogalactic and Wikipedia
We will take the first few paragraphs as introduction

In [None]:
import diff_match_patch as dmp_module

index = 0
user_agent = "Chrome/77.0.3865.90"
headers = {'User-Agent': user_agent}
# dictionary to store the differences in text from each website.
# key -> infogalatic web URL of the website, value -> the text difference between text of the website from Wikipedia and Infogalactic,
# returned by the diff method
differences = {}
while index < len(wiki_urls):
  link_wiki = wiki_urls[index]
  link_infog = infog_urls[index]
  html = requests.get(link_infog,headers=headers).text
  soup = BeautifulSoup(html, "lxml")
  infogalactic_text = ""
  text = soup.find_all(id='toc')[0]
  for element in text.previous_siblings:
    if element.name == 'p':
      infogalactic_text = element.get_text() + infogalactic_text
  infogalactic_text = remove_references(infogalactic_text)
  infogalactic_text = infogalactic_text.replace('\n', '')
  
  # getting introductions from the Wikipedia links
  html = requests.get(link_wiki).text
  soup = BeautifulSoup(html, "lxml")
  wiki_text = ""
  text = soup.find_all(id='toc')[0]
  for element in text.previous_siblings:
    if element.name == 'p':
      wiki_text = element.get_text() + wiki_text

    wiki_text = remove_references(wiki_text)
    wiki_text = wiki_text.replace('\n', '')
  
  # getting the difference in text for each of the websites
  dmp = dmp_module.diff_match_patch()
  diff = dmp.diff_main(wiki_text, infogalactic_text)
  dmp.diff_cleanupSemantic(diff)
  differences[link_infog] = diff
  index = index + 1


  

### We need to display the differences in text in HTML format
Taking the differences in text from the above module, we can construct the HTML and show them in a legible format.
Wikipedia content is highlighted in yellow, Infogalactic content is italicized and highlighted in red. 

diff_match_patch's prettyHtml method can also be used. Its usage is shown below but commented.

In [None]:
from IPython.core.display import display, HTML
import diff_match_patch as dmp_module

def compute_class(number):
  class_name = 'other'
  if number == -1:
    class_name = 'first'
  elif number == 1:
    class_name = 'second'
  return class_name

for (key, diff_array) in differences.items():
  print('%s%s'% (key, '\n'))
  # construct HTML for the difference
  html = '<head><style>.first{background-color: yellow;}\n .second{background-color: #FF4500; font-style: italic; color: white;}</style></head><body><div>'
  for diff in diff_array:
    div_content = "<span class=" + compute_class(diff[0]) + ">" + diff[1] + "</span>"
    html = html + div_content
  html = html + '</div>' + '</body></br>'
  display(HTML(html))

https://infogalactic.com/info/Fox_News



https://infogalactic.com/info/National_Review



https://infogalactic.com/info/The_American_Conservative



https://infogalactic.com/info/The_Weekly_Standard



https://infogalactic.com/info/The_Daily_Caller



https://infogalactic.com/info/Drudge_Report



https://infogalactic.com/info/InfoWars



https://infogalactic.com/info/The_Rebel_Media



https://infogalactic.com/info/Instapundit



https://infogalactic.com/info/Michelle_Malkin



### Findings
Infogalactic edits usually removed words like 'far-right'.

- In the description of **Fox News**, Infogalactic does not contain text that says that their reporting is biased towards Republican candidates and that they have slandered Democratic candidates. Instead it says that Fox News supports *the Republican Party, Donald Trump and conservative causes*. 

- In case of **InfoWars** both Wikipedia and Infogalactic concurred in saying that it is a fake news website.

- For **The Daily Caller**, Infogalactic did not use the term 'conservative' in the introduction. The main difference, however, is evident in the part where the Wikipwdia entry brings up an NYT accusation that Daily Caller *'published articles espousing white nationalist, racist anti-black and antisemitic views under a pseudonym in white supremacist publications.'* This part is entirely absent from the Infogalactic entry.

- Wikipedia mentions that **The Rebel Media** is/was the Canadian version of Breitbart News, a title they rejected after the Charlottesville rally. Infogalactic does not mention this at all.

- The differences in description for **The American Conservative** is also interesting. While Wikipedia mentions that the publication states that *'it exists to promote a conservatism that opposes unchecked power in government and business,'* Infogalactic says it reflects *'traditional American conservatism that has argued vigorously against American interventionism, against a debt-based fiscal policy used to finance adventurism abroad and government growth at home, and against the intrusions on Americans’ private lives.'*
