<a href="https://colab.research.google.com/github/adityaj12/Fake-News-Project/blob/main/FakeNews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1: Introduction to Fake News Classification


In this notebook we'll be:
1.   Exploring HTML and our Data
2.   Understanding Word Frequencies and Fake vs. Real Fractions



In [None]:
#@title Run this to load your data { display-mode: "form" }
import os
from bs4 import BeautifulSoup as bs
import pickle
  
import requests
import zipfile
import io

# Download class resources...
!wget -O data.zip 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/inspirit_fake_news_resources%20(1).zip'
!unzip data.zip

basepath = '.'

--2021-05-29 13:34:37--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fake%20News%20Detection/inspirit_fake_news_resources%20(1).zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.148.128, 209.85.200.128, 108.177.112.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.148.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 109422100 (104M) [application/zip]
Saving to: ‘data.zip’


2021-05-29 13:34:39 (65.8 MB/s) - ‘data.zip’ saved [109422100/109422100]

Archive:  data.zip
  inflating: train_val_data.pkl      
  inflating: test_data.pkl           


## Anatomy of a (Fake) News Website

Have you ever wondered how websites like *google.com* and *nytimes.com* work under the hood? Using the internet every day, it is easy to forget how magical even the most mundane web browsing experiences are. Consider, for example, this article on the New York Times:

![NYTimes Article](https://www.niemanlab.org/images/ochs-nytimes-article-page.png)


How does the browser know to show the title of the article near the top of the page? How does it know that the word "Art & Design" should be left-centered and gray-colored? How does it know where to find the image to display?

All of these questions can be answered by probing through the HTML of a webpage. HTML is a simple markup language that augments text with the structure you'd expect from a webpage. It's the language that provides the structure for every webpage you see. Here's an example of an HTML document for a simple webpage.

![HTML Example](https://miro.medium.com/max/498/1*5gJzummAqpBDGATo0fjU6Q.jpeg)

### HTML in a Nutshell

HTML is the standard markup language for creating Web pages.
* HTML stands for Hyper Text Markup Language
* HTML describes the structure of Web pages using markup
* HTML elements are the building blocks of HTML pages
* HTML elements are represented by tags
* HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
* Browsers do not display the HTML tags, but use them to render the content of the page




## Exercise 1 | HTML Warmup




The best way to learn HTML is to type some of your own. 
1. Start by opening [this interactive environment](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default). 
2. Now change the text in the Header  tag and the Paragraph tag to make the screen on the right look like this. (*Hint make sure to hit the “Run” button)
 
 [![Screen-Shot-2019-06-10-at-8-49-53-AM.png](https://i.postimg.cc/DzKrZbL9/Screen-Shot-2019-06-10-at-8-49-53-AM.png)](https://postimg.cc/JsPDY0Tx)
 
 
3. Now add another header tag underneath the paragraph tag to make the screen on the right look like this: 
 
 [![Screen-Shot-2019-06-10-at-8-53-50-AM.png](https://i.postimg.cc/rwpm9ycD/Screen-Shot-2019-06-10-at-8-53-50-AM.png)](https://postimg.cc/KRdxG21b)
 
 
Great! Now that you understand the fundamental structure of HTML let’s explore how real life web pages use it. 

Since we are going to be analyzing news websites, let’s get a look at what that their HTML looks like. 

1. Head over to the[ New York Times webpage. ](https://www.nytimes.com/)
2. Next, right click on the page and hit on “View Selection” or “View Source” (Depends on your web browser). You should now see all the HTML source code for that page. You will see a lot of tags and text that you don’t know yet, that’s ok. 
3. Just by looking at the HTML, can you anticipate some challenges that might arise when trying to analyze this code? For example, how will we be able to write a program that can differentiate between a link and text?  
4. Discuss 1-2 potential challenges that come to mind. 





## Problem Statement

**Given the URL of a news website and its HTML, can we classify the news website as either fake or real?** 

## Exercise 2 | Exploring the Data 

### Dataset 

Load the train and val in the below cell:


In [None]:
with open(os.path.join(basepath, 'train_val_data.pkl'), 'rb') as f:
  train_data, val_data = pickle.load(f)

print('Number of train examples:', len(train_data))
print('Number of val examples:', len(val_data))

print('Fraction of train examples that are fake:', len([datapoint for datapoint in train_data if datapoint[2] == 0]) / float(len(train_data)))
print('Fraction of val examples that are fake:', len([datapoint for datapoint in val_data if datapoint[2] == 0]) / float(len(val_data)))

Number of train examples: 2002
Number of val examples: 309
Fraction of train examples that are fake: 0.5224775224775224
Fraction of val examples that are fake: 0.5436893203883495


We can see that the number of examples for each portion of the data approximately matches the split above, and each portion has roughly 50% fake news websites. Now to explore what each data point looks like. 

###Changing The Example Index

Spend ~15 minutes browsing through the data by changing example_idx below. You are able to see the URL, label (0 is real, 1 is fake), and part of the HTML for an example.

Observe that each data point has three values: the URL, the HTML, and the binary (0 or 1) label. A label of "1" indicates that the website is a fake news website, and a label of "0" indicates that the website does not have fake news. See if you can spot some differences between examples with label 0 and examples with label 1, especially in their URLs! The HTML may be a bit difficult to read, since it is much longer, so don't worry about this.

In [None]:
### YOUR CODE HERE ###
example_idx = 5
### END CODE HERE ###

print('Number of values per data point: %d\n' % len(train_data[0]))

print('URL for chosen example:', train_data[example_idx][0])
print('Label for chosen example:', train_data[example_idx][2])
print('HTML for chosen example (first 5000 chars):\n\n', bs(train_data[example_idx][1]).prettify()[:1000])

Number of values per data point: 3

URL for chosen example: www.reuters.com
Label for chosen example: 0
HTML for chosen example (first 5000 chars):

 <!--[if !IE]> This has been served from cache <![endif]-->
<!--[if !IE]> Request served from apache server: produs--i-0e8ab7f4ed17eb0b0 <![endif]-->
<!--[if !IE]> Cached on Tue, 21 May 2019 21:09:30 GMT and will expire on Tue, 21 May 2019 21:14:14 GMT <![endif]-->
<!--[if !IE]> token: b50c6d65-2cbe-4693-8c4c-3e8242510ee5 <![endif]-->
<!--[if !IE]> Prepopulated from the cache-server <![endif]-->
<!--[if !IE]> App Server /produs--i-0718b85b30ff2bfdf/ <![endif]-->
<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Business &amp; Financial News, U.S &amp; International Breaking News  | Reuters
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta content="on" http-equiv="x-dns-prefetch-control"/>
  <link href="//s1.reutersmedia.net" rel="dns-prefetch"/>
  <link href="//s2.reutersmedia.net" re

## Exercise 3 | Fake vs Real Fraction


### Probing Hypotheses

Browsing through the examples above, you might have gotten a few ideas for differences between real and fake news websites. For instance, you might have noticed that many fake news websites use domain name extensions other than ".com", whereas this is less common for real news websites. So a possible hypothesis could be: 
 
####Websites with .com extensions are more likely to be real news. 




### Real Fraction 

One simple way to quantify our observation would be to see what percentage of websites using a certain extension (.com, .org, etc.) are real. We can call this number the Real Fraction. 


###Fake Fraction

Likewise, we can find what percentage of websites using a certain extension (.com, .org, etc) are fake. We can call this number the Fake Fraction. 

### Fake/ Real Ratio

How do we use the Fake Fraction and Real Fraction to test our hypothesis ? We could divide them to form a ratio, which we can call the Fake vs Real Ratio. 
 
For the .com extension, the Fake vs Real Ratio would be as follows. 
 
#### (.com) Fake vs  Real Ratio = Fraction of Fake Sites w/ (.com) / Fraction of Real Sites w/ (.com) 



###Interpreting Ratios 

* If the ratio is less than 1, then we have reason to believe that real news websites disproportionately use ".com" extensions, 
* If the ratio is greater than 1, then we have reason to believe that fake news websites disproportionately use ".com" extensions, 
* If the ratio is 1, then both fake and real news websites use the .com extension about the same. This means that our hypothesis isn't very useful for separating out real and fake news websites, at least not by itself.


### Test in Code

We define a function below that returns the real and fake fractions of the training data that satisfy a hypothesis. In our code, our hypotheses will just be simple functions that take in a single data point and return "True" or "False". 


Finish the below function that computes the real and fake fractions, as described above. For each datapoint, you want to compute whether the hypothesis is true, and use this along with label to update *real_true*, *real_total*, *fake_true*, *fake_total*.

In [None]:
def get_real_and_fake_fractions(train_data, hypothesis):
    # Label 0, hypothesis true
    real_true = 0.0
    # Label 0 total
    real_total = 0.0
    # Label 1, hypothesis true
    fake_true = 0.0
    # Label 1 total
    fake_total = 0.0
    
    for datapoint in train_data:
        # Each datapoint has URL, HTML, label in that order.
        label = datapoint[2]
        ### YOUR CODE HERE ###
        hypothesis_truth = int(hypothesis(datapoint))
        
        if label: # Fake
            fake_total += 1
            fake_true += hypothesis_truth

        else: # Real
            real_total += 1
            real_true += hypothesis_truth
            
            ### END CODE HERE ###
            
    return real_true / real_total, fake_true / fake_total

Now, play around with this demonstration that asks you for a domain name extension, and prints out the real fraction, the fake fraction, and the ratio of fake fraction to real fraction. Make sure you understand what the code is doing! After running initially, try other values, like ".org", ".co.uk", and ".edu"! The printed values will update automatically. Note that in some cases, the ratio may be "Infinity", if no real websites in the training data have that domain name.

In [None]:
#@title Run this cell with your hypothesis domain name extension { run: "auto" }

def domain_extension_hypothesis(datapoint):
  extension = ".io" #@param {type:"string"}
  url = datapoint[0]
  return url.endswith(extension)
  
real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           domain_extension_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)

# Simple logic for making the printed ratio more interpretable.
def pretty_ratio(fake_fraction, real_fraction):
    ratio = (fake_fraction / real_fraction) if real_fraction > 0 else 'Infinity'
    if fake_fraction == real_fraction:
      ratio = 1
    return ratio
  
print('Ratio fraction:', pretty_ratio(fake_fraction, real_fraction))

Real fraction: 0.0
Fake fraction: 0.0
Ratio fraction: 1


## Exercise 4:  Ratio Fraction Infinity 

Can you find a domain name extension that produces ratio fraction Infinity? Can you find one that produces ratio fraction 0 (~3 minutes)? Fill them in below.

In [None]:
### YOUR CODE HERE ###
domain_name_extension_with_ratio_infinity = '.uk'
domain_name_extension_with_ratio_zero = ''
### END CODE HERE

Now, answer the following questions in your worksheet. 

How do we interpret ratio fractions of 0?

How do we interpret ratio fractions of Infinity? 

What might this tell us about our data?

## Exercise 5:  Word Frequency Method 


One natural idea is counting whether the frequency of words in the HTML of a webpage is above a certain threshold. For example, given the word "Clinton" and a threshold of 3, does nytimes.com mention "Clinton" 3 times? Does infowars.com? This may tell us something about how useful the word "Clinton" is for telling us whether a website is fake or not.


###Test in Code

Now, code up the below hypothesis function that tests whether the count of a provided word is above a threshold and play with the resulting demo (~15 minutes). We have provided some starter code for you.

In [None]:
#@title Run this cell with a word and a threshold { run: "auto" }

def get_count_from_html(html, hypothesis_word):
    # Transform word to lowercase for consistent results.
    return html.count(hypothesis_word.lower())

def word_threshold_hypothesis(datapoint):
  hypothesis_word = "COVID" #@param {type:"string"}
  threshold =  10#@param {type:"integer"}
  # Transform HTML to lowercase for consistent results.
  html = datapoint[1].lower() 
    
  ### YOUR CODE HERE ### (Use get_count_from_html!)
  count = get_count_from_html(html, hypothesis_word)
  return count > threshold
  ### END CODE HERE ###
  
real_fraction, fake_fraction = get_real_and_fake_fractions(train_data, 
                                                           word_threshold_hypothesis)

print('Real fraction:', real_fraction)
print('Fake fraction:', fake_fraction)
  
print('Ratio fraction:', pretty_ratio(fake_fraction, real_fraction))

Real fraction: 0.0
Fake fraction: 0.0
Ratio fraction: 1


## Exercise 6:  Hypothesize



Once you have "Clinton" working with a threshold of 3, try other words, like "Trump", "Obama", "Sports", "Finance", and "Opinion". 

Discuss three interesting hypothesis word and threshold combination, with an explanation for why you think it is happening.

Be prepared to share with the class! 

## Exercise 7 | Custom Hypothesis


Now, create your own custom hypotheses! All you should change is the hypothesis function (~20 minutes). 

Some ideas: 
* check whether websites contain certain HTML tags (e.g. "\<table>, \<section>"), 
* check whether websites contain certain words or phrases in the URL, 
* check whether websites are Wordpress blogs (hint: check whether they contain "wp-content" frequently).

In [None]:
### YOUR CODE HERE ###


### END CODE HERE ###

Once you are done, list your most interesting hypotheses below and prepare to discuss with the class!

Congratulations on completing this notebook! Tomorrow, we'll use the insights you just built up to build our baseline model.