# FAKE NEWS CLASSIFIER
## a machine learning approach towards fake news detection

DATA COLLECTION - WEB SCRAPING OF NEWS ARTICLES
- Legitimate newspaper websites: NYTimes, CNN, Market Watch, CNBC, The Guardian, Fox News etc.
- Fake news websites: Dailymash, The Onion, News Thump, Clickhole, 100percentfedup, Politifact etc.

In [2]:
pip install feedparser

Collecting feedparser
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
[K     |████████████████████████████████| 194kB 6.5MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: feedparser
  Building wheel for feedparser (setup.py) ... [?25ldone
[?25h  Created wheel for feedparser: filename=feedparser-5.2.1-cp37-none-any.whl size=44940 sha256=fd8a4023962cfb1162cd7593a20d07054242d395d04fb953104bb5fbf4fef86c
  Stored in directory: /Users/Liuyang/Library/Caches/pip/wheels/8c/69/b7/f52763c41c5471df57703a0ef718a32a5e81ee35dcf6d4f97f
Successfully built feedparser
Installing collected packages: feedparser
Successfully installed feedparser-5.2.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install newspaper3k

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 8.3MB/s eta 0:00:01
Collecting feedfinder2>=0.0.4
  Downloading https://files.pythonhosted.org/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz
Collecting jieba3k>=0.35.1
[?25l  Downloading https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4MB)
[K     |████████████████████████████████| 7.4MB 9.5MB/s eta 0:00:01
[?25hCollecting cssselect>=0.9.2
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting tldextract>=2.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/0e/9ab599d6e78f0340bb1d1e28d

In [11]:
import feedparser as fp
import json
import newspaper
from newspaper import Article
from time import mktime
from datetime import datetime

In [12]:
# Set the limit for number of articles to download
LIMIT = 20000

In [13]:
data = {}
data['newspapers'] = {}

In [16]:
# Loads the JSON files with news sites
with open('data/NewsPapers.json') as data_file:
    companies = json.load(data_file)

In [17]:
count = 1

# Iterate through each news company
# the company is the name, the value is the dictionary of links
for company, value in companies.items():
    # If a RSS link is provided in the JSON file, this will be the first choice.
    # Reason for this is that, RSS feeds often give more consistent and correct data.
    # If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.
    if 'rss' in value:
        d = fp.parse(value['rss'])
        print("Downloading articles from ", company)
        newsPaper = {
            "rss": value['rss'],
            "link": value['link'],
            "articles": []
        }
        for entry in d.entries:
            # Check if publish date is provided, if no the article is skipped.
            # This is done to keep consistency in the data and to keep the script from crashing.
            if hasattr(entry, 'published'):
                if count > LIMIT:
                    break
                article = {}
                article['link'] = entry.link
                date = entry.published_parsed
                article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
                try:
                    content = Article(entry.link)
                    content.download()
                    content.parse()
                except Exception as e:
                    # If the download for some reason fails (ex. 404) the script will continue downloading
                    # the next article.
                    print(e)
                    print("continuing...")
                    continue
                article['title'] = content.title
                article['text'] = content.text
                article['author'] = content.authors
                newsPaper['articles'].append(article)
                if count % 10 == 0:
                    print(count, "articles downloaded from", company, ", url: ", entry.link)
                count = count + 1
    else:
        # This is the fallback method if a RSS-feed link is not provided.
        # It uses the python newspaper library to extract articles
        print("Building site for ", company)
        paper = newspaper.build(value['link'], memoize_articles=False)
        newsPaper = {
            "link": value['link'],
            "articles": []
        }
        noneTypeCount = 0
        for content in paper.articles:
            if count > LIMIT:
                break
            try:
                content.download()
                content.parse()
            except Exception as e:
                print(e)
                print("continuing...")
                continue
            # Again, for consistency, if there is no found publish date the article will be skipped.
            # After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
            if content.publish_date is None:
                print(count, " Article has date of type None...")
                noneTypeCount = noneTypeCount + 1
                if noneTypeCount > 10:
                    print("Too many noneType dates, aborting...")
                    noneTypeCount = 0
                    break
                count = count + 1
                continue
            article = {}
            article['title'] = content.title
            article['text'] = content.text
            article['link'] = content.url
            article['published'] = content.publish_date.isoformat()
            article['author'] = content.authors
            newsPaper['articles'].append(article)
            if count % 10 == 0: 
                print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
            count = count + 1
            noneTypeCount = 0
    count = 1
    data['newspapers'][company] = newsPaper


# Finally it saves the articles as a JSON-file.
try:
    fname = 'scraped_articles.json'
    print('saving articles . . . in {}'.format(fname))
    with open(fname, 'w') as outfile:
        json.dump(data, outfile)
except Exception as e: print(e)

Downloading articles from  newyorktimes_business
10 articles downloaded from newyorktimes_business , url:  https://www.nytimes.com/2020/01/18/business/davos-china.html?emc=rss&partner=rss
20 articles downloaded from newyorktimes_business , url:  https://www.nytimes.com/2020/01/17/business/stock-market-rally.html?emc=rss&partner=rss
30 articles downloaded from newyorktimes_business , url:  https://www.nytimes.com/2020/01/17/business/books-debt-personal-finance.html?emc=rss&partner=rss
40 articles downloaded from newyorktimes_business , url:  https://www.nytimes.com/2020/01/17/business/bond-market-investments.html?emc=rss&partner=rss
50 articles downloaded from newyorktimes_business , url:  https://www.nytimes.com/2020/01/16/business/student-loans-forgiveness-taxes.html?emc=rss&partner=rss
Downloading articles from  newyorktimes_science
10 articles downloaded from newyorktimes_science , url:  https://www.nytimes.com/2020/01/19/us/space-force-uniform-camo.html?emc=rss&partner=rss
20 artic

Building prefix dict from /Users/Liuyang/opt/anaconda3/envs/withnewpkg/lib/python3.7/site-packages/jieba/dict.txt ...
Dumping model to file cache /var/folders/d4/ym508p7j2sx7n8kq2d37gzxm0000gn/T/jieba.cache
Loading model cost 1.7148289680480957 seconds.
Prefix dict has been built succesfully.


70 articles downloaded from newyorktimes_politics  using newspaper, url:  https://cn.nytimes.com/china/20200117/china-iran-us-weibo/
80 articles downloaded from newyorktimes_politics  using newspaper, url:  https://cn.nytimes.com/opinion/20200117/ai-weiwei-germany-china/
89  Article has date of type None...
90 articles downloaded from newyorktimes_politics  using newspaper, url:  https://cn.nytimes.com/culture/20200115/star-wars-china/
100  Article has date of type None...
101  Article has date of type None...
102  Article has date of type None...
103  Article has date of type None...
104  Article has date of type None...
105  Article has date of type None...
106  Article has date of type None...
107  Article has date of type None...
108  Article has date of type None...
109  Article has date of type None...
110  Article has date of type None...
Too many noneType dates, aborting...
Building site for  newyorktimes_world
10 articles downloaded from newyorktimes_world  using newspaper, ur