# Advanced Data Science Capstone -- True vs. Fake News 
## Data Exploration

News of current events is a potent tool to direct the reader's attention.   In a fragmented media landscape with low publishing costs, fake news can flourish distract or dilute the attention genuine issues receive.  It is obvious that attention is a finite mental resource and should be reserved for consuming real news versus fiction.

This project willl attempt to use a recurrent neural network to classify news stories as real or fake.

Our sources of data:

### Clément Bisaillon fake news dataset on Kaggle

Initially we considering using the single source of fake and real news dataset provided by Clément Bisaillon on Kaggle: Classifying the news
https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

Examining the fake news dataset presented showed that it included biased and opinionated but not neccessarily false data, so we will not use the fake news from there.  The true dataset seems useable and largely comes from Reuters.  We will have to work to make sure our model is not biased by the reliance on Reuters data.

### Konspiratori.sk fake news dataset on Kaggle
A fake news dataset is available at: https://www.kaggle.com/mrisdal/fake-news  This contains text and metadata scraped from 244 websites tagged as "bullshit" by the BS Detector Chrome Extension by Daniel Sieradski.   The extension uses the opinions of an commission at Konspiratori.sk to designate articles as disreputable and aims to warn against false reports, conspiracy theories, hoaxes and hate speech and fascist ideologies.  The website states: 
```
We create a list of websites, with respect to which the members of our Review Board have doubts regarding their credibility and content quality.
  
The choice is yours whether you agree with our list and whether you use it. The list of websites was compiled by the Review Board members based on their professional assessment of these websites under clearly defined criteria. The Review Board members do this work on a voluntary basis since they are convinced that it is of value to the whole society.

  In our database you will find websites that meet, in the opinion of the Review Board, at least one of the following points:

1. A website contains materials of fraudulent or charlatan nature, such as miraculous healing, magic preparations and the like. The reviewed criterion is the conflict with objective, scientific knowledge, especially if such published information could lead to neglecting any necessary treatment or directly damage one’s health. This does not apply to traditional alternative treatment methods, promotion of healthy lifestyle or healing procedures based on nature etc.
    
2. A website contains misleading news, misinformation or false propaganda, i.e. any claims that are in conflict with the facts, e.g. photos or videos used in some misleading context, made up or severely misinterpreted events, etc. This does not apply to clearly presented opinion articles.
    
3. A website contains conspiracy theories and “delusions” that could have more serious political, economic or health consequences, stirring up passionate or hateful feelings without any critical assessment. This does not apply to curiosities, mysteries, or clearly marked speculations.
    
4. A website contains vulgarisms, calls to violence, extremist content, spreading false alarms, aggressive personal attacks, such as “eye-for-eye”, defamation of minorities, races, nationalities, religious groups, etc.
    
5. A website does not respect fundamental principles of journalistic ethics: it does not publish disclaimers (dementi), it leaves published and uncorrected any news that has been objectively proven untrue. A website does not have a clear owner or authors (protected sources of information and pseudonyms are respected), the site does not publish any responses by concerned parties, it grossly mixes the actual news with commentaries, repeatedly publishes shocking and false claims aimed at increasing visitor traffic, which it however quickly corrects, etc.

We shall gradually supplement and adapt these criteria based on our practical experience. Our primary goal is to protect the advertisers and their interests. The activities of the Review Board and the rules both follow this goal and we reserve the right to amend them as necessary.

Our database is purely of advisory nature and it is the responsibility of each advertiser to consider how to use it. It represents the opinion of the Review Board, which opinion we by no means present as a fact.
```
This dataset uses the article's website as a broad indication that the item is fake, which is less specific than I would like to use as a training dataset.

### FakeNewsNet dataset and scripts

Finally, it seems like the best options are to download a fresh set of news identified as fake by politifact.  There are some scripts available at https://github.com/KaiDMML/FakeNewsNet that use politifact and gossipcop.com data to download the latest real or fake news.  These articles are most likely to be useful as they have been classified as true or false (rather than simply biased or opinionated) based on the actual article content. An older sample of data already downloaded is found at https://www.kaggle.com/antmarakis/fake-news-data

### Rada Mihalcea dataset 
Rada Mihalcea (http://web.eecs.umich.edu/~mihalcea) has made a dataset available at http://web.eecs.umich.edu/~mihalcea/downloads.html#FakeNews  This has two sets of files, one about news and another about celebrity gossip.  News stories are found in six different topic areas: technology, education, business, sports, politics, and entertainment. They are sourced from mainstream news sites in the US such as ABCNews, CNN, USAToday, NewYorkTimes, FoxNews, Bloomberg, and CNET among others.  Each item in the group of real stories has a matching fake news story created using the Mechanical Turk Amazon service.  (This is a service that pays humans to do small piecework tasks.)   The fake news is useful and occasionally humorous, but the use of amateur writers betrays itself and is fairly easy for a reader to spot.  I'm not so sure it would present much of a challenge to a Machine Learning algorithm.

Here's an example of a real news item and its corresponding fake item:
#### Sample true story
> Laptop cabin ban 'ineffective' says IATA 

> The US and UK ban on laptops in cabin baggage on certain flights will not be an effective security measure  the International Air Transport Association has said. In a strongly worded speech  IATA chief executive Alexandre de Juniac said the ban also caused commercial distortions. The US ban was brought in as an anti-terrorist precaution. It covers inbound flights on airlines operating out of 10 airports in the Middle East  North Africa and Turkey. The British ban is similar but applies to different airlines. Airline passengers on 14 carriers are subject to the ban on inbound direct flights from Turkey  Lebanon  Jordan  Egypt  Tunisia and Saudi Arabia.

#### Sample fake story written using mechanical turk service

> Chief executive Alexandre de Juniac, of the International Air Transport Association has said that after surveying and polling passengers, everyone seems to be in agreement that the ban on laptops for in cabin use is awesome. Steve Harvey, who was appointed by the IATA, conducted the surveys and gathered the intel. Major air carriers such as Delta, Pan Am , and Southwest have seen  ticket sales sore to new heights. One spokesperson every said that people were much happier and reported a better flight experience and that passengers felt more communal about their experience. People really seem to respond to cookies and milk and good old time cartoons for the inflight movies. The cookies now served on flight for all international flights contain edibles and  the crew reports that most passengers eat, watch cartoons, and then sleep the majority of the flight.

This dataset may be useful in training for real news and using as a test set for fake news, but I am less inclined to use the fake news as a training set.

The gossip set of stories is from mainstream news websites and entertainment magazine websites such as Entertainment Weekly, People Magazine, RadarOnline, and other tabloid and entertainment-oriented publications.  Again each real item is paired with a fake item that was manually verified using gossip-checking sites such as "GossipCop.com", and also cross-referenced with information from other entertainment news sources on the web.  This set of stories also looks attractive as a set of test stories, both fake and real, but due to the focus on gossip, I will not use it as a training set because I want to focus on topics that have more real world usage.

