# Project Proposal - Fake News Detection

<h3>By Aviv Farag</h3>

## Discussion

Fake news are news articles that contain false facts with the aim of manipulating people's perception on a given subject. Another definition is "low quality news with intentionally false information" [[1]](#def). Fake news can be spread easily across social media because of their low production cost that contribute to profitability, and the format of the news which is small pieces of information [[2]](#spread). Moreover, bots are being employed on social media and have a great impact on spearding fake news since they play an important role in amplifying fake news in the very early moments of a post, and they also target influential users using replies and mentions [[3]](#bots). This strategy is often used by groups of interest in order to affect a country's election [[4]](#beliefs). It can also be utilized by other groups or individuals in the business section in order to affect the reputation of others [[4]](#beliefs). Finally, everyone can post, reply and share on social media and combined with the idea that "everyone now has their own truth, which is based on their personal knowledge and experience and not much else" [[4]](#beliefs) it is another reason for the wide spread of fake news across social media. 


## Data Exhibition

`spark_df.show(10)`:
<br>
<center><img src= "http://drive.google.com/uc?export=view&id=1lm_c40U92urrQ4lOmqHwlThLNG8CmYNH" width="70%" />
<br>
<b><u> Fig 2.</b></u>: Target Column Distribution </center>

In the figure above, both the text and the author are truncated since they are much larger than the dimensions of the output. 

### Attributes:
There are 3 attributes in this dataset:

1. ***Title*** - The title of the particular news
1. ***Text*** - The news
1. ***Author*** - The author of the particular news. There are 4194 unique authors in the dataset:
```
(spark_df
    .filter(spark_df["author"] != "NaN")    # Remove NaN values (missing values)
    .select("author")                       # Select author attribute
    .rdd                                    # Conver to RDD
    .map(lambda x: x[0].lower())            # Convert to lower case letters
    .distinct()                             # Get unique author values
    .count()                                # Count distinct authors
)
```


***The target*** column is called "label" and is categorical:
- ***0*** - Reilable
- ***1*** - Fake





## Exploratory Data Analysis (EDA)

### Target Column Distribution
The dataset is balanced as can be shown in the figure below:
<br>
<center><img src= "http://drive.google.com/uc?export=view&id=1t6zuT-C5FklIb7mpSJ2U_8-Wqa0wCsc6" width="80%" />

<br>
<b><u> Fig 2.</b></u>: Target Column Distribution </center>


### Missing Values

<h4><b><i>Missing Author</b></i></h4>

1. There are 1957 rows missing an Author.
1. This attribute has the highest number of instance missing a value.
1. 1931 out of the 1957 rows are labeled as fake news!

<h4><b><i>Missing Title</b></i></h4>

1. There are 558 rows missing a title. 
1. All of them have both an Author and the Text.
1. All of them are fake!

<h4><b><i>Missing Text</b></i></h4>

1. There are 39 instances missing the text attribute. 
1. All of them are also missing the Author attribute, but have a title. 
1. They are all labeled as fake news. 

***Conclusion:*** More than 98\% of the rows that are missing one or more attribute are labeled as fake news!




## Machine Learning Algorithms

The target column is categorical (0 - reliable, 1 - fake), and therefore chose to implement Naive Bayes Classifier.


## Source


For this project I use the train.csv file found at [Kaggle](https://www.kaggle.com/c/fake-news). 

***Note:*** The train.csv file is part of a data science competition in Kaggle. For the purpose of this project I will only use the train.csv, so I will split it into train and test in the machine learning part. With that being said, the test.csv in the competition's webpage will not be used.

# References

1. <a name = "def"></a> Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, Huan Liu,  *Fake News Detection on Social Media: A Data Mining Perspective*. ACM SIGKDD Explorations News letter, Volume 19, Issue 1 June 2017, pp 22–36,  
[https://doi.org/10.1145/3137597.3137600](https://doi.org/10.1145/3137597.3137600)

1. <a name = "spread"></a> Allcott, Hunt, and Matthew Gentzkow. 2017. *"Social Media and Fake News in the 2016 Election."* Journal of Economic Perspectives, 31: 211-36.
DOI: [https://doi.org/10.1257/jep.31.2.211](https://doi.org/10.1257/jep.31.2.211)

1. <a name = "bots"></a> Shao, Chengcheng & Ciampaglia, Giovanni & Varol, Onur & Flammini, Alessandro & Menczer, Filippo. (2017). *The spread of fake news by social bots*. 

1. <a name = "beliefs"></a> Del Vicario, Michela and Bessi, Alessandro and Zollo, Fabiana and Petroni, Fabio and Scala, Antonio and Caldarelli, Guido and Stanley, H. Eugene and Quattrociocchi, Walter, *The spreading of misinformation online*. National Academy of Sciences, Volume 113, Number 3, Year 2016, Pages 554-559.
DOI: [https://doi.org/10.1073/pnas.1517441113](https://doi.org/10.1073/pnas.1517441113)