# Natural Lanuguage Processing
## Fake news identification

by Daniel Russotto and Christine Utendorf

In this assignment our goal is to determine if a provided article including a title and text provides real or fake news. Fake news consists of disinformation and it imposes a great threat to our today's society. Not knowing what to believe when it comes to news and information or even worse not recognizing that the information provided is not reflecting the truth can truely harm a reader. Especially through the rise of the internet and with it the rise of social media, news can be accessed at any time, any place and from many different sources. However, this also gives fake news the possibility to spread faster, wider and more successfully.

Social media giant Facebook has set up a unit to identify such fake news on its platform. After being critized more than ones for doing to little against the spread of false infromation, Facebook is now "working to stop misinformation and false news". The company is not only working together with third-party fact-check organizations but is also applying machine learning techniques to identify such post that contain fake news (see more under the [link](https://www.facebook.com/facebookmedia/blog/working-to-stop-misinformation-and-false-news)). It is highly likely that Facebook uses Natural Language Processing and classification algorithms in order to determine if they have fraud in front of them or not.

In this assignment we (Dan and Christine) are going to work on such a problem that Facebook (as well as Twitter, Youtube, and many other platforms) is facing everyday: Identifying is an article provides real or fake news. We are not going to use fact-checking in order to prove if an information is acurate, but train a machine learning algorithm to classify articles as fake or real. In order to do so we are using several concepts of natural language processing such as tokenization and lemmatization from the NLTK python library as well as machine learning concepts including logistic regression and naive bayes from the sklearn python library. For this puprpose we were provided with a training data set that includes articels that are already labled as real or fake and a test set without such labels. The goal is to train a model that is able to find a general pattern to identify fake news among articles it has never seen before (here our "blind" test data set).

### Library loading

In [78]:
import pandas as pd 
import numpy as np
import nltk
nltk.download('wordnet')
nltk.download('punkt')
import sklearn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cutendorf\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cutendorf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Data sets

<font color=red>??? Do we want the actual ID as index column or extra column ???</font>

In [61]:
train_data = pd.read_csv("data/fake_or_real_news_training.csv", quotechar='"', header=0, sep=",",
                    index_col="ID", encoding='utf-8')
test_data = pd.read_csv("data/fake_or_real_news_test.csv", quotechar='"', header=0, sep=",",
                   index_col="ID", encoding='utf-8')

#### Train data set
The training data set has 3,999 rows each representing an article. All articles are identified with a unique id, have a title, text and a label if they are fake or real. Furthermore there are the columns X1 and X2. These two columns should actually be all filled with NaN (=empty). However, 33 rows show values in X1 and 2 of these 33 also in X2. This shows that the text was not properly splitted in these cases. The separator used to splitt the csv into a dataframe is "," but as it seems in some of the cases this did not split all rows correctly. In the data cleaning part we will take a closer look at this.

In [62]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3999 entries, 8476 to 9673
Data columns (total 5 columns):
title    3999 non-null object
text     3999 non-null object
label    3999 non-null object
X1       33 non-null object
X2       2 non-null object
dtypes: object(5)
memory usage: 187.5+ KB


In [63]:
train_data.head(5)

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


#### Test data
The test data set consists of 2321 unlabeled rows and due to the fact that the data frame only includes the unique id, the title and the actual text it seems as the text split worked well here (no X1 or X2). The train data has less than double the amount of articles in it compared to the test data set. This makes it crucial to retrain at the end the machine learning with the complete train data set due to limited data availability.

In [64]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2321 entries, 10498 to 4330
Data columns (total 2 columns):
title    2321 non-null object
text     2321 non-null object
dtypes: object(2)
memory usage: 54.4+ KB


In [65]:
test_data.head(5)

Unnamed: 0_level_0,title,text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...
4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...
662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...


### 1.2 Data cleaning
As seen in the train_data.info() we have several rows that were not correctly put in the dataframe. We are now going to fix these rows. The csv is splitted using commas. Thus a title or text that has commas in it is splitted incorrectly. A first step is to filter out the rows that are displaced and take a closer look at them. Overall, we have 33 rows with displaced values since all the rows that have values in X2 have values in X1. It is important to fix these values due to the limited number of training rows.

In [66]:
displaced_rows = train_data[train_data.X1.notnull()]
displaced_rows

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,Election Day: No Legal Pot In Ohio; Democrats ...,REAL,
10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,Who rode it best? Jesse Jackson mounts up to f...,FAKE,
356,Black Hawk crashes off Florida,human remains found,(CNN) Thick fog forced authorities to suspend ...,REAL,
2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,(CNN) Aerial bombardments blew apart a Doctors...,REAL,
3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,A member of Al Qaeda's branch in Yemen said Fr...,REAL,
7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,00 UTC © USGS Map of the earthquake's epicent...,FAKE,
9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,Email Print After writing a lengthy suicide no...,FAKE,
9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,We Are Change \n\nIn today’s political climate...,FAKE,
1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,A new national poll shows Vice President Biden...,REAL,
4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,Russian warplanes began airstrikes in Syria on...,REAL,


Overall, it seems that the problem lies within the title. For the 31 cases that have only one misplacement (X2 = NaN), the title was splitted into two causing that the actual label is within the X1 column and the article text in the label column. For the double splitted row with index 9, it actually seems that in the title were two commas leading to a double split. However the row with index 6268 repeats the wrongly splitted phrase just again in the label and X1. There is no sign of further text and thus this row should be excluded since it does not provide an actual article text.
However, lets start with the rows that have one wrong column break by joining the title and the text field back together into the full title and then replace the text and label column with the actual values.

<font color=red>??? Should we include the comma here in the title ???</font>

In [67]:
train_data.loc[train_data.X1.notnull(), 'title'] = train_data.loc[train_data.X1.notnull(), 'title'] + "," + train_data.loc[train_data.X1.notnull(), 'text']
train_data.loc[train_data.X1.notnull(), 'text'] = train_data.loc[train_data.X1.notnull(), 'label']
train_data.loc[train_data.X1.notnull(), 'label'] = train_data.loc[train_data.X1.notnull(), 'X1']
train_data.loc[train_data.X1.notnull(), 'X1'] = train_data.loc[train_data.X1.notnull(), 'X2']

After fixing the line break for the first 31 rows, we are now gonna look at the last two displaced rows. While the row with index 9 seems to be eadsy to fix, the row with id 6268 does not seem to actually have text. Thus we are gonna fix row 9 and exclude tow 6268.

In [68]:
displaced_rows = train_data[train_data.X1.notnull()]
displaced_rows

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9,"Planned Parenthood’s lobbying effort, pay rais...",and the future Fed rates,PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....,REAL,REAL
6268,Chart Of The Day: Since 2009—–Recovery For The...,Chart Of The Day: Since 2009 Recovery For The 5%,Stagnation for the 95%,FAKE,FAKE


#### Fix row 9

In [69]:
train_data.loc[9, 'title'] = train_data.loc[9, 'title'] + "," + train_data.loc[9, 'text']
train_data.loc[9, 'text'] = train_data.loc[9, 'label']
train_data.loc[9, 'label'] = train_data.loc[9, 'X1']
train_data.loc[9, 'X1'] = np.nan
train_data.loc[9, 'X2'] = np.nan

#### Exclude row 6268

In [70]:
train_data = train_data[train_data.X2.notnull() == False]

In [59]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 8476 to 10142
Data columns (total 5 columns):
title    4 non-null object
text     4 non-null object
label    4 non-null object
X1       0 non-null object
X2       0 non-null object
dtypes: object(5)
memory usage: 192.0+ bytes


After fixing the displaced rows we can now delete the columns X1 and X2.

In [73]:
train_data = train_data[['title', 'text','label']]
train_data.head()

Unnamed: 0_level_0,title,text,label
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


### 1.2 Data exploration

#### Target distribution
Since we are working on a classification problem, it is important to look at the target distribution. Highly imbalanced targets need resampling methods in order to train a well-working machine learning model. Thus our first step in terms of data exploration is to check the amount of fake and real labels within our data set:

In [75]:
train_data[train_data.label == "REAL"].count()

title    2008
text     2008
label    2008
dtype: int64

In [74]:
train_data[train_data.label == "FAKE"].count()

title    1990
text     1990
label    1990
dtype: int64

The target is almost equally distributed with 2008 real and 1990 fake articles.

#### Real news vs. fake news

<font color=red>We could explore common words; title and text length!</font>

## 2. Text preparation
<font color=red>Tokenization, Stemming, Lemmatization</font>