# Fake news challenge

In this task, we will have to build a ML model that gives a confidence score between 1 and 0 to an article (is this article reliable or not). The requirements are:

- train a model on the Kaggle fake news dataset https://www.kaggle.com/c/fake-news/data .
- provide the metrics to assess our model on the testing set (accuracy etc)
- build a final function give_score(link) that takes a link of an article as an input and outputs the confidence score. 
- compare 3 different ML technics.


<br>


**Required libraries to run this notebook:**

*Included in Anaconda distribution:*
- numpy
- pandas
- scikit learn
- requests
- bs4

*Not included in Anaconda distribution:*
- [newspaper](https://github.com/codelucas/newspaper) *( $\rightarrow$ pip3 install newspaper3k)*

<br>

<br>

### **1) Importing useful libraries:**
<br>

In [2]:
import numpy as np
import pandas as pd

<br>

### **2) Loading phase:**
<br>

In [3]:
FN_DATA_FOLDER = 'fake_news_data'

In [4]:
train_data = pd.read_csv(FN_DATA_FOLDER+"/train.csv")
test_data = pd.read_csv(FN_DATA_FOLDER+"/test.csv")

The data we will use in this notebook is a set of press articles. On each of them, we have the following information:

- **`id`**: unique id for a news article
- **`title`**: the title of a news article
- **`author`**: author of the news article
- **`text`**: the text of the article; could be incomplete
- **`label`**: a label that marks the article as potentially unreliable
    - `1`: unreliable
    - `0`: reliable
    
Let's take a closer look at our dataset:

In [5]:
print("Train data :")
print('--------------------------------------------------')
print('Total number of articles: ',len(train_data))
print('Number of fake news: ',len(train_data[train_data.label==1]))
train_data.head(3)

Train data :
--------------------------------------------------
Total number of articles:  20800
Number of fake news:  10413


Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1


In [6]:
print("Test data :")
print('--------------------------------------------------')
print('Number of articles: ',len(test_data))
test_data.head(3)

Test data :
--------------------------------------------------
Number of articles:  5200


Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...


<br>

### **3) Cleaning phase:**
<br>

We can see there are some empty cell in our csv files, resulting in **`Nan`** values in our dataset. 

Let's start by seeing how many they are:

In [7]:
print('--------------------------------------------------')
print('Number of NaN values in x_train by column:')
print('\t',train_data.isna().sum().tolist())
print('--------------------------------------------------')
print('Number of NaN values in x_test by column:')
print('\t',test_data.isna().sum().tolist())
print('--------------------------------------------------')

--------------------------------------------------
Number of NaN values in x_train by column:
	 [0, 558, 1957, 39, 0]
--------------------------------------------------
Number of NaN values in x_test by column:
	 [0, 122, 503, 7]
--------------------------------------------------


<br>

To correct these values, we simply replace the empty cells by a space to mark them as empty.

> We can see that for some articles, the entire text of the article is missing. Very little information is therefore available, but this data can still be interesting. We can indeed have information on the reliability of an author for example.

In [8]:
train_data = train_data.fillna(' ')
test_data = test_data.fillna(' ')

In [9]:
print('--------------------------------------------------')
print('Number of NaN values in x_train by column:')
print('\t',train_data.isna().sum().tolist())
print('--------------------------------------------------')
print('Number of NaN values in x_test by column:')
print('\t',test_data.isna().sum().tolist())
print('--------------------------------------------------')

--------------------------------------------------
Number of NaN values in x_train by column:
	 [0, 0, 0, 0, 0]
--------------------------------------------------
Number of NaN values in x_test by column:
	 [0, 0, 0, 0]
--------------------------------------------------


Our dataset is cleaned !

<br>

### **4) Articles preprocessing:**
<br>

<br>

**Our objective is to classify articles based on their content using Scikit Learn.**

We have two dataset with texts. Unfortunaltly, algorithms are not really good with numbers, so we'll need to find a way to **convert our texts to numbers without losing information**. To do that, we'll see that there are many ways, some of which are more optimized than others.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

<br>

A first step of preprocession is to concatenate the `title`, `author` and `text` for each article inside one column that we will process in the next steps.

In [11]:
train_data['full_article']=train_data['title']+' '+train_data['author']+' '+train_data['text']
test_data['full_article']=test_data['title']+' '+test_data['author']+' '+test_data['text']

train_data.head(3)

Unnamed: 0,id,title,author,text,label,full_article
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired Consortiumne...


<br>

####  **count_vectorizer()**

<br>

In [12]:
count_vectorizer = CountVectorizer()

In [13]:
x_train = count_vectorizer.fit_transform(train_data['full_article'])

>`count_vectorizer.fit_transform()` allows us to extract every unique features (words) in our texts and convert them to a matrix of token counts $\rightarrow$ `1`if the word is present in the article, `0` otherwise.

Let's take a look at our resulting data in more details:

In [14]:
print('\n--------------------------------------------------')
print('Number of different features in all articles:')
print('\t',x_train.shape[1])
print('--------------------------------------------------')


--------------------------------------------------
Number of different features in all articles:
	 182061
--------------------------------------------------


The number of features is actually the number of unique words present in our article: **`182.061`**

This number is obviously huge, but most importantly, **it is not optimized**. Indeed, we know that words in a language are not evenly distributed across a corpus; instead, there are a few words that are very common, and a very large number of words that are rare: They follow a [zipf distribution](https://www.youtube.com/watch?v=fCn8zs912OE).

<br>

#### **tfidf_vectorizer()**

>`tfidf` = short for **term frequency–inverse document frequency**

The idea behind the `tfidf_vectorizer` is to reflect how important a word is to a document, so in fact it will allow us to extract **key words** from a document.

<br>

To do so, it basically uses the **frequency** of appearance of a word in a text $\rightarrow$ **`tf`**. Indeed, we coult think that the most used words are the most important and relevant for our article. Unfortunatly, as we said earlier, most of the words used are very common words, not specific to our article at all!

So what we really want, are words that appear a lot **in one article**, BUT **not a lot in a corpus of texts**, in other words in documents $\rightarrow$ **`idf`**.

<br>

Let's take a look at `tfidf_vectorizer` in action, using sentences that will correspond to our articles:

In [15]:
articles = np.array(['the Future of Mobile',
        'Soon on IOS & Android',
        'Gen Z and millennials',
        'everaging ML and DL',
        'personalized, high-quality'])

<br>

Let's begin by using `TfidfVectorizer` with all parameters set to default

In [16]:
tfidf_1 = TfidfVectorizer()

In [17]:
print('Matrix of words occurencies in our articles:\n')
matrix = tfidf_1.fit_transform(articles).toarray()
print(np.vectorize(lambda x:round(x,1))(matrix))

Matrix of words occurencies in our articles:

[[0.  0.  0.  0.  0.5 0.  0.  0.  0.  0.  0.5 0.5 0.  0.  0.  0.  0.5]
 [0.  0.5 0.  0.  0.  0.  0.  0.5 0.  0.  0.  0.  0.5 0.  0.  0.5 0. ]
 [0.5 0.  0.  0.  0.  0.6 0.  0.  0.6 0.  0.  0.  0.  0.  0.  0.  0. ]
 [0.4 0.  0.5 0.5 0.  0.  0.  0.  0.  0.5 0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0.6 0.  0.  0.  0.  0.  0.  0.6 0.6 0.  0. ]]


As we can see, we managed to get a **matrix of words occurencies** in our articles. Moreover, each word present in an article is given a **score** which represents how relevant it is.

*(The score above is rounded for readability)*

<br>

In [18]:
print('Keywords in our articles:\n--------------------------')
keywords = tfidf_1.inverse_transform(matrix)
for array in keywords:
    print(' - '.join(keyword for keyword in array))

Keywords in our articles:
--------------------------
future - mobile - of - the
android - ios - on - soon
and - gen - millennials
and - dl - everaging - ml
high - personalized - quality


<br>

As we can see above, words are scored to show their importance, but we still have all the most common words which are not specific to our article at all.

These words are called **stop words**. In English for example, they consist of words like `the`, `and` etc. 

We can simply add a parameter to our tfidf to automatically remove them.

In [19]:
tfidf_2 = TfidfVectorizer(stop_words='english')

In [20]:
print('Keywords in our articles:\n--------------------------')
matrix = tfidf_2.fit_transform(articles).toarray()
matrix = np.vectorize(lambda x:round(x,2))(matrix)
keywords = tfidf_2.inverse_transform(matrix)
for array in keywords:
    print(' - '.join(keyword for keyword in array))

Keywords in our articles:
--------------------------
future - mobile
android - ios - soon
gen - millennials
dl - everaging - ml
high - personalized - quality


<br>

As we can see, by removing stop words, **only key words are extracted** from our (very short) articles.

<br>

We can now use `tfidf.fit_transform()` to convert them to a matrix of token counts, just as we did with `count_vectorizer`

In [21]:
df = pd.DataFrame(matrix, index=articles, columns=tfidf_2.get_feature_names())
df

Unnamed: 0,android,dl,everaging,future,gen,high,ios,millennials,ml,mobile,personalized,quality,soon
the Future of Mobile,0.0,0.0,0.0,0.71,0.0,0.0,0.0,0.0,0.0,0.71,0.0,0.0,0.0
Soon on IOS & Android,0.58,0.0,0.0,0.0,0.0,0.0,0.58,0.0,0.0,0.0,0.0,0.0,0.58
Gen Z and millennials,0.0,0.0,0.0,0.0,0.71,0.0,0.0,0.71,0.0,0.0,0.0,0.0,0.0
everaging ML and DL,0.0,0.58,0.58,0.0,0.0,0.0,0.0,0.0,0.58,0.0,0.0,0.0,0.0
"personalized, high-quality",0.0,0.0,0.0,0.0,0.0,0.58,0.0,0.0,0.0,0.0,0.58,0.58,0.0


<br>

This is pretty good, we've managed to remove a lot of useless data from our articles without losing any information.



<br>

<br>
<br>

**Let's get back to our dataset and apply `tf-idf` transformation to our articles :**



In [22]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))

In [23]:
x_train = tfidf_vectorizer.fit_transform(train_data['full_article'].values)
x_test = tfidf_vectorizer.transform(test_data['full_article'].values)

>Using `transform` instead of `fit_transform` preserves the vocabulary created from `fit_transform` in the previous line, and ensures identical columns for these matrices.

<br>

### **5) Learning phase:**
<br>

Now that we have extracted the features, we can train a classifier to try to predict the reliability of an article:

In [24]:
y_train = train_data['label']

In [25]:
from sklearn.model_selection import cross_val_score #to get accuracies and compare results

<br>

Model 1: **MultinomialNB**

 Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [26]:
from sklearn.naive_bayes import MultinomialNB
model1 = MultinomialNB()
model1.fit(x_train, y_train)

accuracies = cross_val_score(estimator=model1, X=x_train, y=y_train, cv=10)
mean_acc = accuracies.mean()
std_acc = accuracies.std()
print('accuracy: ',mean_acc)
print('std: ',std_acc)

accuracy:  0.8773560756901098
std:  0.006281584150964021


<br>

Model 2: **SGDClassifier**

Let’s see if we can do better with a linear support vector machine (SVM), which is considered as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes).

In [27]:
from sklearn.linear_model import SGDClassifier
model2 = SGDClassifier(loss="hinge", penalty="l2", max_iter=15)
model2.fit(x_train, y_train)

accuracies = cross_val_score(estimator=model2, X=x_train, y=y_train, cv=10)
mean_acc = accuracies.mean()
std_acc = accuracies.std()
print('accuracy: ',mean_acc)
print('std: ',std_acc)

accuracy:  0.9751917494691401
std:  0.0030616149951694857


<br>

Model 3: **LogisticRegression**

In [28]:
from sklearn.linear_model import LogisticRegression
model3 = LogisticRegression(C=1e5)
model3.fit(x_train, y_train)

accuracies = cross_val_score(estimator=model2, X=x_train, y=y_train, cv=10)
mean_acc = accuracies.mean()
std_acc = accuracies.std()
print('accuracy: ',mean_acc)
print('std: ',std_acc)

accuracy:  0.9750957111592022
std:  0.002805510410011197


<br>

### **6) Making predictions:**
<br>

Now that our model has been trained, we can make predictions using the features from the test dataset.

In [29]:
predictions = model2.predict(x_test)

#Display our predictions - they are either 0 or 1 for each training instance 
#depending on whether our algorithm believes the article is fake or not.
predictions

array([0, 1, 1, ..., 0, 1, 0])

<br>

Since the data comes from a kaggle competiton, we will make a submission to get our score and see the performance of our model.

To do that, we create a dataFrame with articles ids and our prediction regarding whether they are fake or not:

In [30]:
submission = pd.DataFrame({'id':test_data['id'],'label':predictions})

submission.head()

Unnamed: 0,id,label
0,20800,0
1,20801,1
2,20802,1
3,20803,0
4,20804,1


In [31]:
#Convert DataFrame to a csv file that can be uploaded
filename = 'articles_prediction_4.csv'

submission.to_csv(filename,index=False)
print('\n--------------------------------------')
print('Saved file: ' + filename)
print('--------------------------------------')


--------------------------------------
Saved file: articles_prediction_4.csv
--------------------------------------


<br>

**Kaggle score : `0.98241`**

<br>

### **7) Testing the reliability of a given article:**
<br>

In this last part, we will try to use our model to predict the reliability af an article found on the web, giving it a score:

<br>

Let's start by choosing a web article:

In [32]:
url = input('Enter a link to an article: ')

Enter a link to an article: https://www.nytimes.com/2018/11/09/us/politics/matthew-whitaker-donald-trump.html?action=click&module=Top%20Stories&pgtype=Homepage


<br>
<br>

**First approach:** using [`BeautifulSoup`](https://pypi.org/project/beautifulsoup4/) library:



First thing we need to do, is to read the HTML for this article. We dot that using the `requests` library

In [33]:
import requests

article = requests.get(url)
print('---------------------------------')
print(article.text[0:300])
print('---------------------------------')

---------------------------------
<!DOCTYPE html>
<html lang="en" itemId="https://www.nytimes.com/2018/11/09/us/politics/matthew-whitaker-donald-trump.html" itemType="http://schema.org/NewsArticle" itemScope="true" class="story" xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    <title data-rh="true">Trump Says ‘I Don’t K
---------------------------------


Now that we have the HTML content of the web article, we need to extract useful informations from it.

We will do that by using `BeautifulSoup`, which is a popular python library for web scrapping.

In [34]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(article.text, 'html.parser')

Getting all page classes:

In [35]:
classes = [value
           for element in soup.find_all(class_=True)
           for value in element["class"]]

In [36]:
print('classes containing "author":\n--------------------------')
for class_name in classes:
    if 'author' in class_name:
        print(class_name)

classes containing "author":
--------------------------



**`title`**

In [37]:
title = soup.find('title')
#print(title.text)

**`Author`**

In [38]:
author = soup.find(attrs={'class': 'author'})
#print(author.text)

<br>

As we can understand from the code above, this approach works, but will recquire a lot of work to be fully generic and be able to parse any article on the web.

Indeed, html tags can vary a lot from one site to another, so it might be difficul to think about all possible implementations...

<br>
<br>

**Second approach:** using [`newspaper`](https://github.com/codelucas/newspaper) library:

In [39]:
from newspaper import Article

try:
    article = Article(url)
    article.download()
    article.parse()
    title=article.title
    authors=article.authors
    text = article.text
except:
    print("Problem downloading the user article")
        
print('Author :\n--------------------------\n',authors,'\n')
print('Title :\n--------------------------\n',title,'\n')
print('Text :\n--------------------------\n',text[0:300],'\n')

Author :
--------------------------
 ['Eileen Sullivan'] 

Title :
--------------------------
 Trump Says ‘I Don’t Know Matt Whitaker,’ the Acting Attorney General He Chose 

Text :
--------------------------
 WASHINGTON — President Trump said on Friday that he has not yet spoken to the new acting attorney general, Matthew G. Whitaker, about the special counsel investigation, and he distanced himself from Mr. Whitaker by suggesting that he did not know him.

Mr. Whitaker, who now oversees the investigatio 



As we can see, the `newspaper` library is pretty powerful and allows us to **parse an article really easily**! It allows us to directly separate distinct information from the article. 

<br>

Let's apply our `tf-idf` transformation to the selected article:

In [40]:
test_article = pd.DataFrame(columns=['full_article'])

full_article=article.title
for author in article.authors:
    full_article+=' '+author
full_article+=' '+article.text

test_article.loc[0]= full_article

In [41]:
x_article = tfidf_vectorizer.transform(test_article['full_article']).toarray()

<br>

Before making a prediction on the reliability ot the article, let's simply take a look a the keywords present in the article that were used:

In [42]:
tfidf_vectorizer.inverse_transform(x_article)

[array(['according', 'according people', 'acting', 'acting attorney',
        'attorney', 'attorney general', 'chemistry', 'chose', 'counsel',
        'counsel investigation', 'did', 'did know', 'distanced',
        'distanced mr', 'don', 'don know', 'easy', 'eileen', 'familiar',
        'friday', 'general', 'highly', 'highly respected', 'investigation',
        'know', 'know mr', 'left', 'left washington', 'man', 'matt',
        'matt whitaker', 'matthew', 'matthew whitaker', 'mr', 'mr trump',
        'mr whitaker', 'new', 'new acting', 'office', 'office times',
        'oval', 'oval office', 'oversees', 'paris', 'people',
        'people familiar', 'president', 'president according',
        'president trump', 'relationship', 'relationship don', 'reporters',
        'reporters left', 'respected', 'respected man', 'said',
        'said easy', 'said friday', 'says', 'says don', 'special',
        'special counsel', 'spoken', 'spoken new', 'suggesting',
        'suggesting did', 'sulliv

<br>

<br>

Let's now see what our model thinks about it:

In [43]:
prediction = model3.predict_proba(x_article)
print('\nReliability : ',round(prediction[0,0],6)*100,'%')


Reliability :  99.23729999999999 %


<br>

*Same code as above, regrouped inside a function that takes a link of an article as an input and outputs the confidence score*

In [46]:
def give_score(link):
    # Parsing the article
    article = Article(link)
    article.download()
    article.parse()
    # Applying our tf-idf vectorizer
    test_article = pd.DataFrame(columns=['full_article'])
    full_article=article.title
    for author in article.authors:
        full_article+=' '+author
    full_article+=' '+article.text
    test_article.loc[0]= full_article
    x_article = tfidf_vectorizer.transform(test_article['full_article']).toarray()
    
    return model3.predict_proba(x_article)

In [47]:
print('\nReliability : ',round(give_score(url)[0,0],6)*100,'%')


Reliability :  99.23729999999999 %


<br>

<br>

By making many tests of the model on articles found on the web, I realized that **the performance of our model is far from consistent**.

Indeed, I started by testing it on articles on which it had not trained (obviously), but from **sites it knew**. In these cases, the performances were very (too?) good!

For example, a random article from *The New York Time* gave a reliability of `95%`, when one from *Liberal America* only gave `4%`.

On the other hand, when the articles came from **less common sources**, on which our model probably had never trained, the performances were not as good, or even completely wrong.

<br>

**Some ideas to improve our current model:**

- Taking a lot more articles to train our model. Most of the time, increasing the size of the data used can be easier than tuning every parameter over and over.
- Of course, better tuning the parameters of the model used could definitely help imporving our results a little bit.
- Change the overly simplistic approach we have had. Indeed, in the end we only looked for keywords in the articles. However, it is quite clear that this will not be enough if we want an effective tool on a large number of articles, from a much wider spectrum of sources. For example, it might be useful to check the source and author of the article separately. At the same time, we could also check only the number of digits, percentages, dates and comparisons in the article.


<br>