# Welcome to the Fake News Analizer!

## First steps

 Here we import all the libraries needed for the project and import the data from the CSV. All the data is in one csv named "fake.csv". This csv has got several rows and columns that will be evaluated in this notebook.

In [9]:
import pandas as pd
import sklearn as skl
import nltk
import numpy as np


In [10]:
fake = pd.read_csv("./archive/fake.csv", na_values=[""])
print(fake.head(n=5))

                                       uuid  ord_in_thread  \
0  6a175f46bcd24d39b3e962ad0f29936721db70db              0   
1  2bdc29d12605ef9cf3f09f9875040a7113be5d5b              0   
2  c70e149fdd53de5e61c29281100b9de0ed268bc3              0   
3  7cf7c15731ac2a116dd7f629bd57ea468ed70284              0   
4  0206b54719c7e241ffe0ad4315b808290dbe6c0f              0   

                 author                      published  \
0     Barracuda Brigade  2016-10-26T21:41:00.000+03:00   
1  reasoning with facts  2016-10-29T08:47:11.259+03:00   
2     Barracuda Brigade  2016-10-31T01:41:49.479+02:00   
3                Fed Up  2016-11-01T05:22:00.000+02:00   
4                Fed Up  2016-11-01T21:56:00.000+02:00   

                                               title  \
0  Muslims BUSTED: They Stole Millions In Gov’t B...   
1  Re: Why Did Attorney General Loretta Lynch Ple...   
2  BREAKING: Weiner Cooperating With FBI On Hilla...   
3  PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...  

As we can see it is a pretty big dataset. Let's talk numbers!

In [11]:
print("Number of rows: ", len(fake.index))

print("Number of columns: ", len(fake.columns))

Number of rows:  12999
Number of columns:  20


There are 12999 rows and 20 different columns. This constitutes a big sample of fake news to analyse. Let's now see which type of data we have.

In [12]:
print(fake.dtypes)


uuid                   object
ord_in_thread           int64
author                 object
published              object
title                  object
text                   object
language               object
crawled                object
site_url               object
country                object
domain_rank           float64
thread_title           object
spam_score            float64
main_img_url           object
replies_count           int64
participants_count      int64
likes                   int64
comments                int64
shares                  int64
type                   object
dtype: object


# Data Cleaning

Let's start by removing any data duplicates that add nothing to the dataset. We should compare the number of rows before and after removing the duplicates.

In [13]:
print("Number of rows before: ", len(fake.index))

fake = fake.drop_duplicates()

print("Number of rows after: ", len(fake.index))


Number of rows before:  12999
Number of rows after:  12999


So we can see that there are no duplicate rows. Moving on to missing data.

We should find all the rows with missing data and acknowledge every missing information column-wise. Therefore, we should see check for each column the missing information and see if we should either remove the collumn totally or find a viable substitution for each column.

In [14]:
print(fake.isnull().sum().sort_values(ascending=False))


domain_rank           4223
main_img_url          3643
author                2424
title                  680
country                176
text                    46
thread_title            12
shares                   0
comments                 0
likes                    0
participants_count       0
replies_count            0
uuid                     0
spam_score               0
ord_in_thread            0
site_url                 0
crawled                  0
language                 0
published                0
type                     0
dtype: int64


Here we can see a descending order of the number of missing values per collumn. We will now analyse each column and study if it is worth "fixing" or substituting the missing values or just delete the column all together.

As we can see, we have a column named "main_img_url" that doesn't provide any useful data for the study of this dataset. Because of that, we decided to remove it all together.

In [15]:
fake.drop(['main_img_url'], axis=1, inplace=True)

In the column "author" there are two different cases that caught our attention: there are "anonymous" authors and just missing authors. Since both of these cases are comparable, because there is no info about the author in neither of them, we decided to make them all the same and add "anonymous" to the rows where the "author" info is missing.

In [19]:
fake["author"]=fake["author"].fillna("Anonymous")
fake.to_csv("fake_2.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12999 entries, 0 to 12998
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   uuid                12999 non-null  object 
 1   ord_in_thread       12999 non-null  int64  
 2   author              12999 non-null  object 
 3   published           12999 non-null  object 
 4   title               12319 non-null  object 
 5   text                12953 non-null  object 
 6   language            12999 non-null  object 
 7   crawled             12999 non-null  object 
 8   site_url            12999 non-null  object 
 9   country             12823 non-null  object 
 10  domain_rank         8776 non-null   float64
 11  thread_title        12987 non-null  object 
 12  spam_score          12999 non-null  float64
 13  replies_count       12999 non-null  int64  
 14  participants_count  12999 non-null  int64  
 15  likes               12999 non-null  int64  
 16  comm