# Welcome to the Fake News Analyzer!

## First steps

 Here we import all the libraries needed for the project and import the data from the CSV. All the data is in one csv named "fake.csv". This csv has got several rows and columns that will be evaluated in this notebook.

In [None]:
#imports and reading dataset
import pandas as pd
import sklearn as skl
import nltk
import numpy as np

print("-> FAKE NEWS ANALYZER <-")

fake = pd.read_csv("./archive/fake.csv", na_values=[""])
print(fake.head(n=5))

As we can see it is a pretty big dataset. Let's talk numbers!

In [None]:
print("Number of rows: ", len(fake.index))

print("Number of columns: ", len(fake.columns))

There are 12999 rows and 20 different columns. This constitutes a big sample of fake news to analyse. Let's now see which type of data we have.

In [None]:
print("-> Printing collumn's data types <-")

print(fake.dtypes)


# Data Cleaning

## Duplicates

Let's start by removing any data duplicates that add nothing to the dataset. We should compare the number of rows before and after removing the duplicates.

In [None]:
print("-> DATA CLEANING <-")

print("-> Dropping duplicate rows <-")

#cleaning duplicates
print("Number of rows before: ", len(fake.index))

fake = fake.drop_duplicates()

print("Number of rows after: ", len(fake.index))


So we can see that there are no duplicate rows. Moving on to missing data.

## Missing data

We should find all the rows with missing data and acknowledge every missing information column-wise. Therefore, we should see check for each column the missing information and see if we should either remove the collumn totally or find a viable substitution for each column.

In [None]:
print("-> Showing null values per collumn - before cleaning <-")

print(fake.isnull().sum().sort_values(ascending=False))


Here we can see a descending order of the number of missing values per collumn. We will now analyse each column and study if it is worth "fixing" or substituting the missing values or just delete the column all together.

As we can see, we have a column named "main_img_url" that doesn't provide any useful data for the study of this dataset. Because of that, we decided to remove it all together.

In [None]:
print("-> Dropping 'main_img_url' collumn <-")

fake.drop(['main_img_url'], axis=1, inplace=True)


In the column "author" there are two different cases that caught our attention: there are "anonymous" authors and just missing authors. Since both of these cases are comparable, because there is no info about the author in neither of them, we decided to make them all the same and add "anonymous" to the rows where the "author" info is missing.

In [None]:
print("-> Fixing null values in 'author' <-")

fake["author"]=fake["author"].fillna("Anonymous")


As we can also see, there are 12 rows without a thread title. This doesn't allow us group up the fake news into threads because we don't know where they belong. Since it is such a small ammount of news (12) we decided that they should removed as they don't represent a big sample.

In [None]:
print("-> Fixing null values in 'thread_title' <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['thread_title'].notna()]

print("Number of rows after: ", len(fake.index))


Since we are going to evaluate and perform text search based on the text of the news, news with no text become irrelevant to this dataset. Therefore, we are going to remove those rows as well.

In [None]:
print("-> Fixing null values in 'text' <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['text'].notna()]

print("Number of rows after: ", len(fake.index))


There are news with no title which we have found to be in the same thread where only the news in first place (order_in_thread=0) has got a title. Let's verify if all the existing titles are the same as the thread_title associated.

In [None]:
# get unique values from each column
titles = fake['title'].unique()
thread_titles = fake['thread_title'].unique()

#check if size is the same
print("Number of unique titles:", len(titles))
print("Number of unique thread_titles:", len(thread_titles),"\n")

# convert every value to string
for i in range(0, len(titles)):
    titles[i]=str(titles[i])

for i in range(0, len(thread_titles)):
    thread_titles[i] = str(thread_titles[i])

#sort arrays
np.sort(titles)
np.sort(thread_titles)


# print every different title - compare to bigger list: thread_titles
for element in thread_titles:
    if(element not in titles):
        print("Different thread_title:", element)



As we can see, there are different titles and thread_titles, making it impossible for us to eliminate one of those columns. We can, however, substitute the data and eliminate rows which don't have any of these informations, as it is impossible for us to track the news associated to any theme or key word.

In [None]:
print("Number of rows with no title and no thread_title:", len(fake.loc[fake['title'].isna() & fake['thread_title'].isna()]))

Since there are no rows with neither of the values in the referred columns, we should turn our attention to how to correct the missing data in these columns. For every row with no title, we will substitute the value with the thread_title instead, since the thread_title will be in some way connected to the theme of the news.

In [None]:
print("-> Fixing null values in 'title' <-")

fake.title.fillna(fake.thread_title, inplace=True)


Moreover, there are news with no 'country' value associated, more specifically, 176 news. To make it easier to analyze this data and not have missing values, we will drop these rows, since we can't track the origin of the news and it is a rather small ammount of data that is discarded (176 rows), given the size of the dataset.

In [None]:
print("-> Fixing null values in 'country' <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['country'].notna()]

print("Number of rows after: ", len(fake.index))


As domain rank is a metric, going from 0 to 100, the later being the strongest, evaluated based on many factors dependent on user searches and the authority of a certain domain, if a domain doesn't have an available ranking, we can suppose it is 0 since it hasn't been evaluated yet. That can be due to lack of information or visits to that domain. That being said, we will substitute every missing value in the column "domain_rank" by 0.

In [None]:
print("-> Fixing null values in 'domain_rank' <-")

fake["domain_rank"] = fake["domain_rank"].fillna(0)


Let's take a new look at the null values per column and see the final dataset, with no missing data.

In [None]:
print("-> Showing null values per collumn - after cleaning <-")

print(fake.isnull().sum().sort_values(ascending=False))


Lastly, we have observed that uuid column doesn't really add anything of value to our dataset, therefore, we will drop the column all together before moving on.

In [None]:
print("-> Dropping 'uuid' collumn <-")

fake.drop(['uuid'], axis=1, inplace=True)


## Broken Data & Data Types

As in all datasets, there are rows with broken data, as in, data that was unsuccessfully crawled and filled in rows with broken values. Let's find these rows in our dataset and correct or drop them.

The first case we noticed were some rows filled with 0 and 1, namely in the title and text columns. Let's locate them and print them, to analize the data and see if it should be dropped. If we can't find any and since we already cleaned some of the data, it means they were already dropped in the steps before.


In [None]:
print("-> Cleaning crawling errors <-")

print(fake.loc[fake['title'] == "0"])
print(fake.loc[fake['title'] == "1"])

print(fake.loc[fake['text'] == "0"])
print(fake.loc[fake['text'] == "1"])


As an additional verification that the above statement was correct, we opened the dataset in excel and verified that every case of broken data was indeed gone from the dataset at this point of the cleaning.

The other big case for broken data that we found were some news that weren't correctly crawled and just extended the text of the news throughout several columns. There are several ways of correcting this issue, but in order to get all the cases and clean the data the most, we will start by verifying column by column if the data types and formats are correct, dropping the columns in which that isn't verified.

Starting with 'ord_in_thread' column. It must be a number, therefore let's eliminate every row that doesn't just contain a number. If there is a row with alpha characters in it, the type of the column won't be int64 so let's first verify the type and eliminate non numeric rows if it isn't 'int64'.

In [None]:
print("-> Cleaning 'ord_in_thread' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'ord_in_thread':", fake['ord_in_thread'].dtypes)
print("Is it int64?", fake['ord_in_thread'].dtypes == 'int64')

print("Number of rows after: ", len(fake.index))


As it returns 'int64' we are sure that the whole column only contains numeric values. Moving on to 'author', there are 2 errors to fix: wrong datatypes and the "-NO AUTHOR-" rows. As it is very hard to verify the correctness of the author name, as it is very hard to define "correctness" in these cases, and not exclude valid names with numeric digits using the usual rules for identifying names (since authors can have numbers in their username), we will move on to correcting the "-NO AUTHOR-" cases, where it should be replaced by "Anonymous", as we did before with null values. We will also remove rows with just whitespaces.

In [None]:
print("-> Cleaning 'author' collumn <-")

print("Number of rows before: ", len(fake.index))

fake["author"] = fake["author"].replace("-NO AUTHOR-","Anonymous")

# clean string rows with just whitespace
fake = fake[~fake['author'].str.contains(r'^\s*$')]

print("Number of rows after: ", len(fake.index))


Next up we have the column "published", that should only contain dates. We can verify this through a regex expression that verifies the format of the dates, as well as the datatypes present.

In [None]:
print("-> Cleaning 'published' collumn <-")

print("Number of rows before: ", len(fake.index))

fake=fake[fake['published'].str.contains('\d{4}-\d{2}-\d{2}')]

print("Number of rows after: ", len(fake.index))


The next column to verify is the column "title". We have cleaned this column before so some work was already done. We should just verify that the titles aren't mere strings of whitespaces, as it is extremely difficult to define a "correct" title or text in any way.

In [None]:
print("-> Cleaning 'title' collumn <-")

print("Number of rows before: ", len(fake.index))

# clean string rows with just whitespace
fake = fake[~fake['title'].str.contains(r'^\s*$')]


print("Number of rows after: ", len(fake.index))


Moving on to the "text" column, we must do the same verification as in the "title" column.

In [None]:
print("-> Cleaning 'text' collumn <-")

print("Number of rows before: ", len(fake.index))

fake = fake[~fake['text'].str.contains(r'^\s*$')]  #clean string rows with just whitespace

print("Number of rows after: ", len(fake.index))


Next up we have the "language" column. This column must only contain one word and no digits, so the funtion isalpha() will do the job. There are also some rows, 7 to be exact, which have the value 'ignore', which is not a language. These will be removed.

In [None]:
print("-> Cleaning 'language' collumn <-")

print("Number of rows before: ", len(fake.index))

fake['language'] = fake['language'].apply(str)
fake = fake[fake.language.str.isalpha()] 

# clean string rows with just whitespace or 'ignore'
fake = fake[~fake['language'].str.contains(r'^\s*$')]
fake = fake[fake.language != "ignore"]

print("Number of rows after: ", len(fake.index))


Next up is the column "crawled", which should only contain the date in which the data was crawled.

In [None]:
print("-> Cleaning 'crawled' collumn <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['crawled'].str.contains(r'\d{4}-\d{2}-\d{2}')]  # dates

print("Number of rows after: ", len(fake.index))


The following column is "site_url". Every site follows the same structure of domain.xxx, so we will apply a regex expression to filter unwanted formats.

In [None]:
print("-> Cleaning 'site_url' collumn <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['site_url'].str.contains(r'^(www\.)?(.*)?\/?(.)*')]

print("Number of rows after: ", len(fake.index))


Moving on to the 'country' column, the countries are mentioned as a 2 uppercase letter code. Let's confirm every row follows that format.

In [None]:
print("-> Cleaning 'country' collumn <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['country'].str.contains(r'[A-Z]{2}')]

print("Number of rows after: ", len(fake.index))


Next up is the column 'domain_rank', which should be a number. Let's verify that as well.

In [None]:
print("-> Cleaning 'domain_rank' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'domain_rank':", fake['domain_rank'].dtypes)
print("Is it int64?", fake['domain_rank'].dtypes=='int64')

#cast to int64
fake['domain_rank'] = fake['domain_rank'].apply(np.int64)

#check if all values are greater or equal than 0
fake=fake[fake['domain_rank']>=0]

print("Number of rows after: ", len(fake.index))


Next up is "thread_title". This column follows the same rules as the 'title' column, it is very hard to define what is a correct title or not. The only verifiable condition is the same as the one used before for 'title'.

In [None]:
print("-> Cleaning 'thread_title' collumn <-")

print("Number of rows before: ", len(fake.index))

# clean string rows with just whitespace
fake = fake[~fake['thread_title'].str.contains(r'^\s*$')]

print("Number of rows after: ", len(fake.index))


The next column is "spam_score". This is a very unique column as it only must have float values between 0 and 1. Let's verify this.

In [None]:
print("-> Cleaning 'spam_score' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'spam_score':", fake['spam_score'].dtypes)
print("Is it float64?", fake['spam_score'].dtypes == 'float64')

#check if all values are greater or equal than 0 and lower or equal than 1
fake = fake[fake['spam_score'] >= 0]
fake = fake[fake['spam_score'] <= 1]

print("Number of rows after: ", len(fake.index))


The next column is "replies_count" which, as a counter, should be a positive integer or zero. Let's check those conditions.

In [None]:
print("-> Cleaning 'replies_count' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'replies_count':", fake['replies_count'].dtypes)
print("Is it int64?", fake['replies_count'].dtypes == 'int64')

#check if all values are greater or equal than 0
fake = fake[fake['replies_count'] >= 0]

print("Number of rows after: ", len(fake.index))


Same verification as "replies_count" will happen for the columns "participants_count", "likes", "comments" and "shares". 

In [None]:
print("-> Cleaning 'participants_count' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'participants_count':", fake['participants_count'].dtypes)
print("Is it int64?", fake['participants_count'].dtypes == 'int64')

#check if all values are greater or equal than 0
fake = fake[fake['participants_count'] >= 0]

print("Number of rows after: ", len(fake.index))


In [None]:
print("-> Cleaning 'likes' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'likes':", fake['likes'].dtypes)
print("Is it int64?", fake['likes'].dtypes == 'int64')

#check if all values are greater or equal than 0
fake = fake[fake['likes'] >= 0]

print("Number of rows after: ", len(fake.index))


In [None]:
print("-> Cleaning 'comments' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'comments':", fake['comments'].dtypes)
print("Is it int64?", fake['comments'].dtypes == 'int64')

#check if all values are greater or equal than 0
fake = fake[fake['comments'] >= 0]

print("Number of rows after: ", len(fake.index))


In [None]:
print("-> Cleaning 'shares' collumn <-")

print("Number of rows before: ", len(fake.index))

print("Type of 'shares':", fake['shares'].dtypes)
print("Is it int64?", fake['shares'].dtypes == 'int64')

#check if all values are greater or equal than 0
fake = fake[fake['shares'] >= 0]

print("Number of rows after: ", len(fake.index))


Last but not least we have the "type" column, which contains a lowercase string classifying the news.

In [None]:
print("-> Cleaning 'type' collumn <-")

print("Number of rows before: ", len(fake.index))

fake = fake[fake['type'].str.contains(r'^[a-z]+$')]  # lowercase letters

# clean string rows with just whitespace
fake = fake[~fake['type'].str.contains(r'^\s*$')]

print("Number of rows after: ", len(fake.index))


And now, since the data cleaning is finished, we can save the new clean data to a new csv file, to be used later.

In [None]:
fake.to_csv("fake_clean.csv", index=False)