# Data Cleaning

Let's start by removing any data duplicates that add nothing to the dataset. We should compare the number of rows before and after removing the duplicates.

In [None]:
print("Number of rows before: ", len(fake.index))

fake = fake.drop_duplicates()

print("Number of rows after: ", len(fake.index))


Number of rows before:  12999
Number of rows after:  12999


So we can see that there are no duplicate rows. Moving on to missing data.

We should find all the rows with missing data and acknowledge every missing information column-wise. Therefore, we should see check for each column the missing information and see if we should either remove the collumn totally or find a viable substitution for each column.

In [None]:
print(fake.isnull().sum().sort_values(ascending=False))


domain_rank           4223
main_img_url          3643
author                2424
title                  680
country                176
text                    46
thread_title            12
shares                   0
comments                 0
likes                    0
participants_count       0
replies_count            0
uuid                     0
spam_score               0
ord_in_thread            0
site_url                 0
crawled                  0
language                 0
published                0
type                     0
dtype: int64


Here we can see a descending order of the number of missing values per collumn. We will now analyse each column and study if it is worth "fixing" or substituting the missing values or just delete the column all together.

As we can see, we have a column named "main_img_url" that doesn't provide any useful data for the study of this dataset. Because of that, we decided to remove it all together.

In [None]:
fake.drop(['main_img_url'], axis=1, inplace=True)

In the column "author" there are two different cases that caught our attention: there are "anonymous" authors and just missing authors. Since both of these cases are comparable, because there is no info about the author in neither of them, we decided to make them all the same and add "anonymous" to the rows where the "author" info is missing.

In [None]:
fake["author"]=fake["author"].fillna("Anonymous")
fake.to_csv("fake_2.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12999 entries, 0 to 12998
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   uuid                12999 non-null  object 
 1   ord_in_thread       12999 non-null  int64  
 2   author              12999 non-null  object 
 3   published           12999 non-null  object 
 4   title               12319 non-null  object 
 5   text                12953 non-null  object 
 6   language            12999 non-null  object 
 7   crawled             12999 non-null  object 
 8   site_url            12999 non-null  object 
 9   country             12823 non-null  object 
 10  domain_rank         8776 non-null   float64
 11  thread_title        12987 non-null  object 
 12  spam_score          12999 non-null  float64
 13  replies_count       12999 non-null  int64  
 14  participants_count  12999 non-null  int64  
 15  likes               12999 non-null  int64  
 16  comm