# Data Cleaning and EDA

Here you will see what I did to aggregate the news that belong to the same day, also some EDA in order to find if I have any missing data for any day.

**Notebook Contents**
- [Imports](#Imports)
- [EDA](#EDA)
- [Aggregating Data](#Aggregating-Data)

## Imports

In [31]:
import pandas as pd
import numpy as np

In [32]:
data = pd.read_csv('../Datasets/backup_data.csv')

In [33]:
data.drop_duplicates(subset='article',inplace=True)

## EDA

In [34]:
# There are 1153 days between 01-01-2017 and 02-27-2020
# pd.set_option('display.max_rows', 10) # I used this to check which dates are 
data.groupby('date').count().count()

category    1153
title       1153
article     1153
author      1153
dtype: int64

In [35]:
data.groupby('category').describe()

Unnamed: 0_level_0,date,date,date,date,title,title,title,title,article,article,article,article,author,author,author,author
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
AFRICA,1022,643,2020-02-17,18,1022,1019,Nigerian Women Say ‘MeToo.’ Critics Say ‘Prove...,2,1022,1022,The police said the assault involved rocket-pr...,1,1022,364,By Reuters,65
AMERICAS,1523,825,2020-02-20,19,1523,1523,"In Colombia, Two Rebel Groups Take Different P...",1,1523,1523,There are dozens of communities that continue ...,1,1523,490,By Reuters,81
ASIA PACIFIC,5669,1144,2020-02-21,91,5669,5658,Factbox: Latest on Coronavirus Spreading in Ch...,6,5669,5669,The girl’s marriage to a man 30 years her seni...,1,5669,1131,By Reuters,466
AUSTRALIA,741,510,2018-11-29,4,741,741,Girls as Young as 12 Were Strip-Searched in Au...,1,741,741,A government-appointed commission called for t...,1,741,263,By Damien Cave,85
CANADA,545,427,2018-07-06,4,545,545,"100 Years Later, Battle of Vimy Ridge Remains ...",1,545,545,"In a small Manitoba town, the Canadian police ...",1,545,202,By Ian Austen,113
EUROPE,6148,1147,2020-02-20,88,6148,6134,Countries Evacuating Nationals From Coronaviru...,2,6148,6148,Two migrants were killed in a car crash and se...,1,6148,1127,By Reuters,400
MIDDLE EAST,2741,1016,2020-02-20,42,2741,2736,"In ‘Cave-In,’ Trump Cease-Fire Cements Turkey’...",2,2741,2741,Three countries called on Iran to return to fu...,1,2741,746,By Reuters,162
WHAT IN THE WORLD,9,9,2017-01-18,1,9,9,China’s Poplar Trees: A Spring Nuisance That S...,1,9,9,The European Union spends about $1 billion a y...,1,9,9,By Felipe Villamor,1
WORLD,279,238,2020-02-18,4,279,279,It’s Not Just You: 2017 Was Rough for Humanity...,1,279,279,Twenty-three Marines were rescued after an Osp...,1,279,173,By International Herald Tribune,19


## Aggregating Data

In [36]:
agg_df = pd.DataFrame(columns=['date','category','title','article'])

for date in data['date'].unique():
    
    # Lets add a space to the end of every title and article 
    # so when I sum them, they will be separated.
    titles_of_date = data[data['date'] == date]['title'] + ' '
    articles_of_date = data[data['date'] == date]['article'] + ' '
    categories_of_date = data[data['date'] == date]['category'] + ' '
    
    # Sum all of the titles/articles/categories that belong 
    # to that unique date
    aggreg_titles = titles_of_date.sum(axis=0)
    aggreg_articles = articles_of_date.sum(axis=0)
    aggreg_categories = categories_of_date.sum(axis=0)
    
    # Create a row dictionary
    row = {
    'date': date, 
    'category': aggreg_categories,
    'title' : aggreg_titles,
    'article' : aggreg_articles
}
    # Append it to my dataframe
    agg_df = agg_df.append(row, ignore_index=True)

In [37]:
agg_df.to_csv('../Datasets/aggregated_news.csv', index=False)