# Data Cleaning

In [1]:
import pandas as pd

In [2]:
articles = pd.read_csv('./data/articles.csv')
articles.head()

Unnamed: 0,headline,url,description,category
0,"A Sad Bulldog, A Happy Prince And More Things ...",https://www.huffpost.com/entry/coronavirus-dis...,"A sad bulldog and a happy, paint-covered princ...",good
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fa...,https://www.huffpost.com/entry/john-krasinski-...,"""The Office"" star struck gold again in his You...",good
2,I Was Struggling As A Single Mom. Then A Stran...,https://www.huffpost.com/entry/struggling-sing...,"""With this gift, I was finally able to get out...",good
3,Pink's Advice To Fans: 'Change The F**king Wor...,https://www.huffpost.com/entry/pink-peoples-ch...,“I care about decency and humanity and kindnes...,good
4,10 Books For Parents Who Want To Raise Kind Kids,https://www.huffpost.com/entry/parenting-books...,These parenting books emphasize emotional inte...,good


First, we should drop any duplicate entries that I may have mistakenly added when collecting the data.

In [3]:
pre_drop = len(articles)
articles.drop_duplicates(inplace = True)
post_drop = len(articles)
print(f'{pre_drop - post_drop} duplicate articles removed!')

0 duplicate articles removed!


To make the dataframe easier to sort by the source of each article, let's add a source column.

In [4]:
for i, article in articles.iterrows():
    source = ''
    if 'newser' in article['url']:
        source = 'newser'
    elif 'huffpost' in article['url']:
        source = 'huffpost'
    elif 'goodnewsnetwork' in article['url']:
        source = 'goodnewsnetwork'
    articles.loc[i, 'source'] = source

Let's examine some articles from each source to see what we're working with.

In [5]:
def pretty_display(articles):
    for source in articles['source'].unique():
        print(f'Example articles from {source}:')
        condition = articles['source'] == source
        columns = ['headline', 'description', 'category']
        selected_articles = articles.loc[condition, columns][:5]
        with pd.option_context('max_colwidth', None):
            display(selected_articles)
pretty_display(articles)

Example articles from huffpost:


Unnamed: 0,headline,description,category
0,"A Sad Bulldog, A Happy Prince And More Things To Make You Smile This Weekend","A sad bulldog and a happy, paint-covered prince top our list of distractions.",good
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fan In New 'Good News',"""The Office"" star struck gold again in his YouTube antidote to the coronavirus pandemic.",good
2,I Was Struggling As A Single Mom. Then A Stranger's Kind Act Changed Everything.,"""With this gift, I was finally able to get out of the hole I had basically been in for two years since getting separated.""",good
3,Pink's Advice To Fans: 'Change The F**king World!',"“I care about decency and humanity and kindness. Kindness today is an act of rebellion,"" the singer said at the E! People's Choice Awards.",good
4,10 Books For Parents Who Want To Raise Kind Kids,"These parenting books emphasize emotional intelligence, empathy and respect for others.",good


Example articles from goodnewsnetwork:


Unnamed: 0,headline,description,category
861,82 Year-old Grandpa Dances With a Broom Because He Isn’t Going to Let COVID-19 Ruin This Senior’s Fun,"A Michigan grandpa, Frank Cicorelli, is seen dancing with a broom during the COVID-19 pandemic, because he’s not letting the virus spoil his fun.",good
862,Guy Fieri Giving $22 Million He Raised to Restaurant Workers; Plans Nacho Battle With Bill Murray to Give More,,good
863,"Hero Teen is Rewarded for Returning $135,000 He Found Next to ATM Machine",Albuquerque teen Jose Nuñez Romaniz found a huge bag of cash sitting next to an ATM and called police who honored his honesty along with local businesses.,good
864,"From Hong Kong to New Zealand, Hawaii and Montana, Officials Celebrate No New Cases of COVID-19",,good
865,Family-Owned Greenhouse Donates $1Million in Orchids to Healthcare Workers in 7 Hardest Hit U.S. Cities,Green Circle Growers of Ohio have donated $1 million in orchids to frontline workers in seven cities hardest hit by the COVID-19 pandemic.,good


Example articles from newser:


Unnamed: 0,headline,description,category
5849,Autopsy: 13 Shotgun Pellets Exited Arbery&#39;s Back,"\r\n 13 shotgun pellets exited his back, 11 remained embedded in his chest, medical examiner finds\r\n",good
5850,Woman Spat on at Work Dies From COVID-19,\r\n Man told British railway worker he had the virus\r\n,good
5851,Doctors Without Borders Offering Rare Help in US,\r\n Navajo Nation in Southwest\r\n,good
5852,"At Afghan Maternity Clinic, a &#39;Particularly Despicable&#39; Act","\r\n Armed militants storm hospital in Dashti Barchi, Afghanistan; at least 15 injured\r\n",good
5853,"Crash Kills 5 Family Members, Including Pregnant Mom and Her 2 Babies","\r\n Their pregnant mom, and a teen also among victims\r\n",good


Looks like I accidentally mislabeled all the newser articles as good, not bad. First step should be to fix that.

In [6]:
for i, row in articles.iterrows():
    if 'https://www.newser.com' in row['url']:
        articles.loc[i, 'category'] = 'bad'
        
pretty_display(articles)

Example articles from huffpost:


Unnamed: 0,headline,description,category
0,"A Sad Bulldog, A Happy Prince And More Things To Make You Smile This Weekend","A sad bulldog and a happy, paint-covered prince top our list of distractions.",good
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fan In New 'Good News',"""The Office"" star struck gold again in his YouTube antidote to the coronavirus pandemic.",good
2,I Was Struggling As A Single Mom. Then A Stranger's Kind Act Changed Everything.,"""With this gift, I was finally able to get out of the hole I had basically been in for two years since getting separated.""",good
3,Pink's Advice To Fans: 'Change The F**king World!',"“I care about decency and humanity and kindness. Kindness today is an act of rebellion,"" the singer said at the E! People's Choice Awards.",good
4,10 Books For Parents Who Want To Raise Kind Kids,"These parenting books emphasize emotional intelligence, empathy and respect for others.",good


Example articles from goodnewsnetwork:


Unnamed: 0,headline,description,category
861,82 Year-old Grandpa Dances With a Broom Because He Isn’t Going to Let COVID-19 Ruin This Senior’s Fun,"A Michigan grandpa, Frank Cicorelli, is seen dancing with a broom during the COVID-19 pandemic, because he’s not letting the virus spoil his fun.",good
862,Guy Fieri Giving $22 Million He Raised to Restaurant Workers; Plans Nacho Battle With Bill Murray to Give More,,good
863,"Hero Teen is Rewarded for Returning $135,000 He Found Next to ATM Machine",Albuquerque teen Jose Nuñez Romaniz found a huge bag of cash sitting next to an ATM and called police who honored his honesty along with local businesses.,good
864,"From Hong Kong to New Zealand, Hawaii and Montana, Officials Celebrate No New Cases of COVID-19",,good
865,Family-Owned Greenhouse Donates $1Million in Orchids to Healthcare Workers in 7 Hardest Hit U.S. Cities,Green Circle Growers of Ohio have donated $1 million in orchids to frontline workers in seven cities hardest hit by the COVID-19 pandemic.,good


Example articles from newser:


Unnamed: 0,headline,description,category
5849,Autopsy: 13 Shotgun Pellets Exited Arbery&#39;s Back,"\r\n 13 shotgun pellets exited his back, 11 remained embedded in his chest, medical examiner finds\r\n",bad
5850,Woman Spat on at Work Dies From COVID-19,\r\n Man told British railway worker he had the virus\r\n,bad
5851,Doctors Without Borders Offering Rare Help in US,\r\n Navajo Nation in Southwest\r\n,bad
5852,"At Afghan Maternity Clinic, a &#39;Particularly Despicable&#39; Act","\r\n Armed militants storm hospital in Dashti Barchi, Afghanistan; at least 15 injured\r\n",bad
5853,"Crash Kills 5 Family Members, Including Pregnant Mom and Her 2 Babies","\r\n Their pregnant mom, and a teen also among victims\r\n",bad


## Cleaning up Headlines and Descriptions

There are some NaN values in the description, so let's define a helper to make our lives easier (and the code cleaner!). This helper applies a given function to all the non-NaN entries in headline / description.

In [7]:
def apply(articles, function):
    '''Helper function to apply a given function to the articles dataframe, ignoring NaN values.'''
    decorated_fn = lambda entry: function(entry) if type(entry) == str else entry
    
    articles['headline'] = articles['headline'].apply(decorated_fn)
    articles['description'] = articles['description'].apply(decorated_fn)

The descriptions for the newser articles have a lot of whitespace around them it seems ("\r\n"), so let's strip that away.

In [8]:
apply(articles, str.strip)
pretty_display(articles)

Example articles from huffpost:


Unnamed: 0,headline,description,category
0,"A Sad Bulldog, A Happy Prince And More Things To Make You Smile This Weekend","A sad bulldog and a happy, paint-covered prince top our list of distractions.",good
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fan In New 'Good News',"""The Office"" star struck gold again in his YouTube antidote to the coronavirus pandemic.",good
2,I Was Struggling As A Single Mom. Then A Stranger's Kind Act Changed Everything.,"""With this gift, I was finally able to get out of the hole I had basically been in for two years since getting separated.""",good
3,Pink's Advice To Fans: 'Change The F**king World!',"“I care about decency and humanity and kindness. Kindness today is an act of rebellion,"" the singer said at the E! People's Choice Awards.",good
4,10 Books For Parents Who Want To Raise Kind Kids,"These parenting books emphasize emotional intelligence, empathy and respect for others.",good


Example articles from goodnewsnetwork:


Unnamed: 0,headline,description,category
861,82 Year-old Grandpa Dances With a Broom Because He Isn’t Going to Let COVID-19 Ruin This Senior’s Fun,"A Michigan grandpa, Frank Cicorelli, is seen dancing with a broom during the COVID-19 pandemic, because he’s not letting the virus spoil his fun.",good
862,Guy Fieri Giving $22 Million He Raised to Restaurant Workers; Plans Nacho Battle With Bill Murray to Give More,,good
863,"Hero Teen is Rewarded for Returning $135,000 He Found Next to ATM Machine",Albuquerque teen Jose Nuñez Romaniz found a huge bag of cash sitting next to an ATM and called police who honored his honesty along with local businesses.,good
864,"From Hong Kong to New Zealand, Hawaii and Montana, Officials Celebrate No New Cases of COVID-19",,good
865,Family-Owned Greenhouse Donates $1Million in Orchids to Healthcare Workers in 7 Hardest Hit U.S. Cities,Green Circle Growers of Ohio have donated $1 million in orchids to frontline workers in seven cities hardest hit by the COVID-19 pandemic.,good


Example articles from newser:


Unnamed: 0,headline,description,category
5849,Autopsy: 13 Shotgun Pellets Exited Arbery&#39;s Back,"13 shotgun pellets exited his back, 11 remained embedded in his chest, medical examiner finds",bad
5850,Woman Spat on at Work Dies From COVID-19,Man told British railway worker he had the virus,bad
5851,Doctors Without Borders Offering Rare Help in US,Navajo Nation in Southwest,bad
5852,"At Afghan Maternity Clinic, a &#39;Particularly Despicable&#39; Act","Armed militants storm hospital in Dashti Barchi, Afghanistan; at least 15 injured",bad
5853,"Crash Kills 5 Family Members, Including Pregnant Mom and Her 2 Babies","Their pregnant mom, and a teen also among victims",bad


Nice! Now it looks like the headlines (and possibly the descriptions too) have some escaped characters. Let's remove those.

In [9]:
import html

apply(articles, html.unescape)

pretty_display(articles)

Example articles from huffpost:


Unnamed: 0,headline,description,category
0,"A Sad Bulldog, A Happy Prince And More Things To Make You Smile This Weekend","A sad bulldog and a happy, paint-covered prince top our list of distractions.",good
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fan In New 'Good News',"""The Office"" star struck gold again in his YouTube antidote to the coronavirus pandemic.",good
2,I Was Struggling As A Single Mom. Then A Stranger's Kind Act Changed Everything.,"""With this gift, I was finally able to get out of the hole I had basically been in for two years since getting separated.""",good
3,Pink's Advice To Fans: 'Change The F**king World!',"“I care about decency and humanity and kindness. Kindness today is an act of rebellion,"" the singer said at the E! People's Choice Awards.",good
4,10 Books For Parents Who Want To Raise Kind Kids,"These parenting books emphasize emotional intelligence, empathy and respect for others.",good


Example articles from goodnewsnetwork:


Unnamed: 0,headline,description,category
861,82 Year-old Grandpa Dances With a Broom Because He Isn’t Going to Let COVID-19 Ruin This Senior’s Fun,"A Michigan grandpa, Frank Cicorelli, is seen dancing with a broom during the COVID-19 pandemic, because he’s not letting the virus spoil his fun.",good
862,Guy Fieri Giving $22 Million He Raised to Restaurant Workers; Plans Nacho Battle With Bill Murray to Give More,,good
863,"Hero Teen is Rewarded for Returning $135,000 He Found Next to ATM Machine",Albuquerque teen Jose Nuñez Romaniz found a huge bag of cash sitting next to an ATM and called police who honored his honesty along with local businesses.,good
864,"From Hong Kong to New Zealand, Hawaii and Montana, Officials Celebrate No New Cases of COVID-19",,good
865,Family-Owned Greenhouse Donates $1Million in Orchids to Healthcare Workers in 7 Hardest Hit U.S. Cities,Green Circle Growers of Ohio have donated $1 million in orchids to frontline workers in seven cities hardest hit by the COVID-19 pandemic.,good


Example articles from newser:


Unnamed: 0,headline,description,category
5849,Autopsy: 13 Shotgun Pellets Exited Arbery's Back,"13 shotgun pellets exited his back, 11 remained embedded in his chest, medical examiner finds",bad
5850,Woman Spat on at Work Dies From COVID-19,Man told British railway worker he had the virus,bad
5851,Doctors Without Borders Offering Rare Help in US,Navajo Nation in Southwest,bad
5852,"At Afghan Maternity Clinic, a 'Particularly Despicable' Act","Armed militants storm hospital in Dashti Barchi, Afghanistan; at least 15 injured",bad
5853,"Crash Kills 5 Family Members, Including Pregnant Mom and Her 2 Babies","Their pregnant mom, and a teen also among victims",bad


Great! The data is clean, so now we can move on.

In [10]:
articles.to_csv('./data/cleaned_articles.csv', index = False)