In [1]:
# Import modules

import numpy as np
import pandas as pd
import re
from scraper import Subreddit

# Set pandas to display a maximum of 400 characters in the column
pd.options.display.max_colwidth = 400

---

# 01 - Web Scraping

We will be using my custom class, `scraper.Subreddit` to perform the web scraping from Reddit.  

My custom class contains methods to scrape a particular subreddit (which will be the instance) and create DataFrames of the resulting data from the json, or load the json, or csv files directly.  

The scraping method will ignore any posts which have been deleted from the subreddit, as these are likely spam or irrelevant posts.  

It also includes methods for retrieving each news article's text from its url using the newspaper3 python package which can be installed as per the instructions from the [webpage](https://newspaper.readthedocs.io/en/latest/).  


Further, there are methods to remove duplicate posts, and sort the posts by `article_text` length, which will be used to explore and clean the gathered data.

## Scraping Subreddit 01 - World News

First, we will instantiate an instance of the class for the `worldnews` subreddit, and scrape at least 1200 of the most recent posts.  

Next, we will use newspaper3 to attempt to pull the article text from each post.

As the posts have been scraped previously, the codes have been commented out.  

The DataFrame will be created from the .csv file instead.

In [2]:
# Instantiate an instance of the class for the 'worldnews' subreddit

world_news = Subreddit('worldnews')

In [3]:
# Scrape the most recent 1200 posts

# world_news.scrape(1200)

In [4]:
# Pull the article text using newspaper3

# world_news.news_text_pull()

In [5]:
# Create a DataFrame including article text

world_news.df_from_csv('full')

## Scraping Subreddit 02 - TheOnion  

We instantiate an instance of our second subreddit, `theonion` and repeat the steps.  

Once again, the codes for scraping have been commented out as this has been done previously.

In [6]:
# Instantiate an instance of the class for the 'worldnews' subreddit

theonion = Subreddit('theonion')

In [7]:
# Scrape the most recent 1200 posts

# theonion.scrape(1200)

In [8]:
# Pull the article text using newspaper3

# theonion.news_text_pull()

In [9]:
# Create a DataFrame including article text

theonion.df_from_csv()

---

# 02 - Data cleaning


## Cleaning Subreddit 01 - World News

Now that we have created the DataFrames, we will check for any duplicated or null values.  
These duplicate and null values will be removed to ensure that our model will not be affected by them.

We will use the method `remove_duplicates` from our custom class to attempt to remove the duplicate entries from all dataframes which are loaded for the instance.

In [10]:
# Mask the dataframe to identify duplicate results

world_news.full_df[world_news.full_df.title.duplicated(keep=False)].sort_values(by='title').head(5)

Unnamed: 0,subreddit,title,url,article_text
14,worldnews,'Out of control' fire breaks out in Cape Town's Table Mountain National Park,https://www.cnn.com/2021/04/18/africa/table-mountain-south-africa-fire-intl-afr/index.html,"(CNN) An ""out of control"" fire has broken out in Cape Town's Table Mountain National Park on Sunday, according to South African officials, prompting the evacuation of hikers from the city's most famous landmark.\n\nMore than 120 firefighters are battling the massive blaze, and four helicopters have been deployed to help with the efforts, according to a media release shared by a Twitter account..."
106,worldnews,'Out of control' fire breaks out in Cape Town's Table Mountain National Park,https://edition.cnn.com/2021/04/18/africa/table-mountain-south-africa-fire-intl-afr/index.html,"(CNN) An ""out of control"" fire has broken out in Cape Town's Table Mountain National Park on Sunday, according to South African officials, prompting the evacuation of hikers from the city's most famous landmark.\n\nMore than 120 firefighters are battling the massive blaze, and four helicopters have been deployed to help with the efforts, according to a media release shared by a Twitter account..."
81,worldnews,'Out of control' fire breaks out in Cape Town's Table Mountain National Park,https://edition.cnn.com/2021/04/18/africa/table-mountain-south-africa-fire-intl-afr/index.html,"(CNN) An ""out of control"" fire has broken out in Cape Town's Table Mountain National Park on Sunday, according to South African officials, prompting the evacuation of hikers from the city's most famous landmark.\n\nMore than 120 firefighters are battling the massive blaze, and four helicopters have been deployed to help with the efforts, according to a media release shared by a Twitter account..."
873,worldnews,'You can't clone us': Polish doctors cry for help as COVID deaths spike,https://www.reuters.com/world/europe/you-cant-clone-us-polish-doctors-cry-help-covid-deaths-spike-2021-04-16/,"When the pandemic began last year, Kinga Szlachcic-Wyroba, an anaesthesiologist in the Stefan Zeromski Specialist Hospital in Krakow, Poland, had to manage one COVID-19 patient and 10 others in intensive care with three other doctors.\n\nNow the third wave has hit Poland and the number of COVID-19 patients in intensive care stands at 17, with just four non-COVID sufferers. Around 80% of the CO..."
730,worldnews,'You can't clone us': Polish doctors cry for help as COVID deaths spike,https://www.reuters.com/world/europe/you-cant-clone-us-polish-doctors-cry-help-covid-deaths-spike-2021-04-16/?utm_source=reddit.com,"When the pandemic began last year, Kinga Szlachcic-Wyroba, an anaesthesiologist in the Stefan Zeromski Specialist Hospital in Krakow, Poland, had to manage one COVID-19 patient and 10 others in intensive care with three other doctors.\n\nNow the third wave has hit Poland and the number of COVID-19 patients in intensive care stands at 17, with just four non-COVID sufferers. Around 80% of the CO..."


In [11]:
# Call the method to remove duplicate results by the title column.

world_news.remove_duplicates()

In [12]:
# Check for duplicate entries once again

world_news.full_df[world_news.full_df.title.duplicated(keep=False)].sort_values(by='title')

Unnamed: 0,subreddit,title,url,article_text


In [13]:
# Check for null values

world_news.full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1127 entries, 0 to 1126
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subreddit     1127 non-null   object
 1   title         1127 non-null   object
 2   url           1127 non-null   object
 3   article_text  1078 non-null   object
dtypes: object(4)
memory usage: 35.3+ KB


In [14]:
# Remove null values

world_news.full_df.dropna(inplace=True)

Next we will check if there are any errors in the article text which has been pulled with newspaper3. We also do not want these erroneous data to affect our model.  

To achieve this, we have created a method `sort_length` that creates a new column `length` which counts the number of characters in `article_text` and sorts the DataFrame according to the length in ascending order.  

This method also creates a new column `num_words` which counts the number of words in `article_text`.

This is based on the assumption that there will be more erroneous text with a low number of characters in the news article.

In [15]:
# Call the method to sort the dataframe by article_text length

world_news.sort_length()

In [16]:
# Examine the statistics of article_texts

world_news.full_df.article_text_length.describe()

count     1078.000000
mean      3585.028757
std       3064.539678
min         51.000000
25%       1687.250000
50%       2857.500000
75%       4617.000000
max      35888.000000
Name: article_text_length, dtype: float64

In [17]:
# Calculate the value which is 1 std from the mean, which we will use to identify "short" articles.

world_news.full_df.article_text_length.mean() - world_news.full_df.article_text_length.std()

520.489079359782

Given these statistics, we will take anything with less characters than 1 standard deviation from the mean as articles which are too short. This works out to be any article with less than 520 characters.

In [18]:
# Check the entries with shorter article lengths

world_news.full_df.article_text[world_news.full_df.article_text_length < 520].head(50)

112                                                                                                                                                                                   Could not find the requested document in the cache.
419                                                                                                                                                                  Some of the birds that call Vedanthangal home. Photo: Arun and Shyam
192                                                                                                                                                              The Lamu seaport in Kenya will open for business in June. Photo: Twitter
331                                                                                                                                                     5,000 character limit. Use the arrows to translate more.\n\nDone\n\nPrevious Next
115                                                             

Looking at the 50 shortest article texts, we can see that there are many which could not be extracted properly, and some which appear to be photo captions.  

Further, we identified other news articles that may be duplicates.  

Hence we will drop the duplicates under `article_text` and drop those with less than 520 characters.


In [19]:
# Remove duplicates by article_text column

world_news.remove_duplicates('article_text')

In [20]:
# Remove short articles with less than 520 characters

world_news.full_df = world_news.full_df[world_news.full_df.article_text_length > 520]

We will also create two new columns for `title_length` and `title_num_words`.

In [21]:
# Call the method to create the columns

world_news.sort_length('title')

In [22]:
# Check the first 5 entries in the DataFrame

world_news.full_df.head()

Unnamed: 0,subreddit,title,url,article_text,article_text_num_words,article_text_length,title_num_words,title_length
125,worldnews,Please help against Russia,https://www.ruetir.com/2021/04/17/czech-republic-russia-involved-in-deadly-explosion-expel-diplomats/,"According to the Czech Republic, the Russian secret service GRU was involved in a deadly explosion in an ammunition depot in the Czech town of Vrbetice in 2014. The Czech security services have reached this conclusion, Prime Minister Andrej Babis and Minister of Foreign Affairs Jan Hamacek announced on Saturday.\n\nIn response to the investigation results, the country has decided to expel 18 R...",156,1007,4,26
90,worldnews,Arabs attack Jews in Ramla,https://www.israelnationalnews.com/News/News.aspx/304587,"A Jew was lightly injured and two minors were sprayed with pepper spray in a brawl with Arabs in Ramla on Sunday evening.\n\nThe man was taken to Assaf Harofeh Hospital for treatment.\n\nPolice are investigating the circumstances of the incident.\n\nElsewhere, hundreds of Arabs rioted in Jerusalem Sunday night, with disturbances reported around the Old City's Damascus Gate.\n\nArab rioters cla...",98,636,5,26
910,worldnews,A Gun Bill | No Police State,https://nopolicestate.net/2007/12/21/a-gun-bill-3/,"There was a headline in the news yesterday on the Yahoo homepage that read “Gun bill inspired by “A place where gun violence occurred”, and the article went on to say something about the mentally ill will be targeted and asked for documentation when they go to buy guns. Hugh? Is not the problem with guns that guns are sold everywhere to anyone at all, and not the people.\n\nAnd there was a sal...",1328,7770,7,28
736,worldnews,University of Cape Town on fire,https://www.news24.com/amp/news24/southafrica/news/uct-evacuates-students-as-runaway-veldfire-rages-on-20210418,A City of Cape Town firefighter sustained burn wounds while battling a blaze at Rhodes Memorial.\n\nTable Mountain National Park (TMNP) said an initial investigation showed a vagrant's fire may have sparked the blaze.\n\nUCT has evacuated its students and announced it was cancelling campus activity on Monday and Tuesday.\n\nA firefighter sustained burn wounds while battling the massive fire th...,750,4565,6,31
238,worldnews,Israel ends outdoor mask mandate,https://nypost.com/2021/04/18/israel-ends-outdoor-mask-mandate/,"Israel ended its outdoor mask mandate Sunday, now that about 80 percent of its adult population has received both doses of Pfizer’s COVID-19 vaccine, reports said.\n\n“Being without a mask for the first time in a long time feels weird. But it’s a very good weird,” Amitai Hallgarten, 19, told Reuters.\n\n“If I need to be masked indoors to finish with this, I’ll do everything I can.”\n\nAt the s...",277,1682,5,32


As we are trying to flag out Satire news among real headlines, we will convert `worldnews` under `subreddit` to 0. 

In [23]:
world_news.full_df.subreddit = 0

In [25]:
# Save the cleaned DataFrame

world_news.full_df.to_csv('../data/worldnews/worldnews_clean.csv', index=False)

## Cleaning Subreddit 02 - The Onion  

We will repeat the cleaning process for the second subreddit by removing duplicates, nulls, and rows with `length` more than 1 standard deviation less than the mean.

In [26]:
# Mask the dataframe to identify duplicate results

theonion.full_df[theonion.full_df.title.duplicated(keep=False)].sort_values(by='title')

Unnamed: 0,subreddit,title,url,article_text
385,TheOnion,"A shocking new study that asked teen boys about their sexual habits reveals that they are all having sex all the time and are really, really good at having it.",https://youtu.be/q8NDCJY5DW4,
384,TheOnion,"A shocking new study that asked teen boys about their sexual habits reveals that they are all having sex all the time and are really, really good at having it.",https://youtu.be/q8NDCJY5DW4,
478,TheOnion,"After Obama Victory, Shrieking White-Hot Sphere Of Pure Rage Early GOP Front-Runner For 2016",https://www.youtube.com/watch?v=jjonGtrCyVE,
270,TheOnion,"After Obama Victory, Shrieking White-Hot Sphere Of Pure Rage Early GOP Front-Runner For 2016",https://www.youtube.com/watch?v=jjonGtrCyVE,
488,TheOnion,"After Obama Victory, Shrieking White-Hot Sphere Of Pure Rage Early GOP Front-Runner For 2016",https://www.youtube.com/watch?v=jjonGtrCyVE,
...,...,...,...,...
868,TheOnion,‘Damn You’ Shouts Contact Tracer Losing Track Of Coronavirus After It Catches Hold Of Helicopter’s Ladder,http://www.theonion.com/damn-you-shouts-contact-tracer-losing-track-of-corona-1844674700,"LOS ANGELES— Shaking his first from the roof of an office building, contact tracer Calvin Rosen reportedly shouted “Damn you” Friday after losing track of the coronavirus after it caught hold of a passing helicopter’s ladder. “Son of a bitch, I’ll track you down one day!” said Rosen, who threw his mask to the ground in anger and pulled out a walkie-talkie while watching the virus he thought he..."
872,TheOnion,‘Damn You’ Shouts Contact Tracer Losing Track Of Coronavirus After It Catches Hold Of Helicopter’s Ladder,https://theonion.com/1844674700,"LOS ANGELES— Shaking his first from the roof of an office building, contact tracer Calvin Rosen reportedly shouted “Damn you” Friday after losing track of the coronavirus after it caught hold of a passing helicopter’s ladder. “Son of a bitch, I’ll track you down one day!” said Rosen, who threw his mask to the ground in anger and pulled out a walkie-talkie while watching the virus he thought he..."
68,TheOnion,"‘No Way To Prevent This,’ Says Only Nation Where This Regularly Happens",https://www.theonion.com/no-way-to-prevent-this-says-only-nation-where-this-r-1846494525,"ATLANTA—In the hours following a violent rampage in Georgia in which a lone attacker killed eight individuals and injured one other, citizens living in the only country where this kind of mass killing routinely occurs reportedly concluded Wednesday that there was no way to prevent the massacre from taking place. “This was a terrible tragedy, but sometimes these things just happen and there’s n..."
673,TheOnion,"‘No Way To Prevent This,’ Says Only Nation Where This Regularly Happens",https://www.theonion.com/no-way-to-prevent-this-says-only-nation-where-this-r-1841942413,"MILWAUKEE—In the hours following a violent rampage in Wisconsin in which a lone attacker killed five individuals, including himself, citizens living in the only country where this kind of mass killing routinely occurs reportedly concluded Wednesday that there was no way to prevent the massacre from taking place. “This was a terrible tragedy, but sometimes these things just happen and there’s n..."


In [27]:
# Call the method to remove duplicate results by the title column.

theonion.remove_duplicates()

In [28]:
# Check for duplicate entries once again

theonion.full_df[theonion.full_df.title.duplicated(keep=False)].sort_values(by='title')

Unnamed: 0,subreddit,title,url,article_text


In [29]:
# Check for null values

theonion.full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1123 entries, 0 to 1122
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subreddit     1123 non-null   object
 1   title         1123 non-null   object
 2   url           1123 non-null   object
 3   article_text  863 non-null    object
dtypes: object(4)
memory usage: 35.2+ KB


In [30]:
# Drop the null values

theonion.full_df.dropna(inplace=True)

In [31]:
# Call the method to sort the dataframe by article_text length

theonion.sort_length()

In [32]:
# Examine the statistics of article_texts

theonion.full_df.article_text_length.describe()

count      863.000000
mean      1395.078795
std       1168.948631
min         24.000000
25%        933.500000
50%       1147.000000
75%       1386.500000
max      16413.000000
Name: article_text_length, dtype: float64

In [33]:
theonion.full_df.article_text_length.mean() - theonion.full_df.article_text_length.std()

226.13016430359335

In [34]:
# Check the entries with shorter article lengths

theonion.full_df.article_text[theonion.full_df.article_text_length < 226].head(50)

582                                                                                                                                                                                                           The Topical\n\nThe Topical
386                                                                                                                                                                                                           The Topical\n\nThe Topical
998                                                                                                                                                                                                           The Topical\n\nThe Topical
978                                                                                                                                                                                 Here’s a list of organizations where you can donate.
1119                                                                

This time we did not identify more duplicates, but we can see that there are many article_texts which were not extracted properly.  

Hence we will drop rows with less than 226 characters in the article_text column.

In [35]:
# Remove short articles with less than 226 characters

theonion.full_df = theonion.full_df[theonion.full_df.article_text_length > 226]

We will also create two new columns for `title_length` and `title_num_words`.

In [36]:
# Call the method to create the columns

theonion.sort_length('title')

In [37]:
# Check the first 5 entries of the DataFrame.

theonion.full_df.head()

Unnamed: 0,subreddit,title,url,article_text,article_text_num_words,article_text_length,title_num_words,title_length
981,TheOnion,Ask A Bee,https://www.theonion.com/ask-a-bee-1819583411,"Worker Bee #7438-F87904\n\nAdvertisement\n\nDear Worker Bee #7438-F87904,\n\nMy husband and I split last year after 11 years of marriage. We’re still good friends, though, and we even go out for coffee once a week. Problem is, lately, he’s been seeing a new person, someone I feel is definitely not right for him. Should I say anything? I’m not jealous—I know I wasn’t right for him, either. What...",503,3661,3,9
152,TheOnion,Seems sensible,https://www.theonion.com/woman-quick-to-clarify-that-child-in-dating-profile-pic-1846174572,"SKOKIE, IL—In an effort to ensure that potential mates wouldn’t get the wrong idea, local woman Karen Dugas told reporters Monday she was always quick to clarify that the child in her dating profile picture wasn’t actually alive anymore. “It’s such a cute snapshot of the two of us at her second birthday party that I wanted to share it, but I always make sure people know I’m totally unattached ...",191,1043,2,14
977,TheOnion,Best Recipe Blogs,https://entertainment.theonion.com/best-recipe-blogs-1844216522,"Blogs for recipes have exploded in popularity over the past several years, with everyone from professional chefs to self-trained cooks providing recipes, inspiration, and tips to the foodie community. The Onion takes a look at the best recipe blogs on the internet.\n\n\n\nAdvertisement\n\nSerious Eats: This soberly written publication, which eschews the frivolity of pictures, is not for those ...",238,1566,3,17
506,TheOnion,This aged very well,https://local.theonion.com/ghost-of-christmas-future-taunts-children-with-visions-1819566694,"SOUTHFIELD, MI—Bored with scaring elderly misers, the Ghost of Christmas Future is spending the holiday season taunting modern children with visions of Christmas 2016's hottest toy: the Sony PlayStation 5, a 2,048-bit console featuring a 45-Ghz trinary processor, CineReal graphics booster with 2-gig biotexturing, and an RSP connector for 360-degree online-immersion play.\n\nThe Ghost of Christ...",845,5059,4,19
835,TheOnion,Bad News: Toad Died,https://ogn.theonion.com/bad-news-toad-died-1844673221,"Hello everyone, it pains us to do this, but we have some really bad news. Mario’s longtime sidekick Toad died last night surrounded by his friends and family after a long battle with pancreatic cancer.\n\n\n\nAdvertisement\n\nWe figured it would be best if you heard it from us first.\n\nObviously, this is a huge blow to the gaming community. Living in a world without Toad around is going to be...",97,547,4,19


As we are trying to flag out Satire news among real headlines, we will convert `theonion` under `subreddit` to 1.

In [39]:
theonion.full_df.subreddit = 1

In [40]:
# Save the cleaned DataFrame

theonion.full_df.to_csv('../data/theonion/theonion_clean.csv', index=False)

Now that we have cleaned the DataFrames from both subreddits, we'll combine them into a single DataFrame and split them into a train and test set.

In [41]:
# Combine DataFrames
full_data = world_news.full_df.merge(theonion.full_df, how='outer')

# Save the full data
full_data.to_csv('../data/full_data.csv', index=False)

In [42]:
# To make our data feel more randomized, we will shuffle the dataset, with random_state for replicability
shuffle_df = full_data.sample(frac=1, random_state = 7)

# Define a size for the train set 
train_size = int(0.7 * len(full_data))

# Split your dataset 
train = shuffle_df[:train_size]
test = shuffle_df[train_size:]

In [43]:
# Save the split datasets
train.to_csv('../data/train.csv', index=False)
test.to_csv('../data/test.csv', index=False)