# Deep Learning for Article Classification with Publication using Kaggle Data

Here is the link for data - https://www.kaggle.com/snapcrack/all-the-news (courtesy of Andrew Thompson)

------------

**We will build a text classifier with Keras and 1-D Convolutional Neural Networks for the Kaggle dataset containing information such as title, author, publication and the content of the article. This data set includes several publications but we focus on top 5 publications (all US publishers) as labels acting for the title and content as features. Below is the description of labels.**

    Label	Description
    0	    Breitbart
    1	    New York Post
    2	    NPR (National Public Radio)
    3	    Washington Post
    4	    Reuters

## The Data

**The code below is meant to connect to drive for accessing the dataset**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Importing some libraries

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

## For Cleaning and Decoding
import re
import html
import unicodedata
import string
import glob

from tqdm import tqdm

In [3]:
## For Preparing Features to Model
import nltk

from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


## For Bonus Section
try:
    from gensim.models import word2vec
except:
    !pip install gensim
    from gensim.models import word2vec

## For Data Exploration
try:
    from wordcloud import WordCloud
except:
    !pip install wordcloud
    from wordcloud import WordCloud

In [4]:
from textblob import Word
from string import punctuation
from gensim.parsing.preprocessing import STOPWORDS

In [5]:
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Use your own gdrive directory to access the files you store on your gdrive**

**Or just directly have it ready in your working directory**

In [15]:
articles_loc = glob.glob("./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/articles/*.csv")

In [16]:
articles_loc

['./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/articles/articles2.csv',
 './drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/articles/articles1.csv',
 './drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/articles/articles3.csv']

In [17]:
df1 = pd.read_csv(articles_loc[0])
df1.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,53293,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,2017-01-11,2017.0,1.0,,"Patriots Day, Peter Berg’s new thriller that r..."
1,53294,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,2017-01-11,2017.0,1.0,,"In Norse mythology, humans and our world were ..."
2,53295,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,2017-01-11,2017.0,1.0,,“If our democracy is to work in this increasin...
3,53296,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,2017-01-11,2017.0,1.0,,Updated on January 11 at 5:05 p. m. In his fir...
4,53297,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,2017-01-11,2017.0,1.0,,Updated at 12:25 p. m. After months of equivoc...


In [18]:
df2 = pd.read_csv(articles_loc[1])
df2.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [19]:
df3 = pd.read_csv(articles_loc[2])
df3.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
1,103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
2,103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
3,103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
4,103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


In [20]:
df1.columns==df2.columns, df2.columns==df3.columns #Checking all columns are same

(array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]),
 array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]))

In [21]:
# Merging to a single dataset
df = pd.concat([df1, df2, df3], axis=0)
df.reset_index(inplace=True, drop=True)

In [22]:
# Dropping columns not required
df.drop(['Unnamed: 0', 'date', 'year', 'month', 'url'], axis=1, inplace=True)

## Data Cleaning

In [23]:
# Checking for null values
df.isnull().sum()

id                 0
title              2
publication        0
author         15876
content            0
dtype: int64

In [24]:
# Dropping null values
df.dropna(inplace=True)

In [25]:
df.isnull().sum()

id             0
title          0
publication    0
author         0
content        0
dtype: int64

In [26]:
df.head()

Unnamed: 0,id,title,publication,author,content
0,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,"Patriots Day, Peter Berg’s new thriller that r..."
1,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,"In Norse mythology, humans and our world were ..."
2,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,“If our democracy is to work in this increasin...
3,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,Updated on January 11 at 5:05 p. m. In his fir...
4,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,Updated at 12:25 p. m. After months of equivoc...


In [27]:
df.tail()

Unnamed: 0,id,title,publication,author,content
142565,218078,An eavesdropping Uber driver saved his 16-year...,Washington Post,Avi Selk,Uber driver Keith Avila picked up a p...
142566,218079,Plane carrying six people returning from a Cav...,Washington Post,Sarah Larimer,Crews on Friday continued to search L...
142567,218080,After helping a fraction of homeowners expecte...,Washington Post,Renae Merle,When the Obama administration announced a...
142568,218081,"Yes, this is real: Michigan just banned bannin...",Washington Post,Chelsea Harvey,This story has been updated. A new law in...
142569,218082,What happened in Washington state after voters...,Washington Post,Christopher Ingraham,The nation’s first recreational marijuana...


In [28]:
# Getting count of articles for each publication
df['publication'].value_counts()

Breitbart              23781
New York Post          17485
NPR                    11654
Washington Post        11077
Reuters                10709
New York Times          7767
Guardian                7250
CNN                     7025
National Review         6203
Atlantic                6199
Business Insider        4950
Vox                     4947
Buzzfeed News           4853
Talking Points Memo     1676
Fox News                1117
Name: publication, dtype: int64

In [29]:
# Identifying length of title
df['len_title'] = [len(x) for x in df['title']]

In [30]:
# Identifying length of content
df['len_content'] = [len(x) for x in df['content']]

In [31]:
df.head()

Unnamed: 0,id,title,publication,author,content,len_title,len_content
0,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,"Patriots Day, Peter Berg’s new thriller that r...",50,5357
1,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,"In Norse mythology, humans and our world were ...",52,8015
2,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,“If our democracy is to work in this increasin...,42,5096
3,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,Updated on January 11 at 5:05 p. m. In his fir...,42,5773
4,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,Updated at 12:25 p. m. After months of equivoc...,36,2445


### Filtering for Top-5 Publications

In [32]:
top_publications = df.publication.value_counts()[:5] 
sum(top_publications.values)

74706

In [33]:
top_publications # Printing the count and names of publications

Breitbart          23781
New York Post      17485
NPR                11654
Washington Post    11077
Reuters            10709
Name: publication, dtype: int64

In [34]:
# Drop rows that are not in the top_writer
df_top = pd.DataFrame(columns=df.columns)
for i in top_publications.keys():
    df_top = df_top.append(df[df.publication == i], ignore_index=True)
df = df_top
del df_top

In [35]:
df

Unnamed: 0,id,title,publication,author,content,len_title,len_content
0,26539,CNN’s Zeleny: ’Hard to Imagine’ Obama Would Ha...,Breitbart,Ian Hanchett,On Tuesday’s broadcast of CNN’s “Situation Roo...,109,1067
1,26540,American Students on Spring Break Chant ’Build...,Breitbart,Katherine Rodriguez,A group of American spring break revelers repo...,73,1514
2,26541,Surge in ’Honour Crimes’ and Forced Marriages ...,Breitbart,Liam Deacon,“honour crimes” have risen by 40 per cent in...,55,2195
3,26542,MILO Announces New Media Venture - Breitbart,Breitbart,Lucas Nolan,Former Breitbart Senior Editor MILO has announ...,44,2113
4,26543,Jared Kushner at Center of Media Spotlight on ...,Breitbart,Penny Starr,The focus of the continuous media reports of a...,88,3866
...,...,...,...,...,...,...,...
74701,195402,Speculators raise net long dollar bets in fina...,Reuters,Dion Rabouin,Net long bets on the dollar fell last week for...,60,4670
74702,195403,Wall St. thinks stocks will rise in 2017 - Wha...,Reuters,Caroline Valetkevitch and Noel Randewich,Wall Street’s rally could be derailed by rene...,63,6011
74703,195404,Disney buying Netflix could be practical magic,Reuters,Jennifer Saba,(Reuters Breakingviews) Walt Disney may be ...,46,3243
74704,195410,Actors seek posthumous protections after big-s...,Reuters,Lisa Richwine and Jill Serjeant,(Story refiled to correct date in paragraph 1...,65,6482


### Filtering for datapoints that have considerable text to process with

In [36]:
df = df[df['len_content']>150]

In [37]:
df = df[df['len_title']>5]

In [38]:
df.describe() # Not a considerable decrease in amount

Unnamed: 0,id,title,publication,author,content,len_title,len_content
count,74548,74548,74548,74548,74548,74548,74548
unique,74548,74540,5,8982,74513,160,10913
top,131071,The best upcoming sample sales,Breitbart,Breitbart News,Caption Businessman Donald Trum...,63,2878
freq,1,3,23664,1483,3,2341,36


In [39]:
df['publication'].value_counts() # Final count of articles for publications

Breitbart          23664
New York Post      17481
NPR                11646
Washington Post    11048
Reuters            10709
Name: publication, dtype: int64

## Data Processing

In [40]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [41]:
def process_row(row):
    
    #Mail address
    row = re.sub('(\S+@\S+)(com|\s+com)', ' ', row)
    #Username
    row = re.sub('(\S+@\S+)', ' ', row)
    #punctuation
    # punctuation = punctuation + '\n' + '—“,”‘-’' + '0123456789'
    punctuation = '—“,”‘-’' + '0123456789'
    row = ''.join(word for word in row if word not in punctuation)
    #Lower case 
    row = row.lower()
    #Stopwords
    stop = STOPWORDS
    row = ' '.join(word for word in row.split() if word not in stop)    
    #Lemma
    row = " ".join([Word(word).lemmatize() for word in row.split()])
    #Stemming
    stemmer = SnowballStemmer(language='english')
    row = " ".join([stemmer.stem(word) for word in row.split()])
    #Extra whitespace
    row = re.sub('\s{1,}', ' ', row)
    
    row = " ".join([word for word in row.split() if len(word) > 2])

    return row

In [42]:
df.shape

(74548, 7)

In [43]:
df['content_clean'] = df['content'].apply(process_row) # Processing on Content

In [45]:
df

Unnamed: 0,id,title,publication,author,content,len_title,len_content,content_clean
0,26539,CNN’s Zeleny: ’Hard to Imagine’ Obama Would Ha...,Breitbart,Ian Hanchett,On Tuesday’s broadcast of CNN’s “Situation Roo...,109,1067,tuesday broadcast cnns situat room cnn senior ...
1,26540,American Students on Spring Break Chant ’Build...,Breitbart,Katherine Rodriguez,A group of American spring break revelers repo...,73,1514,group american spring break revel report chant...
2,26541,Surge in ’Honour Crimes’ and Forced Marriages ...,Breitbart,Liam Deacon,“honour crimes” have risen by 40 per cent in...,55,2195,honour crime risen cent year london number for...
3,26542,MILO Announces New Media Venture - Breitbart,Breitbart,Lucas Nolan,Former Breitbart Senior Editor MILO has announ...,44,2113,breitbart senior editor milo announc found new...
4,26543,Jared Kushner at Center of Media Spotlight on ...,Breitbart,Penny Starr,The focus of the continuous media reports of a...,88,3866,focus continu medium report alleg collus russi...
...,...,...,...,...,...,...,...,...
74701,195402,Speculators raise net long dollar bets in fina...,Reuters,Dion Rabouin,Net long bets on the dollar fell last week for...,60,4670,net long bet dollar fell week time octob rebou...
74702,195403,Wall St. thinks stocks will rise in 2017 - Wha...,Reuters,Caroline Valetkevitch and Noel Randewich,Wall Street’s rally could be derailed by rene...,63,6011,wall street ralli derail renew worri donald tr...
74703,195404,Disney buying Netflix could be practical magic,Reuters,Jennifer Saba,(Reuters Breakingviews) Walt Disney may be ...,46,3243,(reuter breakingviews) walt disney look bit ma...
74704,195410,Actors seek posthumous protections after big-s...,Reuters,Lisa Richwine and Jill Serjeant,(Story refiled to correct date in paragraph 1...,65,6482,(stori refil correct date paragraph lisa richw...


    0	    Breitbart
    1	    New York Post
    2	    NPR (National Public Radio)
    3	    Washington Post
    4	    Reuters

**Dumping the dataset onto gdrive. Change directory if you want to**

In [51]:
df.to_pickle("./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/dataset.pkl")

## Preparing Data for Modeling


In [52]:
from sklearn import preprocessing
from tensorflow.keras.utils import to_categorical

**Encoding the label - publications**

In [53]:
le = preprocessing.LabelEncoder()

In [54]:
df['publication_label'] = le.fit_transform(df.publication.values)

**Setting Targets & Features**

In [56]:
X = df['content_clean'].values

In [57]:
y = df['publication_label'].values

In [58]:
y

array([0, 0, 0, ..., 3, 3, 3])

**Converting label as one-hot encoded matrix**

In [59]:
to_categorical(y)

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]], dtype=float32)

**Splitting for training and testing sets**

In [60]:
from sklearn.model_selection import train_test_split

In [61]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y)

In [62]:
y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

**Storing the prepared data for modeling**

In [63]:
pd.DataFrame(X_train).to_pickle("./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/X_train.pkl")
pd.DataFrame(X_test).to_pickle("./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/X_test.pkl")

In [64]:
pd.DataFrame(y_train_cat).to_pickle("./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/y_train_cat.pkl")
pd.DataFrame(y_test_cat).to_pickle("./drive/MyDrive/UofT/Deep NN Projects/Keras-ArticlePublishersData-1D-CNN-MC/y_test_cat.pkl")

### This is the end of Part-1. In the second part, we will pick up here with some feature encoding and finally model building, fitting & evaluation