In [1]:
import feedparser
from bs4 import BeautifulSoup
import urllib.request as rqst
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# Extracting news data using RSS feeds

## What is RSS?
RSS stands for “really simple syndication” or, depending on who you ask, “rich site summary.” At its heart, RSS refers to simple text files with necessary, updated information — news pieces, articles, that sort of thing. That stripped-down content gets plugged into a feed reader, an interface that quickly converts the RSS text files into a stream of the latest updates from around the web.

## Libraries used
- ### feedparser
Universal Feed Parser is a Python module for downloading and parsing syndicated feeds.
- ### urllib.request
The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
- ### BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In this section, we are extracting metadata from four news websites' RSS feeds: **NY Times, FOX, CBC, BBC**

## 1. NY Times

Use **feedparser** to parse the RSS feed of NY Times news of arts section.

In [2]:
ny_feed = feedparser.parse('https://rss.nytimes.com/services/xml/rss/nyt/Arts.xml')#feedparsing rss
# ny_feed: dictionary containing metadata (title, date, link, author, ...)

Check the result of one article for example.

In [3]:
print(ny_feed['entries'][0])

{'title': '‘Song Exploder’ and the Inexhaustible Hustle of Hrishikesh Hirway', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://rss.nytimes.com/services/xml/rss/nyt/Arts.xml', 'value': '‘Song Exploder’ and the Inexhaustible Hustle of Hrishikesh Hirway'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.nytimes.com/2020/11/03/arts/hrishikesh-hirway-song-exploder.html'}, {'href': 'https://www.nytimes.com/2020/11/03/arts/hrishikesh-hirway-song-exploder.html', 'rel': 'standout', 'type': 'text/html'}], 'link': 'https://www.nytimes.com/2020/11/03/arts/hrishikesh-hirway-song-exploder.html', 'id': 'https://www.nytimes.com/2020/11/03/arts/hrishikesh-hirway-song-exploder.html', 'guidislink': False, 'summary': 'The creator of several podcasts and a new television series is a popular investigator of the creative process. But his most personal case remains unsolved.', 'summary_detail': {'type': 'text/html', 'language': None, 'base': 'https://rss.nyti

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [4]:
#getting metadata
titles = []
dates = []
links = []
authors = []
texts = []
for article in ny_feed['entries']:
    titles.append(article['title'])
    dates.append(article['published'])
    links.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.select('.css-158dogj')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts.append(text)
    if 'author' in article:
        authors.append(article['author'])
    else:
        authors.append(None)

Pandas dataframe is created by data saved previously.

In [5]:
#create dataframe
ny_data={"title": titles, "date": dates,"link": links, "author": authors, "text": texts}
ny=pd.DataFrame(ny_data)

In [6]:
ny.head()

Unnamed: 0,title,date,link,author,text
0,‘Song Exploder’ and the Inexhaustible Hustle o...,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,Making something new is like climbing a mounta...
1,I’m a Chess Expert. Here’s What ‘The Queen’s G...,"Tue, 03 Nov 2020 17:09:24 +0000",https://www.nytimes.com/2020/11/03/arts/televi...,Dylan Loeb McClain,This article contains spoilers for “The Queen’...
2,When a Dance Collective Was Like a Rock Band,"Tue, 03 Nov 2020 13:54:35 +0000",https://www.nytimes.com/2020/11/03/arts/dance/...,Gia Kourlas,"In 1970, the Grand Union came into being, and ..."
3,A ‘Wicked’ Challenge and Other Tough Questions...,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,I’m 15 years old and here is my question: When...
4,Ian Bostridge on Schubert’s Hidden Depths,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,I first got to know the 20 songs of Franz Schu...


Export results as a csv file.

In [7]:
ny.to_csv('./ny.csv', index=True)#export to csv

## 2. FOX

Use **feedparser** to parse the RSS feed of FOX latest news.

In [8]:
fox_feed = feedparser.parse('http://feeds.foxnews.com/foxnews/latest')#feedparsing rss

Check the result of one article for example.

In [9]:
print(fox_feed['entries'][0])

{'id': 'https://www.foxnews.com/sports/no-10-wisconsin-cancels-game-with-purdue-due-to-outbreak', 'guidislink': True, 'link': 'https://www.foxnews.com/sports/no-10-wisconsin-cancels-game-with-purdue-due-to-outbreak', 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.foxnews.com/sports/no-10-wisconsin-cancels-game-with-purdue-due-to-outbreak'}], 'media_content': [{'url': 'https://static.foxnews.com/foxnews.com/content/uploads/2020/11/AP20298017028265.jpg', 'medium': 'image', 'isdefault': 'true'}, {'url': 'http://a57.foxnews.com60/60/AP20298017028265.jpg', 'medium': 'image', 'width': '60', 'height': '60'}], 'media_thumbnail': [{'url': 'http://a57.foxnews.com60/60/AP20298017028265.jpg', 'width': '60', 'height': '60'}], 'href': '', 'tags': [{'term': '058a6f4b-e018-5c86-a423-88463afda044', 'scheme': 'foxnews.com/metadata/dc.identifier', 'label': None}, {'term': 'fox-news/sports/ncaa', 'scheme': 'foxnews.com/taxonomy', 'label': None}, {'term': 'fox-news/sports/ncaa-fb'

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [10]:
#getting metadata
titles2 = []
dates2 = []
links2 = []
authors2 = []
texts2 = []
for article in fox_feed['entries']:
    titles2.append(article['title'])
    dates2.append(article['published'])
    links2.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.select('.article-body')[0].select('p')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts2.append(text)
    if 'author' in article:
        authors2.append(article['author'])
    else:
        authors2.append(None)

Pandas dataframe is created by data saved previously.

In [11]:
#create dataframe
fox_data={"title": titles2, "date": dates2,"link": links2, "author": authors2, "text": texts2}
fox=pd.DataFrame(fox_data)
fox.head()

Unnamed: 0,title,date,link,author,text
0,No. 10 Wisconsin cancels game with Purdue due ...,"Tue, 03 Nov 2020 19:33:36 GMT",https://www.foxnews.com/sports/no-10-wisconsin...,,Fox News Flash top headlines are here. Check o...
1,Election 2020: Live Coverage,"Tue, 03 Nov 2020 19:30:59 GMT",https://www.foxnews.com/politics/election-2020...,,
2,Beverly Hills spending $4.8M on 'election-prep...,"Tue, 03 Nov 2020 19:23:52 GMT",https://www.foxnews.com/us/beverly-hills-rodeo...,Danielle Wallace,"Roughly 2,100 voters in Los Angeles County rec..."
3,NFL to consider 16-team playoff structure for ...,"Tue, 03 Nov 2020 19:22:29 GMT",https://www.foxnews.com/sports/nfl-16-team-pla...,Paulina Dedaj,Fox News Flash top headlines are here. Check o...
4,Colorado hiker with coronavirus-like symptoms ...,"Tue, 03 Nov 2020 19:20:33 GMT",https://www.foxnews.com/health/colorado-hiker-...,Madeline Farber,Fox News Flash top headlines are here. Check o...


Export results as a csv file.

In [12]:
fox.to_csv('./fox.csv', index=True)#export to csv

## 3. CBC

Use **feedparser** to parse the RSS feed of CBC news of top stories section. And check the result of one article for example.

In [13]:
cbc_feed = feedparser.parse('https://rss.cbc.ca/lineup/topstories.xml')#feedparsing rss
print(cbc_feed['entries'][0])

{'title': 'Election day arrives in U.S. as polls open across country', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://www.cbc.ca/cmlink/rss-topstories', 'value': 'Election day arrives in U.S. as polls open across country'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.cbc.ca/news/world/us-election-day-2020-trump-biden-1.5787485?cmp=rss'}], 'link': 'https://www.cbc.ca/news/world/us-election-day-2020-trump-biden-1.5787485?cmp=rss', 'id': '1.5327704', 'guidislink': False, 'published': 'Sat, 19 Oct 2019 12:37:24 EDT', 'published_parsed': time.struct_time(tm_year=2019, tm_mon=10, tm_mday=19, tm_hour=16, tm_min=37, tm_sec=24, tm_wday=5, tm_yday=292, tm_isdst=0), 'authors': [{}], 'author': '', 'tags': [{'term': 'News', 'scheme': None, 'label': None}], 'summary': '<img alt="1229435769" height="259" src="https://i.cbc.ca/1.5787608.1604413470!/fileImage/httpImage/image.jpg_gen/derivatives/16x9_460/1229435769.jpg" title="ATLANTA, GA - NOVEMBE

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [14]:
#getting metadata
titles3 = []
dates3 = []
links3 = []
authors3 = []
texts3 = []
for article in cbc_feed['entries']:
    titles3.append(article['title'])
    dates3.append(article['published'])
    links3.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.find("body").select('.story')
    if targets != []:
        targets = targets[0].select('p')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts3.append(text)
    if 'author' in article:
        authors3.append(article['author'])
    else:
        authors3.append(None)

Pandas dataframe is created by data saved previously.

In [15]:
#create dataframe
cbc_data={"title": titles3, "date": dates3,"link": links3, "author": authors3, "text": texts3}
cbc=pd.DataFrame(cbc_data)
cbc.head()

Unnamed: 0,title,date,link,author,text
0,Election day arrives in U.S. as polls open acr...,"Sat, 19 Oct 2019 12:37:24 EDT",https://www.cbc.ca/news/world/us-election-day-...,,The latest: After a campaign marked by rancour...
1,Canada's top public health doctor now recommen...,"Tue, 3 Nov 2020 13:17:07 EST",https://www.cbc.ca/news/politics/three-layer-m...,Catharine Tunney,The Public Health Agency of Canada is now reco...
2,How to follow U.S. election day coverage on CBC,"Thu, 22 Oct 2020 20:36:51 EDT",https://www.cbc.ca/news/world/how-to-watch-us-...,,This election day in the United States is goin...
3,"Trudeau, O'Toole vow to work with Trump, while...","Tue, 3 Nov 2020 13:48:42 EST",https://www.cbc.ca/news/politics/trudeau-otool...,John Paul Tasker,Prime Minister Justin Trudeau and Conservative...
4,A Muskoka cottage owner paid $64K to save it. ...,"Tue, 3 Nov 2020 04:00:00 EST",https://www.cbc.ca/news/canada/toronto/muskoka...,John Lancaster,Liz Saunders ambles around the outside of her ...


Export results as a csv file.

In [16]:
cbc.to_csv('./cbc.csv', index=True)#export to csv

## 4. BBC

Use **feedparser** to parse the RSS feed of BBC news. And check the result of one article for example.

In [17]:
bbc_feed = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml')#feedparsing rss
print(bbc_feed['entries'][0])

{'title': "UK terrorism threat level raised to 'severe'", 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://feeds.bbci.co.uk/news/rss.xml', 'value': "UK terrorism threat level raised to 'severe'"}, 'summary': 'It means an attack is highly likely but there is no specific intelligence of an imminent incident.', 'summary_detail': {'type': 'text/html', 'language': None, 'base': 'http://feeds.bbci.co.uk/news/rss.xml', 'value': 'It means an attack is highly likely but there is no specific intelligence of an imminent incident.'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.bbc.co.uk/news/uk-54799377'}], 'link': 'https://www.bbc.co.uk/news/uk-54799377', 'id': 'https://www.bbc.co.uk/news/uk-54799377', 'guidislink': False, 'published': 'Tue, 03 Nov 2020 18:41:29 GMT', 'published_parsed': time.struct_time(tm_year=2020, tm_mon=11, tm_mday=3, tm_hour=18, tm_min=41, tm_sec=29, tm_wday=1, tm_yday=308, tm_isdst=0)}


For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [18]:
#getting metadata
titles4 = []
dates4 = []
links4 = []
authors4 = []
texts4 = []
for article in bbc_feed['entries']:
    titles4.append(article['title'])
    dates4.append(article['published'])
    links4.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.select('.css-83cqas-RichTextContainer')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts4.append(text)
    if 'author' in article:
        authors4.append(article['author'])
    else:
        authors4.append(None)

Pandas dataframe is created by data saved previously.

In [19]:
#create dataframe
bbc_data={"title": titles4, "date": dates4,"link": links4, "author": authors4, "text": texts4}
bbc=pd.DataFrame(bbc_data)
bbc.head()

Unnamed: 0,title,date,link,author,text
0,UK terrorism threat level raised to 'severe',"Tue, 03 Nov 2020 18:41:29 GMT",https://www.bbc.co.uk/news/uk-54799377,,.css-14iz86j-BoldText{font-weight:bold;}The UK...
1,Vienna shootings: Three men praised for helpin...,"Tue, 03 Nov 2020 17:38:35 GMT",https://www.bbc.co.uk/news/world-europe-54779882,,.css-14iz86j-BoldText{font-weight:bold;}Three ...
2,John Lewis and Currys PC World extend hours ah...,"Tue, 03 Nov 2020 16:12:22 GMT",https://www.bbc.co.uk/news/business-54795015,,.css-14iz86j-BoldText{font-weight:bold;}John L...
3,Rape suspect Kadian Nelson urged to hand himse...,"Tue, 03 Nov 2020 19:06:13 GMT",https://www.bbc.co.uk/news/uk-england-london-5...,,.css-14iz86j-BoldText{font-weight:bold;}Police...
4,Town houses collapse leaves gaping hole in bui...,"Tue, 03 Nov 2020 15:54:17 GMT",https://www.bbc.co.uk/news/uk-england-london-5...,,.css-14iz86j-BoldText{font-weight:bold;}Two fo...


Export results as a csv file.

In [20]:
bbc.to_csv('./bbc.csv', index=True)#export to csv

# Advanced Text Processing for NY times dataset

In this section we will be using Bag of Words document representation like TF and TF-IDF and word embedding techniques like Word2Vec to extract features that can be used for scikit-learn models. The New York times data obtained above will be used.

#### Data loading, cleaning and preparation

In [21]:
#Load the New York Times data
df = pd.read_csv("ny.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,title,date,link,author,text
0,0,‘Song Exploder’ and the Inexhaustible Hustle o...,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,Making something new is like climbing a mounta...
1,1,I’m a Chess Expert. Here’s What ‘The Queen’s G...,"Tue, 03 Nov 2020 17:09:24 +0000",https://www.nytimes.com/2020/11/03/arts/televi...,Dylan Loeb McClain,This article contains spoilers for “The Queen’...
2,2,When a Dance Collective Was Like a Rock Band,"Tue, 03 Nov 2020 13:54:35 +0000",https://www.nytimes.com/2020/11/03/arts/dance/...,Gia Kourlas,"In 1970, the Grand Union came into being, and ..."
3,3,A ‘Wicked’ Challenge and Other Tough Questions...,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,I’m 15 years old and here is my question: When...
4,4,Ian Bostridge on Schubert’s Hidden Depths,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,I first got to know the 20 songs of Franz Schu...


In [22]:
# Data cleaning
df.drop(['Unnamed: 0'], axis=1, inplace=True) # Since it is a duplicate of the dataframe indices

#Check if there are any missing values
df.isnull().sum(axis=0)

title     0
date      0
link      0
author    0
text      0
dtype: int64

The number of missing values is low, so we can drop the rows with nan values

In [23]:
df.dropna(inplace=True)

#### Data pre-processing
* We combine the text and title columns since they are likely to have similar words. In addition, combining them reduces the size of the vector representation of each article's text 
* we want to clean the data by removing numbers, punctuation

In [24]:
import re

#Combine title and text data
df['title_text'] = df['title'] + df['text']

def preprocessing(text):
     # lowercase
    text=text.lower()
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

df['title_text'] = df['title_text'].apply(lambda x: preprocessing(x))
print(df['title_text'][1])

i m a chess expert here s what the queen s gambit gets rightthis article contains spoilers for the queen s gambit in the netflix series the queen s gambit the character benny watts walks up to beth harmon the show s heroine at the start of the united states chess championship the location is a small auditorium on the campus of ohio university uttering an expletive benny gestures around the hall and complains about the conditions noting that the best players in the country are competing and yet the venue is second rate the chess boards and pieces are cheap plastic and the few spectators seem bored at best as a chess master who grew up in the era just after the one in which the series takes place and who wrote the chess column for the new york times for eight years i can attest to the scene s almost painful authenticity many tournaments of that era were played in odd and sometimes dingy locations even the u s championship was not immune the competition was not even held the exchange betw

Split the data into training and test datasets

In [25]:
from sklearn.model_selection import train_test_split

#Separate data into training and testing, maybe do this step as a maybe when there are more data points
df_train,df_test = train_test_split(df,test_size=0.33, random_state=42)

## Bag of Words Document Representation (TF and TF-IDF)

Scikit-learn provides a TF-IDF vectorizer (TfidfVectorizer) that can create both TF and TF-IDF vector representations for text data. Note:
* use_idf which if equal to true TF-IDF representation is used and TF is used otherwise

Helper function for text processing using TF and TF-IDF

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

def text_processing(text, is_tfidf):
    vectorizer = TfidfVectorizer(stop_words = 'english', 
                                 use_idf = is_tfidf, 
                                 max_features = 300, 
                                 ngram_range = (2,3)).fit(text)
    return vectorizer

## Option 1: Combine title and text

In general, the title of the article is related to the content of the article. So, in this context, combining the text and title before extracting features seems reasonable.

### Document Representation based on TF (Term Frequency)

Term frequency depends on the frequency of a word in the current article. Here, we create a TF vectorizer for the words in the article. We have limited the number of features to be 300 or less, used the common english stopwords provide by the model.

In [27]:
#transform the train and test articles to vectors
vectorizer = text_processing(df_train['title_text'], False)
train_text_rep = vectorizer.transform(df_train['title_text'])
test_text_rep = vectorizer.transform(df_test['title_text'])

#Replace the article's text data (title and text) with their vector representations
X_train_tf = pd.concat([df_train.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = train_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)

X_test_tf = pd.concat([df_test.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = test_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)
X_test_tf.head()

Unnamed: 0,date,link,author,accident history,aghdashloo work,alike years,alike years come,alternative weekly,amber heard,american actors,...,work black artists,work mr,work ms,working dancers,worry dolls,year old,years later,years old,york city,york times
0,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0
2,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Document Representation based on TF-IDF (Term Frequency-Inverse Document Frequency)

Using TF is problematic because words that are frequent but not necessarily useful (e.g. the) will have a high score. TF-IDF, combines TF and IDF, which measures the how rare a word is accross articles to solve the limitation. 

The settings for the model are the same as TF above expcept is_df which is now true to indicate this is an TF-IDF vectorizer.

In [28]:
#transform the train and test articles to vectors
vectorizer = text_processing(df_train['title_text'], True)
train_text_rep = vectorizer.transform(df_train['title_text'])
test_text_rep = vectorizer.transform(df_test['title_text'])

#Replace the article's text data (title and text) with their vector representations
X_train_tfidf = pd.concat([df_train.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = train_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)

X_test_tfidf = pd.concat([df_test.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = test_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)
X_test_tfidf.head()

Unnamed: 0,date,link,author,accident history,aghdashloo work,alike years,alike years come,alternative weekly,amber heard,american actors,...,work black artists,work mr,work ms,working dancers,worry dolls,year old,years later,years old,york city,york times
0,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125558,0.0,0.0,0.0,0.0
2,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Option 2:  Separate title and text data

In option 1, we combined article's title and text together because they are closely related. So, it might be impractical to vectorize them separately as we may end up with redundant columns. In general however, if  there are text features that are not related, one might want to vectorize them separately. Below, we show how the two features can be vectorized separately.

### Document Representation based on TF

In [29]:
# On Training

#Vectorize the title and text data separately
title_vectorizer = text_processing(df_train['title'], False)
title_rep = title_vectorizer.transform(df_train['title'])

text_vectorizer = text_processing(df_train['text'], False)
text_rep = text_vectorizer.transform(df_train['text'])

X_train_tf = pd.concat([df_train.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)

#On test data
#transform the test data's title and text data separately
title_rep = title_vectorizer.transform(df_test['title'])
text_rep = text_vectorizer.transform(df_test['text'])

X_test_tf = pd.concat([df_test.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)
X_test_tf.head()

Unnamed: 0,date,link,author,title_text,activism art,actor kids,actor kids right,actors say,agent suave,agent suave bond,...,white student,working dancers,workshop gallery,worry dolls,worth bingham,year old,years later,years old,york city,york times
0,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,song exploder and the inexhaustible hustle of...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0
2,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,a wicked challenge and other tough questions f...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,ian bostridge on schubert s hidden depthsi fir...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Document Representation based on TF-IDF

In [30]:
# On Training

title_vectorizer = text_processing(df_train['title'], True)
title_rep = title_vectorizer.transform(df_train['title'])

text_vectorizer = text_processing(df_train['text'], True)
text_rep = text_vectorizer.transform(df_train['text'])

X_train_tfidf = pd.concat([df_train.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)

#On test data

title_rep = title_vectorizer.transform(df_test['title'])
text_rep = text_vectorizer.transform(df_test['text'])

X_test_tfidf = pd.concat([df_test.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)

In [31]:
X_test_tfidf.head()

Unnamed: 0,date,link,author,title_text,activism art,actor kids,actor kids right,actors say,agent suave,agent suave bond,...,white student,working dancers,workshop gallery,worry dolls,worth bingham,year old,years later,years old,york city,york times
0,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,song exploder and the inexhaustible hustle of...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125558,0.0,0.0,0.0,0.0
2,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,a wicked challenge and other tough questions f...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,ian bostridge on schubert s hidden depthsi fir...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##  Word embedding techniques

Word embedding techniques represents each word as a vector, in such a way that the context of the word is captured by that representation. That is, vector representation of words that are used in similar ways (e.g. mother, and father) are closer in the vector space.

Some of the most popular word embedding techniques include: Word2Vec and GloVe

In this section, the Word2Vec is used to get word embeddings for the New York Times articles (obtained above). Given the lack of a big dataset to train our own model, we will use Google Word2Vec pre-trained model. This model fits the context since it was trained on news articles with about 100 billion words. 

To get our text data vector representation, we will go though these 3 main steps:
* Basic preprocessing of the data including tokenizing words, removing stop words and punctuation
* Get word embeddings using the pre-trained model
* Adapt word embeddings representation into data that can be used for modelling (e.g. classification)

In [32]:
#import needed libraries
import gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np

#### Importing the model and stopwords

Here, the Word2Vec Google news pre-trained model is imported. It will be used to produce word embeddings for our news article dataset.
The pre-trained model can be found here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In the data preprocessing stage, we want to remove stopwords from the data before using the model, so English stopwords are imported from nltk.

In [33]:
# import stop words from the nltk library
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Load Word2Vec pre-trained Google News model
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True) 

# Convert the article's text into a list
text_list = [text for text in df['title_text']]
print(len(text_list))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\merci\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


59


#### Data pre-processing

In this stage, we want to clean the data before we use the model. This includes:
* Creating word tokens for each article
* Basic data cleaning including removing stop words, punctuation, uppercases
* Removing words that are not part of the pre-trained model vocabulary
* Removing articles that are empty
* Removing articles where none of its words is in the pre-trained model vocabulary

In [34]:
import nltk
nltk.download('punkt')

#FUnction for basic preprocessing
def preprocess(text):
    text = text.lower()
    #Get word tokens for the article
    article = word_tokenize(text)
    
    #remove stop words
    article = [word for word in article if word not in stop_words]
    
    #remove punctuation
    article = [word for word in article if word.isalpha()] 
    
    #remove words that are not in the pre-trained model vocabulary
    article = [word for word in article if word in model.vocab]
    
    return article

collection = [preprocess(text) for text in text_list]

#Remove any empty articles
collection = [article for article in collection if len(article) > 0]

# Want to remove all articles with none of its words in the pretrained model's vocabulary
collection = [article for article in collection if not all(word not in model.vocab for word in article)]
print(collection[1])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\merci\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['chess', 'expert', 'queen', 'gambit', 'gets', 'article', 'contains', 'spoilers', 'queen', 'gambit', 'netflix', 'series', 'queen', 'gambit', 'character', 'benny', 'watts', 'walks', 'beth', 'harmon', 'show', 'heroine', 'start', 'united', 'states', 'chess', 'championship', 'location', 'small', 'auditorium', 'campus', 'ohio', 'university', 'uttering', 'expletive', 'benny', 'gestures', 'around', 'hall', 'complains', 'conditions', 'noting', 'best', 'players', 'country', 'competing', 'yet', 'venue', 'second', 'rate', 'chess', 'boards', 'pieces', 'cheap', 'plastic', 'spectators', 'seem', 'bored', 'best', 'chess', 'master', 'grew', 'era', 'one', 'series', 'takes', 'place', 'wrote', 'chess', 'column', 'new', 'york', 'times', 'eight', 'years', 'attest', 'scene', 'almost', 'painful', 'authenticity', 'many', 'tournaments', 'era', 'played', 'odd', 'sometimes', 'dingy', 'locations', 'even', 'u', 'championship', 'immune', 'competition', 'even', 'held', 'exchange', 'beth', 'played', 'anya', 'taylor', 

After data cleaning, an article tokens look like the above. Now, we can pass the tokenized articles to the model to get word embeddings

#### Using the pre-trained model to vectorize data

Passing the tokenized data, we get word embeddings for each article. Note that each word is represented by a 300 long vector. So, each article has a (number of words in article by 300) matrix representation. 

However, this matrix representation is not practical if we are going to use the resulting representation for scikit-learn algorithms like classification. In other words, we need to represent each article with a vector instead of a matrix. To do that, We average over the rows of the matrix (i.e average over the words in the article) to get a 300 long vector representation of the article.

Below is an example of getting word embeddings for one article using the pretrained model and converting that into a vector representation

In [35]:
#Example of the vector representation returned by the model
example_embedding = model[collection[1]]
print("Article size (Number of words): ", len(collection[1]))
print("Dimensions of returned vector representation: ", np.array(example_embedding).shape)

#To get a one dimensional vector representation of the article, average over the words to get a 300 by 1 vector representation
final_representation = np.mean(example_embedding, axis=0)
print("Final article vector representation: ", final_representation.shape)
print(final_representation)

Article size (Number of words):  757
Dimensions of returned vector representation:  (757, 300)
Final article vector representation:  (300,)
[ 0.01933521  0.03753862  0.02249091  0.07384218 -0.00304275  0.00455796
  0.03144605 -0.07511539  0.05584898  0.07820767 -0.01551456 -0.10441145
 -0.04442851  0.01821963 -0.0786284   0.0711761   0.05042451  0.11170877
  0.02418499 -0.06758834  0.00220837  0.05266898  0.01790958 -0.01257178
  0.00155603 -0.04016721 -0.10146177  0.06132949 -0.00831238 -0.0232123
 -0.03476984  0.00704002 -0.00741802  0.01187331  0.03841127 -0.01465935
  0.01790204  0.04132473  0.05244576  0.05568149  0.09817482 -0.04886817
  0.10871352  0.02379193 -0.01191496 -0.05154021 -0.03097785 -0.01147875
  0.00709169  0.04319341 -0.04366703  0.05001208 -0.03585338 -0.03495815
 -0.01109524  0.00600169 -0.02844731 -0.10010839  0.01369905 -0.06960253
 -0.00233859  0.06745585 -0.06478704 -0.07270085 -0.00467693  0.00583099
 -0.02882861  0.05726865 -0.01905065  0.06747171  0.049886

Next, we perform similar steps as in the above example to get the final document representation for the whole collection of articles of the dataset. This representation is then merged with the original dataframe to create a dataset that can be used for scikit-learn algorithms

In [36]:
X = []
# Do the same step as above for all articles to get the final dataset that can be used for modelling
for article in collection:
    #average over the word vectors of the current article to get its final vector representation
    X.append(np.mean(model[article], axis=0))

#Replace the title and text features with their vector representations
X = pd.concat([df.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = X)], axis = 1)
X.head()

Unnamed: 0,date,link,author,0,1,2,3,4,5,6,...,290,291,292,293,294,295,296,297,298,299
0,"Tue, 03 Nov 2020 16:50:28 +0000",https://www.nytimes.com/2020/11/03/arts/hrishi...,Reggie Ugwu,0.041694,0.019194,-0.013325,0.068121,-0.029784,0.007827,0.04028,...,-0.053279,0.023308,-0.111168,-0.006335,-0.038804,-0.040099,0.033622,-0.042574,0.021526,0.001228
1,"Tue, 03 Nov 2020 17:09:24 +0000",https://www.nytimes.com/2020/11/03/arts/televi...,Dylan Loeb McClain,0.019335,0.037539,0.022491,0.073842,-0.003043,0.004558,0.031446,...,-0.081641,0.005431,-0.088254,-0.017435,-0.018961,-0.029106,0.016734,-0.039739,0.004653,0.021193
2,"Tue, 03 Nov 2020 13:54:35 +0000",https://www.nytimes.com/2020/11/03/arts/dance/...,Gia Kourlas,0.039026,0.022501,0.007078,0.090727,-0.036886,-0.017872,0.040199,...,-0.088283,0.028951,-0.091134,0.005467,-0.055262,-0.051101,0.057514,-0.047479,0.017819,-0.0003
3,"Tue, 03 Nov 2020 17:41:32 +0000",https://www.nytimes.com/2020/11/03/theater/ben...,Ben Brantley,0.044677,0.038605,0.018023,0.091937,-0.052442,-0.00766,0.049574,...,-0.084238,0.033683,-0.095721,0.008553,-0.041119,-0.021825,0.031232,-0.038794,0.034905,0.004407
4,"Tue, 03 Nov 2020 13:00:08 +0000",https://www.nytimes.com/2020/11/03/arts/music/...,Ian Bostridge,0.086072,0.045585,-0.010598,0.069083,-0.045031,0.021492,0.065773,...,-0.06894,0.005981,-0.112651,-0.022302,-0.062983,-0.033652,0.032777,-0.040017,0.031406,0.023256
