In [1]:
import feedparser
from bs4 import BeautifulSoup
import urllib.request as rqst
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# Extracting news data using RSS feeds

We are extracting metadata from four news websites' RSS feeds: **NY Times, FOX, CBC, BBC**

## 1. NY Times

Use **feedparser** to parse the RSS feed of NY Times news of arts section.

In [2]:
ny_feed = feedparser.parse('https://rss.nytimes.com/services/xml/rss/nyt/Arts.xml')#feedparsing rss

Check the result of one article for example.

In [3]:
print(ny_feed['entries'][0])

{'title': 'A Golden Team, a Terrible Title and a Show That Vanished', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://rss.nytimes.com/services/xml/rss/nyt/Arts.xml', 'value': 'A Golden Team, a Terrible Title and a Show That Vanished'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.nytimes.com/2020/11/02/theater/a-pray-by-blecht-sondheim-bernstein.html'}, {'href': 'https://www.nytimes.com/2020/11/02/theater/a-pray-by-blecht-sondheim-bernstein.html', 'rel': 'standout', 'type': 'text/html'}], 'link': 'https://www.nytimes.com/2020/11/02/theater/a-pray-by-blecht-sondheim-bernstein.html', 'id': 'https://www.nytimes.com/2020/11/02/theater/a-pray-by-blecht-sondheim-bernstein.html', 'guidislink': False, 'summary': 'Would you like to see a new musical from the people who brought you “West Side Story”? For better or worse, you probably never will.', 'summary_detail': {'type': 'text/html', 'language': None, 'base': 'https://rss.nytimes.com/servi

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [4]:
#getting metadata
titles = []
dates = []
links = []
authors = []
texts = []
for article in ny_feed['entries']:
    titles.append(article['title'])
    dates.append(article['published'])
    links.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.select('.css-158dogj')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts.append(text)
    if 'author' in article:
        authors.append(article['author'])
    else:
        authors.append(None)

Pandas dataframe is created by data saved previously.

In [5]:
#create dataframe
ny_data={"title": titles, "date": dates,"link": links, "author": authors, "text": texts}
ny=pd.DataFrame(ny_data)

In [6]:
ny.head()

Unnamed: 0,title,date,link,author,text
0,"A Golden Team, a Terrible Title and a Show Tha...","Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,How do you top “West Side Story”? If you’re Le...
1,Johnny Depp Loses Court Case Against Newspaper...,"Mon, 02 Nov 2020 17:01:15 +0000",https://www.nytimes.com/2020/11/02/arts/johnny...,Alex Marshall,LONDON — Johnny Depp on Monday lost his court ...
2,"With New Show, Tschabalala Self Explores Black...","Mon, 02 Nov 2020 18:31:41 +0000",https://www.nytimes.com/2020/11/02/arts/design...,Robin Pogrebin,"NEW HAVEN — It was warm for October, the sun f..."
3,Dancing on Grass and Concrete at New York City...,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,"When it comes to digital site-specific work, t..."
4,"Dementia ‘Took Its Toll’ on Sean Connery, Wife...","Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,"Sean Connery, the actor who originated the rol..."


Export results as a csv file.

In [16]:
ny.to_csv('./ny.csv', index=True)#export to csv

## 2. FOX

Use **feedparser** to parse the RSS feed of FOX latest news.

In [7]:
fox_feed = feedparser.parse('http://feeds.foxnews.com/foxnews/latest')#feedparsing rss

Check the result of one article for example.

In [8]:
print(fox_feed['entries'][0])

{'id': 'https://www.foxnews.com/politics/sen-john-kennedy-trump-an-insult-to-the-political-elite', 'guidislink': True, 'link': 'https://www.foxnews.com/politics/sen-john-kennedy-trump-an-insult-to-the-political-elite', 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.foxnews.com/politics/sen-john-kennedy-trump-an-insult-to-the-political-elite'}], 'media_content': [{'url': 'https://static.foxnews.com/foxnews.com/content/uploads/2020/11/image-2020-11-02T210112.135.jpg', 'medium': 'image', 'isdefault': 'true'}, {'url': 'http://a57.foxnews.com60/60/image-2020-11-02T210112.135.jpg', 'medium': 'image', 'width': '60', 'height': '60'}], 'media_thumbnail': [{'url': 'http://a57.foxnews.com60/60/image-2020-11-02T210112.135.jpg', 'width': '60', 'height': '60'}], 'href': '', 'tags': [{'term': 'cc643860-917a-52b8-b49d-860952ad8138', 'scheme': 'foxnews.com/metadata/dc.identifier', 'label': None}, {'term': 'fox-news/media/fox-news-flash', 'scheme': 'foxnews.com/taxonomy', 'labe

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [9]:
#getting metadata
titles2 = []
dates2 = []
links2 = []
authors2 = []
texts2 = []
for article in fox_feed['entries']:
    titles2.append(article['title'])
    dates2.append(article['published'])
    links2.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.select('.article-body')[0].select('p')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts2.append(text)
    if 'author' in article:
        authors2.append(article['author'])
    else:
        authors2.append(None)

Pandas dataframe is created by data saved previously.

In [10]:
#create dataframe
fox_data={"title": titles2, "date": dates2,"link": links2, "author": authors2, "text": texts2}
fox=pd.DataFrame(fox_data)
fox.head()

Unnamed: 0,title,date,link,author,text
0,Kennedy calls Trump 'an insult to the politica...,"Tue, 03 Nov 2020 02:56:32 GMT",https://www.foxnews.com/politics/sen-john-kenn...,Charles Creitz,Louisiana Republican gives his thoughts on why...
1,Olivia Newton-John says Kelly Preston’s death ...,"Tue, 03 Nov 2020 02:56:29 GMT",https://www.foxnews.com/entertainment/olivia-n...,Julius Young,Fox News Flash top entertainment and celebrity...
2,Julianne Hough files for divorce from Brooks L...,"Tue, 03 Nov 2020 02:55:16 GMT",https://www.foxnews.com/entertainment/julianne...,Mariah Haas,Fox News Flash top entertainment and celebrity...
3,Candace Owens warns Democrats over potential E...,"Tue, 03 Nov 2020 02:45:46 GMT",https://www.foxnews.com/media/candace-owens-de...,Yael Halon,Businesses and retailers board up windows acro...
4,Joy Behar urges Fauci to 'quit' after Trump te...,"Tue, 03 Nov 2020 02:40:42 GMT",https://www.foxnews.com/media/joy-behar-anthon...,Joseph Wulfsohn,Joe Biden slams President Trump’s coronavirus ...


Export results as a csv file.

In [17]:
fox.to_csv('./fox.csv', index=True)#export to csv

## 3. CBC

Use **feedparser** to parse the RSS feed of CBC news of top stories section. And check the result of one article for example.

In [11]:
cbc_feed = feedparser.parse('https://rss.cbc.ca/lineup/topstories.xml')#feedparsing rss
print(cbc_feed['entries'][0])

{'title': 'B.C. records staggering number of new COVID-19 cases, Manitoba mulls curfew', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://www.cbc.ca/cmlink/rss-topstories', 'value': 'B.C. records staggering number of new COVID-19 cases, Manitoba mulls curfew'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.cbc.ca/news/canada/coronavirus-covid-canada-world-nov2-1.5785926?cmp=rss'}], 'link': 'https://www.cbc.ca/news/canada/coronavirus-covid-canada-world-nov2-1.5785926?cmp=rss', 'id': '1.5433912', 'guidislink': False, 'published': 'Mon, 20 Jan 2020 17:11:33 EST', 'published_parsed': time.struct_time(tm_year=2020, tm_mon=1, tm_mday=20, tm_hour=22, tm_min=11, tm_sec=33, tm_wday=0, tm_yday=20, tm_isdst=0), 'authors': [{}], 'author': '', 'tags': [{'term': 'News', 'scheme': None, 'label': None}], 'summary': "<img src='https://i.cbc.ca/1.5787163.1604362139!/fileImage/httpImage/image.JPG_gen/derivatives/16x9_460/fraser-health-covid-19.JPG' alt=

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [12]:
#getting metadata
titles3 = []
dates3 = []
links3 = []
authors3 = []
texts3 = []
for article in cbc_feed['entries']:
    titles3.append(article['title'])
    dates3.append(article['published'])
    links3.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.find("body").select('.story')
    if targets != []:
        targets = targets[0].select('p')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts3.append(text)
    if 'author' in article:
        authors3.append(article['author'])
    else:
        authors3.append(None)

Pandas dataframe is created by data saved previously.

In [13]:
#create dataframe
cbc_data={"title": titles3, "date": dates3,"link": links3, "author": authors3, "text": texts3}
cbc=pd.DataFrame(cbc_data)
cbc.head()

Unnamed: 0,title,date,link,author,text
0,B.C. records staggering number of new COVID-19...,"Mon, 20 Jan 2020 17:11:33 EST",https://www.cbc.ca/news/canada/coronavirus-cov...,,The latest: Manitoba is considering whether to...
1,Police hunt for gunmen after Vienna rocked by ...,"Mon, 2 Nov 2020 15:13:46 EST",https://www.cbc.ca/news/world/vienna-synagogue...,The Associated Press,Gunmen attacked six locations in central Vienn...
2,"Trump, Biden make final appeals to voters in k...","Sat, 19 Oct 2019 12:37:24 EDT",https://www.cbc.ca/news/world/trump-biden-last...,,With early voter turnout setting a record and ...
3,"Secrecy, conflicts in Collingwood deals point ...","Mon, 2 Nov 2020 21:36:15 EST",https://www.cbc.ca/news/canada/toronto/colling...,CBC News,A tangle of insider relationships among offici...
4,"3 dead, 1 injured after shooting in rural Vanc...","Mon, 2 Nov 2020 12:55:16 EST",https://www.cbc.ca/news/canada/british-columbi...,CBC News,Police are investigating after the discovery S...


Export results as a csv file.

In [18]:
cbc.to_csv('./cbc.csv', index=True)#export to csv

## 4. BBC

Use **feedparser** to parse the RSS feed of BBC news. And check the result of one article for example.

In [14]:
bbc_feed = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml')#feedparsing rss
print(bbc_feed['entries'][0])

{'title': "Vienna shooting: Gunman hunted after deadly 'terror' attack", 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://feeds.bbci.co.uk/news/rss.xml', 'value': "Vienna shooting: Gunman hunted after deadly 'terror' attack"}, 'summary': 'Two people are dead, along with a gunman, after attacks at six different locations, police say.', 'summary_detail': {'type': 'text/html', 'language': None, 'base': 'http://feeds.bbci.co.uk/news/rss.xml', 'value': 'Two people are dead, along with a gunman, after attacks at six different locations, police say.'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.bbc.co.uk/news/world-europe-54786952'}], 'link': 'https://www.bbc.co.uk/news/world-europe-54786952', 'id': 'https://www.bbc.co.uk/news/world-europe-54786952', 'guidislink': False, 'published': 'Tue, 03 Nov 2020 01:36:46 GMT', 'published_parsed': time.struct_time(tm_year=2020, tm_mon=11, tm_mday=3, tm_hour=1, tm_min=36, tm_sec=46, tm_wday=1, tm_yday=

For each article in parsed results, save its title, date, link, author, and full text. To get the full text of the article, we use **BeautifulSoup** to parse the html from its link, and select tags with the class of its main body.

In [15]:
#getting metadata
titles4 = []
dates4 = []
links4 = []
authors4 = []
texts4 = []
for article in bbc_feed['entries']:
    titles4.append(article['title'])
    dates4.append(article['published'])
    links4.append(article['link'])
    html = rqst.urlopen(article['link'])
    bs = BeautifulSoup(html, features='html.parser')
    targets = bs.select('.css-83cqas-RichTextContainer')
    text = ''
    for target in targets:
        text += target.text
        text += ' '
    texts4.append(text)
    if 'author' in article:
        authors4.append(article['author'])
    else:
        authors4.append(None)

Pandas dataframe is created by data saved previously.

In [19]:
#create dataframe
bbc_data={"title": titles4, "date": dates4,"link": links4, "author": authors4, "text": texts4}
bbc=pd.DataFrame(bbc_data)
bbc.head()

Unnamed: 0,title,date,link,author,text
0,Vienna shooting: Gunman hunted after deadly 't...,"Tue, 03 Nov 2020 01:36:46 GMT",https://www.bbc.co.uk/news/world-europe-54786952,,Gunmen armed with rifles have opened fire in s...
1,Covid-19: Liverpool to pilot city-wide coronav...,"Tue, 03 Nov 2020 03:10:31 GMT",https://www.bbc.co.uk/news/health-54786130,,People in Liverpool will be offered regular Co...
2,US Election 2020: Biden and Trump make final p...,"Tue, 03 Nov 2020 02:37:44 GMT",https://www.bbc.co.uk/news/election-us-2020-54...,,US President Donald Trump and his Democratic c...
3,Universities and colleges face Covid funding s...,"Tue, 03 Nov 2020 01:01:24 GMT",https://www.bbc.co.uk/news/education-54780790,,"Universities and colleges in England face ""sig..."
4,US election: A wild three-year campaign in thr...,"Mon, 02 Nov 2020 22:43:10 GMT",https://www.bbc.co.uk/news/election-us-2020-54...,,"Billions of dollars spent, dozens of candidate..."


Export results as a csv file.

In [20]:
bbc.to_csv('./bbc.csv', index=True)#export to csv

# Advanced Text Processing for NY times dataset

In this section we will be using Bag of Words document representation like TF and TF-IDF and word embedding techniques like Word2Vec to extract features that can be used for scikit-learn models. The New York times data obtained above will be used.

#### Data loading, cleaning and preparation

In [24]:
#Load the New York Times data
df = pd.read_csv("ny.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,title,date,link,author,text
0,0,"A Golden Team, a Terrible Title and a Show Tha...","Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,How do you top “West Side Story”? If you’re Le...
1,1,Johnny Depp Loses Court Case Against Newspaper...,"Mon, 02 Nov 2020 17:01:15 +0000",https://www.nytimes.com/2020/11/02/arts/johnny...,Alex Marshall,LONDON — Johnny Depp on Monday lost his court ...
2,2,"With New Show, Tschabalala Self Explores Black...","Mon, 02 Nov 2020 18:31:41 +0000",https://www.nytimes.com/2020/11/02/arts/design...,Robin Pogrebin,"NEW HAVEN — It was warm for October, the sun f..."
3,3,Dancing on Grass and Concrete at New York City...,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,"When it comes to digital site-specific work, t..."
4,4,"Dementia ‘Took Its Toll’ on Sean Connery, Wife...","Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,"Sean Connery, the actor who originated the rol..."


In [25]:
# Data cleaning

df.drop(['Unnamed: 0'], axis=1, inplace=True) # Since it is a duplicate of the dataframe indices

#Check if there are any missing values
df.isnull().sum(axis=0)

title     0
date      0
link      0
author    1
text      0
dtype: int64

The number of missing values is low, so we can drop the rows with nan values

In [26]:
df.dropna(inplace=True)

Split the data into training and test datasets

In [27]:
from sklearn.model_selection import train_test_split

#Combine title and text data
df['title_text'] = df['title'] + df['text']
#Separate data into training and testing, maybe do this step as a maybe when there are more data points
df_train,df_test = train_test_split(df,test_size=0.33, random_state=42)

#### Helper function for text processing using TF and TF-IDF

The sklearn model object for TF and TF-IDF is the same. i.e. if is_tfidf parameter is true, then the object is a TF-IDF vectorizer, otherwise it's TF.

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

def text_processing(text, is_tfidf):
    vectorizer = TfidfVectorizer(stop_words = 'english', 
                                 use_idf = is_tfidf, 
                                 max_features = 300, 
                                 ngram_range = (2,3)).fit(text)
    return vectorizer

## Option 1: Combine title and text

In general, the title of the article is related to the content of the article. So, in this context, combining the text and title before extracting features seems reasonable.

### Document Representation based on TF (Term Frequency)

Term frequency depends on the frequency of a word in the current article. Here, we create a TF vectorizer for the words in the article. We have limited the number of features to be 300 or less, used the common english stopwords provide by the model.

In [29]:
#transform the train and test articles to vectors
vectorizer = text_processing(df_train['title_text'], False)
train_text_rep = vectorizer.transform(df_train['title_text'])
test_text_rep = vectorizer.transform(df_test['title_text'])

#Replace the article's text data (title and text) with their vector representations
X_train_tf = pd.concat([df_train.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = train_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)

X_test_tf = pd.concat([df_test.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = test_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)
X_test_tf.head()

Unnamed: 0,date,link,author,12 years,30 years,50 cent,590 shirt,631 24,99 ages,aghdashloo work,...,wrong kind,wrong kind black,year old,years ago,years later,years old,york city,york times,young adult,young man
0,"Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.447214,0.0,0.447214,0.0,0.0,0.447214,0.0,0.0
1,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.196116,0.0,0.0,0.196116,0.0,0.0
3,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.204124,0.204124,0.0,0.0,0.204124,0.0,0.0,0.0
4,"Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.204124,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Document Representation based on TF-IDF (Term Frequency-Inverse Document Frequency)

Using TF is problematic because words that are frequent but not necessarily useful (e.g. the) will have a high score. TF-IDF, combines TF and IDF, which measures the how rare a word is accross articles to solve the limitation. The settings for the model are the same as TF above expcept is_df which is now true to indicate this is an TF-IDF vectorizer.

In [30]:
#transform the train and test articles to vectors
vectorizer = text_processing(df_train['title_text'], True)
train_text_rep = vectorizer.transform(df_train['title_text'])
test_text_rep = vectorizer.transform(df_test['title_text'])

#Replace the article's text data (title and text) with their vector representations
X_train_tfidf = pd.concat([df_train.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = train_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)

X_test_tfidf = pd.concat([df_test.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = test_text_rep.toarray(), columns = vectorizer.get_feature_names())], axis = 1)
X_test_tfidf.head()

Unnamed: 0,date,link,author,12 years,30 years,50 cent,590 shirt,631 24,99 ages,aghdashloo work,...,wrong kind,wrong kind black,year old,years ago,years later,years old,york city,york times,young adult,young man
0,"Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.459748,0.0,0.506595,0.0,0.0,0.459748,0.0,0.0
1,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.212392,0.0,0.0,0.192751,0.0,0.0
3,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.209923,0.278945,0.0,0.0,0.303431,0.0,0.0,0.0
4,"Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.16685,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Option 2:  Separate title and text data

In option 1, we combined article's title and text together because they are closely related. So, it might be impractical to vectorize them separately as we may end up with redundant columns. In general however, if  there are text features that are not related, one might want to vectorize them separately. Below, we show how the two features can be vectorized separately.

### Document Representation based on TF

In [31]:
# On Training

#Vectorize the title and text data separately
title_vectorizer = text_processing(df_train['title'], False)
title_rep = title_vectorizer.transform(df_train['title'])
text_vectorizer = text_processing(df_train['text'], False)
text_rep = text_vectorizer.transform(df_train['text'])

X_train_tf = pd.concat([df_train.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)

#On test data
#transform the test data's title and text data separately
title_rep = title_vectorizer.transform(df_test['title'])
text_rep = text_vectorizer.transform(df_test['text'])

X_test_tf = pd.concat([df_test.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)
X_test_tf.head()

Unnamed: 0,date,link,author,title_text,2020 scaffolding,590 scratch,590 scratch sniff,70 years,70 years screams,activism art,...,work art,worry dolls,year old,years ago,years later,years old,york city,york times,young adult,young man
0,"Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,"A Golden Team, a Terrible Title and a Show Tha...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.447214,0.0,0.447214,0.0,0.0,0.447214,0.0,0.0
1,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.196116,0.0,0.0,0.196116,0.0,0.0
3,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,Dancing on Grass and Concrete at New York City...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.19245,0.19245,0.0,0.0,0.19245,0.0,0.0,0.0
4,"Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,"Dementia ‘Took Its Toll’ on Sean Connery, Wife...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.204124,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Document Representation based on TF-IDF

In [32]:
# On Training

title_vectorizer = text_processing(df_train['title'], True)
title_rep = title_vectorizer.transform(df_train['title'])
text_vectorizer = text_processing(df_train['text'], True)
text_rep = text_vectorizer.transform(df_train['text'])

X_train_tfidf = pd.concat([df_train.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)

#On test data

title_rep = title_vectorizer.transform(df_test['title'])
text_rep = text_vectorizer.transform(df_test['text'])

X_test_tfidf = pd.concat([df_test.drop(['title', 'text'], axis=1),
           pd.DataFrame(data = title_rep.toarray(), columns = title_vectorizer.get_feature_names()),
           pd.DataFrame(data = text_rep.toarray(), columns = text_vectorizer.get_feature_names())], axis = 1)

In [33]:
X_test_tfidf.head()

Unnamed: 0,date,link,author,title_text,2020 scaffolding,590 scratch,590 scratch sniff,70 years,70 years screams,activism art,...,work art,worry dolls,year old,years ago,years later,years old,york city,york times,young adult,young man
0,"Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,"A Golden Team, a Terrible Title and a Show Tha...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.459748,0.0,0.506595,0.0,0.0,0.459748,0.0,0.0
1,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.212392,0.0,0.0,0.192751,0.0,0.0
3,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,Dancing on Grass and Concrete at New York City...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.184232,0.244807,0.0,0.0,0.266296,0.0,0.0,0.0
4,"Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,"Dementia ‘Took Its Toll’ on Sean Connery, Wife...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.16685,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##  Word embedding techniques

Word embedding techniques represents each word as a vector, in such a way that the context of the word is captured by that representation. That is, vector representation of words that are used in similar ways (e.g. mother, and father) are closer in the vector space.

Some of the most popular word embedding techniques include: Word2Vec and GloVe

In this section, the Word2Vec is used to get word embeddings for the New York Times dataset. Given the lack of a big dataset, we will use Google Word2Vec pre-trained model. This model fits the context given that it is a News model trained on 100 billion words. 

In [34]:
#import needed libraries
import gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np

In [35]:
# import stop words from the nltk library
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load Word2Vec pre-trained Google News model
#Can be downloaded here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True) 

# Convert the article's text into a list of texts
text_list = [text for text in df['title_text']]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\merci\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Data pre-processing

In this stage, we want to clean the data before we use the model. This includes:
* Basic data cleaning including removing stop words, punctuation, uppercases
* Remove words that are not part of the pre-trained model vocabulary
* Remove articles that are empty
* Remove articles where none of its words is in the pre-trained model vocabulary

In [36]:
import nltk
nltk.download('punkt')

#FUnction for basic preprocessing
def preprocess(text):
    text = text.lower()
    article = word_tokenize(text)
    #remove stop words
    article = [word for word in article if word not in stop_words]
    #remove punctuation
    article = [word for word in article if word.isalpha()] 
    #remove words that are not in the pre-trained model vocabulary
    article = [word for word in article if word in model.vocab]
    return article

collection = [preprocess(text) for text in text_list]

#Remove any empty articles
collection = [article for article in collection if len(article) > 0]

# Want to remove all articles with none of its words in the pretrained model's vocabulary
collection = [article for article in collection if not all(word not in model.vocab for word in article)]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\merci\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Using the pre-trained model to vectorize data

Here, we vectorize the words using the pre-trained model. Note that calling the model on each article returns a 300 by the number of words in article vector. However, this is not practical if the data is going to be used for most scikit-learn algorithms. So, for each article vector representation, we average over word vectors in the article. 

In [37]:
X = []
for article in collection:
    X.append(np.mean(model[article], axis=0)) #average over the word vectors of the current article

#Replave the title and text features with their vector representations
X = pd.concat([df.drop(['title', 'text', 'title_text'], axis=1),
           pd.DataFrame(data = X)], axis = 1)
X.head()

Unnamed: 0,date,link,author,0,1,2,3,4,5,6,...,290,291,292,293,294,295,296,297,298,299
0,"Mon, 02 Nov 2020 20:08:17 +0000",https://www.nytimes.com/2020/11/02/theater/a-p...,Jesse Green,0.049907,0.034578,0.006905,0.064221,-0.027185,-0.016675,0.040625,...,-0.068019,0.016593,-0.088471,0.007614,-0.031239,-0.023109,0.019094,-0.053758,0.030256,0.000952
1,"Mon, 02 Nov 2020 17:01:15 +0000",https://www.nytimes.com/2020/11/02/arts/johnny...,Alex Marshall,0.001007,0.021372,-0.001012,0.028865,-0.024952,0.007513,0.058422,...,-0.027191,0.001041,-0.042919,0.017026,-0.026761,-0.003498,0.026365,-0.061626,0.030746,0.008071
2,"Mon, 02 Nov 2020 18:31:41 +0000",https://www.nytimes.com/2020/11/02/arts/design...,Robin Pogrebin,0.043229,0.04552,-0.006456,0.077451,-0.030619,0.000727,0.053678,...,-0.106418,0.013027,-0.086506,0.005425,-0.049713,-0.015565,0.057395,-0.037284,0.036528,-0.027423
3,"Mon, 02 Nov 2020 18:38:35 +0000",https://www.nytimes.com/2020/11/02/arts/dance/...,Gia Kourlas,0.046235,0.024736,0.011516,0.069366,-0.043427,-0.035909,0.041845,...,-0.08195,0.014459,-0.094453,0.003957,-0.019149,-0.012099,0.055602,-0.047661,0.008169,0.007202
4,"Mon, 02 Nov 2020 20:52:18 +0000",https://www.nytimes.com/2020/11/02/movies/sean...,Sarah Bahr,0.0403,0.039894,-0.016884,0.078152,-0.030645,0.013234,0.053049,...,-0.046308,0.031506,-0.095556,-0.010872,-0.047558,0.001148,-0.035881,-0.034077,0.046425,-0.022088
