# Part 2: Classification of News Articles from Different Sources

This is Part 2 of my AM216 project on simulating and differentiating the Harvard Crimson and the Harvard Gazette. Please go to Part 1 for the intro, scraping, data generation with RNN, and exploratory analyses. 

There is lots of literature on classifying news articles by category (Sports, Arts, Tech, etc.) but I couldn't find any research on classifying news articles from different news sources. However, I think this could be very interesting because it shows how different different news sources may be despite discussing many of the same things. A human could easily read an article and tell you whether it belongs in Sports or Tech, for example. But unless they know the tone of the papers very well they would have a much harder time telling you whether the article came from the Harvard Gazette or the Harvard Crimson. Would a computer be able to distinguish them easily?

This part classifies news articles by tokenizing, word-stemming/lemmatizing the data into a set of words with NLTK, vectorizing with TFIDF, and fitting an SVM. I used [this tutorial](https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34). 

While one can classify text with pretty much any classification method, as seen [here](https://github.com/miguelfzafra/Latest-News-Classifier/blob/master/0.%20Latest%20News%20Classifier/04.%20Model%20Training/12.%20Best%20Model%20Selection.ipynb), it's been shown that SVMs generally perform the best. ![](classification.png)

To begin, I import packages:

In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /Users/terry/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/terry/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /Users/terry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Random seed for reproducible results:

In [3]:
np.random.seed(500)

## Harvard Gazette vs. Harvard Crimson

Building dataframes with all Gazette and Crimson articles using my original and my generated data (stored in .txt files in Part 1). I labeled Crimson articles as 1 and the Gazette articles as 0. 

In [16]:
Corpus_gen=pd.DataFrame(columns=['text', 'label'])

for i in range(100):
    with open("articles/gen/crimson"+str(i)+".txt", "r") as file:
        text=file.read()
        Corpus_gen=Corpus_gen.append({'text': text, 'label': 1}, ignore_index=True)
    
for i in range(144):
    with open("articles/gen/gazette"+str(i)+".txt", "r") as file:
        text=file.read()
        Corpus_gen=Corpus_gen.append({'text': text, 'label': 0}, ignore_index=True)
 

Corpus_og=pd.DataFrame(columns=['text', 'label'])

for i in range(47):
    with open("articles/og/crimson"+str(i+100)+".txt", "r") as file:
        text=file.read()
        Corpus_og=Corpus_og.append({'text': text, 'label': 1}, ignore_index=True)
    
for i in range(24):
    with open("articles/og/gazette"+str(i+144)+".txt", "r") as file:
        text=file.read()
        Corpus_og=Corpus_og.append({'text': text, 'label': 0}, ignore_index=True)

I clean my data by changing everything to lowercase, splitting it into words, and doing word stemming/lemmatizing. Word stemming/lemmatizing essentially breaks words down to their roots and grouping variant forms of a word. "Run" and "running", for example, would be interpreted similarly.

Then, I split the data into a train and a test set. I do the generated data and the original data separately because my generated data is obviously less accurate than the original data and I want more of my better data to go in the test set, since it's been shown that noisy data in training is fine as long as the test set is accurate. 

In [17]:
# For the generated data

# Change all the text to lower case
Corpus_gen['text'] = [entry.lower() for entry in Corpus_gen['text']]

# Tokenization : In this each entry in the corpus will be broken into set of words
Corpus_gen['text']= [word_tokenize(entry) for entry in Corpus_gen['text']]

# Remove stop words, non-numeric and perfom word stemming/lemmatizing

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


for index,entry in enumerate(Corpus_gen['text']):
    # Declaring empty list to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus_gen.loc[index,'text_final'] = str(Final_words)


# Split the into train and test set, since this is generated data 85% will go into the training set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus_gen['text_final'],Corpus_gen['label'],test_size=0.15)

In [18]:
# Change all the text to lower case
Corpus_og['text'] = [entry.lower() for entry in Corpus_og['text']]

# Tokenization
Corpus_og['text']= [word_tokenize(entry) for entry in Corpus_og['text']]

# Remove stop words, non-numeric and perfom word stemming/lemmatizing

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


for index,entry in enumerate(Corpus_og['text']):
    # Declaring empty list to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus_og.loc[index,'text_final'] = str(Final_words)


# Split the model into train and test data set, since this is the original data 30% will go into the test set
Train_Xadd, Test_Xadd, Train_Yadd, Test_Yadd = model_selection.train_test_split(Corpus_og['text_final'],Corpus_og['label'],test_size=0.3)

Now I combine my original and generated data to make my final train and test sets:

In [19]:
Train_X=Train_X.append(Train_Xadd)
Test_X=Test_X.append(Test_Xadd)
Train_Y=Train_Y.append(Train_Yadd)
Test_Y=Test_Y.append(Test_Yadd)

Corpus=Corpus_gen.append(Corpus_og)

Making my data numeric and vectorizing with TFIDF: Term Frequency * Inverse Document Frequency. This is done to find how important a word in document is in comparison to the corpus.

In [20]:
# Label encode the target variable: transforming categorical data of string type in the data set into numerical values
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

# Vectorize the words by using TF-IDF Vectorizer
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])

Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

Finally, using an SVM classifier:

In [21]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)

# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  94.91525423728814


95% accuracy is pretty good! That suggests Harvard Crimson articles do differ from Harvard Gazette articles, despite all being about the same subject, Harvard. There must be differences in the language used, whether unintentional due to different writers or intentional in the tone/POV they want to convey.

## Comparing Generated vs. Original Data

I thought it would be interesting to try and classify my generated vs. my original data, since it gives a metric on how well my RNN in Part 1 performed in generation. Of course, my loss in Part 1 is also such a metric, but how well an SVM performs is easier to interpret. Ideally, I'd want close to 50% accuracy since that means my generated and original articles would be so close the SVM can't do better than random guessing.

Building my combined dataframe from the .txt files with generated articles labeled as 0 and original articles labeled as 1.

In [22]:
Corpus_comp=pd.DataFrame(columns=['text', 'label'])

for i in range(100):
    with open("articles/gen/crimson"+str(i)+".txt", "r") as file:
        text=file.read()
        Corpus_comp=Corpus_comp.append({'text': text, 'label': 0}, ignore_index=True)
    
for i in range(144):
    with open("articles/gen/gazette"+str(i)+".txt", "r") as file:
        text=file.read()
        Corpus_comp=Corpus_comp.append({'text': text, 'label': 0}, ignore_index=True)

for i in range(47):
    with open("articles/og/crimson"+str(i+100)+".txt", "r") as file:
        text=file.read()
        Corpus_comp=Corpus_comp.append({'text': text, 'label': 1}, ignore_index=True)
    
for i in range(24):
    with open("articles/og/gazette"+str(i+144)+".txt", "r") as file:
        text=file.read()
        Corpus_comp=Corpus_comp.append({'text': text, 'label': 1}, ignore_index=True)

The same data cleaning and 80%-20% train-test split:

In [23]:
# Change all the text to lower case
Corpus_comp['text'] = [entry.lower() for entry in Corpus_comp['text']]

# Tokenization
Corpus_comp['text']= [word_tokenize(entry) for entry in Corpus_comp['text']]

# Remove stop words, non-numeric and perfom word stemming/lemmatization

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


for index,entry in enumerate(Corpus_comp['text']):
    # Declaring empty list to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus_comp.loc[index,'text_final'] = str(Final_words)


# Split the model into train and test
Train_Xc, Test_Xc, Train_Yc, Test_Yc = model_selection.train_test_split(Corpus_comp['text_final'],Corpus_comp['label'],test_size=0.2)

Again, label encoding and vectorizing with TFIDF:

In [24]:
# Label encode the target variable
Encoder = LabelEncoder()
Train_Yc = Encoder.fit_transform(Train_Yc)
Test_Yc = Encoder.fit_transform(Test_Yc)

# Vectorize the words by using TF-IDF Vectorizer
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus_comp['text_final'])

Train_Xc_Tfidf = Tfidf_vect.transform(Train_Xc)
Test_Xc_Tfidf = Tfidf_vect.transform(Test_Xc)

SVM classification:

In [25]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_Xc_Tfidf,Train_Yc)

# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_Xc_Tfidf)

# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Yc)*100)

SVM Accuracy Score ->  80.95238095238095


The SVM is much less accurate than when we compared Crimson and Gazette articles. This makes sense since our RNNs trained on the Crimson vs the Gazette should generate articles with similar vocabularies as the Crimson and the Gazette themselves! However, we still obtain an accuracy quite a bit higher than 50% meaning our generated articles are still pretty distinguishable from the original ones (which a reader would be able to tell, also). However, this also means that about a fifth of the time the SVM can't tell the difference between generated and original articles. 

A way to continue this project may be to try and develop a text GAN, where the classifier and the generator train each other. This is more difficult than image GANs because RNNs work in a chain so it can be hard to integrate what is already technically a combination of networks. However, there has been research published on this. 

## Beyond the Gazette vs. the Crimson

The Harvard Gazette and the Harvard Crimson are both about Harvard, and can be distinguished. Can this be applied to other news sources? I tested SVM classification on the New York Times and and Wall Street Journal. Both are two very popular newspapers based in New York, so similar subjects should be covered and a human might not be able to classify the articles. 

Scraping with the Python newspaper package:

In [28]:
# NYT vs WSJ
import newspaper

Corpus_other=pd.DataFrame(columns=['text', 'label'])

nyt_paper = newspaper.build('http://nytimes.com', memoize_articles=False)
wsj_paper = newspaper.build('http://wsj.com', memoize_articles=False)  

for article in nyt_paper.articles:
    article.download()
    article.parse()
    Corpus_other=Corpus_other.append({'text': article.text, 'label': 1}, ignore_index=True)

for article in wsj_paper.articles:
    article.download()
    article.parse()
    Corpus_other=Corpus_other.append({'text': article.text, 'label': 0}, ignore_index=True)

In [29]:
print(len(nyt_paper.articles))
print(len(wsj_paper.articles))

137
261


As can be seen, I don't need to generate data because plenty of original articles can be scraped from their category pages and RSS feeds (137 articles from the New York Times and 261 articles from the Wall Street Journal).

The same classification process:

In [30]:
# Change all the text to lower case
Corpus_other['text'] = [entry.lower() for entry in Corpus_other['text']]

# Tokenization
Corpus_other['text']= [word_tokenize(entry) for entry in Corpus_other['text']]

# Remove stop words, non-numeric and perfom word stemming/lemmatization

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


for index,entry in enumerate(Corpus_other['text']):
    # Declaring empty list to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus_other.loc[index,'text_final'] = str(Final_words)


# Split the model into train and test data set
Train_Xo, Test_Xo, Train_Yo, Test_Yo = model_selection.train_test_split(Corpus_other['text_final'],Corpus_other['label'],test_size=0.2)

# Label encoding
Encoder = LabelEncoder()
Train_Yo = Encoder.fit_transform(Train_Yo)
Test_Yo = Encoder.fit_transform(Test_Yo)

# Vectorize the words with TF-IDF
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])

Train_Xo_Tfidf = Tfidf_vect.transform(Train_Xo)
Test_Xo_Tfidf = Tfidf_vect.transform(Test_Xo)

# SVM Classifier
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_Xo_Tfidf,Train_Yo)

# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_Xo_Tfidf)

# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Yo)*100)

SVM Accuracy Score ->  91.25


At 90%+ classification was still pretty good using all original articles from two prestigious New York-based newspapers. This means that there is a difference in the kinds of words used here too, whether unintentional due to different writers or intentional due to tone/POV they wanted to convey.

## Conclusions

Crimson and Gazette articles generated with RNNs can represent the original articles to some extent, as it was relatively difficult to classify generated and original articles. A text GAN might be something to explore, to generate even better articles. 

Though articles from different papers may be about similar subjects and of similar lengths and sentiments, they are still quite easily distinguished with a SVM. The differences in words used may be due to difference in authors and the tone they try to convey. This can easily be extended to classifying articles from multiple papers, with more labels. 