# AM216 Final Project: Generating and Classifying News Articles from Different Sources

### Terry Ni

This project uses trains an RNN to generate articles from the Harvard Crimson and the Harvard Gazette, and analyzes whether the styles can be distinguished between them with exploratory data analysis and with SVM classification. 

## Scraping the Crimson

I scraped article with the Python package [newspaper](https://newspaper.readthedocs.io/en/latest/index.html). Newspaper takes the homepage, category pages, and RSS feeds of a news site and collects the articles linked from them. (The Crimson doesn't currently have RSS feeds, but the Gazette does.)

In [6]:
import newspaper

# scraping the Crimson
crimson_paper = newspaper.build('http://www.thecrimson.com', memoize_articles=False)

# printing categories
for category in crimson_paper.category_urls():
    print(category)

# printing the articles collected
for article in crimson_paper.articles:
    print("article:", article.url)
    
# printing the number of articles collected
print(len(crimson_paper.articles))

http://www.thecrimson.com
http://www.thecrimson.com/flyby
http://www.thecrimson.com/todays-paper
http://globalprograms.thecrimson.com
article: http://www.thecrimson.com/section/news/
article: http://www.thecrimson.com/section/media/
article: http://www.thecrimson.com/article/2020/5/12/harvard-coronavirus-travel-restricted-indefinitely/
article: http://www.thecrimson.com/article/2020/5/11/april-thefts-increase/
article: http://www.thecrimson.com/article/2020/5/11/coronavirus-federal-job-guarantee/
article: http://www.thecrimson.com/article/2020/5/11/current-coronavirus-respiratory-treatment/
article: http://www.thecrimson.com/article/2020/5/11/commencement-retrospective/
article: http://www.thecrimson.com/article/2020/5/6/harvard-coronavirus-resident-tutors/
article: http://www.thecrimson.com/article/2020/5/4/cpd-tweets-kennedy-markey/
article: http://www.thecrimson.com/article/2020/5/9/harvard-net-zero-experts/
article: http://www.thecrimson.com/article/2020/5/8/hms-faces-fy20-losses/


## Prepping Crimson data

To build my corpus to train my RNN (and for easier exporatory text analysis), I concatenate all my articles into one string. I also tried a version of the training where I concatenated the headlines into the strings as well, but I got higher loss, and, more importantly, in Part 2 I'll be classifying based on article text, so it was more useful for me to generate articles based on article text only. 

In [151]:
crimson=''
for crimson_article in crimson_paper.articles:
    crimson_article.download()
    crimson_article.parse()
    toadd = crimson_article.text+'\n\n'
    crimson += toadd

# Length of concatenated string
print(len(crimson))

221033


The Shakespeare data in the Week 8 section notebook on RNNs was trained on a corpus of about 1000000 characters, but this seemed to work fine and have decent loss, as you can see later on. 

In [152]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import numpy as np
import os
import time
import re

Crimson articles all have a statement about the authors at the end (ex. "-Staff writer...") so I've removed these with regex:

In [204]:
crimson=re.sub(r'\n—.* can be reached at .*@thecrimson.com.[ Follow (him|her) on Twitter @[^\n]*]?', '', crimson)
print(crimson)

•

Since students’ departure, House resident tutors have balanced their responsibilities for their students, Houses, and jobs — all the while taking care of themselves in the middle of an unprecedented time in the history of Harvard’s campus.

•

In what can only be described as dominant, the Harvard men’s squash team routed the competition at every step on its way to the CSA title from Feb. 28, 2020, to March 1, 2020, at the Harvard Murr Center. The team played Drexel, Princeton, and Penn on Friday, Saturday, and Sunday on its way to the championship, losing only one match over the entire weekend.

Harvard Medical School is facing losses between $39 million and $65 million for the current fiscal year, Dean George Q. Daley ’82 announced in an email to affiliates Thursday.

Daley wrote that a “significant portion” of the deficit stems from the decision to forgive a fiscal contribution from the Medical School’s affiliated hospitals.

Unlike most medical schools, Harvard Medical School do

Here I further prep my data for training as in the section notebook by converting the characters to numerical indices and splitting it into chunks: 

In [205]:
# The unique characters in the file
vocab = sorted(set(crimson))
print ('{} unique characters'.format(len(vocab)))

98 unique characters


In [207]:
# Creating a mapping from unique characters to indices
char2idxcrim = {u:i for i, u in enumerate(vocab)}
idx2charcrim = np.array(vocab)

text_as_int = np.array([char2idxcrim[c] for c in crimson])

In [156]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(crimson)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [157]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Training RNN on Crimson data

In [158]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Here the RNN model is built as in the section notebook: 

In [159]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [160]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 98) # (batch_size, sequence_length, vocab_size)


Loss is defined:

In [161]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)

In [162]:
model.compile(optimizer='adam', loss=loss)

Creating checkpoints so I can save the network (which is especially important as I'll be generating Crimson and Gazette datasets with two different RNNs in this notebook).

I stop the training when I get two epochs in a row with less-than-best loss and save my network that gives me the best loss only. 

In [166]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

# saving only the best loss, stopping training after two epochs in a row with less-than-best loss
checkpoint_callback=[tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True,
    monitor='loss', mode='min',
    save_best_only=True), 
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=2)]

Training the network:

In [167]:
EPOCHS=100

In [168]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=checkpoint_callback)

Train for 33 steps
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Ep

## Generating Sample Crimson Text

To visually evaluate how my network did, I generated a sample article. 

In [169]:
# recalling best model
tf.train.latest_checkpoint(checkpoint_dir) 

'./training_checkpoints/ckpt_91'

In [213]:
# Rebuilding model
vocab_size=98

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

# Generating text using the learned model to evaluate
def generate_text_crim(model, start_string, num_generate):
    
    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idxcrim[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # I found 0.5 to give me the reasonable results
    temperature = 0.5

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2charcrim[predicted_id])

    return (start_string + ''.join(text_generated))

# generating an article 5000 chars long (pretty standard)
# starting with the word "College"
print(generate_text_crim(model, start_string=u"College ", num_generate=5000))

College counselors to complete college applications.

The percentage of applicants that Harvard saw from research to residencies to teaching, resident tutors have also made or thought of making a bullet journal at one point in your life. Microsoft! You are the wholesome deal and the players embody that love and particular faculty selection and eugenic importantly, Harvard’s financial health has taken a serious hit in the last few weeks. If the current fiscal year was $4 million.

Advertisement

Despite recent losses pushing that the poor are disadvantaged when the rich can drive up market prices. But the consequences, she dress her from the group, presumption of innocence is not the same thing as presuming that either party investments.”

The letter states that Harvard should not include carbon offsets in its called the drompted by an impartial investigator who will then submit a neutral report for review in a hearing process, the duminished since Facebook that day that the poor are di

It didn't turn out so bad! Now I repeat the process for the Harvard Gazette.

## Scraping the Gazette

In [54]:
# using python package newspaper
gazette_paper = newspaper.build('https://news.harvard.edu', memoize_articles=False)
  
# printing categories
for category in gazette_paper.category_urls():
    print(category)

# printing articles collected
for article in gazette_paper.articles:
    print("article:", article.url)

# printing number of articles
print(len(gazette_paper.articles))

https://news.harvard.edu/
https://accessibility.harvard.edu
https://toservebetter.harvard.edu
https://www.harvard.edu
http://trademark.harvard.edu
https://news.harvard.edu
http://harvard.edu
article: https://news.harvard.edu/gazette/story/series/coronavirus/
article: https://news.harvard.edu/gazette/story/series/honoring-the-class-of-2020/
article: https://news.harvard.edu/gazette/story/series/photography/
article: https://news.harvard.edu/gazette/story/2020/05/kennedy-school-grad-will-return-to-the-south-with-a-plan-in-hand/
article: https://news.harvard.edu/gazette/story/2020/05/breaking-new-ground-with-public-health-and-urban-planning-degree/
article: https://news.harvard.edu/gazette/story/2020/05/assessing-where-vaccine-efforts-stand-and-the-challenges-ahead/
article: https://news.harvard.edu/gazette/story/2020/05/wilderness-medicine-fellows-return-to-lend-a-hand/
article: https://news.harvard.edu/gazette/story/series/experience/
article: https://news.harvard.edu/gazette/story/seri

## Prepping Gazette data

Building the corpus by concatenating all articles:

In [55]:
gazette=''
for gazette_article in gazette_paper.articles:
    gazette_article.download()
    gazette_article.parse()
    toadd = gazette_article.text+'\n\n'
    gazette += toadd
    
print(len(gazette))

107536


Mapping characters to integers:

In [215]:
# The unique characters in the file
vocab = sorted(set(gazette))
print ('{} unique characters'.format(len(vocab)))

87 unique characters


In [216]:
# Creating a mapping from unique characters to indices
char2idxgaz = {u:i for i, u in enumerate(vocab)}
idx2chargaz = np.array(vocab)

text_as_int = np.array([char2idxgaz[c] for c in gazette])

Splitting the data into chunks:

In [182]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(crimson)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

dataset = sequences.map(split_input_target)

In [183]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Training RNN on Gazette data

We use the same RNN structure as we used for the Crimson. 

In [184]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [185]:
# using the same RNN model as the Crimson model
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 87) # (batch_size, sequence_length, vocab_size)


We also use the same loss function:

In [186]:
example_batch_loss  = loss(target_example_batch, example_batch_predictions)

In [187]:
model.compile(optimizer='adam', loss=loss)

Just like with the Crimson I stop the training when I get two epochs in a row with less-than-best loss and save my network that gives me the best loss only. However, I save these checkpoints in a different directory so I can call them both whenever I like. 

In [188]:
# Directory where the checkpoints will be saved
checkpoint_gaz = './training_checkpoints_gaz'

# Name of the checkpoint files
checkpoint_prefix_gaz = os.path.join(checkpoint_gaz, "ckpt_gaz_{epoch}")

# saving only the best loss, stopping training after two epochs in a row with less-than-best loss
checkpoint_callback_gaz=[tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix_gaz,
    save_weights_only=True,
    monitor='loss', mode='min',
    save_best_only=True), 
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=2)]

Training the network:

In [189]:
EPOCHS=100

In [190]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=checkpoint_callback_gaz)

Train for 16 steps
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100


## Generating Sample Gazette Text

Getting the latest Gazette checkpoint:

In [191]:
tf.train.latest_checkpoint(checkpoint_gaz) 
#final run stop training after more loss

'./training_checkpoints_gaz/ckpt_gaz_71'

Rebuilding the model from checkpoint and generating sample text to visually evaluate:

In [223]:
# vocab size for crimson=98, for gazette=87
vocab_size=87

# rebuilding the model
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_gaz))
model.build(tf.TensorShape([1, None]))

# Generating text using the learned model to evaluate
def generate_text_gaz(model, start_string, num_generate):
    
    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idxgaz[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # I found 0.5 to give me the reasonable results
    temperature = 0.5

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2chargaz[predicted_id])

    return (start_string + ''.join(text_generated))

# Again, 5000 word article starting with "College"
print(generate_text_gaz(model, start_string=u"College ", num_generate=5000))

College and a health care system. So having the ability to predict the coincidence of the two, so we had an idea of what printed muscle is supposed to look like. But we have to reconceptualize the whole job of child development and education, and construct systems in exacting detail to better understand how they functions, but she also underscored the importance of connection and communication among schools and students throughout the research, and encouraging them to conduct similar field visits for their contributions to undergraduate teaching.

Claudine Gay, Edgerley Family Dean of the Faculty of Arts and Sciences, and Co-Directors Nathaniel Hendren, a Harvard professor reaching out to interview people in red-leaning areas may seem like courting troubled by whothe teaching faculty, staff, and students.

“I know the issues in the forced overreliance on homeschooling so that we avoid further disadvantaging the already-tense global COVID picture, Bloom said, has been an increase in nat

Again, not bad!

## Generating Gazette and Crimson Articles for Classification (in Part 2)

For classification, obviously the 47 Crimson articles I collected and the 24 Gazette articles I collected aren't enough, but I know I can train my classifier on simulated data from my RNNs, as long as my test set has most of the original data. 

I start by generating random seed words to start my simulated articles: 

In [None]:
# generate 1000 random seed words to start articles
import random
from random_words import RandomWords
rw = RandomWords()
seeds = rw.random_words(count=1000)

In [229]:
# printing first 10 seed words as an example
print(seeds[:10])

['invention', 'compiler', 'leads', 'sonars', 'reach', 'discipline', 'community', 'discards', 'anchor', 'wrenches']


To simulate Gazette data, I rebuild my model from my Gazette checkpoint:

In [None]:
# vocab size for crimson=98, for gazette=87
vocab_size=87

# rebuilding the model
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_gaz))
model.build(tf.TensorShape([1, None]))

I store the generated texts in a list:

In [None]:
gazette_gen=[]
# generating some Gazette articles by going through the seeds
for seed in seeds:
    # each text will have a random length between 2000 and 5000 characters
    n=random.randint(2000, 5000)
    gazette_gen.append(generate_text_gaz(model, start_string=seed.capitalize()+" ", num_generate=n))

In this case, I generated 144 articles. You can see what the third article generated looks like below. 

In [92]:
print(len(gazette_gen))
print(gazette_gen[2])

144
Leads they are normal. As she prepares to release her, she notices the fear in the woman’s eyes. Since he use of social media presence, seeing a live black scientist,” said Extavour, who did not meet another black professional scientific response to the outbreak, one that has nonetheless been outpaces on its lands.

For the former drans to pursue a reality of what biology looks like was really challenged in our society will be the most vulnerable in this crisis.

“Because this crisis has highlighted inequities, I’m hoping that we know that no matter what the latest developments in the COVID-19 outbreak may bring.

One of the most popular — and highest-stakes — guessing games to emphasize just how meaningful it is to have a great deal of information, but right now our job is to, number one, in the population people who you think are positive, but enough of them are actually negative that you are going to be the last popiology in this country. Again, in 1983, the report “Nation Infla

To simulate Crimson data, I rebuild my model from my Crimson checkpoint:

In [224]:
# vocab size for crimson=98, for gazette=87
vocab_size=98

# rebuilding the model
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

I store the generated texts in another list:

In [230]:
crimson_gen=[]
for seed in seeds[:100]:
    n=random.randint(2000, 5000)
    crimson_gen.append(generate_text_crim(model, start_string=seed.capitalize()+" ", num_generate=n))

In this case, I generated 100 articles. You can see what the second article generated looks like below. 

In [234]:
print(len(crimson_gen))
print(crimson_gen[1])

100
Compiler s been something that has been going on for a long time?”

While the social consequences of COVID-19 Cancer Institute, and Boston Children’s Hospital. The filming carbon somewhere else’s work — it seems, is exactly what leadership is all about,” Medical School Dean for Students Financial Aid William R. Fitzsimmons '67 attributed the decline in applicants to a few feet of space. Under ordinary circumstances, the lives of factory farm animals will be slaughtered not for consumption, but to make room for others.

These meat carcasses precisely important to me,” she says. “It was a genetic consideration.”

The donor Copeland ultimately chose a school-sponsored activity, but her donation required that she trek farther than ten minutes off once stories were said. “I’m say that the story of slavery, lynching, and racial segregation in the United States and share poems. Event organizers asked The Crimson left the gender from the clothing as they could be effectively coerced by the

## Storing Articles in .txt Files for Use in Part 2

I'll be doing the classification in a different notebook so I stored all my generated and scraped articles in .txt files. 

In [112]:
count=0
for text in crimson_gen:
    with open("articles/gen/crimson"+str(count)+".txt", "w") as file:
        file.write(text)
    count+=1
    
for article in crimson_paper.articles:
    with open("articles/og/crimson"+str(count)+".txt", "w") as file:
        file.write(article.text)
    count+=1

count=0
for text in gazette_gen:
    with open("articles/gen/gazette"+str(count)+".txt", "w") as file:
        file.write(text)
    count+=1
    
for article in gazette_paper.articles:
    with open("articles/og/gazette"+str(count)+".txt", "w") as file:
        file.write(article.text)
    count+=1

## Exploratory Analyses, Crimson vs. Gazette

First, I compare average article length in scraped data. 

In [236]:
# len of string/total number of articles collected
print(len(crimson)/47)
print(len(gazette)/24)

4618.425531914893
4480.666666666667


Looks like article lengths are similar! So this can't be a distinguishing factor when I do my classification in Part 2. 

Next, I do simple sentiment analysis with the AFINN lexicon. It's very commonly used for news articles. The Python package sums the sentiments of all the words as the score, so I divide by the length of the passage to roughly normalize.

In [12]:
# initialize afinn sentiment analyzer
from afinn import Afinn
af = Afinn(language='en')

# compute sentiment scores
print("gazette: ",af.score(gazette)/len(gazette))
print("crimson: ",af.score(crimson)/len(crimson))

gazette:  0.0034772290314828164
crimson:  0.003336903864549033


Looks like the Gazette and the Crimson have very similar positive trending sentiments! This means this is also not a distinguishing feature for classification. For comparison, here's the New York Times similarly scored on the AFINN lexicon:

In [15]:
nyt_paper = newspaper.build('http://nytimes.com', memoize_articles=False)

nyt=''
for nyt_article in nyt_paper.articles:
    nyt_article.download()
    nyt_article.parse()
    toadd = nyt_article.text+'\n\n'
    nyt += toadd
    
print("nyt: ",af.score(nyt)/len(nyt))

nyt:  0.0024826216484607746


As you can see the the NYT is more negative than the Harvard Gazette and the Crimson; not all papers trend so positive. 

One deficiency of AFINN scoring is that it scores by word so doesn't take negation into account. Meanwhile, lexicons like VADER score by passage. VADER, however, is specially tailored for social media. Still, I tried VADER. Similarly, the Gazette and the Crimson aren't very different, while you can see the New York Times is more negative. 

In [14]:
import nltk
nltk.download('vader_lexicon')

# first, we import the relevant modules from the NLTK library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# next, we initialize VADER so we can use it within our Python script
sid = SentimentIntensityAnalyzer()
scores = sid.polarity_scores(gazette)
# Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
for key in sorted(scores):
        print('gazette: {0}: {1}, '.format(key, scores[key]), end='')
scores = sid.polarity_scores(crimson)
# Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
print()
for key in sorted(scores):
        print('crimson: {0}: {1}, '.format(key, scores[key]), end='')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/terry/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


gazette: compound: 1.0, gazette: neg: 0.06, gazette: neu: 0.825, gazette: pos: 0.115, 
crimson: compound: 1.0, crimson: neg: 0.071, crimson: neu: 0.811, crimson: pos: 0.118, 

In [13]:
scores = sid.polarity_scores(nyt)
# Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
for key in sorted(scores):
        print('nyt: {0}: {1}, '.format(key, scores[key]), end='')

nyt:  0.0012236963374303927
nyt: compound: 1.0, nyt: neg: 0.061, nyt: neu: 0.84, nyt: pos: 0.099, 