# News Headline Analysis

In this project we're analyzing news headlines written by two journalists – a **finance** reporter from the Business Insider, and a **celebrity** reporter from the Huffington post – to find similarities and differences between the ways that these authors write headlines for their news articles and blog posts. Our selected reporters are:

- Akin Oyedele the Business Insider who covers market updates; and
- Carly Ledbetter from the Huffington Post who mainly writes about celebrities.

### Approach

We're initially going to collect and parse news headlines from each of the authors in order to obtain a parse tree. Then we're going to extract certain information from these parse trees that are indicative of the overall structure of the headline.

Next, we will define a simple sequence similarity metric to compare any pair of headlines quantitatively, and we will apply the same method to all of the headlines we've gathered for each author, to find out how similar each pair of headlines is.

Finally, we're going to use K-Means and tSNE to produce a visual map of all the headlines, where we can see the similarities and the differences between the two authors more clearly.

### Data

For this project we've gathered 700 headlines for each author using the [AYLIEN News API](https://newsapi.aylien.com) which we're going to analyze using Python. You can obtain the Pickled data files directly from the GitHub repository, or by using [the data collection notebook](https://github.com/AYLIEN/headline_analysis/blob/master/data-collection.ipynb) that we've prepared for this project.

### A primer on parse trees

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar.

For a simple sentence like "The cat sat on the mat", a parse tree might look like this:

![The cat sat on the mat](https://raw.githubusercontent.com/AYLIEN/headline_analysis/master/parsetree.png)

We're going to use the [Pattern library](http://www.clips.ua.ac.be/pages/pattern-en#tree) for Python to parse the headlines and create parse trees for them:

In [None]:
from pattern.en import parsetree

Let's see an example:

In [None]:
s = parsetree('The cat sat on the mat.')
for sentence in s:
    for chunk in sentence.chunks:
        print chunk.type, [(w.string, w.type) for w in chunk.words]

### Loading the data

Let's load the Pickled data file for the first author (Akin Oyedele) which contains 700 headlines, and let's see an example of what a headline might look like:

In [None]:
import cPickle as pickle
author1 = pickle.load( open( "author1.p", "rb" ) )
author1[0]

### Parsing the data

Now that we have all the headlines for the first author loaded, we're going to analyze them, create parse trees for each headline, and store them together with some basic information about the headline in the same object:

In [None]:
author1[0]

Let's see what the numeric attributes for headlines written by this author look like. We're going to use [Pandas](http://pandas.pydata.org/) for this.

In [None]:
import pandas as pd

df1 = pd.DataFrame.from_dict(author1)

In [None]:
df1.describe()

From this information, we're going to extract the chunk type sequence of each headline (i.e. the first level of the parse tree) and use it as an indicator of the overall structure of the headline. So in the above example, we would extract and use the following sequence of chunk types in our analysis:

```
['NP', 'PP', 'NP', 'VP']
```

### Similarity

We have loaded all the headlines written by the first author, and created and stored their parse trees. Next, we need to find a similarity metric that given two chunk type sequences, tells us how similar these two headlines are, from a structural perspective. 

For that we're going to use the SequenceMatcher class of difflib, which produces a similarity score between 0 and 1 for any two sequences (Python lists):

In [None]:
import difflib
print "Similarity scores for...\n"
print "Two identical sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["A","B","C"]).ratio()
print "Two similar sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["A","B","D"]).ratio()
print "Two completely different sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["X","Y","Z"]).ratio()

Now let's see how that works with our chunk type sequences, for two randomly selected headlines from the first author:

In [None]:
v1 = author1[3]["title_chunks"]
v2 = author1[1]["title_chunks"]

print v1, v2, difflib.SequenceMatcher(None,v1,v2).ratio()

### Pair-wise similarity matrix for the headlines

We're now going to apply the same sequence similarity metric to all of our headlines, and create a 700x700 matrix of pairwise similarity scores between the headlines:

In [None]:
import numpy as np
chunks = [author["title_chunks"] for author in author1]
m = np.zeros((700,700))
for i, chunkx in enumerate(chunks):
    for j, chunky in enumerate(chunks):
        m[i][j] = difflib.SequenceMatcher(None,chunkx,chunky).ratio()

### Visualization

To make things clearer and more understandable, let's try and put all the headlines written by the first author on a 2d scatter plot, where similarly structured headlines are close together.

For that we're going to first use tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2:

In [None]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)

In [None]:
tsne = tsne_model.fit_transform(m)

And to add a bit of color to our visualization, let's use K-Means to identify 5 clusters of similar headlines, which we will use in our visualization:

In [None]:
from sklearn.cluster import MiniBatchKMeans

kmeans_model = MiniBatchKMeans(n_clusters=5, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(m)
kmeans_clusters = kmeans.predict(m)
kmeans_distances = kmeans.transform(m)

Finally let's plot the actual chart using [Bokeh](http://bokeh.pydata.org/en/latest/):

In [None]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

output_notebook()
plot_author1 = bp.figure(plot_width=900, plot_height=700, title="Author1",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_author1.scatter(x=tsne[:,0], y=tsne[:,1],
                    color=colormap[kmeans_clusters],
                    source=bp.ColumnDataSource({
                        "chunks": [x["title_chunks"] for x in author1], 
                        "title": [x["title"] for x in author1],
                        "cluster": kmeans_clusters
                    }))

hover = plot_author1.select(dict(type=HoverTool))
hover.tooltips={"chunks": "@chunks (title: \"@title\")", "cluster": "@cluster"}
show(plot_author1)

The above interactive chart shows a number of dense groups of headlines, as well as some sparse ones. Some of the dense groups that stand out more are:

- The **NP, VP** group on the left, which typically consists of short, snappy stock update headlines such as "Viacom is crashing";
- The **VP, NP** group on the top right, which is mostly announcement headlines in the "Here comes the..." format; and
- The **NP, VP, ADJP, PP, VP** group at bottom left, where we have headlines such as "Industrial production **falls more than expected**" or "ADP private payrolls **rise more than expected**".

If you look closely you will find other interesting groups, as well as their similarities/disimilarities when compared to their neighbors.

### Comparing the two authors

Finally, let's load the headlines for the second author and see how they compare to the ones from the first one. The steps are quite similar to the above, except this time we're going to calculate the similarity of both sets of headlines and store it in a 1400x1400 matrix:

In [None]:
author2 = pickle.load( open( "author2.p", "rb" ) )
for story in author2:
    story["title_length"] = len(story["title"])
    story["title_chunks"] = [chunk.type for chunk in parsetree(story["title"])[0].chunks]
    story["title_chunks_length"] = len(story["title_chunks"])

In [None]:
pd.DataFrame.from_dict(author2).describe()

The basic stats don't show a significant difference between the headlines written by the two authors.

In [None]:
chunks_joint = [author["title_chunks"] for author in (author1+author2)]
m_joint = np.zeros((1400,1400))
for i, chunkx in enumerate(chunks_joint):
    for j, chunky in enumerate(chunks_joint):
        sm=difflib.SequenceMatcher(None,chunkx,chunky)
        m_joint[i][j] = sm.ratio()

Now that we have analyzed the headlines for the second author, let's see how many common patterns exist between the two authors:

In [None]:
for story in author1:
    story["title_length"] = len(story["title"])
    story["title_chunks"] = [chunk.type for chunk in parsetree(story["title"])[0].chunks]
    story["title_chunks_length"] = len(story["title_chunks"])

In [None]:
set1= [author["title_chunks"] for author in author1]
set2= [author["title_chunks"] for author in author2]
list_new = [itm for itm in set1 if itm in set2]
len(list_new)

We observe that about 50% (347/700) of the headlines have a similar structure.

### Visualization of headlines by the two authors

Our approach here is quite similar to what we did for the first author. The only difference is that here we're going to use colors to indicate the _author_ and not the cluster this time (blue for author1 and orange for author2).