# SI 618 WN 2018 - Homework 4: Using the Spark RDD API to analyze bigrams in text

## Objectives
1. To gain familiarity with PySpark
2. To learn the basics of the Spark RDD API
3. To get experience finding and downloading data

## Please fill in...
### * Your name: Anthony Cozart
### * People you worked with: Anna Lenhart, Lauren (@ Office Hours)

## Submission Instructions:
Please turn in this Jupyter notebook file (in both .ipynb and .html formats) **and the text file that contains the text you analyzed** via Canvas.

## Overview

This project is designed to give you a basic familiarity with the Apache Spark RDD API.

**Your first task** is to run the next code cell in this notebook, as is, (without modification) and confirm that everything is working.

Your second and main task is to modify the word_count_v2.ipynb file in Lab 4 to create a si618-hw4-YOUR_UNIQNAME, which counts the number of bigrams within the corpus. At the conclusion of its execution, it should output three pieces of information: 
1. the total number of bigrams
2. the 20 most common bigrams
3. the minimum number of bigrams required to add up to 10% of all bigrams

In [1]:
import re
import pyspark
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

totc = ["It was the best of times",
    "it was the worst of times",
    "it was the age of wisdom",
    "it was the age of foolishness",
    "it was the epoch of belief",
    "it was the epoch of incredulity",
    "it was the season of Light",
    "it was the season of Darkness",
    "it was the spring of hope",
    "it was the winter of despair",
    "we had everything before us",
    "we had nothing before us",
    "we were all going direct to Heaven",
    "we were all going direct the other way"]

WORD_RE = re.compile(r"\b[\w']+\b")

input_text = sc.parallelize(totc)
word_counts_sorted = input_text.flatMap(lambda line: WORD_RE.findall(line)) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda accumulator, value: accumulator + value) \
    .sortBy(lambda x: x[1], ascending = False)

top10_sorted = word_counts_sorted.take(10)
top10_sorted

[('the', 11),
 ('was', 10),
 ('of', 10),
 ('it', 9),
 ('we', 4),
 ('epoch', 2),
 ('before', 2),
 ('us', 2),
 ('times', 2),
 ('age', 2)]

## Bigram Analysis
Bigrams are pairs of continguous words.  For example, consider the text "The quick brown fox".  The bigrams in that text are: (The, quick), (quick, brown), (brown,fox).

You should be able to use the above code as a starting point, with the main difference being that you will be extracting pairs of words rather than single words.

Here are the steps you need to include:

1. Find and download a text file from the Gutenberg Project.  Please select a book written in English and download the plain text version (i.e. the .txt file).  
1. Normalize the text by converting it to lowercase.  
1. Extract all bigrams from the text.  Don’t try to get fancy: just take all the pairs of words from each line of your text file
1. Perform a mapping to yield a count of one for each bigram (e.g. (("the", "quick"), 1)). 
1. Reduce this list of bigrams with count of one to counts of bigrams (e.g. (("the", "quick"), 312)).
1. Sort to determine the most frequent and total number of bigrams. 
1. Report the total number of bigrams, the top 20 most common bigrams and the minimum number of bigrams required to add up to 10% of all bigrams.


### Step 1a: Find and download a text file
Go to Project Gutenberg (http://www.gutenberg.org) and find a book written in English that you think sounds interesting. 
Download the plain text (.txt) version of the book.

### Step 1b: Load the text file into Spark

In [208]:
text_file_name = "war_peace_tolstoy.txt"
input_text = sc.textFile(text_file_name)

In [209]:
for line in input_text.take(10):
    print(line.lower())


the project gutenberg ebook of war and peace, by leo tolstoy

this ebook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. you may copy it, give it away or re-use
it under the terms of the project gutenberg license included with this
ebook or online at www.gutenberg.org


title: war and peace


### Steps 2 and 3: Normalize text and extract bigrams
Note: there are many ways to accomplish this.  We recommend you create
a function that both creates bigrams and normalizes text.  The following
template code assumes this is the approach you are planning to take.

In [210]:
words = input_text.map(lambda line: line.lower().split(" "))

In [211]:
# Using Anna's function here -- the only (but key) difference from mine is that she saves the bigram separately and then together, instead of "returning" each
def normalize_and_extract_bigrams(words):
    bigrams=[]
    print(words)
    for i in range (0, len(words)-1):
        bigram=(words[i], words[i+1])
        #bigram.lower
        bigrams.append(bigram)       
    return bigrams

In [212]:
bigrams1 = words.flatMap(normalize_and_extract_bigrams)

### Step 4:  Map the bigrams to key-value pairs where the key is the bigram and the value is 1

In [213]:
bigrams2 = bigrams1.map(lambda words: (words, 1))

### Step 5: Reduce the (word,1) key-value pairs to give you counts of each bigram

In [214]:
bigrams3 = bigrams2.reduceByKey(lambda accumulator, value: accumulator + value)

### Step 6: Sort the resulting RDD by value in descending order

In [215]:
bigrams_sorted = bigrams3.sortBy(lambda x: x[1], ascending = False)

### Step 7: Report the required values
1. the total number of bigrams
2. the 20 most common bigrams
3. the minimum number of bigrams required to add up to 10% of all bigrams

In [216]:
# 1. The total number of bigrams (this is not the number of bigrams, but the count -- all occurences)
total_bigrams = bigrams_sorted.values().sum()
total_bigrams

517022

In [217]:
# 2. The 20 most common bigrams
top20 = bigrams_sorted.take(20)
top20

[(('of', 'the'), 3851),
 (('to', 'the'), 2189),
 (('', ''), 2182),
 (('in', 'the'), 2174),
 (('and', 'the'), 1390),
 (('at', 'the'), 1281),
 (('on', 'the'), 1236),
 (('he', 'had'), 1141),
 (('did', 'not'), 994),
 (('with', 'a'), 898),
 (('he', 'was'), 858),
 (('from', 'the'), 815),
 (('it', 'was'), 797),
 (('with', 'the'), 755),
 (('of', 'his'), 743),
 (('by', 'the'), 733),
 (('in', 'a'), 704),
 (('to', 'be'), 698),
 (('had', 'been'), 695),
 (('prince', 'andrew'), 630)]

In [218]:
# 3. the minimum number of bigrams required to add up to 10% of all bigrams
def running_total(a):
    threshold = total_bigrams/10
    total = 0
    count = 0
    for i in a:
        if total < threshold:
            total += i[1]
            count += 1
        else:
            return count

print("{0:d} bigrams comprise 10% of the total number of bigrams in War & Peace.".format(running_total(bigrams_sorted.take(500))))

108 bigrams comprise 10% of the total number of bigrams in War & Peace.


### Above and Beyond

Anna Karenina, War & Peace, and the Brothers Karamazov are among the most highly regarded novels in the world. All three are by russian novelists, and are very very long. 

After looking at the counts of bigrams of for War & Peace, I became interested in the concentration of words in novels. So I looked up the other two novels, ran word counts, and used a measure of concentration from public policy and economics called a Herfindahl Index (https://en.wikipedia.org/wiki/Herfindahl_index).

The Herfindahl Index is the square of the count divided by the total number of words. It's typically used to understand the market power (control) of firms -- for example, how much power does Verizon have compared to AT&T. To calculate it, I take the top 1000 words in each book, to speed up computing.

The concentration of words is similar for all three books -- maybe that's not so suprising, since all three are written in a similar, complex style.

In [207]:
ak_txt = "anna_karenina_tolstoy.txt"
input_text = sc.textFile(ak_txt)

ak_word_counts = input_text.flatMap(lambda line: WORD_RE.findall(line)) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda accumulator, value: accumulator + value) \
    .sortBy(lambda x: x[1], ascending = False)

ak_top1000 = ak_word_counts.take(1000)
ak_total = ak_word_counts.values().sum()

In [197]:
wp_txt = "war_peace_tolstoy.txt"
input_text = sc.textFile(wp_txt)

wp_word_counts = input_text.flatMap(lambda line: WORD_RE.findall(line)) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda accumulator, value: accumulator + value) \
    .sortBy(lambda x: x[1], ascending = False)

wp_top1000 = wp_word_counts.take(1000)
wp_total = wp_word_counts.values().sum()

In [198]:
bk_txt = "brothers_karamazov.txt"
input_text = sc.textFile(bk_txt)

bk_word_counts = input_text.flatMap(lambda line: WORD_RE.findall(line)) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda accumulator, value: accumulator + value) \
    .sortBy(lambda x: x[1], ascending = False)

bk_top1000 = bk_word_counts.take(1000)
bk_total = bk_word_counts.values().sum()

In [199]:
ak = pd.DataFrame(list(ak_top1000), columns =['word', 'count'])
wp = pd.DataFrame(list(wp_top1000), columns =['word', 'count'])
bk = pd.DataFrame(list(bk_top1000), columns =['word', 'count'])

In [200]:
wp['share'] = (wp['count']^2)/int(wp_total)
ak['share'] = (ak['count']^2)/int(ak_total)
bk['share'] = (bk['count']^2)/int(bk_total)

In [201]:
print("In a similar spirit to calculating bigrams, we can see just how concentrated the words are in each novel:")
print("The Herfindahl index for Anna Karenina is {0:f}.".format(ak['share'].sum()))
print("The Herfindahl index for War & Peace is {0:f}.".format(wp['share'].sum()))
print("The Herfindahl index for Brothers Karamazov is {0:f}.".format(bk['share'].sum()))

In a similar spirit to calculating bigrams, we can see just how concentrated the words are in each novel:
The Herfindahl index for Anna Karenina is 0.828109.
The Herfindahl index for War & Peace is 0.796827.
The Herfindahl index for Brothers Karamazov is 0.832341.


In [202]:
print(ak_total)
print(wp_total)
print(bk_total)

361904
576627
363547


# END OF HOMEWORK 4