# Chapter 4 - Part 2

Paul E. Anderson

In [19]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. 

import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [1]:
from pyspark import SparkConf
from pyspark.context import SparkContext

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

Creating an RDD from a Python object is useful, but most of our data will be in files or databases. Spark provides a few functions to read files:

* ``sc.textFile(path)``
* ``sc.wholeTextFiles(path)``

``textFile`` takes a string argument that is the path to the file or files to be read. Both of these can read multiple files into an RDD/PairRDD.

For example:

* ``sc.textFile(path)`` - returns an RDD with each line as an element
* ``sc.wholeTextFiles(path)`` - returns a PairRDD object where the key is file path and the value is the contents of the file.

``textFile`` will split the data into chuncks of 32MB. 

This is advantagous from a memory perspective, but the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

``wholeTextFiles`` will read the complete content of a file at once. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

In [14]:
data = ["Project Gutenberg’s",
        "Alice’s Adventures in Wonderland",
        "Project Gutenberg’s",
        "Adventures in Wonderland",
        "Project Gutenberg’s"]
rdd=sc.parallelize(data) # distributes data
for element in rdd.collect():
    print(element)

Project Gutenberg’s
Alice’s Adventures in Wonderland
Project Gutenberg’s
Adventures in Wonderland
Project Gutenberg’s


**Exercise 2:** Create a function that returns the word frequency of each word in an RDD.

In [15]:
def word_freq(rdd):
    output = []
    # Your solution here
    return output

In [16]:
word_freq(rdd)

[('Gutenberg’s', 3),
 ('Adventures', 2),
 ('Wonderland', 2),
 ('Alice’s', 1),
 ('in', 2),
 ('Project', 3)]

**Exercise 3:** 

Write a function that reads all of the books into a single RDD. Call this function ``load_rdd_all_books``.

In [4]:
def load_rdd_all_books(sc,dir):
    lines = None
    return lines

In [20]:
all_books_rdd = load_rdd_all_books(sc,f"file:{home}/csc-369-student/data/gutenberg")
all_books_rdd

file:/home/jupyter-pander14/csc-369-student/data/gutenberg/*.txt MapPartitionsRDD[33] at textFile at NativeMethodAccessorImpl.java:0

**Problem 1:** Apply your word frequency function to the all_books_rdd.

In [23]:
output = word_freq(all_books_rdd)
# Your solution here
output[:10]

[('', 480179),
 ('have', 43375),
 ('Æt.', 6),
 ('LITERATURE', 2),
 ('XIX.', 28),
 ('HURTFUL?', 2),
 ('government', 462),
 ('environment', 15),
 ('dedicates', 5),
 ('pleasure,', 302)]

In [24]:
# We can use pandas to print it in a readable way
pd.DataFrame(output,columns=['Word','Frequency']).sort_values(by="Frequency")

Unnamed: 0,Word,Frequency
183816,water.—_Anglo-Indian._,1
214278,Hareton?,1
214279,hoile’s,1
214280,disordered:,1
214281,"sneeringly,—“Will",1
...,...,...
195628,to,243237
146853,of,253400
4800,and,290833
205473,the,458876


**Exercise 4:** Use ``wholeTextFiles`` and the ``map`` function to return the word counts for each book individually. Call this function book_word_counts. Do not use your previous function word_freq as that will not work in this case without modifications.

In [29]:
# This is a helper function you can use

def count_words(content):
    counts = {}
    for line in content.split("\n"):
        words = line.split(" ")
        for word in words:
            if word not in counts:
                counts[word] = 0
            counts[word] += 1
    return counts

In [30]:
def book_word_counts(sc,dir):
    res = None
    return res

In [33]:
res = book_word_counts(sc,f"file:{home}/csc-369-student/data/gutenberg")
res[0] # One book

('1232-0.txt',
 {'\ufeffThe': 1,
  'Project': 65,
  'Gutenberg': 19,
  'eBook': 5,
  'of': 1645,
  'The': 89,
  'Prince,': 1,
  'by': 452,
  'Nicolo': 11,
  'Machiavelli\r': 6,
  '\r': 588,
  'This': 51,
  'is': 398,
  'for': 377,
  'the': 2641,
  'use': 24,
  'anyone': 3,
  'anywhere': 2,
  'in': 871,
  'United': 12,
  'States': 8,
  'and\r': 167,
  'most': 40,
  'other': 114,
  'parts': 4,
  'world': 9,
  'at': 191,
  'no': 101,
  'cost': 2,
  'and': 1600,
  'with': 447,
  'almost': 9,
  'restrictions\r': 1,
  'whatsoever.': 2,
  'You': 14,
  'may': 100,
  'copy': 8,
  'it,': 41,
  'give': 27,
  'it': 425,
  'away': 19,
  'or': 239,
  're-use': 2,
  'under': 69,
  'terms\r': 2,
  'License': 8,
  'included': 3,
  'this': 244,
  'online': 4,
  'at\r': 11,
  'www.gutenberg.org.': 2,
  'If': 23,
  'you': 179,
  'are': 256,
  'not': 438,
  'located': 7,
  'States,': 3,
  'you\r': 9,
  'will': 183,
  'have': 344,
  'to': 1924,
  'check': 5,
  'laws': 16,
  'country': 26,
  'where': 40,
  '

In [35]:
# print this in a pretty way
pd.DataFrame(res,columns=['File','Word Frequency'])

Unnamed: 0,File,Word Frequency
0,1232-0.txt,"{'﻿The': 1, 'Project': 65, 'Gutenberg': 19, 'e..."
1,45-0.txt,"{'﻿The': 1, 'Project': 66, 'Gutenberg': 18, 'E..."
2,6133-0.txt,"{'﻿The': 1, 'Project': 64, 'Gutenberg': 18, 'E..."
3,2814-0.txt,"{'﻿The': 1, 'Project': 66, 'Gutenberg': 20, 'E..."
4,46-0.txt,"{'﻿The': 1, 'Project': 64, 'Gutenberg': 18, 'E..."
...,...,...
70,84-0.txt,"{'﻿Project': 1, 'Gutenberg's': 1, 'Frankenstei..."
71,215-0.txt,"{'﻿ ': 1, 'The': 165, 'Project': 66, 'Gutenber..."
72,147-0.txt,"{'﻿': 1, 'This': 15, 'eBook': 5, 'is': 357, 'f..."
73,4300-0.txt,"{'﻿ ': 1, 'The': 967, 'Project': 66, 'Gutenber..."
