# Lab 4 - Spark Lab 1

## Map/Reduce

In this lab, you will work through some of your first programs using the map/reduce paradigm. They are designed to get you to think in a map reduce frame of mind.

### The usual imports

In [1]:
%load_ext autoreload
%autoreload 2


# Put all your solutions into Lab1_helper.py as this script which is autograded
import Lab4_helper
    
import os
from pathlib import Path
home = str(Path.home())

import pandas as pd

### Set up your Spark context

In [2]:
from pyspark import SparkConf
from pyspark.context import SparkContext

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

In [3]:
rdd=sc.parallelize(Lab4_helper.data) # distributes data
for element in rdd.collect():
    print(element)

Project Gutenberg’s
Alice’s Adventures in Wonderland
Project Gutenberg’s
Adventures in Wonderland
Project Gutenberg’s


**Exercise 1:** Create a python function called ``word_counts``. You should use ``flatMap``, ``map``, and a ``reduceByKey``. Your function should take in an RDD.

In [4]:
counts = Lab4_helper.word_counts(rdd)
counts

[('Gutenberg’s', 3),
 ('Adventures', 2),
 ('Wonderland', 2),
 ('Alice’s', 1),
 ('in', 2),
 ('Project', 3)]

**Exercise 2:** Create a function that returns the word frequency of each word in an RDD.

In [5]:
word_frequencies = Lab4_helper.word_freq(rdd)
word_frequencies

[('Gutenberg’s', 3),
 ('Adventures', 2),
 ('Wonderland', 2),
 ('Alice’s', 1),
 ('in', 2),
 ('Project', 3)]

**Exercise 3:** 

Write a function that reads all of the books into a single RDD. Call this function ``load_rdd_all_books``.

In [6]:
all_books_rdd = Lab4_helper.load_rdd_all_books(sc,f"file:{home}/csc-369-student/data/gutenberg")
all_books_rdd

file:/home/jupyter-pander14/csc-369-student/data/gutenberg/*.txt MapPartitionsRDD[12] at textFile at NativeMethodAccessorImpl.java:0

In [7]:
for element in all_books_rdd.collect()[:10]:
    print(element)

The Project Gutenberg eBook of The Prince, by Nicolo Machiavelli

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.



**Problem 1:** Apply your word frequency function to the all_books_rdd and check out the output.

In [8]:
output = Lab4_helper.word_freq(all_books_rdd)
output[:10]

[('', 480179),
 ('have', 43375),
 ('Æt.', 6),
 ('LITERATURE', 2),
 ('XIX.', 28),
 ('HURTFUL?', 2),
 ('government', 462),
 ('environment', 15),
 ('dedicates', 5),
 ('pleasure,', 302)]

In [9]:
# We can use pandas to print it in a readable way
pd.DataFrame(output,columns=['Word','Frequency']).sort_values(by="Frequency")

Unnamed: 0,Word,Frequency
183816,water.—_Anglo-Indian._,1
214278,Hareton?,1
214279,hoile’s,1
214280,disordered:,1
214281,"sneeringly,—“Will",1
...,...,...
195628,to,243237
146853,of,253400
4800,and,290833
205473,the,458876


**Exercise 4:** Use ``wholeTextFiles`` and the ``map`` function to return the word counts for each book individually in an **RDD**. Call this function book_word_counts. Do not use your previous function word_freq as that will not work in this case without modifications.

In [10]:
res = Lab4_helper.book_word_counts(sc,f"file:{home}/csc-369-student/data/gutenberg")
res

[('1232-0.txt',
  {'\ufeffThe': 1,
   'Project': 65,
   'Gutenberg': 19,
   'eBook': 5,
   'of': 1645,
   'The': 89,
   'Prince,': 1,
   'by': 452,
   'Nicolo': 11,
   'Machiavelli\r': 6,
   '\r': 588,
   'This': 51,
   'is': 398,
   'for': 377,
   'the': 2641,
   'use': 24,
   'anyone': 3,
   'anywhere': 2,
   'in': 871,
   'United': 12,
   'States': 8,
   'and\r': 167,
   'most': 40,
   'other': 114,
   'parts': 4,
   'world': 9,
   'at': 191,
   'no': 101,
   'cost': 2,
   'and': 1600,
   'with': 447,
   'almost': 9,
   'restrictions\r': 1,
   'whatsoever.': 2,
   'You': 14,
   'may': 100,
   'copy': 8,
   'it,': 41,
   'give': 27,
   'it': 425,
   'away': 19,
   'or': 239,
   're-use': 2,
   'under': 69,
   'terms\r': 2,
   'License': 8,
   'included': 3,
   'this': 244,
   'online': 4,
   'at\r': 11,
   'www.gutenberg.org.': 2,
   'If': 23,
   'you': 179,
   'are': 256,
   'not': 438,
   'located': 7,
   'States,': 3,
   'you\r': 9,
   'will': 183,
   'have': 344,
   'to': 1924,
 

In [11]:
# print this in a pretty way
pd.DataFrame(res,columns=['File','Word Frequency'])

Unnamed: 0,File,Word Frequency
0,1232-0.txt,"{'﻿The': 1, 'Project': 65, 'Gutenberg': 19, 'e..."
1,45-0.txt,"{'﻿The': 1, 'Project': 66, 'Gutenberg': 18, 'E..."
2,6133-0.txt,"{'﻿The': 1, 'Project': 64, 'Gutenberg': 18, 'E..."
3,2814-0.txt,"{'﻿The': 1, 'Project': 66, 'Gutenberg': 20, 'E..."
4,46-0.txt,"{'﻿The': 1, 'Project': 64, 'Gutenberg': 18, 'E..."
...,...,...
70,84-0.txt,"{'﻿Project': 1, 'Gutenberg's': 1, 'Frankenstei..."
71,215-0.txt,"{'﻿ ': 1, 'The': 165, 'Project': 66, 'Gutenber..."
72,147-0.txt,"{'﻿': 1, 'This': 15, 'eBook': 5, 'is': 357, 'f..."
73,4300-0.txt,"{'﻿ ': 1, 'The': 967, 'Project': 66, 'Gutenber..."


**Exercise 5:** Create a new function called ``lower_case_word_freq``. This function takes as input the output of word_freq parallized into an RDD (see below). This new function converts all the keys to lowercase and then reduces the counts correctly. 

In [12]:
output = Lab4_helper.word_freq(all_books_rdd)
output_lower = Lab4_helper.lower_case_word_freq(sc.parallelize(output))
output_lower[:10]

[('', 480179),
 ('dedicates', 5),
 ('yet', 5147),
 ('policy', 69),
 ('destroyed', 198),
 ('1500,', 1),
 ('reside', 28),
 ('therefore,', 790),
 ('everything', 2209),
 ('led', 1127)]

In [13]:
pd.DataFrame(output_lower,columns=['Word','Frequency']).sort_values(by="Frequency")

Unnamed: 0,Word,Frequency
169314,"gay-header,",1
198018,like—people,1
198019,wussians,1
198020,expanse.,1
198022,do—perished,1
...,...,...
226247,to,247147
13796,of,257229
212622,and,305384
0,,480179


In [14]:
# Don't forget to push!