# Loading and Saving Data

## Spark Set Up

In [1]:
## Imports
import re
import json
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession

app_name = "week2_demo"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .config("spark.ui.port","42229")\
        .getOrCreate()
sc = spark.sparkContext

## Change the working directory
%cd /media

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/19 14:06:06 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/04/19 14:06:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/04/19 14:06:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/04/19 14:06:06 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


/media


## File Formats

Spark can support a variety of file formats. In the following table we summarize the most common supported file formats

|  Format name | Structured  | Comments  |
|:-----------:|:---:|:-----------------------------------------------------------|
|  Text Files | No  | Plain text files. Records are assumed to be one per line  |
|  JSON | Semi  |  Common text-based format. Semistructured. We will use python library json |
|  CSV | Yes  |  Very common text-based format, often used in spreadsheet applications |

In Week 3 we will cover the compression factor of CSV and compare it to two new data types that we will use in Streaming: parquet and avro.

### Text Files

Quite simple to load from and save to with Spark. When we load a single text file as an RDD, each input line becomes a record of the RDD. We can also load multiple text files at the same time. In this case, it will be stored in a key/value pair RDD (Pair RDD), with the key being the name of the text file and the value the contents of the file

In [3]:
## Download the Alice in Wonderland data
### Store path to notebook
PWD = !pwd
PWD = PWD[0]

### Make the data directory
!mkdir data

### Download the data and store in the a data folder
!gsutil cp gs://great-learning-data-eng/alice.txt data/
ALICE_TXT = 'file:///media' + "/data/alice.txt"

mkdir: cannot create directory ‘data’: File exists
Copying gs://great-learning-data-eng/alice.txt...
/ [1 files][170.2 KiB/170.2 KiB]                                                
Operation completed over 1 objects/170.2 KiB.                                    


In [4]:
## Load data into an RDD
aliceRDD = sc.textFile(ALICE_TXT)

## Perform a word count
result = aliceRDD.flatMap(lambda line: re.findall('[a-z]+', line.lower())) \
                 .map(lambda word: (word, 1)) \
                 .reduceByKey(lambda a, b: a + b)\
                 .cache()

In [5]:
## Get top 10 words (alphabetically)
result.takeOrdered(10)

                                                                                

[('a', 695),
 ('abide', 2),
 ('able', 1),
 ('about', 102),
 ('above', 3),
 ('absence', 1),
 ('absurd', 2),
 ('accept', 1),
 ('acceptance', 1),
 ('accepted', 2)]

In [6]:
## Get top 10 words (by count)
result.takeOrdered(10, key=lambda x: -x[1])

[('the', 1839),
 ('and', 942),
 ('to', 811),
 ('a', 695),
 ('of', 638),
 ('it', 610),
 ('she', 553),
 ('i', 546),
 ('you', 486),
 ('said', 462)]

In [8]:
## Let's save as a text file
outputPATH = 'file:///media' + "/data/result"
result.saveAsTextFile(outputPATH)

### JSON

JSON files are a very popular semistructured data format, used by almost all web APIs and the web. Python has a very powerful built-in library. Let's take a look

In [7]:
## Load the data
!gsutil cp gs://great-learning-data-eng/annot_fpid.json data/

Copying gs://great-learning-data-eng/annot_fpid.json...
/ [1 files][  2.3 MiB/  2.3 MiB]                                                
Operation completed over 1 objects/2.3 MiB.                                      


In [18]:
## Use Json Library to read the file
json_path = "file:///media/data/annot_fpid.json"
input = sc.textFile(json_path)
data = input.map(lambda x: json.loads(x))

## Define helper function to parse the lists
def parse_list(x):
    for i in x:
        return (i, 1)

## Calculate a frequency table
freq_tableRDD = data.flatMap(lambda x: x.values())\
                    .map(parse_list)\
                    .reduceByKey(lambda x,y: x+y)\
                    .cache()

In [20]:
freq_tableRDD.takeOrdered(10, key=lambda x: -x[1])

[('programming', 10005),
 ('javascript', 7082),
 ('ms_excel', 2130),
 ('sql', 2094),
 ('android', 1599),
 ('scala', 1538),
 ('php', 1453),
 ('puppet', 1354),
 ('powershell', 1315),
 ('r', 1200)]

### CSV

Comma-Separated Values or CSVs are files that look very similar to an Excel Spreadsheet. Similar to JSON, we need to first load it as a text file and then we can process it. We can also use the `spark.read.csv` method. Let's showcase the latter (this will use a DataFrame, Week 3 will be where we discuss DataFrames in detail)

In [2]:
## Load the data
!gsutil cp gs://great-learning-data-eng/lp_data.csv data/

Copying gs://great-learning-data-eng/lp_data.csv...
/ [1 files][  8.7 KiB/  8.7 KiB]                                                
Operation completed over 1 objects/8.7 KiB.                                      


In [10]:
## Read the data
csv_path = "file:///media/data/lp_data.csv"
data_df = spark.read.csv(csv_path, header=True)
type(data_df)

pyspark.sql.dataframe.DataFrame

In [11]:
## Transform the DataFrame into an RDD
data_RDD = data_df.rdd
data_RDD.take(5)

[Row(usage='1170.88', product_name="The Manager's Path, 1Ed", delivery_code='SafBook', month='2017 / 10', pdb_ip_pub_date='2017-03-23'),
 Row(usage='21.13', product_name='XPath and XPointer, 1Ed', delivery_code='SafBook', month='2017 / 10', pdb_ip_pub_date='2002-07-31'),
 Row(usage='516.92', product_name='Learning Path: Advanced CSS & Sass', delivery_code='SafVideo', month='2017 / 10', pdb_ip_pub_date='2015-11-23'),
 Row(usage='2.06', product_name='Learning Path: 2017 Design Conference Viewer???s Choice, 1Ed', delivery_code='SafVideo', month='2017 / 10', pdb_ip_pub_date='2017-05-26'),
 Row(usage='1234.83', product_name="Learning Path: A Beginner's Guide to Architecting Big Data Applications, 1Ed", delivery_code='SafVideo', month='2017 / 10', pdb_ip_pub_date='2016-12-13')]

In [17]:
## Get total usage
data_RDD.map(lambda x: float(x.usage))\
        .reduce(lambda x,y: x+y)

32486.070000000007