# EDA (using Spark)

#### Description:

This notebook shows the EDA performed for the w261 final project. This EDA started with taking a small random sample of the raw data and performing exploratory analysis with Spark.

### Load libraries

In [2]:
! pip install pyarrow

Collecting pyarrow
[?25l  Downloading https://files.pythonhosted.org/packages/5a/ee/fd2d696eff911f76ed14feeb51e6db6783dd04abd9b8e14be4cbf48d6088/pyarrow-0.15.1-cp37-cp37m-manylinux2010_x86_64.whl (59.2MB)
[K     |████████████████████████████████| 59.2MB 58.7MB/s eta 0:00:01    |███████████████                 | 27.6MB 3.4MB/s eta 0:00:10
Installing collected packages: pyarrow
Successfully installed pyarrow-0.15.1


In [3]:
# General tools & operations libraries
import re
import ast
import time
import csv
import itertools

# Mathematical operations and dataframes libraries
import numpy as np
import pandas as pd

# Plotting and visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Parquet libraries
import pyarrow as pa
import pyarrow.parquet as pq

# PySpark libraries
from pyspark.sql import SQLContext
#from pyspark.sql import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.conf import SparkConf

#### Set parameters and Spark configurations

In [4]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [5]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [6]:
# assign parameters
!BUCKET=danielalvarez_w261projects

In [7]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "finalproject_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

In [7]:
# Spark configuration Information
for object in sc.getConf().getAll():
    print(object)

('spark.app.id', 'local-1575325734557')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.master', 'local[*]')
('spark.executor.id', 'driver')
('spark.driver.port', '35303')
('spark.app.name', 'finalproject_notebook')
('spark.submit.deployMode', 'client')
('spark.ui.showConsoleProgress', 'true')
('spark.driver.host', 'docker.w261')


In [8]:
spark

# Exploratory Data Analysis

Determine 2-3 relevant EDA tasks that will help you make decisions about how you implement the algorithm to be scalable. Discuss any challenges that you anticipate based on the EDA you perform.

### Load dataset

Dataset represents 0.4% of the raw `train.txt` dataset.

In [9]:
#!cat train.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .001) print $0}' > data/sample.txt
#!gzip -cd data/dac.tar.gz | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .001) print $0}' > data/sample.txt

Impose schema structure. 

The 13th variable is a numeric (`n13`), 14th variable is categorical (`cat14`)

In [10]:
# the 13th variable is a numeric (`n13`), 14th variable is categorical (`cat14`)
schema = StructType([
    StructField('y', IntegerType()),
    StructField('n1', IntegerType()),
    StructField('n2', IntegerType()),
    StructField('n3', IntegerType()),
    StructField('n4', IntegerType()),
    StructField('n5', LongType()),
    StructField('n6', IntegerType()),
    StructField('n7', IntegerType()),
    StructField('n8', IntegerType()),
    StructField('n9', IntegerType()),
    StructField('n10', IntegerType()),
    StructField('n11', IntegerType()),
    StructField('n12', IntegerType()),
    StructField('n13', IntegerType()), 
    StructField('cat14', StringType()),
    StructField('cat15', StringType()),
    StructField('cat16', StringType()),
    StructField('cat17', StringType()),
    StructField('cat18', StringType()),
    StructField('cat19', StringType()),
    StructField('cat20', StringType()),
    StructField('cat21', StringType()),
    StructField('cat22', StringType()),
    StructField('cat23', StringType()),
    StructField('cat24', StringType()),
    StructField('cat25', StringType()),
    StructField('cat26', StringType()),
    StructField('cat27', StringType()),
    StructField('cat28', StringType()),
    StructField('cat29', StringType()),
    StructField('cat30', StringType()),
    StructField('cat31', StringType()),
    StructField('cat32', StringType()),
    StructField('cat33', StringType()),
    StructField('cat34', StringType()),
    StructField('cat35', StringType()),
    StructField('cat36', StringType()),
    StructField('cat37', StringType()),
    StructField('cat38', StringType()),
    StructField('cat39', StringType()),
])

Create Spark Dataframe

In [11]:
start = time.time()
print('Creating dataframe..')
df = spark.read.load("data/sample.txt", format='csv', sep='\t', header='false', schema=schema)
print(f"... completed job in {time.time() - start} seconds")

Creating dataframe..
... completed job in 1.8787925243377686 seconds


Show the first 5 rows of selected columns

In [12]:
print(df.select('y','n1','n12','n13','cat14','cat39').show(n=5))

+---+----+----+----+--------+--------+
|  y|  n1| n12| n13|   cat14|   cat39|
+---+----+----+----+--------+--------+
|  0|   2|null|null|68fd1e64|    null|
|  0|   3|   0|   1|68a25dc5|da9fe092|
|  0|   0|null|   2|68fd1e64|    null|
|  0|  64|   0|null|05db9164|    null|
|  1|null|null|null|8cf07265|0a47000d|
+---+----+----+----+--------+--------+
only showing top 5 rows

None


In [13]:
# Count the number of rows
df.count()

183029

In [14]:
df.head(5)

[Row(y=0, n1=2, n2=-1, n3=None, n4=None, n5=501, n6=0, n7=2, n8=0, n9=0, n10=1, n11=1, n12=None, n13=None, cat14='68fd1e64', cat15='4c2bc594', cat16='d032c263', cat17='c18be181', cat18='25c83c98', cat19='fe6b92e5', cat20='1e9876db', cat21='0b153874', cat22='a73ee510', cat23='fa7d0797', cat24='043725ae', cat25='dfbb09fb', cat26='7f0d7407', cat27='8ceecbc8', cat28='7ac43a46', cat29='84898b2a', cat30='07c540c4', cat31='bc48b783', cat32=None, cat33=None, cat34='0014c32a', cat35=None, cat36='3a171ecb', cat37='3b183c5c', cat38=None, cat39=None),
 Row(y=0, n1=3, n2=-1, n3=2, n4=1, n5=79, n6=1, n7=3, n8=1, n9=1, n10=1, n11=1, n12=0, n13=1, cat14='68a25dc5', cat15='80e26c9b', cat16='3b40a9aa', cat17='37dff460', cat18='25c83c98', cat19=None, cat20='815e3303', cat21='0b153874', cat22='a73ee510', cat23='b9b1972c', cat24='2cfc1696', cat25='ba5646a2', cat26='9bbdb8bd', cat27='07d13a8f', cat28='f3635baf', cat29='cb880c3a', cat30='07c540c4', cat31='f54016b9', cat32='21ddcdc9', cat33='b1252a9d', cat34=

### Write dataframe to parquet file format

In [15]:
start = time.time()
print('Writing dataframe to parquet format..')

df.write.parquet('data/df.parquet', compression='snappy', mode='overwrite')
#df.write.parquet(OUT_FILES, compression='snappy', mode='overwrite')

print(f"... completed job in {time.time() - start} seconds")

Writing dataframe to parquet format..
... completed job in 4.1327736377716064 seconds


### Read in parquet files

In [16]:
df_pq = spark.read.load('data/df.parquet')

In [17]:
# count the number of rows
print(df_pq.count())

# perform an assert to check number of rows matches before and after parquet conversion
print(df_pq.count() == df.count())

183029
True


In [18]:
# Examine first 5 rows
df_pq.head(5)

[Row(y=0, n1=None, n2=1, n3=2, n4=2, n5=23480, n6=361, n7=2, n8=2, n9=157, n10=None, n11=1, n12=None, n13=2, cat14='68fd1e64', cat15='7008ef6d', cat16='08e19f66', cat17='ded6a29a', cat18='25c83c98', cat19='fbad5c96', cat20='1c63b114', cat21='0b153874', cat22='a73ee510', cat23='f6f942d1', cat24='67841877', cat25='e7e7e539', cat26='781f4d92', cat27='07d13a8f', cat28='03259d67', cat29='ccf3df7a', cat30='e5ba7672', cat31='d1c83925', cat32=None, cat33=None, cat34='3336022d', cat35='ad3062eb', cat36='3a171ecb', cat37='b2f178a3', cat38=None, cat39=None),
 Row(y=0, n1=0, n2=57, n3=16, n4=27, n5=9105, n6=277, n7=26, n8=15, n9=1070, n10=0, n11=5, n12=0, n13=27, cat14='05db9164', cat15='7b99bba3', cat16='3c548aa7', cat17='96cc0f03', cat18='25c83c98', cat19=None, cat20='1c4d06eb', cat21='1f89b562', cat22='a73ee510', cat23='36c6971d', cat24='3168dd4c', cat25='8e5c8813', cat26='060905ec', cat27='b28479f6', cat28='b5de5085', cat29='bc8707ae', cat30='27c07bd6', cat31='42235923', cat32='21ddcdc9', cat3

### Load files to GCP bucket and convert to RDDs for Spark analysis

The `train.txt` and `test.txt` files were downloaded to an external hard drive and subsequently loaded into a GCP bucket. 

In [None]:
# This command streams the main data set from dropbox directly to your GCP bucket - this may take a little time (RUN THIS CELL AS IS)
#!curl -L "https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz" | gsutil cp - gs://{BUCKET}/finalproject/train.txt

In [None]:
# Do not run in the Docker container. This command puts a local file on GCP
#!gsutil cp 'train.txt' gs://{BUCKET}/finalproject/train.txt
#!gsutil cp 'train.txt' gs://danielalvarez_w261projects/finalproject/train.txt

In [None]:
# Do not run in the Docker container. This command puts a local file on GCP
#!gsutil cp 'test.txt' gs://{BUCKET}/finalproject/test.txt
#!gsutil cp 'test.txt' gs://danielalvarez_w261projects/finalproject/test.txt

In [None]:
# load the data into Spark RDDs for convenience of use later (RUN THIS CELL AS IS)
# trainRDD = sc.textFile(f'gs://danielalvarez_w261projects/finalproject/train.txt')
# testRDD = sc.textFile(f'gs://danielalvarez_w261projects/finalproject/test.txt')

In [None]:
# print the class
# print(type(trainRDD))
# print(type(testRDD))

In [None]:
# number of rows and shape of the files
# !cat trainRDD | wc -l

In [None]:
# convert to RDDs to DataFrames
#DF = trainRDD.map(lambda x: (x.split('\t')[0], ast.literal_eval(x.split('\t')[1]))).toDF()
# from pyspark.sql.types import Row

# #here you are going to create a function
# def f(x):
#     d = {}
#     for i in range(len(x)):
#         d[str(i)] = x[i]
#     return d

# #Now populate that
# df = trainRDD.map(lambda x: Row(**f(x))).toDF()