<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Reading Data Lab 2

## Instructions

Consider the file `s3a://quantia-master/training/people-with-dups.txt` and:
1. Explore the file on s3
1. Read using pyspark exploiting inferSchema
1. Read the file using your own schema
1. Save in parquet on local disk
1. Read the parquet
1. Compute the medium salary
1. Count by Gender


### 1 - Explore the file on s3

In [None]:
%load_ext autotime

import pandas
import s3fs
import boto3
import io
import qcutils

baseUri = "s3a://quantia-master/training/"

In [None]:
qcutils.print_s3_bucket_object(key='training/people-with-dups.txt')

### 2 - Read using pyspark exploiting inferSchema

In [None]:
from pyspark.sql import SparkSession
import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

In [None]:
csvFile = baseUri + "people-with-dups.txt"

In [None]:
(spark.read                        # The DataFrameReader
   .option("header", "true")       # Use first line of all files as header
   .option("sep", ":")            # Use tab delimiter (default is comma-separator)
   .option("inferSchema", "true")  # Automatically infer data types
   .csv(csvFile)                   # Creates a DataFrame from CSV after reading in the file
   .printSchema()
)

### 3 - Read the file using your own schema

In [None]:
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("firstName", StringType(), nullable=False),
  StructField("middleName", StringType(), nullable=False),
  StructField("lastName", StringType(), nullable=False),
  StructField("gender", StringType(), nullable=False),
  StructField("birthDate", TimestampType(), nullable=False),
  StructField("salary", IntegerType(), nullable=False),
  StructField("ssn", StringType(), nullable=False)
])

In [None]:
df1 = (spark.read                   
  .option('header', 'true')   
  .option('sep', ":")        
  .schema(csvSchema)          
  .csv(csvFile)
)

df1.printSchema()

### 4 - Save in parquet on local disk

In [None]:
outputBaseUri = "/home/jovyan/data/pyspark/"

(df1.write                       
  .option("compression", "snappy") 
  .mode("overwrite")               
  .parquet(outputBaseUri + "people-with-dups.parquet") 
)

### 5 - Read the parquet

In [None]:
df2 = spark.read.parquet(outputBaseUri + "people-with-dups.parquet")

### 6 - Compute the medium salary

In [None]:
df2.groupBy().avg("salary")

In [None]:
df2.select(df2.salary).groupBy().avg()

### 7 - Count by gender

In [None]:
df2.groupBy(df2.gender).count()

In [None]:
df2.groupBy("gender").count()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.