<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) CSV Ingestion

**Data Source**
* The data is available at:
    * `s3a://quantia-master/training/Exam-1-3/store_sales.csv` **for pyspark**
    * `s3://quantia-master/training/Exam-1-3/store_sales.csv` **for python**
* The csv file contains around 2mln rows

**Instructions**
* Take a first look to the data
* Read the csv using python and pyspark:
  * With inferred schema
  * With user-defined schema
* Show a part of the result `DataFrame`

**Hints**
* numpy int64 data type has some limitation (i.e., it can't manage NA values)

## ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) General Set-up

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
import pandas
import numpy as np

psbaseUri = "s3a://quantia-master/training/Exam-1-3/"
pybaseUri = "s3://quantia-master/training/Exam-1-3/"

qcutils.print_s3_bucket_object(key='training/Exam-1-3/store_sales.csv')

## ![Python Tiny Logo](https://www.quantiaconsulting.com/logos/logo_python_tiny.png) Python

In [None]:
df = pandas.read_csv(pybaseUri+"store_sales.csv", sep="|")
df.info()

In [None]:
df = (pandas
  .read_csv(pybaseUri+"store_sales.csv", sep="|"
            , dtype={
                    'ss_sold_date_sk': np.float64
                    , 'ss_sold_time_sk': np.float64
                    , 'ss_item_sk': np.int64
                    , 'ss_customer_sk': np.float64
                    , 'ss_cdemo_sk': np.float64
                    , 'ss_hdemo_sk': np.float64
                    })
     )
      
df.info()

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Pyspark

In [None]:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

In [None]:
df = (spark.read                        
   .option("header", "true")
   .option("sep", "|")
   .option("inferSchema", "true")
   .csv(psbaseUri+"store_sales.csv")
)

df.printSchema()

In [None]:
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("ss_sold_date_sk", IntegerType(), nullable=True),
  StructField("ss_sold_time_sk", IntegerType(), nullable=True),
  StructField("ss_item_sk", IntegerType(), nullable=True),
  StructField("ss_customer_sk", IntegerType(), nullable=True),
  StructField("ss_cdemo_sk", IntegerType(), nullable=True),
  StructField("ss_hdemo_sk", IntegerType(), nullable=True)
])


df = (spark.read
    .option('header', 'true')
    .option("sep", "|")
    .schema(csvSchema)
    .csv(psbaseUri+"store_sales.csv")
)

df.printSchema()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.