<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# !!!!!! Remove pyspark content !!!!!!

# Reading Data Lab
* The goal of this lab is to put into practice some of what you have learned about reading data with Apache Spark python pandas.

## Instructions
0. Start with the file **quantia-master/training/data_geo.csv**.
0. Inspect the content of the file and ask yourself:
    * What is the separator?
    * Is there an header?
    ...
0. Read in the file using python in two different ways
    1 - Let system infer the schema
    2 - Manually pass the schema
0. Repeat the read operation using pyspark (both ways)
0. Use pyspark to save the dataframe as a `parquet` in the `/home/jovyan/data/pyspark` folder
0. Use pyspark to save the dataframe as a `table` in the default db
0. Perform some exploration queries:
    * Extract the `2015 Median Home Prices`
    * Show the top 10 cities by `2015 Median Sales Price`
    * Show the top 10 cities with the `2015 Median Sales Price` >= $ 300,000
    * ...

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
import pandas
import s3fs

s3 = boto3.client('s3')
psBaseUri = "s3://quantia-master/training/"
pyBaseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

## Explore data

You can use the `print_s3_bucket_object(...)` from qc utils

In [None]:
qcutils.print_s3_bucket_object(key='training/data_geo.csv')

`data_geo.csv` is a common csv file with a `,` as a separator and qith the header (first line before the `\n` character).

Often (as today) you can't use a text editor to "beautify" your print.

## Read Data

### Python with inferSchema

In [None]:
pyCsvPath = pyBaseUri + "data_geo.csv"

pydf = pandas.read_csv(pyCsvPath)
pydf.info()

### Python with user-defined Schema

In [None]:
import numpy as np
import csv

pyCsvPath = pyBaseUri + "data_geo.csv"

pydf = ( pandas
            .read_csv(
              pyCsvPath
              , dtype={
                '2014 rank': np.int64
                , 'City': np.string_
                , 'State': np.string_
                , 'State Code': np.string_
                , '2014 Population estimate': np.float64
                , '2015 median sales price': np.float64
              }
            )
           )
pydf.info()

### PySpark with inferSchema

In [None]:
psCsvPath = psBaseUri + "data_geo.csv"

psTestDF = (spark.read
            .option("header", True)
            .option("inferschema", True)
            .csv(psCsvPath)
           )

psTestDF.printSchema()

### PySpark with user-defined Schema

In [None]:
from pyspark.sql.types import *

psCsvPath = psBaseUri + "data_geo.csv"

csvSchema = StructType([
  StructField("2014_rank", IntegerType(), nullable=False),
  StructField("City", StringType(), nullable=False),
  StructField("State", StringType(), nullable=False),
  StructField("State_Code", StringType(), nullable=False),
  StructField("2014_Population_estimate", DoubleType(), nullable=False),
  StructField("2015_Median_sales_price", DoubleType(), nullable=True),
])

psTestDF = (spark.read
            .option("header", True)
            .option("inferschema", True)
            .schema(csvSchema)
            .csv(psCsvPath)
           )

psTestDF.printSchema()

In [None]:
psTestDF

## Write Data

## Parquet

In [None]:
psTestDF.write.parquet("/home/jovyan/data/pyspark/data_geo.parquet")

## Table

NOTA: usare overwrite per poter sovrascrivere una tabella giÃ  salvata

In [None]:
psTestDF.write.mode("overwrite").saveAsTable("data_geo1")

## Explore Data

### Extract the `2015 Median Home Prices`

In [None]:
spark.sql("""
SELECT State_Code, 2015_median_sales_price 
FROM data_geo
""")

### Extract the top 10 cities by `2015 Median Sales Price`

In [None]:
spark.sql("""
SELECT
    City, 
    2014_Population_estimate/1000 AS 2014_Population_Estimate_1000, 
    2015_median_sales_price AS 2015_Median_Sales_Price_1000
FROM data_geo 
ORDER BY 2015_median_sales_price DESC
LIMIT 10
""")

### Extract the top 10 cities with the `2015 Median Sales Price` >= $ 300,000

In [None]:
spark.sql("""
SELECT
    City, 
    State_Code, 
    2015_median_sales_price
FROM data_geo 
WHERE 2015_median_sales_price >= 300
ORDER BY 2015_median_sales_price DESC
LIMIT 10
""")

### soluzione migliore proposta da Federico

non chiediamo a Spark di leggere `2014_Population_estimate` e `2015_Median_sales_price` come `FloatType` o `DoubleType`

```
  StructField("2014_Population_estimate", DoubleType(), nullable=False),
  StructField("2015_Median_sales_price", DoubleType(), nullable=True),
```

ma li leggiamo come `StringType`

```
  StructField("2014_Population_estimate", StringType(), nullable=False),
  StructField("2015_Median_sales_price", StringType(), nullable=True),
```

In [None]:
from pyspark.sql.types import *

psCsvPath = psBaseUri + "data_geo.csv"

csvSchema = StructType([
  StructField("2014_rank", IntegerType(), nullable=False),
  StructField("City", StringType(), nullable=False),
  StructField("State", StringType(), nullable=False),
  StructField("State_Code", StringType(), nullable=False),
  StructField("2014_Population_estimate", StringType(), nullable=False),
  StructField("2015_Median_sales_price", StringType(), nullable=True),
])

psTestDF = (spark.read
            .option("header", True)
            .option("inferschema", True)
            .schema(csvSchema)
            .csv(psCsvPath)
           )

psTestDF.printSchema()

In [None]:
psTestDF

***NOTATE*** che ora le righe sono lette correttamente, ma i `null` sono stringhe!

In [None]:
psTestDF.filter(psTestDF["2015_Median_sales_price"] == "null")

facendo invece il casting usando `withColumn` e `cast`

In [None]:
castedPsTestDF = (psTestDF.withColumn("2015_Median_sales_price",psTestDF["2015_Median_sales_price"].cast("float"))
            .withColumn("2014_Population_estimate",psTestDF["2014_Population_estimate"].cast("float"))
           )

***NOTA*** ora i null sono vermante null

In [None]:
castedPsTestDF

In [None]:
castedPsTestDF.filter(castedPsTestDF["2015_Median_sales_price"] == "null")