<h1> Deep Learning - CPGEI </h1>
<h2> Distributed processing with Apache Spark - November 29, 2022 </h2>
<h3> Prof. M.Sc. Clayton Kossoski </h3>

## Basic requirements

In [None]:
! pip install findspark pandas matplotlib numpy keras pyspark tensorflow==2.9.0

## Start Pyspark

In [None]:
import findspark
findspark.init()
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import pyspark
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
        .builder \
        .master('spark://Tux:7077') \
        .appName("sparkFromJupyter") \
        .getOrCreate()
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc=spark.sparkContext
print("Spark Version: " + spark.version)
print("PySpark Version: " + pyspark.__version__)

## EXAMPLE 1

## **Defining a dataset, loading, transforming and plotting**

Requirements: Functional Programming Concepts

https://www.geeksforgeeks.org/functional-programming-in-python/

https://medium.com/analytics-vidhya/pyspark-in-15-minutes-49bcde83f6b

In [None]:
# name, temperature, humidity

from pyspark.sql import Row
sensorDataList = [('s1',15.0, 40.1),('s2',24.5, 12.0),('s3',21.9, 42.45),('s4',30.1, 10.4)]
rdd = sc.parallelize(sensorDataList)
sensorData = rdd.map(lambda x: Row(name=x[0], temp=float(x[1]), hum=float(x[2])))
df = spark.createDataFrame(sensorData)

In [None]:
df.printSchema()

In [None]:
df.count()

In [None]:
df.show()

In [None]:
ds_pandas = df.toPandas()

In [None]:
ds_pandas

In [None]:
ds_pandas.set_index('name')['temp'].plot();

## EXAMPLE 2

## **Pyspark manipulating a dataset of sensor readings**

Download: https://data.melbourne.vic.gov.au/Environment/Sensor-readings-with-temperature-light-humidity-ev/ez6b-syvw

Title: Sensor readings, with temperature, light, humidity every 5 minutes at 8 locations (trial, 2014 to 2015)

## Pyspark ETL (Extract, Transform, Load)

https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/

In [None]:
df = spark.read.format("com.databricks.spark.csv") \
.option("header","true") \
.option("delimiter", ",") \
.option("inferSchema", "true") \
.load("data/Sensor_readings__with_temperature__light__humidity_every_5_minutes_at_8_locations__trial__2014_to_2015_.csv")

### 1. How many lines are there in the file and what is the number of distinct records (lines)?

In [None]:
df.count()

In [None]:
df.distinct().count()

### 2. Show a readable summary of this dataset containing the temp_avg, humidity_avg and elevation columns

In [None]:
df.describe(['temp_avg', 'humidity_avg', 'elevation']).show()

### 3. Show the schema

In [None]:
df.printSchema()

### 4. Display the first 10 lines containing only the following attributes: timestamp, boardid, temp_avg, light_avg, and humidity_avg.

In [None]:
df.select('timestamp', 'boardid', 'temp_avg', 'light_avg','humidity_avg').show(10, truncate=False)

### 5. What was the overall minimum (temp_min) temperature and overall maximum (temp_max) temperature recorded across the entire dataset?

In [None]:
from pyspark.sql.functions import min, max
df.agg(min("temp_min")).show()
df.agg(max("temp_max")).show()

### 6. What is the distinct number of sensors that recorded this data and what is their name/id?

In [None]:
# quantos sensores diferentes existem
df.select("boardid").distinct().count()

In [None]:
# qual nome dos sensores
df.select("boardid").distinct().show()

### 7. Display the highest temperature (temp_max) and lowest temperature (temp_min) recorded by each sensor.

In [None]:
df.groupBy("boardid").max("temp_max").show()

df.groupBy("boardid").min("temp_min").show()

### 8. How many records are there per sensor?

In [None]:
from pyspark.sql import functions as F

#df.agg(max(df.temp_avg)).where(df.boardid==501).show()
#df.groupBy("temp_avg")
df.groupBy("boardid").count().show()
