# Optimización de Consultas

creamos la sesión de Spark

In [None]:
import $ivy.`org.apache.spark::spark-sql:2.4.5` 

import org.apache.spark.sql.{NotebookSparkSession, SparkSession}

val spark: SparkSession = 
    NotebookSparkSession
      .builder()
      .appName("Queries Optimization")
      .master("local[*]")
      .getOrCreate()


In [None]:
import $ivy.`org.plotly-scala::plotly-almond:0.8.1`

import plotly._
import plotly.element._
import plotly.layout._
import plotly.Almond._

In [None]:
import $ivy.`ch.cern.sparkmeasure:spark-measure_2.12:0.17`

Logging

In [None]:
import org.slf4j.LoggerFactory
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger().setLevel(Level.ERROR)

imports

In [None]:
import spark.implicits._
import spark.sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.{functions => func, _}
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark._
import org.apache.spark.sql.types._, func._
import org.apache.spark.sql.functions.{col, to_date}

# Los Datos

El dataset ha sido obtenido de:
https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

En el se observan los casos diarios de Covid-19 por país hasta el 14-12-20

En la segunda parte se utilizan los datos de las medidas aplicadas a cada país por fecha de inicio y fin:

https://www.ecdc.europa.eu/en/publications-data/download-data-response-measures-covid-19

La consulta para calcular las infecciones por km2:

https://www.kaggle.com/tanuprabhu/population-by-country-2020

Y por último trabajaremos también con vacunaciones:

https://www.kaggle.com/gpreda/covid-world-vaccination-progress

tendremos que utilizar otro dataset con datos de covid ya que en el original no vienen todas las fechas:

https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

## Creo una clase para trabajar con infecciones 

In [None]:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
case class Infection(day : Int, 
                     month : Int, 
                     year : Int, 
                     nCases: Int, 
                     nDeaths : Int, 
                     country : String,  
                     continent : String) 
extends Serializable

Y un método para medir tiempos de ejecución

In [None]:
def runWithOutput[A](code: => A): Int = {
    val start = System.currentTimeMillis()
    val res = code
    val out = System.currentTimeMillis() - start
    println(s"Took ${System.currentTimeMillis() - start}")
    out.toInt
}

### Para utilizar showHTML()

In [None]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit:Int = 20, truncate: Int = 20) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map { cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map { row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)        
    }
}

# Empiezo trabajando con RDDs

In [None]:
val infectionData = spark.sparkContext.textFile("data.csv")

Creo una funcion para trabajar con un RDD de infecciones

In [None]:
def infections(lines : RDD[String]) : RDD[Infection] =
    lines.map(line => {
      val arr = line.split(",")
      Infection(
        day = arr(1).toInt,
        month = arr(2).toInt,
        year = arr(3).toInt,
        nCases = arr(4).toInt,
        nDeaths = arr(5).toInt,
        country = arr(6),
        continent = arr(10)
      )
    })

Calculo la media de infecciones diarias por país trabajando con pair RDD

In [None]:
  def infectionGrowthAverage(infections : RDD[Infection]) : RDD[(String, Int)]= {

    val countriesAndCases : RDD[(String, Iterable[Int])] = 
      infections.map(x => (x.country,x.nCases))
      .groupByKey()
      
    countriesAndCases.mapValues(x => (x.sum / x.size)).sortBy(_._2)
  }

Muestro el resultado y el tiempo de ejecución

In [None]:
val infectionRDD = infections(infectionData)
val infectionAvgRDD = infectionGrowthAverage(infectionRDD)

Usando la API de spark

In [None]:
val timeRDD = spark.time(infectionAvgRDD.collect())

o bien el framework del cern que nos da más información

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(infectionAvgRDD.collect())

# Hago los mismos calculos con un DataFrame

Convierto el RDD obtenido previamente en un DataFrame para inferir la clase infección

In [None]:
val infectionDF = spark.createDataFrame(infectionRDD)

Utilizo los métodos de la clase DF que incluye uno optimizado para calcular la media.

Ejecuto y comprabamos como el tiempo de ejecución es significativamente menor que en RDD

In [None]:
val infAvgOrDf = infectionDF.
    groupBy("country")
    .avg("nCases")
    .orderBy(desc("avg(nCases)"))

In [None]:
infAvgOrDf.showHTML()

In [None]:
spark.time(infAvgOrDf.count())

In [None]:
val timeDF = spark.time(infAvgOrDf.collect)

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(infAvgOrDf.collect)

Otra opción es crear el DataFrame directamente importando los datos pero deja de ser un DF de infecciones

In [None]:
val dfCovid = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("covidworldwide.csv")

In [None]:
dfCovid.schema

In [None]:
dfCovid.explain

In [None]:
val dfCovidWithSchema = dfCovid.toDF
    .groupBy("countriesAndTerritories")
    .agg(mean("cases"))
    .orderBy("avg(cases)")

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(dfCovidWithSchema.collect)

puedo definir el esquema manualmente para crear el DataFrame

In [None]:
//Defino el esquema manualmente pero podría verlo importando el csv y viendo como lo hace de base spark

val schema = new StructType()
    .add("dateRep",StringType,true)
    .add("day",IntegerType,true)
    .add("month",IntegerType,true)
    .add("year",IntegerType,true)
    .add("cases",IntegerType,true)
    .add("deaths",IntegerType,true)
    .add("countriesAndTerritories",StringType,true)
    .add("geoId",StringType,true)
    .add("countryterritoryCode",StringType,true)
    .add("popData2018",IntegerType,true)
    .add("continentExp",StringType,true)

In [None]:
val df = spark.read
.format("csv")
.option("header","true")
.schema(schema)
.load("data.csv")

In [None]:
df.printSchema()

# Y con un DataSet

In [None]:
val infectionDS = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv("covidworldwide.csv")
.as[(String,String,String,String,String,String,String,String,String,String,String,String)]

In [None]:
val avgDS = 
    infectionDS.groupBy($"countriesAndTerritories")
    .agg(avg($"cases"))
    .orderBy("avg(cases)")
    .as[(String,Double)]

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(avgDS.collect)

### Trabajamos con Dataset[Infection]

In [None]:
val infectionDataset = spark.createDataset(infectionRDD)

In [None]:
val avgInfectionDS = infectionDataset
    .groupBy($"country")
    .agg(avg($"nCases").as[Double])
    .orderBy("avg(nCases)")
    .as[(String,Double)]

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(avgInfectionDS.collect)

# Utilizo una segunda tabla y cruzo datos con RDD, DS y DF

## Creo una consulta para calcular la media de infecciones por Km2

### Utilizando RDDs

In [None]:
val populationData = spark.sparkContext.textFile("population_by_country_2020.csv")

In [None]:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
case class Population(
    country : String, 
    population : Int, 
    density : Int, 
    land_area: Int, 
    ) 
extends Serializable

Limpio la primera linea del CSV y creo un RDD de población

In [None]:
val header = populationData.first() 

def population(lines : RDD[String]) : RDD[Population] =
    lines.filter(x => x != header)
    .map(line => {
      val arr = line.split(",")
      Population(
        country = arr(0),
        population = arr(1).toInt,
        density = arr(4).toInt,
        land_area = arr(5).toInt,
      )
    })

Compruebo que se visualizan correctamente los datos

In [None]:
val populationRDD = population(populationData)

In [None]:
populationRDD.toDF.showHTML()

### Un join computacionalmente pesado desde el principio ya que cruza todos los datos sin quedarnos con los que nos interesen

Spark no me deja hacer un Join de RDD que no sean pair RDD así que tenemos que construirlo

In [None]:
// populationRDD.join(infectionRDD)

Construyo Pair RDDs conservando todos los datos

In [None]:
val populationByCountry = populationRDD.map(
    x => (x.country,x))

val infectionByCountry = 
      infectionRDD.map(x => (x.country,x))

Hago el Join y agrupo por paises

In [None]:
val megaRDD = infectionByCountry.join(populationByCountry).groupByKey()

Finalmente calculo la media

In [None]:
megaRDD.mapValues(
    x => x.map( 
        line => line._1.nCases.toFloat / line._2.land_area.toFloat
    )).mapValues(
    x => x.sum / x.size
).collect()

Lo hago todo en una única operación para calcular el tiempo de ejecución

In [None]:
val notOptimizedRDD =
    infectionByCountry.join(populationByCountry)
    .groupByKey()
    .mapValues(
    x => x.map( 
        line => line._1.nCases.toFloat / line._2.land_area.toFloat)
    ).mapValues(
        x => x.sum / x.size
    )

¿Hay alguna diferencia cruzando los datos en orden inverso? Parece que no

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    populationByCountry.join(infectionByCountry)
    .groupByKey()
    .mapValues(
    x => x.map( 
        line => line._1.land_area.toFloat / line._2.nCases.toFloat)
    ).mapValues(
        x => x.sum / x.size
    ).collect()
)

#### Para optimizar un poco esta consulta:

Despejo solo los datos que me interesan para trabajar con Pair RDDs y optimizar la consulta

In [None]:
val countriesAndLandArea = populationRDD.map(
    x => (x.country,x.land_area))

In [None]:
val countriesAndCases = 
      infectionRDD.map(x => (x.country,x.nCases))
      .groupByKey()

Ejecuto un join y trabajo para calcular primero la media de infecciones por Km2 diaria, 
para luego calcular la media total

In [None]:
val average = countriesAndCases.join(countriesAndLandArea)

In [None]:
average.mapValues(
    x => x._1.map(
        y => (y.toFloat / x._2.toFloat)
    )).mapValues(
    x => x.sum/x.size
).collect()

Lo hago todo en una única operación para calcular el tiempo de ejecución

In [None]:
val meanInfectionsRDD =
countriesAndCases.join(countriesAndLandArea)   
.mapValues(
    x => x._1.map(
        y => (y.toDouble / x._2.toDouble)
    )).mapValues(
    x => x.sum / x.size
)

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    meanInfectionsRDD.collect
)

## Consulta con DataSet

In [None]:
val infectionDS = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("covidworldwide.csv")
.withColumnRenamed("countriesAndTerritories","Country")
.as[(String,String,String,String,Double,Double,String,String,String,String,String,String)]

In [None]:
val populationDS = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("population_by_country_2020.csv")
.withColumnRenamed("Country (or dependency)","Country")
.withColumnRenamed("Population (2020)","Population")
.as[(String,Float,String,Float,Float,Float,Double,String,String,String,String)]

### infecciones por Km2

In [None]:
val meanInfectionsperKM2DS = 
infectionDS.join(populationDS, "Country")
        .select($"Country",
                $"dateRep" as "date",
                $"cases",
                $"Land Area (Km\u00b2)",
                $"cases" / $"Land Area (Km\u00b2)" as "infection Per Km\u00b2")
        .groupBy("Country")
        .agg(round(avg("infection Per Km\u00b2"),10).as[Float])
        .orderBy(desc("round(avg(infection Per Km²), 10)"))
        .as[(String,Double)]

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(meanInfectionsperKM2DS.collect)

### infecciones por número de habitantes

In [None]:
val meanInfectionPerPopulationDS = 
infectionDS.join(populationDS, "Country")
        .select($"Country",
                $"dateRep" as "date",
                $"cases",
                $"Land Area (Km\u00b2)",
               $"cases" / $"Population" as "infection Per Population")
        .groupBy("country")
        .avg("infection Per Population")
        .orderBy(desc("avg(infection Per Population)"))
        .as[(String,Double)]

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(meanInfectionPerPopulationDS.collect)

## Consulta con DataFrame

In [None]:
val dfCovid = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("covidworldwide.csv")

In [None]:
val dfMeasures = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("response_graphs_data_2021-04-15.csv")
dfMeasures.show
dfMeasures.schema

In [None]:
val dfPopulation = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("population_by_country_2020.csv")
.withColumnRenamed("Country (or dependency)","Country")
.withColumnRenamed("Population (2020)","Population")
dfPopulation.showHTML()
dfPopulation.schema

Modifico los datos de entrada para que el formato fecha se adecue al TimeStamp de Spark

In [None]:
val dfCovidClean = dfCovid
    .select($"*",$"dateRep",translate($"dateRep","/","-").as("date"))
    .drop("dateRep")

In [None]:
val dfCovidDate = dfCovidClean
    .select($"*",col("date"),to_date(col("date"),"dd-MM-yyyy").as("to_date"))

Hago una consulta de prueba para obtener la media solo de los casos en España

In [None]:
val spainCovid = dfCovid.select("dateRep","cases").where("countriesAndTerritories == 'Spain'").toDF

In [None]:
spainCovid.agg(avg("cases")).showHTML()

Cruzo los datos con un Join y hago algunas consultas sencillas

In [None]:
val megaDF = dfCovid.join(dfMeasures, $"Country" === $"countriesAndTerritories")

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    
    dfCovid.join(dfMeasures, $"Country" === $"countriesAndTerritories")
        .select("cases","deaths","dateRep","Response_measure")
        .where("countriesAndTerritories == 'Spain'")
        .collect()
)

### Finalmente ejecuto la consulta de nuestro caso de uso, infecciones por Km2

In [None]:
val meanInfectionsperKM2DF = 
dfCovid.join(dfPopulation, $"country" === $"countriesAndTerritories")
        .select($"country",
                $"dateRep" as "date",
                $"cases",
                $"Land Area (Km\u00b2)",
                $"cases" / $"Land Area (Km\u00b2)" as "infection Per Km\u00b2")
        .groupBy("country")
        .avg("infection Per Km\u00b2")
        .orderBy(desc("avg(infection Per Km²)"))

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    meanInfectionsperKM2DF.collect
    )

### Media de casos por número de habitante

In [None]:
val infectionsPerPopulation = dfCovid.join(dfPopulation, $"country" === $"countriesAndTerritories")
        .select($"country",
                $"dateRep" as "date",
                $"cases",
                $"Population",
                $"cases" / $"Population" as "infection Per Population")
        .groupBy("country")
        .avg("infection Per Population")
        .orderBy(desc("avg(infection Per Population)"))

### Porcentaje diario de infectados

In [None]:
val diaryInfectionsDF =
dfCovidDate.join(dfPopulation, $"country" === $"countriesAndTerritories")
        .select($"country",
                $"to_date",
                $"day",
                $"month",
                $"cases",
                $"Population",
                $"cases" / $"Population" as "infection Per Population")
        .orderBy($"to_date".asc)

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    diaryInfectionsDF.collect
    )

# Consulta con vacunaciones

## Comparativa de infecciones frente a vacunaciones

In [0]:
val dfCovid2 = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("covid_19_data.csv")
dfCovid2.schema

cmd0.sc:1: not found: value spark
val dfCovid2 = spark.read
               ^Compilation Failed

: 

In [None]:
val vaccinations = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.option("inferSchema", "true")
.csv("country_vaccinations.csv")
vaccinations.schema

modifico los datos de entrada para que se ajuste la fecha

In [None]:
val vaccinationsClean = vaccinations
    .select($"*",col("date"),to_date(col("date"),"MM-dd-yyyy")
            .as("dateVaccinated"))
    .drop("date")

In [None]:
val dfCovidClean2 = dfCovid2
    .select($"*",$"ObservationDate",translate($"ObservationDate","/","-")
            .as("date1"))
    .drop("ObservationDate")
    .select($"*",col("date1"),to_date(col("date1"),"MM-dd-yyyy")
            .as("date"))
    .drop("date1")

triple join

In [None]:
dfCovidClean2.join(
    vaccinationsClean,$"date" === $"dateVaccinated"
    && dfCovidClean2("Country/Region") <=> vaccinationsClean("country")
).join(dfPopulation, "country").showHTML()

In [None]:
val megaQuerie = dfCovidClean2.join(
    vaccinationsClean,$"date" === $"dateVaccinated"
    && dfCovidClean2("Country/Region") <=> vaccinationsClean("country")
).join(dfPopulation,"country")
        .select($"country",
                $"date",
                $"confirmed",
                $"people_vaccinated",
                $"Population",
                $"confirmed" / $"Population" as "infection Per Population",
                $"people_vaccinated"/ $"Population" as "vaccination Per Population",
                $"people_vaccinated" / $"confirmed" as "infection-vaccination rate")
        .orderBy($"date".asc)
        .na.fill("")
        .withColumn("infection-vaccination rate", round($"infection-vaccination rate",8))
        .withColumn("vaccination Per Population", round($"vaccination Per Population",8))

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    megaQuerie.collect
    )

# Consulta utilizando los datos en .parquet

In [None]:
/*dfCovid.write
    .partitionBy("countriesAndTerritories","cases")
    .parquet("data_files/covid.parquet")
*/

In [None]:
val parqDF = spark.read.parquet("data_files/covid.parquet")

casos por km2

In [None]:
val parquetCasesKM2 =
parqDF.join(dfPopulation, $"country" === $"countriesAndTerritories")
        .select($"country",
                $"dateRep" as "date",
                $"cases",
                $"Land Area (Km\u00b2)",
                $"cases" / $"Land Area (Km\u00b2)" as "infection Per Km\u00b2")
        .groupBy("country")
        .avg("infection Per Km\u00b2")
        .orderBy(desc("avg(infection Per Km²)"))

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    parquetCasesKM2
    .collect()
)

ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    parquetCasesKM2
    .collect()
)

In [None]:
val parquetCasesPopulation =
parqDF.join(dfPopulation, $"country" === $"countriesAndTerritories")
        .select($"country",
                $"dateRep" as "date",
                $"cases",
                $"Population",
                $"cases" / $"Population" as "infection Per Population")
        .groupBy("country")
        .avg("infection Per Population")
        .orderBy(desc("avg(infection Per Population)"))

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    parquetCasesPopulation
    .collect()
)

Porcentaje diario de infecciones

In [None]:
val parquetDailyCasesRate =
parqDF.join(dfPopulation, $"country" === $"countriesAndTerritories")
                .select($"country",
                $"dateRep",
                $"day",
                $"month",
                $"cases",
                $"Population",
                $"cases" / $"Population" as "infection Per Population")
        .orderBy($"dateRep".asc)

In [None]:
ch.cern.sparkmeasure.StageMetrics(spark).runAndMeasure(
    parquetDailyCasesRate
    .collect()
)

# Visualización de datos con plotly

In [None]:
val (x, y) = Seq(
  "Banana" -> 10,
  "Apple" -> 8,
  "Grapefruit" -> 5
).unzip

Bar(x, y).plot()

## media de infecciones diarias

In [None]:
val (x,y) = infAvgOrDf.collect.map(r=>(r(0).toString, r(1).toString.toDouble)).toList.unzip
Bar(x, y).plot()

## media de infecciones por km2

In [None]:
val (x,y) = meanInfectionsperKM2DF.collect.map(r=>(r(0).toString, r(1).toString.toFloat)).toList.unzip
Bar(x, y).plot()

## media de infecciones por densidad de población

In [None]:
val (x,y) = infectionsPerPopulation.collect.map(r=>(r(0).toString, r(1).toString)).toList.unzip
Bar(x, y).plot()

## porcentaje diario de infectados

In [None]:
val (x,y) = diaryInfectionsDF.filter($"country" === "Spain").collect.map(r=>(r(1).toString, r(6).toString)).toList.unzip
Bar(x, y).plot()


## comparacion entre paises de crecimiento de la enfermedad

In [None]:
val y = diaryInfectionsDF.filter($"country" === "Spain").select($"infection Per Population").
    collect.map(r => r(0).toString.toDouble).toList

val x = diaryInfectionsDF.filter($"country" === "Spain").select($"to_date").collect.toList.map(_.toString)

val y1 = diaryInfectionsDF.filter($"country" === "Italy").select($"infection Per Population").
    collect.map(r => r(0).toString.toDouble).toList
val x1 = diaryInfectionsDF.filter($"country" === "Italy").select($"to_date").collect.toList.map(_.toString)

val data = Seq(
    Scatter(x,y).withName("Spain"),
    Scatter(x1,y1,mode = ScatterMode(ScatterMode.Lines),
  line = Line(color = Color.StringColor("#7F7F7F"))).withName("Italy")
).map(_.withFill(Fill.ToNextY).withStackgroup("A"))

plot(data)

## porcentaje de la población vacunada

## crecimiento de la vacunacion con respecto a la población

In [None]:
val y = megaQuerie.filter($"country" === "Chile").select($"vaccination Per Population" * 10000000).
    collect.map(r => r(0).toString.toDouble).toList

val x = megaQuerie.filter($"country" === "Chile").select($"date").collect.toList.map(_.toString)

val y1 = megaQuerie.filter($"country" === "Chile").select($"people_vaccinated").
    collect.map(r => r(0).toString.toDouble).toList
val x1 = megaQuerie.filter($"country" === "Chile").select($"date").collect.toList.map(_.toString)

val data = Seq(
    Scatter(x,y).withName("% population"),
    Scatter(x1,y1).withName("Vaccines administrated")
).map(_.withFill(Fill.ToNextY).withStackgroup("A"))

val myLayout =
  Layout()
    .withTitle("CHILE")

plot(data,myLayout)

# Visualización de eficiencia

para la querie de media de infecciones diarias:

In [None]:
val (x, y) = Seq(
    "RDD" -> runWithOutput(infectionAvgRDD.collect),
    "DataSet" -> runWithOutput(avgDS.collect),
    "DataFrame" -> runWithOutput(infAvgOrDf.collect)
).unzip

Bar(x, y).plot()

para la querie de infecciones por km2

In [None]:
val (x, y) = Seq(
    "Not Optimized RDD" -> runWithOutput(notOptimizedRDD.collect),
    "RDD" -> runWithOutput(meanInfectionsRDD.collect),
    "DataSet" -> runWithOutput(meanInfectionsperKM2DS.collect),
    "DataFrame" -> runWithOutput(meanInfectionsperKM2DF.collect),
    "DataFrame using Parquet" -> runWithOutput(parquetCasesKM2.collect)
).unzip

Bar(x, y).plot()

para la querie de infecciones por número de habitantes

In [None]:
val (x, y) = Seq(
    "DataSet" -> runWithOutput(meanInfectionPerPopulationDS.collect),
    "DataFrame" -> runWithOutput(infectionsPerPopulation.collect),
    "DataFrame vaccinations" -> runWithOutput(megaQuerie.collect),
    "DataFrame using Parquet" -> runWithOutput(parquetCasesPopulation.collect)
).unzip

Bar(x, y).plot()