# Zenpli challenge

### Parte 1.3


# Instalar Java, Spark, and Findspark
El siguiente bloque de codigo instala Apache Spark 3.4.0, Java 8 y [Findspark](https://github.com/minrk/findspark), una librería que hace facil instalar Spark en ambientes express (Google Colab, etc)

In [2]:
!rm -rf spark-3.4.0-bin-hadoop3.tgz
!rm -rf spark-3.4.0-bin-hadoop3
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
!tar xf spark-3.4.0-bin-hadoop3.tgz
!pip install -q findspark --quiet

Varuiables de entorno para findspark

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.0-bin-hadoop3"
os.environ["SPARK_VERSION"] = "3.4"

### Imports


In [4]:
import findspark
findspark.init()

In [5]:
from pyspark.sql import SparkSession
from pyspark import SparkConf

from pyspark.sql.types import (ArrayType, LongType, StringType, StructField, StructType, IntegerType, StringType, DateType, DoubleType, TimestampType)
import pyspark.sql.functions as F
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession


In [6]:
import pandas as pd
import numpy as np
import time
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler

Definimos la configuración de Spark y creamos la sesión

In [7]:
conf = SparkConf() \
    .setMaster('local[*]') \
    .set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED") \
    .set("spark.sql.parquet.mergeSchema", "true") \
    .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") \
    .set("spark.sql.caseSensitive", "true") \
    .set("spark.storage.memoryFraction", 1) \
    .set("spark.executor.memory", "20g") \
    .set("spark.driver.memory", "20g") \
    .set("spark.cores.max", 5) \
    .set("spark.executor.cores", 5)

In [8]:
sc = SparkContext(conf=conf)

# Local
spark = SparkSession \
    .builder \
    .appName("zenpli-challenge-app") \
    .config('spark.sql.session.timeZone', 'UTC') \
    .config(conf=conf) \
    .getOrCreate()

Definimos el esquema apropiado

In [9]:
schema = StructType([
    StructField("key_1", StringType()),
    StructField("date_2", TimestampType()),
    StructField("cont_3", DoubleType()),
    StructField("cont_4", DoubleType()),
    StructField("disc_5", IntegerType()),
    StructField("disc_6", IntegerType()),
    StructField("cat_7", StringType()),
    StructField("cat_8", StringType()),
    StructField("cont_9", DoubleType()),
    StructField("cont_10", DoubleType())
])

Definimos la variable apuntando al archivo local

In [10]:
data_path = 'file:///content/backend-dev-data-dataset.txt'

Ejecutamos la lectura

In [13]:
df = spark.read \
        .schema(schema) \
        .option("header",True) \
        .option("delimiter",",") \
        .option("quote", "\"") \
        .option("escape", "\"") \
        .option("unescapedQuoteHandling", "STOP_AT_DELIMITER") \
        .csv(data_path)

Visualizamos la cantidad de registros

In [14]:
df.count()

16825

Visualizamos esquema

In [15]:
df.printSchema()

root
 |-- key_1: string (nullable = true)
 |-- date_2: timestamp (nullable = true)
 |-- cont_3: double (nullable = true)
 |-- cont_4: double (nullable = true)
 |-- disc_5: integer (nullable = true)
 |-- disc_6: integer (nullable = true)
 |-- cat_7: string (nullable = true)
 |-- cat_8: string (nullable = true)
 |-- cont_9: double (nullable = true)
 |-- cont_10: double (nullable = true)



Visualizamos los primeros 10 registros

In [16]:
df.show(10, truncate=False)

+------+-------------------+-------+------+------+------+--------+---------+------+-------+
|key_1 |date_2             |cont_3 |cont_4|disc_5|disc_6|cat_7   |cat_8    |cont_9|cont_10|
+------+-------------------+-------+------+------+------+--------+---------+------+-------+
|HC2030|2016-11-16 00:00:00|622.27 |-2.36 |2     |6     |frequent|happy    |0.24  |0.25   |
|sP8147|2004-02-18 00:00:00|1056.16|59.93 |2     |8     |never   |happy    |1.94  |2.29   |
|Cq3823|2007-03-25 00:00:00|210.73 |-93.94|1     |1     |never   |happy    |-0.11 |-0.1   |
|Hw9428|2013-12-28 00:00:00|1116.48|80.58 |3     |10    |never   |surprised|1.27  |1.15   |
|xZ0360|2003-08-25 00:00:00|1038.3 |12.37 |6     |17    |never   |happy    |1.76  |1.76   |
|IK2721|2012-10-19 00:00:00|835.17 |16.3  |4     |11    |frequent|surprised|2.04  |2.3    |
|iK8875|2005-02-04 00:00:00|769.02 |75.69 |3     |2     |never   |happy    |-1.53 |-1.56  |
|qd0312|2014-11-17 00:00:00|273.11 |66.2  |1     |8     |frequent|surprised|2.67

1.   Transformar una variable y agregarla al conjunto de datos. (Aplique la función x^3 + exp(y) sobre
cualquier tupla de variables continuas);






In [17]:
df_transformed = df.withColumn("transform_step_3_1", F.expr("pow(cont_3, 3) + exp(cont_4)"))
df_transformed.show()

+------+-------------------+-------+------+------+------+--------+---------+------+-------+--------------------+
| key_1|             date_2| cont_3|cont_4|disc_5|disc_6|   cat_7|    cat_8|cont_9|cont_10|  transform_step_3_1|
+------+-------------------+-------+------+------+------+--------+---------+------+-------+--------------------+
|HC2030|2016-11-16 00:00:00| 622.27| -2.36|     2|     6|frequent|    happy|  0.24|   0.25| 2.409553601855032E8|
|sP8147|2004-02-18 00:00:00|1056.16| 59.93|     2|     8|   never|    happy|  1.94|   2.29|1.064800632551066...|
|Cq3823|2007-03-25 00:00:00| 210.73|-93.94|     1|     1|   never|    happy| -0.11|   -0.1|   9357915.116016999|
|Hw9428|2013-12-28 00:00:00|1116.48| 80.58|     3|    10|   never|surprised|  1.27|   1.15|9.895764508800898E34|
|xZ0360|2003-08-25 00:00:00| 1038.3| 12.37|     6|    17|   never|    happy|  1.76|   1.76|1.1195924776322393E9|
|IK2721|2012-10-19 00:00:00| 835.17|  16.3|     4|    11|frequent|surprised|  2.04|    2.3| 5.94

2.   Agregación - Conteo de registros únicos (sobre cualquier columna de valores categóricos).

In [18]:
grouped_df = df.groupBy("cat_8","cat_7").agg(F.count("key_1").alias("cnt_keys")).orderBy(F.col("cnt_keys").desc())
grouped_df.show()

+---------+----------+--------+
|    cat_8|     cat_7|cnt_keys|
+---------+----------+--------+
|    happy|     never|    3958|
|    happy|  frequent|    3856|
|surprised|     never|    3505|
|surprised|  frequent|    3453|
|      sad|  frequent|     604|
|      sad|     never|     560|
|    happy|infrequent|     315|
|surprised|infrequent|     256|
|    happy|    always|     129|
|surprised|    always|     118|
|      sad|infrequent|      51|
|      sad|    always|      17|
|   scared|     never|       2|
|   scared|  frequent|       1|
+---------+----------+--------+



Lo mismo pero con Spark SQL

In [19]:
df.createOrReplaceTempView("data")

my_new_df = spark.sql("""
  select cat_8,
         cat_7,
         count(key_1) as cnt_keys
  from data
  group by 1, 2
  order by 3 desc
""")

my_new_df.show()

+---------+----------+--------+
|    cat_8|     cat_7|cnt_keys|
+---------+----------+--------+
|    happy|     never|    3958|
|    happy|  frequent|    3856|
|surprised|     never|    3505|
|surprised|  frequent|    3453|
|      sad|  frequent|     604|
|      sad|     never|     560|
|    happy|infrequent|     315|
|surprised|infrequent|     256|
|    happy|    always|     129|
|surprised|    always|     118|
|      sad|infrequent|      51|
|      sad|    always|      17|
|   scared|     never|       2|
|   scared|  frequent|       1|
+---------+----------+--------+

