## JSON Plano to CSV in Spark

Nos llego un archivo json en texto plano (es decir todo en una sola linea) y deseamos convertirlo en un csv

In [1]:
import findspark
findspark.init()

# Creamos la session de Spark
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()
print('Apache Spark Version :' + spark.sparkContext.version)

Apache Spark Version :3.3.1


In [5]:
df_json = spark.read.text("filejsonplano.csv")

In [6]:
df_json.show()

+--------------------+
|               value|
+--------------------+
|{"records":[{"id"...|
+--------------------+



In [7]:
json_schema =   StructType([
                StructField('records',  
                    ArrayType(StructType([
                        StructField('id',  StringType(), True ),
                        StructField('first_name', StringType(), True),
                        StructField('last_name', StringType(), True),
                        StructField('email', StringType(), True),
                        StructField('gender', StringType(), True),
                        StructField('ip_address', StringType(), True)
                    ])),
                True )])

In [9]:
df_details = df_json.withColumn("parsed_data", from_json(df_json["value"], json_schema)).drop("value")
    #Bajamas un nivel para no hacer 'parsed_data.columna'
df = df_details.select(col("parsed_data.*"))
df.printSchema()

root
 |-- records: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- first_name: string (nullable = true)
 |    |    |-- last_name: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- gender: string (nullable = true)
 |    |    |-- ip_address: string (nullable = true)



In [10]:
df = df.withColumn("rec_exp", explode_outer("records"))
df_final = df.select(col('rec_exp.*'))

In [16]:
df_final.show()
df_final.printSchema()

+---+----------+---------+--------------------+------+--------------+
| id|first_name|last_name|               email|gender|    ip_address|
+---+----------+---------+--------------------+------+--------------+
|  1|  Jeanette|Penddreth|jpenddreth0@censu...|Female|   26.58.193.2|
|  2|   Giavani| Frediani|gfrediani1@senate...|  Male| 229.179.4.212|
|  3|     Noell|      Bea| nbea2@imageshack.us|Female|180.66.162.255|
|  4|   Willard|    Valek|      wvalek3@vk.com|  Male|  67.76.188.26|
+---+----------+---------+--------------------+------+--------------+

root
 |-- id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)



## Save Data

### Opcion 1

Si el marco de datos cabe en la **memoria del controlado**r y desea guardarlo en el **sistema de archivos local**, 
puede convertir **Spark DataFrame** a **Pandas DataFrame local** usando `toPandas()` el método y luego simplemente usar `to_csv()`

In [18]:
df_final.toPandas().to_csv('mycsvtoPandas.csv', sep='|', header=False, index=False)

### Opcion 2

Si deseas guardarlo en un **HDFS** como por ejemplo **Amazon S3**

Recordar que esto genera un folder llamdo `path/` con un archivo csv dentro, con un nombre raro. A partir de aqui bajarlo a mano o usar  funciones de movimiento, renombrar, borrar archivos o folders en su HDFS

```shell
path/
 |--  SUSCESS
 |--  part-010233110193.csv
```

In [None]:
df_final.repartition(1).write.csv("path", sep='|')