In [1]:
import pandas as pd

APP_NAME = 'pyspark_python'
MASTER = 'local[*]'
from pyspark import SparkConf
from pyspark.sql import SparkSession


conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster(MASTER)
spark = SparkSession.builder.config(conf = conf).getOrCreate()
sc = spark.sparkContext

## **Spark tricks**

Some tricks:

## **See all the columns of a large dataset**

Configuration instructions:

In ~/.jupyter/custom/custom.js (to avoid wrap - horizontal scroll):

`
$([IPython.events]).on('app_initialized.NotebookApp', function(){
  IPython.CodeCell.options_default['cm_config']['lineWrapping'] = true;
});
`
And added this in ~/.jupyter/custom/custom.css (to use all width):

`
.container { width:100% !important; }
pre, code, kbd, samp {
    white-space: pre;
}
`

These two tricks help to make the Spark SQL DataFrame show() method a little more "palatable" by aligning columns to effectively kill word wrap and giving more width to the view. If these files do not exist, try creating them.

In [2]:
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [5]:
data1 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
         'Name': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
         'Sex': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
         'Survived': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0},
         'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35},
         'Fare': {0: 7.3, 1: 71.3, 2: 7.9, 3: 53.1, 4: 8.0},
         'Pclass': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3},
         'PassengerId2': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
         'Name2': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
         'Sex2': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
         'Survived2': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0},
         'Age2': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35},
         'Fare2': {0: 7.3, 1: 71.3, 2: 7.9, 3: 53.1, 4: 8.0},
         'Pclass2': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3},
         'Pclass3': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3},
         'Pclass4': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3}
            }
df1_pd = pd.DataFrame(data1, columns=data1.keys())
data1 = spark.createDataFrame(df1_pd)

In [6]:
data1.show()

+-----------+--------+------+--------+---+----+------+------------+--------+------+---------+----+-----+-------+-------+-------+
|PassengerId|    Name|   Sex|Survived|Age|Fare|Pclass|PassengerId2|   Name2|  Sex2|Survived2|Age2|Fare2|Pclass2|Pclass3|Pclass4|
+-----------+--------+------+--------+---+----+------+------------+--------+------+---------+----+-----+-------+-------+-------+
|          1|    Owen|  male|       0| 22| 7.3|     3|           1|    Owen|  male|        0|  22|  7.3|      3|      3|      3|
|          2|Florence|female|       1| 38|71.3|     1|           2|Florence|female|        1|  38| 71.3|      1|      1|      1|
|          3|   Laina|female|       1| 26| 7.9|     3|           3|   Laina|female|        1|  26|  7.9|      3|      3|      3|
|          4|    Lily|female|       1| 35|53.1|     1|           4|    Lily|female|        1|  35| 53.1|      1|      1|      1|
|          5| William|  male|       0| 35| 8.0|     3|           5| William|  male|        0|  35

## **Use partitions**

By default, when we perform a shuffle Spark will output two hundred shuffle partitions. We will set this value from 1 to five in order to reduce the number of the output partitions from the shuffle from two hundred to five.

Go ahead and experiment with different values and see the number of partitions yourself. In experimenting with different values, you should see drastically different run times. Remenber that you can monitor the job progress by navigating to the Spark UI on port 4040 to see the physical and logical execution characteristics of our jobs.

In [2]:
import datetime

# load data

flightData2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("../data/2015-summary.csv") 


#time of execution for one partition
timestart= datetime.datetime.now()
spark.conf.set("spark.sql.shuffle.partitions", "1")
flightData2015.sort("count").take(2)

# Calculation of the time
timeend = datetime.datetime.now()
timedelta = round((timeend-timestart).total_seconds(), 2) 
print("Time require to run the model: " + str(timedelta) + " segundos")

Time require to run the model: 0.18 segundos


In [3]:
#time of execution for one partition
timestart= datetime.datetime.now()
spark.conf.set("spark.sql.shuffle.partitions", "5")
flightData2015.sort("count").take(2)

# Calculation of the time
timeend = datetime.datetime.now()
timedelta = round((timeend-timestart).total_seconds(), 2) 
print("Time require to run the model: " + str(timedelta) + " segundos")

Time require to run the model: 0.08 segundos


## **Basic dataframe terms**
We define that we think are five basic verbs — select, filter, mutate, summarize, and arrange

In [4]:
data1 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
         'Name': {0: 'Owen', 1: 'Florence', 2: 'Laina', 3: 'Lily', 4: 'William'},
         'Sex': {0: 'male', 1: 'female', 2: 'female', 3: 'female', 4: 'male'},
         'Survived': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}}

data2 = {'PassengerId': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
         'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35},
         'Fare': {0: 7.3, 1: 71.3, 2: 7.9, 3: 53.1, 4: 8.0},
         'Pclass': {0: 3, 1: 1, 2: 3, 3: 1, 4: 3}}

df1_pd = pd.DataFrame(data1, columns=data1.keys())
df2_pd = pd.DataFrame(data2, columns=data2.keys())

In [5]:
df1 = spark.createDataFrame(df1_pd)
df2 = spark.createDataFrame(df2_pd)
df1.show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          1|    Owen|  male|       0|
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
|          5| William|  male|       0|
+-----------+--------+------+--------+



In [6]:
df1.printSchema()

root
 |-- PassengerId: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Survived: long (nullable = true)



## **Select**

In [7]:
cols1 = ['PassengerId', 'Name']
df1.select(cols1).show()

+-----------+--------+
|PassengerId|    Name|
+-----------+--------+
|          1|    Owen|
|          2|Florence|
|          3|   Laina|
|          4|    Lily|
|          5| William|
+-----------+--------+



In [8]:
df1.select(df1.Name.substr(1, 3).alias("name")) .show()

+----+
|name|
+----+
| Owe|
| Flo|
| Lai|
| Lil|
| Wil|
+----+



In [9]:
df1.select("PassengerId", "Sex", df1.Name.startswith("Ow")).show(5)
df1.select("PassengerId", "Sex", df1.Name.endswith("am")).show(5)

+-----------+------+--------------------+
|PassengerId|   Sex|startswith(Name, Ow)|
+-----------+------+--------------------+
|          1|  male|                true|
|          2|female|               false|
|          3|female|               false|
|          4|female|               false|
|          5|  male|               false|
+-----------+------+--------------------+

+-----------+------+------------------+
|PassengerId|   Sex|endswith(Name, am)|
+-----------+------+------------------+
|          1|  male|             false|
|          2|female|             false|
|          3|female|             false|
|          4|female|             false|
|          5|  male|              true|
+-----------+------+------------------+



## **Filter**

In [10]:
# one way
df1.filter(df1.Sex == 'female').show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
+-----------+--------+------+--------+



In [11]:
df1.filter(df1["PassengerId"]>2).show()

+-----------+-------+------+--------+
|PassengerId|   Name|   Sex|Survived|
+-----------+-------+------+--------+
|          3|  Laina|female|       1|
|          4|   Lily|female|       1|
|          5|William|  male|       0|
+-----------+-------+------+--------+



In [12]:
# second way: sql expression
df1.filter("Sex='female'").show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
+-----------+--------+------+--------+



In [13]:
## other way to filter strings
df1.filter(df1.Sex.contains("female")).show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
+-----------+--------+------+--------+



In [14]:
## sql like statement, you use % to indicate that there are more...
df1.filter(df1.Sex.like("fem%")).show()
# or
#df1.filter(df1.Sex.like("%em%")).show()
# or 
#df1.filter(df1.Sex.like("%emale")).show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
+-----------+--------+------+--------+



In [15]:
## sql statement similar to like, it is not necessary to put '%'
df1.filter(df1.Sex.rlike("fem")).show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
+-----------+--------+------+--------+



In [16]:
from pyspark.sql.functions import col
df1.filter(~col('PassengerId').isin(['2','3'])).show()

+-----------+-------+------+--------+
|PassengerId|   Name|   Sex|Survived|
+-----------+-------+------+--------+
|          1|   Owen|  male|       0|
|          4|   Lily|female|       1|
|          5|William|  male|       0|
+-----------+-------+------+--------+



In [17]:
df1.filter(df1.PassengerId.between(3, 4)).show() 

+-----------+-----+------+--------+
|PassengerId| Name|   Sex|Survived|
+-----------+-----+------+--------+
|          3|Laina|female|       1|
|          4| Lily|female|       1|
+-----------+-----+------+--------+



## **Where**

In [18]:
df1.where(df1.PassengerId.between(3, 4)).show() 

+-----------+-----+------+--------+
|PassengerId| Name|   Sex|Survived|
+-----------+-----+------+--------+
|          3|Laina|female|       1|
|          4| Lily|female|       1|
+-----------+-----+------+--------+



In [19]:
df1.where(~col('PassengerId').isin(['2','3'])).show()

+-----------+-------+------+--------+
|PassengerId|   Name|   Sex|Survived|
+-----------+-------+------+--------+
|          1|   Owen|  male|       0|
|          4|   Lily|female|       1|
|          5|William|  male|       0|
+-----------+-------+------+--------+



## **Mutate**: creating new columns

In [20]:
df2.withColumn('AgeTimesFare', df2.Age*df2.Fare).show()

+-----------+---+----+------+------------+
|PassengerId|Age|Fare|Pclass|AgeTimesFare|
+-----------+---+----+------+------------+
|          1| 22| 7.3|     3|       160.6|
|          2| 38|71.3|     1|      2709.4|
|          3| 26| 7.9|     3|       205.4|
|          4| 35|53.1|     1|      1858.5|
|          5| 35| 8.0|     3|       280.0|
+-----------+---+----+------+------------+



In [24]:
import pyspark.sql.functions as F
df2.withColumn('new_column', F.lit('This is a new column')).show()

+-----------+---+----+------+--------------------+
|PassengerId|Age|Fare|Pclass|          new_column|
+-----------+---+----+------+--------------------+
|          1| 22| 7.3|     3|This is a new column|
|          2| 38|71.3|     1|This is a new column|
|          3| 26| 7.9|     3|This is a new column|
|          4| 35|53.1|     1|This is a new column|
|          5| 35| 8.0|     3|This is a new column|
+-----------+---+----+------+--------------------+



## **Drop**

In [25]:
df2.drop("new_column").show(5)

+-----------+---+----+------+
|PassengerId|Age|Fare|Pclass|
+-----------+---+----+------+
|          1| 22| 7.3|     3|
|          2| 38|71.3|     1|
|          3| 26| 7.9|     3|
|          4| 35|53.1|     1|
|          5| 35| 8.0|     3|
+-----------+---+----+------+



##### Drop Na's

In [29]:
df2.na.drop()
df2.dropna()

DataFrame[PassengerId: bigint, Age: bigint, Fare: double, Pclass: bigint]

## **When**

In [30]:
from pyspark.sql import functions as F 
df1.withColumn("set", F.when( df1.PassengerId > 4, 1 ).otherwise( 0 )).show()

+-----------+--------+------+--------+---+
|PassengerId|    Name|   Sex|Survived|set|
+-----------+--------+------+--------+---+
|          1|    Owen|  male|       0|  0|
|          2|Florence|female|       1|  0|
|          3|   Laina|female|       1|  0|
|          4|    Lily|female|       1|  0|
|          5| William|  male|       0|  1|
+-----------+--------+------+--------+---+



In [31]:
from pyspark.sql import functions as sf 
df1.withColumn("set", sf.when( df1.PassengerId > 4, 1 ).otherwise( 0 )).show()

+-----------+--------+------+--------+---+
|PassengerId|    Name|   Sex|Survived|set|
+-----------+--------+------+--------+---+
|          1|    Owen|  male|       0|  0|
|          2|Florence|female|       1|  0|
|          3|   Laina|female|       1|  0|
|          4|    Lily|female|       1|  0|
|          5| William|  male|       0|  1|
+-----------+--------+------+--------+---+



In [32]:
from pyspark.sql import functions as f 
df1.withColumn("set", f.when( df1.PassengerId > 4, 1 ).otherwise( 0 )).show()

+-----------+--------+------+--------+---+
|PassengerId|    Name|   Sex|Survived|set|
+-----------+--------+------+--------+---+
|          1|    Owen|  male|       0|  0|
|          2|Florence|female|       1|  0|
|          3|   Laina|female|       1|  0|
|          4|    Lily|female|       1|  0|
|          5| William|  male|       0|  1|
+-----------+--------+------+--------+---+



## **Summarize** using group by

In [33]:
gdf2 = df2.groupby('Pclass')

In [34]:
#gdf2.count().select(
#  'count'
#).rdd.flatMap(
#  lambda x: x
#).histogram(20)

In [39]:
avg_cols = ['Age', 'Fare']
gdf2.avg(*avg_cols).show()

+------+------------------+-----------------+
|Pclass|          avg(Age)|        avg(Fare)|
+------+------------------+-----------------+
|     3|27.666666666666668|7.733333333333333|
|     1|              36.5|             62.2|
+------+------------------+-----------------+



To call multiple aggregation functions at once, pass a dictionary.

In [40]:
gdf2.agg({'*': 'count', 'Age': 'avg', 'Fare':'sum'}).show()

+------+--------+------------------+---------+
|Pclass|count(1)|          avg(Age)|sum(Fare)|
+------+--------+------------------+---------+
|     3|       3|27.666666666666668|     23.2|
|     1|       2|              36.5|    124.4|
+------+--------+------------------+---------+



#### The toDF() method can be called on a sequence object to create a DataFrame.

In [41]:
gdf2.agg({'*': 'count', 'Age': 'avg', 'Fare':'sum'})\
    .toDF('Pclass', 'counts', 'average_age', 'total_fare')\
    .show()

+------+------+------------------+----------+
|Pclass|counts|       average_age|total_fare|
+------+------+------------------+----------+
|     3|     3|27.666666666666668|      23.2|
|     1|     2|              36.5|     124.4|
+------+------+------------------+----------+



### **Count Distinct**

In [42]:
df2.show()

+-----------+---+----+------+
|PassengerId|Age|Fare|Pclass|
+-----------+---+----+------+
|          1| 22| 7.3|     3|
|          2| 38|71.3|     1|
|          3| 26| 7.9|     3|
|          4| 35|53.1|     1|
|          5| 35| 8.0|     3|
+-----------+---+----+------+



In [43]:
from pyspark.sql.functions import col, countDistinct
df2.agg(*(countDistinct(col(c)).alias(c) for c in df2.columns)).show()

+-----------+---+----+------+
|PassengerId|Age|Fare|Pclass|
+-----------+---+----+------+
|          5|  4|   5|     2|
+-----------+---+----+------+



In [44]:
from pyspark.sql.functions import col, countDistinct

df2.groupby('Pclass').agg(*(countDistinct(col(c)).alias(c) for c in df2.columns)).show()

+------+-----------+---+----+------+
|Pclass|PassengerId|Age|Fare|Pclass|
+------+-----------+---+----+------+
|     3|          3|  3|   3|     1|
|     1|          2|  2|   2|     1|
+------+-----------+---+----+------+



In [45]:
df2.distinct().count()

5

## **Sort**

In [46]:
df2.sort('Fare', ascending=False).show()

+-----------+---+----+------+
|PassengerId|Age|Fare|Pclass|
+-----------+---+----+------+
|          2| 38|71.3|     1|
|          4| 35|53.1|     1|
|          5| 35| 8.0|     3|
|          3| 26| 7.9|     3|
|          1| 22| 7.3|     3|
+-----------+---+----+------+



In [47]:
df2.orderBy('Fare').show()

+-----------+---+----+------+
|PassengerId|Age|Fare|Pclass|
+-----------+---+----+------+
|          1| 22| 7.3|     3|
|          3| 26| 7.9|     3|
|          5| 35| 8.0|     3|
|          4| 35|53.1|     1|
|          2| 38|71.3|     1|
+-----------+---+----+------+



## **Joins and unions**

In [48]:
#join
df1.join(df2, ['PassengerId']).show()

+-----------+--------+------+--------+---+----+------+
|PassengerId|    Name|   Sex|Survived|Age|Fare|Pclass|
+-----------+--------+------+--------+---+----+------+
|          4|    Lily|female|       1| 35|53.1|     1|
|          3|   Laina|female|       1| 26| 7.9|     3|
|          2|Florence|female|       1| 38|71.3|     1|
|          5| William|  male|       0| 35| 8.0|     3|
|          1|    Owen|  male|       0| 22| 7.3|     3|
+-----------+--------+------+--------+---+----+------+



In [49]:
#Unions
#Union() returns a dataframe from the union of two dataframes
df1.union(df1).show()

+-----------+--------+------+--------+
|PassengerId|    Name|   Sex|Survived|
+-----------+--------+------+--------+
|          1|    Owen|  male|       0|
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
|          5| William|  male|       0|
|          1|    Owen|  male|       0|
|          2|Florence|female|       1|
|          3|   Laina|female|       1|
|          4|    Lily|female|       1|
|          5| William|  male|       0|
+-----------+--------+------+--------+



One common symptom of performance issues caused by chained unions in a for loop is it took longer and longer to iterate through the loop. In this case, **repartition()** and **checkpoint()** may help solving this problem.

## **The spark.sql API**

Many of the operations that I showed can be accessed by writing SQL (Hive) queries in spark.sql().

In [50]:
df1.createOrReplaceTempView('df1_temp')
df2.createOrReplaceTempView('df2_temp')
#df.registerTempTable("connections")

In [51]:
query = '''
    select
        a.PassengerId,
        a.Name,
        a.Sex,
        a.Survived,
        b.Age,
        b.Fare,
        b.Pclass
    from df1_temp a
    join df2_temp b
        on a.PassengerId = b.PassengerId'''
dfj = spark.sql(query)


In [52]:
dfj.show()

+-----------+--------+------+--------+---+----+------+
|PassengerId|    Name|   Sex|Survived|Age|Fare|Pclass|
+-----------+--------+------+--------+---+----+------+
|          4|    Lily|female|       1| 35|53.1|     1|
|          3|   Laina|female|       1| 26| 7.9|     3|
|          2|Florence|female|       1| 38|71.3|     1|
|          5| William|  male|       0| 35| 8.0|     3|
|          1|    Owen|  male|       0| 22| 7.3|     3|
+-----------+--------+------+--------+---+----+------+



## Other way to do the same

```
spark.read.parquet('hdfs://sces3p100.hdfsinternalana/user/red_mov_with_customer_2')
df = spark.read.format('parquet').load('prueba').registerTempTable("tmp")
spark.sql('''select * from tmp''').show(1)
```

## **Create empty data.frames**

In [53]:
from pyspark.sql.types import *
from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)
field = [StructField('cod_pais_1',StringType(), True),
        StructField('cod_entidad_1',IntegerType(), True),
         StructField('cod_id_1',IntegerType(), True),
         StructField('cod_persona_1',IntegerType(), True),
         StructField('fec_movim_1',IntegerType(), True),
         StructField('fec_month_1',IntegerType(), True),
         StructField('partition_1',IntegerType(), True)
        ]
schema = StructType(field)
#schema = cl_contratos_nomina.printSchema()
table_name = sqlContext.createDataFrame([], schema)

In [54]:
table_name.printSchema()

root
 |-- cod_pais_1: string (nullable = true)
 |-- cod_entidad_1: integer (nullable = true)
 |-- cod_id_1: integer (nullable = true)
 |-- cod_persona_1: integer (nullable = true)
 |-- fec_movim_1: integer (nullable = true)
 |-- fec_month_1: integer (nullable = true)
 |-- partition_1: integer (nullable = true)



### **Change the names of the columns in a data.frame**

## 1st way

In [55]:
table_name.show()

+----------+-------------+--------+-------------+-----------+-----------+-----------+
|cod_pais_1|cod_entidad_1|cod_id_1|cod_persona_1|fec_movim_1|fec_month_1|partition_1|
+----------+-------------+--------+-------------+-----------+-----------+-----------+
+----------+-------------+--------+-------------+-----------+-----------+-----------+



In [56]:
new_column_name_list= list(map(lambda x: x.replace("_1", ""), table_name.columns))
table_name_renamed = table_name.toDF(*new_column_name_list)

In [57]:
table_name_renamed.show()

+--------+-----------+------+-----------+---------+---------+---------+
|cod_pais|cod_entidad|cod_id|cod_persona|fec_movim|fec_month|partition|
+--------+-----------+------+-----------+---------+---------+---------+
+--------+-----------+------+-----------+---------+---------+---------+



## 2nd way

In [58]:
data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])

In [59]:
data.show()

+---+---+
| x1| x2|
+---+---+
|  1|  2|
|  3|  4|
+---+---+



In [60]:
data = (data
   .withColumnRenamed('x1','x3')
   .withColumnRenamed('x2', 'x4'))

In [61]:
data.show()

+---+---+
| x3| x4|
+---+---+
|  1|  2|
|  3|  4|
+---+---+



## if you want to aggregate a prefix

In [62]:
col = data.columns

In [63]:
from pyspark.sql import functions as f 
data.select([f.col(c).alias('x' + c) for c in col]).show()

+---+---+
|xx3|xx4|
+---+---+
|  1|  2|
|  3|  4|
+---+---+



if you want to aggregate a suffix

In [64]:
from pyspark.sql import functions as F
(data
 .select(*[F.col(c).alias(f"{c}_x") for c in data.columns])
 .show()
)

+----+----+
|x3_x|x4_x|
+----+----+
|   1|   2|
|   3|   4|
+----+----+



## 3rd way

In [65]:
df = data.selectExpr("x3 as name", "x4 as age")
df.show()

+----+---+
|name|age|
+----+---+
|   1|  2|
|   3|  4|
+----+---+



## **Fill na**

In [66]:
df.na.fill(0).show()

+----+---+
|name|age|
+----+---+
|   1|  2|
|   3|  4|
+----+---+



In [67]:
df.fillna(0).show()

+----+---+
|name|age|
+----+---+
|   1|  2|
|   3|  4|
+----+---+



## Fill na's for specific columns

In [68]:
df.fillna(0, subset=['a', 'b'])

DataFrame[name: bigint, age: bigint]

## **schema with arrays**

**Fisrt way**

Sometime one of the columns is an array and we have to define those columns as arrays type.

* 1st. Define shema before to load.
* 2nd. Define columns that are arrays

In [8]:
from pyspark.sql.functions import split
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("_c0",  StringType(), True),
     StructField("imp_sdopost",  StringType(), True)
])

trends = spark\
.read\
.schema(schema)\
.option("header", "true")\
.csv("data/trends.csv") 

trends.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- imp_sdopost: string (nullable = true)



In [9]:
trends= trends.withColumn("imp_sdopost", split(col("imp_sdopost"), ",").cast("array<long>"))

In [10]:
trends.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- imp_sdopost: array (nullable = true)
 |    |-- element: long (containsNull = true)



In [11]:
trends.show()

+---+--------------------+
|_c0|         imp_sdopost|
+---+--------------------+
|  0|[, 2915, 2912, 28...|
|  1|[, 228, 228, 228,...|
+---+--------------------+



## separate in elements

In [12]:
from pyspark.sql.functions import explode
trends.select("_c0",explode("imp_sdopost")).show()

+---+----+
|_c0| col|
+---+----+
|  0|null|
|  0|2915|
|  0|2912|
|  0|2853|
|  0|2853|
|  0|2853|
|  0|2853|
|  0|2796|
|  0|2796|
|  0|2796|
|  0|2796|
|  0|2431|
|  0|2431|
|  0|2431|
|  0|2339|
|  0|2339|
|  0|2339|
|  0|2339|
|  0|2339|
|  0|2339|
+---+----+
only showing top 20 rows



## row number

In [18]:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
# accepted solution above

from pyspark.sql.window import *
from pyspark.sql.functions import row_number

df = df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))

df.show()

+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
|    B|1999|      1|
|    B|2015|      2|
|    A|2000|      1|
|    A|2002|      2|
|    A|2007|      3|
+-----+----+-------+



## **Save CSV**

In [19]:
## other ways to do it
# with repartition
flightData2015.repartition(1).write.format("com.databricks.spark.csv").option("header", "true")\
   .mode("overwrite").save("data/data_csv_2") #path to folder

# with coalesce
flightData2015.coalesce(1).write.format("com.databricks.spark.csv")\
.option("header", "true").save("data/data_csv_3")

In [20]:
flightData2015.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

In [5]:
# better and simplest way
flightData2015.repartition(1).write.csv("data/test_2.csv")

## **Save and read Parquet format**

In [22]:
flightData2015.write.partitionBy("DEST_COUNTRY_NAME").format("parquet").save("data/flightData2015.parquet")

In [25]:
df = spark.read.load("data/flightData2015.parquet")
df.show(5)

+-------------------+-----+-----------------+
|ORIGIN_COUNTRY_NAME|count|DEST_COUNTRY_NAME|
+-------------------+-----+-----------------+
|            Romania|   15|    United States|
|            Croatia|    1|    United States|
|            Ireland|  344|    United States|
|              India|   62|    United States|
|          Singapore|    1|    United States|
+-------------------+-----+-----------------+
only showing top 5 rows



## **Tuning performance or debugging dataframes**

1. Cache a dataframe when it is used multiple times in the script.

In [23]:
df1.cache()

DataFrame[PassengerId: bigint, Name: string, Sex: string, Survived: bigint]

In [24]:
df1 = df1.cache()

In [25]:
df1.storageLevel

StorageLevel(True, True, False, True, 1)

In [26]:
df1.unpersist()
df1.storageLevel

StorageLevel(False, False, False, False, 1)

2. Checkpointing

Before we indicate that sometimes chaining too many union() cause performance problem or even out of memory errors. checkpoint() truncates the execution plan and saves the checkpointed dataframe to a temporary 
location on the disk.

2.1. It is recomended caching before checkpointing, so Spark doesn’t have to read in the dataframe from disk after it’s checkpointed.

2.2. To use checkpoint(), I need to specify the temporary file location to save the datafame to by accessing the sparkContext object from SparkSession.

In [27]:
#...
#sc = spark.sparkContext
sc.setCheckpointDir("checkpointdir") 

In [28]:
#For example, I can join df1 to itself 3 times:
df = df1.join(df1, ['PassengerId'])
df.join(df1, ['PassengerId']).explain()

== Physical Plan ==
*(8) Project [PassengerId#22L, Name#23, Sex#24, Survived#25L, Name#334, Sex#335, Survived#336L, Name#345, Sex#346, Survived#347L]
+- *(8) SortMergeJoin [PassengerId#22L], [PassengerId#344L], Inner
   :- *(5) Project [PassengerId#22L, Name#23, Sex#24, Survived#25L, Name#334, Sex#335, Survived#336L]
   :  +- *(5) SortMergeJoin [PassengerId#22L], [PassengerId#333L], Inner
   :     :- *(2) Sort [PassengerId#22L ASC NULLS FIRST], false, 0
   :     :  +- Exchange hashpartitioning(PassengerId#22L, 5)
   :     :     +- *(1) Filter isnotnull(PassengerId#22L)
   :     :        +- Scan ExistingRDD[PassengerId#22L,Name#23,Sex#24,Survived#25L]
   :     +- *(4) Sort [PassengerId#333L ASC NULLS FIRST], false, 0
   :        +- ReusedExchange [PassengerId#333L, Name#334, Sex#335, Survived#336L], Exchange hashpartitioning(PassengerId#22L, 5)
   +- *(7) Sort [PassengerId#344L ASC NULLS FIRST], false, 0
      +- ReusedExchange [PassengerId#344L, Name#345, Sex#346, Survived#347L], Excha

In [29]:
#I can also checkpoint() after the first join to truncate the plan.
df = df1.join(df1, ['PassengerId']).checkpoint()
df.join(df1, ['PassengerId']).explain()

== Physical Plan ==
*(4) Project [PassengerId#22L, Name#23, Sex#24, Survived#25L, Name#359, Sex#360, Survived#361L, Name#377, Sex#378, Survived#379L]
+- *(4) SortMergeJoin [PassengerId#22L], [PassengerId#376L], Inner
   :- *(1) Filter isnotnull(PassengerId#22L)
   :  +- Scan ExistingRDD[PassengerId#22L,Name#23,Sex#24,Survived#25L,Name#359,Sex#360,Survived#361L]
   +- *(3) Sort [PassengerId#376L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(PassengerId#376L, 5)
         +- *(2) Filter isnotnull(PassengerId#376L)
            +- Scan ExistingRDD[PassengerId#376L,Name#377,Sex#378,Survived#379L]


## **Partitions and repartition()**

Another common cause of performance problems for me was having too many partitions. I think the Hadoop world call this the small file problem. A rule of thumb: keep the partitions to ~128MB.

To check the number of partitions, use **.rdd.getNumPartitions()**

In [31]:
df1.rdd.getNumPartitions()

4

This dataframe, despite having only 5 rows, has 4 partitions. This is too many. I can repartition to only 1 partition.

In [32]:
df1_repartitioned = df1.repartition(1)
df1_repartitioned.rdd.getNumPartitions()

1

## Magic scala

In [22]:
!pip install pixiedust

Collecting pixiedust
[?25l  Downloading https://files.pythonhosted.org/packages/16/ba/7488f06b48238205562f9d63aaae2303c060c5dfd63b1ddd3bd9d4656eb1/pixiedust-1.1.18.tar.gz (197kB)
[K     |████████████████████████████████| 204kB 6.7MB/s eta 0:00:01
[?25hCollecting mpld3 (from pixiedust)
[?25l  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
[K     |████████████████████████████████| 798kB 15.1MB/s eta 0:00:01
Collecting geojson (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/e4/8d/9e28e9af95739e6d2d2f8d4bef0b3432da40b7c3588fbad4298c1be09e48/geojson-2.5.0-py2.py3-none-any.whl
Collecting astunparse (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/2b/03/13dde6512ad7b4557eb792fbcf0c653af6076b81e5941d36ec61f7ce6028/astunparse-1.6.3-py2.py3-none-any.whl
Collecting markdown (from pixiedust)
[?25l  Downloading https://files.pythonhosted.org/packages/ab

# !jupyter pixiedust install y

In [36]:
import pixiedust

In [37]:
pixiedust.optOut()

Pixiedust will not collect anonymous install statistics.


In [38]:
var1="Hello"
var2=200

In [39]:
%%scala
println(var1)
println(var2 + 10)

Error Cannot run scala code: SCALA_HOME environment variable not set


## **UDF functions**

https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873

## References

* PySpark documentation [[Here]](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframewriter#pyspark.sql.DataFrameWriter) 
* PySpark Dataframe Basics [[https://changhsinlee.com/pyspark-dataframe-basics/]] 
