<img src="../static/logo.png" alt="datio" style="width: 200px "align="right"/>

## Dataframe Python vs Spark

## 1 - Packages

Let's first import all the packages that you will need. 

- [pandas](http://pandas.pydata.org/)
- [pyspark](http://spark.apache.org/docs/2.1.0/api/python/pyspark.html)

In [None]:
import pandas as pd
import pyspark

from pyspark.sql.context import SQLContext
sc = pyspark.SparkContext('local[*]')
sqlContext = SQLContext(sc)

## 2 - Reading csv file

**SAS**:

SAS **proc import** is usually a good starting point for reading a delimited ASCII data file, such as a .csv (comma-separated values) file or a tab-delimited file.

*proc import datafile="DATA.csv" out=mydata dbms=dlm replace; delimiter=","; getnames=yes;run*;

**PYTHON**:

With Pandas, you easily read CSV files with **read_csv(path_file)**. 

**SPARK**:

Spark DataFrame supports reading data from popular professional formats, like JSON files, Parquet files, Hive table — be it from local file systems, distributed file systems (HDFS), cloud storage (S3), or external relational database systems. But CSV is not supported natively by Spark. You have to use a separate library: spark-csv. 
Both pandas and Spark Dataframes can easily read multiple formats including CSV, JSON, and some binary formats.

In [None]:
dataPath = "../data/ttgofici.csv"

#PYTHON
pandasDF = pd.read_csv(dataPath)
#SPARK
sparkDF = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(dataPath)

## 3 - Counting

In [None]:
## PYTHON : Count non NA/null observations of each column
pandasDF.count()
len(pandasDF)

In [None]:
# SPARK : Count number of rows
sparkDF.count() 

## 4 - Viewing

In [None]:
## PYTHON
pandasDF.head(5)

In [None]:
## SPARK
sparkDF.head(5)

In [None]:
## SPARK
sparkDF.show(5)

## 5 - Inferring Types 

In [None]:
## PYTHON
pandasDF.dtypes

## 6 - Cast the values in a column

In [None]:
#PYTHON:
# also : pandasDF['f_cierre'] = pandasDF['f_cierre'].astype('datetime64[ns]')
pandasDF['f_cierre'] = pd.to_datetime(pandasDF['f_cierre'])
pandasDF.dtypes

In [None]:
#SPARK:
# With Spark DataFrames loaded from CSV files, default types are assumed to be “strings”. 
sparkDF.printSchema

In [None]:
sparkDF = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true')\
.option("inferSchema", "true")\
.load(dataPath)
sparkDF.printSchema

In [None]:
#  SPARK: Change types of columns
from pyspark.sql.types import DateType
sparkDF = sparkDF.withColumn("f_cierre", sparkDF.f_cierre.cast(DateType()))
sparkDF.select("f_cierre").schema

## 7 - Reading and apply customized schema with Spark

In [None]:
from pyspark.sql.types import *
    
customSchema = StructType([
 StructField("cod_bancsb",  StringType(), True),
 StructField("cod_ofici",  IntegerType(), True),
 StructField("cnivel",  StringType(), True),
 StructField("cod_zona",  StringType(), True),
 StructField("cod_territor",  StringType(), True),
 StructField("cod_dirgener",  StringType(), True),
 StructField("cod_areanego",  IntegerType(), True),
 StructField("cod_dar",  StringType(), True),
 StructField("des_nomco",  StringType(), True),
 StructField("des_nomab",  StringType(), True),
 StructField("f_cierre",  StringType(), True),
 StructField("cod_cbc",  DateType(), True)])

sparkDFSchemaApplied = sqlContext.read.format("com.databricks.spark.csv")\
            .option("header", "true")\
            .load(dataPath, schema=customSchema)

In [None]:
sparkDFSchemaApplied.printSchema()

## 8 - Describing

In Pandas and Spark, .describe() generate various summary statistics. They could give slightly different results for two reasons: 


1) In Pandas, NaN values are excluded. In Spark, NaN values make that computation of mean and standard deviation fail

2) standard deviation is not computed in the same way. Unbiased (or corrected) standard deviation by default in Pandas, and uncorrected standard deviation in Spark. The difference is the use of N-1 instead of N on the denominator


In [None]:
#PYTHON:
pandasDF.describe()

In [None]:
#SPARK:
sparkDF.describe().show()