<img src="../static/logo.png" alt="datio" style="width: 200px "align="right"/>

## Dataframe Python vs Spark

In [None]:
#PYTHON
import pandas as pd

In [None]:
#SPARK
import pyspark
from pyspark.sql.context import SQLContext
sc = pyspark.SparkContext('local[*]')
sqlContext = SQLContext(sc)

## Reading csv file

SAS: SAS proc import is usually a good starting point for reading a delimited ASCII data file, such as a .csv (comma-separated values) file or a tab-delimited file. Sometimes we can also use a data step to read in an ASCII data file. On this page, we will show examples on how to read delimited ASCII files using proc import and data step.

proc import datafile="DATA.csv" out=mydata dbms=dlm replace;

   delimiter=",";
   
   getnames=yes;
   
run;


PYTHON : With Pandas, you easily read CSV files with read_csv(). 

SPARK: Spark DataFrame supports reading data from popular professional formats, like JSON files, Parquet files, Hive table — be it from local file systems, distributed file systems (HDFS), cloud storage (S3), or external relational database systems. But CSV is not supported natively by Spark. You have to use a separate library: spark-csv. 
Both pandas and Spark Dataframes can easily read multiple formats including CSV, JSON, and some binary formats.

In [None]:
dataPath = "../data/ttgofici.csv"
#PYTHON
pandasDF = pd.read_csv(dataPath)
#SAS
sparkDF = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(dataPath)

## Counting

In [None]:
# Count non NA/null observations of each column
pandasDF.count()

In [None]:
# Number of rows
sparkDF.count() 

## Viewing

In [None]:
pandasDF.head(5)

In [None]:
sparkDF.head(5)

In [None]:
sparkDF.show(5)

## Inferring Types 

In [None]:
pandasDF.dtypes

In [None]:
#PYTHON: CAST THE VALUES IN A COLUMN
# also : pandasDF['f_cierre'] = pandasDF['f_cierre'].astype('datetime64[ns]')
pandasDF['f_cierre'] = pd.to_datetime(pandasDF['f_cierre'])
pandasDF.dtypes

In [None]:
# With Spark DataFrames loaded from CSV files, default types are assumed to be “strings”. 
sparkDF.schema

In [None]:
sparkDF = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true')\
.option("inferSchema", "true")\
.load(dataPath)
sparkDF.schema

In [None]:
#  SPARK: Change types of columns
from pyspark.sql.types import DateType
sparkDF = sparkDF.withColumn("f_cierre", sparkDF.f_cierre.cast(DateType()))
sparkDF.select("f_cierre").schema

## Reading and apply customized schema with Spark

In [None]:
from pyspark.sql.types import *
    
customSchema = StructType([
 StructField("cod_bancsb",  StringType(), True),
 StructField("cod_ofici",  IntegerType(), True),
 StructField("cnivel",  StringType(), True),
 StructField("cod_zona",  StringType(), True),
 StructField("cod_territor",  StringType(), True),
 StructField("cod_dirgener",  StringType(), True),
 StructField("cod_areanego",  IntegerType(), True),
 StructField("cod_dar",  StringType(), True),
 StructField("des_nomco",  StringType(), True),
 StructField("des_nomab",  StringType(), True),
 StructField("f_cierre",  StringType(), True),
 StructField("cod_cbc",  StringType(), True)])

sparkDFSchemaApplied = sqlContext.read.format("com.databricks.spark.csv")\
            .option("header", "true")\
            .load(dataPath, schema=customSchema)

In [None]:
#process schema doesn't work with StructType
sparkDFSchemaApplied.printSchema()

## Describing

In Pandas and Spark, .describe() generate various summary statistics. They could give slightly different results for two reasons: 


1) In Pandas, NaN values are excluded. In Spark, NaN values make that computation of mean and standard deviation fail

2) standard deviation is not computed in the same way. Unbiased (or corrected) standard deviation by default in Pandas, and uncorrected standard deviation in Spark. The difference is the use of N-1 instead of N on the denominator


In [None]:
pandasDF.describe()

In [None]:
sparkDF.describe().show()