<img src="../static/logo.png" alt="datio" style="width: 200px "align="right"/>

## READ SAS DATA FILE WITH SAS7BDAT AND SAVE TO CSV


This notebook will read sas7bdat files using pure Python (2.6+, 3+). No SAS software required!
The project was originally based off the work done by Matt Shotwell and Clint Cummins in their R project found at https://github.com/BioStatMatt/sas7bdat but has since been completely rewritten.

In [None]:
import sys
sys.path.append("../lib/")
from sas7bdat import *

Read SAS file and instantiate the sas7bdat class


In [None]:
#File from nces: http://nces.ed.gov/ccd/Data/zip/ag121a_supp_sas.zi
nameFile = "ag121a_supp"
inFile = "../data/" + nameFile + ".sas7bdat"
data = SAS7BDAT(inFile)

Get a pandas Dataframe


In [None]:
df = data.to_data_frame()

In [None]:
#Visualize df
df.head(4)

In [None]:
#Getting columns
df.columns

In [None]:
#Getting the number of rows:
len(df.index)

## Converts sas7bdat files to csv files

sas7bdat script -> Convert sas7bdat files to csv. <infile> is the path to a sas7bdat 

In [None]:
%run ../lib/sas7bdat_to_csv ../data/*.sas7bdat 

## TRANSFORM SAS FILES TO PARQUET THROUGHT SPARK

First, we have transformed a SAS sas7bdat file to a pandas DataFrame. The great thing in Spark is that a Python/pandas DataFrame could be translated to Spark DataFrame by the createDataFrame method. Now I have two DataFrames: one is a pandas DataFrame and the other is a Spark DataFrame

The strategy is to build a pipeline to realize my purpose such as SAS --> Python --> Spark --> Parquet

In [None]:
with SAS7BDAT(inFile) as f:
     pandas_df = f.to_data_frame()
print('-----Data in Pandas dataframe-----')
print(pandas_df.head())


In [None]:
import pyspark
from pyspark.sql.context import SQLContext
sc = pyspark.SparkContext('local[*]')
sqlContext = SQLContext(sc)
print('-----Data in Spark dataframe-----')
spark_df = sqlContext.createDataFrame(pandas_df)

The two dataframes should be the identical length. Here both show 1838 rows.

In [None]:
print(len(pandas_df))
print(spark_df.count())

 To write in parquet format: **df.write.save()**

In [None]:
spark_df.write.save(path = "../data/" + nameFile, mode="overwrite")

## Automate the transformation

In [None]:
def sas_to_parquet(filelist, destination):
    """Save SAS file to parquet
    Args:
        filelist (list): the list of sas file names
        destination (str): the path for parquet
    Returns:
        None
    """
    rows = 0
    for i, filename in enumerate(filelist):
        with SAS7BDAT(filename) as f:
            pandas_df = f.to_data_frame()
            rows += len(pandas_df)
        spark_df = sqlContext.createDataFrame(pandas_df)
        spark_df.write.save(destination +  i + ".parquet")
    print('{0} rows have been transformed'.format(rows))

## Advantages of Parquet Version
- Self-describing   
- Columnar format (very efficient compression)   
- Language-independent  