**_pySpark Basics: Loading and Exploring CSV Data_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 15 June 2016, Spark v1.6.1_

_Abstract: This guide will go over loading a CSV file into a dataframe, setting data types, and renaming columns._

***

A few initial setup items:  First we test that the spark context was successfully created during bootstrap and is available in the global namespace as 'sc'.  After that we create the SQL context necessary for working with a dataframe (panel data).  Note that the spark-csv package we'll be using is installed automatically in the bootstrap script.

In [19]:
try:
    sc
except NameError:
    raise Exception('Spark context not created.')

In [20]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Next we load our data from a CSV file in an S3 bucket.  There are three ways to handle data types (dtypes) for each column: The easiest, but the most computationally-expensive, is to pass `inferSchema='true'` to the load call.  The second way entails specifiying the dtypes manually by passing `schema=StructType(...)` to the load call, which is computationally-efficient but may be difficult and prone to coder error for especially wide datasets.  The final option is to not specify a schema, in which case all the columns will have string dtypes.  Note that dtypes can be changed later, though it is more costly than doing it correctly in the loading process.

Loading the data with the schema inferred:

In [21]:
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          inferSchema='true',
                          delimiter='|')

Example loading of the same data by passing a custom schema:

In [22]:
"""
from pyspark.sql.types import DateType, TimestampType, IntegerType, FloatType, LongType, DoubleType
from pyspark.sql.types import StructType, StructField

customSchema = StructType([StructField('C0', DateType(), True),
                           StructField('C1', StringType(), True),
                           StructField('C2', DoubleType(), True),
                           StructField('C3', DoubleType(), True),
                           StructField('C4', DoubleType(), True),
                           StructField('C5', IntegerType(), True),
                           ...
                           StructField('C27', StringType(), True)])
                           
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          schema='customSchema',
                          delimiter='|')
""";

Count the number of rows in the dataframe:

In [23]:
df.count()

3526154

We can easily see what the dtypes are for each column:

In [24]:
df.dtypes

[('C0', 'bigint'),
 ('C1', 'string'),
 ('C2', 'string'),
 ('C3', 'double'),
 ('C4', 'double'),
 ('C5', 'int'),
 ('C6', 'int'),
 ('C7', 'int'),
 ('C8', 'string'),
 ('C9', 'int'),
 ('C10', 'string'),
 ('C11', 'string'),
 ('C12', 'int'),
 ('C13', 'string'),
 ('C14', 'string'),
 ('C15', 'string'),
 ('C16', 'string'),
 ('C17', 'string'),
 ('C18', 'string'),
 ('C19', 'string'),
 ('C20', 'string'),
 ('C21', 'string'),
 ('C22', 'string'),
 ('C23', 'string'),
 ('C24', 'string'),
 ('C25', 'string'),
 ('C26', 'int'),
 ('C27', 'string')]

Take a peak at five rows:

In [25]:
df.take(5)

[Row(C0=100002091588, C1=u'01/01/2015', C2=u'OTHER', C3=4.125, C4=None, C5=0, C6=360, C7=360, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u''),
 Row(C0=100002091588, C1=u'02/01/2015', C2=u'', C3=4.125, C4=None, C5=1, C6=359, C7=359, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u''),
 Row(C0=100002091588, C1=u'03/01/2015', C2=u'', C3=4.125, C4=None, C5=2, C6=358, C7=358, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u''),
 Row(C0=100002091588, C1=u'04/01/2015', C2=u'', C3=4.125, C4=None, C5=3, C6=357, C7=357, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'',

This data came with no headers, so pySpark named all columns *C0, C1, C2, ..., Cn*.  We can rename columns one or a few at a time:

In [26]:
df = df.withColumnRenamed('C0','id').withColumnRenamed('C1','date')

In [27]:
df.take(1)

[Row(id=100002091588, date=u'01/01/2015', C2=u'OTHER', C3=4.125, C4=None, C5=0, C6=360, C7=360, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u'')]

Or rename many of them in a loop using two lists or a dictionary:

In [28]:
old_names = ['C2', 'C3', 'C4', 'C5', 'C6', 'C7']
new_names = ['foo', 'bar', 'baz', 'more', 'another', 'stuff']
for old, new in zip(old_names, new_names):
    df = df.withColumnRenamed(old, new)

In [29]:
df.take(1)

[Row(id=100002091588, date=u'01/01/2015', foo=u'OTHER', bar=4.125, baz=None, more=0, another=360, stuff=360, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u'')]

In [30]:
df.columns

['id',
 'date',
 'foo',
 'bar',
 'baz',
 'more',
 'another',
 'stuff',
 'C8',
 'C9',
 'C10',
 'C11',
 'C12',
 'C13',
 'C14',
 'C15',
 'C16',
 'C17',
 'C18',
 'C19',
 'C20',
 'C21',
 'C22',
 'C23',
 'C24',
 'C25',
 'C26',
 'C27']

And finally, we'll describe the data.  Note that `describe` returns a new dataframe with the information, and so must have `show` called after it if our goal is to view it.  This can be called on one or more specific columns, as we do here, or the entire dataframe by passing no columns to describe:

In [31]:
df.describe('foo', 'bar', 'baz').show()

+-------+--------------------+-------------------+------------------+
|summary|                 foo|                bar|               baz|
+-------+--------------------+-------------------+------------------+
|  count|             3526154|            3526154|           1580402|
|   mean|                null|  4.178168090221902|234846.78065481802|
| stddev|                null|0.34382335723646484|118170.68592261615|
|    min|                    |               2.75|              0.85|
|    max|WELLS FARGO BANK,...|              6.125|        1193544.39|
+-------+--------------------+-------------------+------------------+



Working with big data can be difficult, because we're often used to being able to "look" at all of our data in Excel to get a feel for it.  Dealing with data so large that it can't be loaded into memory is entirely doable but requires a shift in thinking.  When necessary, tiny subsets can be used as with the `take` command used repeatedly above, though doing so can be an expensive operation.

HOWEVER, the ability to look at your data in a traditional way does exist in pySpark; you just can't do it unless the data fits in the memory of one computer.  With that crucial caveat in mind, here are some commands you can use on a pySpark dataframe, but usually shouldn't:

In [33]:
"""
df.show()                          #show the entire dataframe
df.select('column_name').show()    #show the entire selected row
""";