**_pySpark Basics: Loading, Exploring and Saving Data_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 8 Aug 2016, Spark v2.0_

_Abstract: This guide will go over loading a CSV file into a dataframe, exploring it with basic commands, and finally writing it out to file in S3 and reading it back in._

_Main operations used:_ `write.save`, `read.load`, `count`, `dtypes`, `schema`/`inferSchema`, `take`, `show`, `withColumnRenamed`, `columns`, `describe`, `coalesce`

***

An initial setup item that you will see in all of the Spark 1.6.1 tutorials: We create the SQL context necessary for working with a dataframe (panel data).  Note that the spark-csv package we'll be using is installed automatically in the bootstrap script.

# Loading From CSV

Next we load our data from a CSV file in an S3 bucket.  There are three ways to handle data types (dtypes) for each column: The easiest, but the most computationally-expensive, is to pass `inferSchema=True` to the load method.  The second way entails specifiying the dtypes manually for every column by passing `schema=StructType(...)`, which is computationally-efficient but may be difficult and prone to coder error for especially wide datasets.  The final option is to not specify a schema option at all, in which case Spark will assign all the columns string dtypes.  Note that dtypes can be changed later, as we will demonstrate, though it is more costly than doing it correctly in the loading process.

Loading the data with the schema inferred:

In [1]:
df = spark.read.csv('s3://ui-spark-data/Performance_2015Q1.txt', header=False, inferSchema=True, sep='|')

Example loading of the same data by passing a custom schema:

In [2]:
"""
from pyspark.sql.types import DateType, TimestampType, IntegerType, FloatType, LongType, DoubleType
from pyspark.sql.types import StructType, StructField

customSchema = StructType([StructField('_c0', DateType(), True),
                           StructField('_c1', StringType(), True),
                           StructField('_c2', DoubleType(), True),
                           StructField('_c3', DoubleType(), True),
                           StructField('_c4', DoubleType(), True),
                           StructField('_c5', IntegerType(), True),
                           ...
                           StructField('_c27', StringType(), True)])
                           
df = spark.read.csv('s3://ui-spark-data/Performance_2015Q1.txt', header=False, schema=customSchema, sep='|')
""";

One example of using infering and specifying a schema together might be with a large, unfamiliar dataset that you know you will need to load up and work with repeatedly.  The first time you load it use `inferSchema`, then make note of the dtypes it assigns.  Use that information to build the custom schema, so that when you load the data in the future you avoid the extra processing time necessary for infering.

# Exploring the Data

Our data is now loaded into a dataframe that we named `df`, with all the dtypes inferred.  First we'll count the number of rows it found:

In [3]:
df.count()

3526154

Then we look at the column-by-column dtypes the system estimated:

In [4]:
df.dtypes

[('_c0', 'bigint'),
 ('_c1', 'string'),
 ('_c2', 'string'),
 ('_c3', 'double'),
 ('_c4', 'double'),
 ('_c5', 'int'),
 ('_c6', 'int'),
 ('_c7', 'int'),
 ('_c8', 'string'),
 ('_c9', 'int'),
 ('_c10', 'string'),
 ('_c11', 'string'),
 ('_c12', 'int'),
 ('_c13', 'string'),
 ('_c14', 'string'),
 ('_c15', 'string'),
 ('_c16', 'string'),
 ('_c17', 'string'),
 ('_c18', 'string'),
 ('_c19', 'string'),
 ('_c20', 'string'),
 ('_c21', 'string'),
 ('_c22', 'string'),
 ('_c23', 'string'),
 ('_c24', 'string'),
 ('_c25', 'string'),
 ('_c26', 'int'),
 ('_c27', 'string')]

For each pairing (a `tuple` object in Python, denoted by the parentheses), the first entry is the column name and the second is the dtype.  Notice that this data has no headers with it (we specified `headers=False` when we loaded it), so Spark used its default naming convention of `_c0, _c1, ... _cn`.  We'll makes some changes to that in a minute.

Take a peak at five rows:

In [5]:
df.take(5)

[Row(_c0=100002091588, _c1=u'01/01/2015', _c2=u'OTHER', _c3=4.125, _c4=None, _c5=0, _c6=360, _c7=360, _c8=u'01/2045', _c9=16740, _c10=u'0', _c11=u'N', _c12=None, _c13=u'', _c14=u'', _c15=u'', _c16=u'', _c17=u'', _c18=u'', _c19=u'', _c20=u'', _c21=u'', _c22=u'', _c23=u'', _c24=u'', _c25=u'', _c26=None, _c27=u''),
 Row(_c0=100002091588, _c1=u'02/01/2015', _c2=u'', _c3=4.125, _c4=None, _c5=1, _c6=359, _c7=359, _c8=u'01/2045', _c9=16740, _c10=u'0', _c11=u'N', _c12=None, _c13=u'', _c14=u'', _c15=u'', _c16=u'', _c17=u'', _c18=u'', _c19=u'', _c20=u'', _c21=u'', _c22=u'', _c23=u'', _c24=u'', _c25=u'', _c26=None, _c27=u''),
 Row(_c0=100002091588, _c1=u'03/01/2015', _c2=u'', _c3=4.125, _c4=None, _c5=2, _c6=358, _c7=358, _c8=u'01/2045', _c9=16740, _c10=u'0', _c11=u'N', _c12=None, _c13=u'', _c14=u'', _c15=u'', _c16=u'', _c17=u'', _c18=u'', _c19=u'', _c20=u'', _c21=u'', _c22=u'', _c23=u'', _c24=u'', _c25=u'', _c26=None, _c27=u''),
 Row(_c0=100002091588, _c1=u'04/01/2015', _c2=u'', _c3=4.125, _c4=No

In the format `column_name=value` for each row.  Note that the formatting above is ugly because `take` doesn't try to make it pretty, it just returns the row object itself.  We can use `show` instead and that attempts to format the data better, but because there are so many columns in this case the formatting of `show` doesn't fit, and each line wraps down to the next.  We'll use `show` on a subset below.

# Renaming Columns

We can rename columns one at a time, or several at a time:

In [6]:
df = df.withColumnRenamed('_c0','id').withColumnRenamed('_c1','date')

In [7]:
df.take(1)

[Row(id=100002091588, date=u'01/01/2015', _c2=u'OTHER', _c3=4.125, _c4=None, _c5=0, _c6=360, _c7=360, _c8=u'01/2045', _c9=16740, _c10=u'0', _c11=u'N', _c12=None, _c13=u'', _c14=u'', _c15=u'', _c16=u'', _c17=u'', _c18=u'', _c19=u'', _c20=u'', _c21=u'', _c22=u'', _c23=u'', _c24=u'', _c25=u'', _c26=None, _c27=u'')]

You can see that column `C0` has been renamed to `id`, and `C1` to `date`.

We can also rename many of them in a loop using two lists or a dictionary:

In [8]:
old_names = ['_c2', '_c3', '_c4', '_c5', '_c6', '_c7']
new_names = ['foo', 'bar', 'baz', 'more', 'another', 'stuff']
for old, new in zip(old_names, new_names):
    df = df.withColumnRenamed(old, new)

In [9]:
df.take(1)

[Row(id=100002091588, date=u'01/01/2015', foo=u'OTHER', bar=4.125, baz=None, more=0, another=360, stuff=360, _c8=u'01/2045', _c9=16740, _c10=u'0', _c11=u'N', _c12=None, _c13=u'', _c14=u'', _c15=u'', _c16=u'', _c17=u'', _c18=u'', _c19=u'', _c20=u'', _c21=u'', _c22=u'', _c23=u'', _c24=u'', _c25=u'', _c26=None, _c27=u'')]

In [10]:
df.columns

['id',
 'date',
 'foo',
 'bar',
 'baz',
 'more',
 'another',
 'stuff',
 '_c8',
 '_c9',
 '_c10',
 '_c11',
 '_c12',
 '_c13',
 '_c14',
 '_c15',
 '_c16',
 '_c17',
 '_c18',
 '_c19',
 '_c20',
 '_c21',
 '_c22',
 '_c23',
 '_c24',
 '_c25',
 '_c26',
 '_c27']

# Describe

Now we'll describe the data.  Note that `describe` returns a new dataframe with the information, and so must have `show` called after it if our goal is to view it (note the nice formatting in this case).  This can be called on one or more specific columns, as we do here, or the entire dataframe by passing no columns to describe:

In [11]:
df_described = df.describe('foo', 'bar', 'baz')
df_described.show()

+-------+--------------------+-------------------+------------------+
|summary|                 foo|                bar|               baz|
+-------+--------------------+-------------------+------------------+
|  count|             3526154|            3526154|           1580402|
|   mean|                null|  4.178168090219519|234846.78065481762|
| stddev|                null|0.34382335723646673|118170.68592261661|
|    min|                    |               2.75|              0.85|
|    max|WELLS FARGO BANK,...|              6.125|        1193544.39|
+-------+--------------------+-------------------+------------------+



# Writing to S3

And finally, we can write data out to our S3 bucket.  Note that if your data is small enough to be collected onto one computer, writing it is easy.  We'll use the dataframe we just created using `describe` above as an example:

In [12]:
df_described.write.csv('s3://pyspark-tutorials/mycsv', header=True)

The above line will turn *each partition* of this dataframe into a .csv file.  This is an important note; if your data is very big it may be on a lot of partitions.  This may be required if your data too large to fit in one csv file, but if your data should fit you can include the `coalesce` command, like this:

    df_described.coalesce(1).write.csv('s3://pyspark-tutorials/mycsv', header=True)

To tell it to combine all the data into 1 partition (or however many you pass in as the value).  Again, only do this if your data isn't very large.  See the pySpark tutorial on subsetting for more.

Now you can read our output back in:

In [15]:
df_new = spark.read.csv('s3://pyspark-tutorials/mycsv', header=True, inferSchema=True, sep=',')

In [16]:
df_new.show()

+-------+--------------------+-------------------+------------------+
|summary|                 foo|                bar|               baz|
+-------+--------------------+-------------------+------------------+
| stddev|                    |0.34382335723646673|118170.68592261661|
|    min|                    |               2.75|              0.85|
|    max|WELLS FARGO BANK,...|              6.125|        1193544.39|
|  count|             3526154|          3526154.0|         1580402.0|
|   mean|                    |  4.178168090219519|234846.78065481762|
+-------+--------------------+-------------------+------------------+



Note the differences from the original loading we did at the top of this tutorial; we pointed it to the new path we created on S3, we told it our data had headers that we wanted to keep, and the delimiter is now the default `,` instead of the `|` the original data had - we could have specified this as an option in our write operation if we had wanted it to be something else.  Also, since commas are the default we could have left the `delimiter` argument out all together here.