The Spark DataFrame is a feature that allows us to create and work with DataFrame objects. As we may have guessed, pandas inspired it.

Spark is well known for its ability to [process large data sets](https://app.dataquest.io/m/91/spark-dataframes/1/the-spark-dataframe-an-introduction). Spark DataFrames combine the scale and speed of Spark with the familiar query, filter, and analysis capabilities of pandas. Unlike pandas, which can only run on one computer, Spark can use distributed memory (and disk when necessary) to handle larger data sets and run computations more quickly.

Spark DataFrames allow us to modify and reuse our existing pandas code to scale up to much larger data sets. They also have better support for various data formats. We can even use a SQL interface to write distributed SQL queries that query large database systems and other data stores.

For this file, we'll be working with a JSON file containing data from the 2010 U.S. Census. It has the following columns:

* age - Age (year)
* females - Number of females
* males - Number of males
* total - Total number of individuals
* year - Year column (2010 for all rows)

Let's open and explore the data set before we dive into Spark DataFrames.

In [1]:
f = open('census_2010.json')

for i in range(0,4):
    print(f.readline())

{"females": 1994141, "total": 4079669, "males": 2085528, "age": 0, "year": 2010}

{"females": 1997991, "total": 4085341, "males": 2087350, "age": 1, "year": 2010}

{"females": 2000746, "total": 4089295, "males": 2088549, "age": 2, "year": 2010}

{"females": 2002756, "total": 4092221, "males": 2089465, "age": 3, "year": 2010}



In previous files, we explored reading data into an RDD object. Recall that an RDD is essentially a list of tuples with no enforced schema or structure of any kind. An RDD can have a variable number of elements in each tuple, and combinations of types between tuples.

RDDs are useful for representing unstructured data like text. Without them, we'd need to write a lot of custom Python code to interact with such data.

We use the SparkContext object to read data into an RDD:

![image.png](attachment:image.png)

To use the familiar DataFrame query interface from pandas, however, the data representation needs to include rows, columns, and types. Spark's implementation of DataFrames mirrors the pandas implementation, with logic for rows and columns.

The Spark SQL class is very powerful. It gives Spark more information about the data structure we're using and the computations we want to perform. Spark uses that information to optimize processes.

To take advantage of these features, we'll have to use the SQLContext object to structure external data as a DataFrame, instead of the SparkContext object.

We can query Spark DataFrame objects with SQL. The SQLContext class gets its name from this capability.

This class allows us to read in data and create new DataFrames from a wide range of sources. It can do this because it takes advantage of Spark's powerful [Data Sources API](https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html).

**File Formats**

* JSON, CSV/TSV, XML
* Parquet, Amazon S3 (cloud storage service)

**Big Data Systems**

* Hive, Avro, HBase

**SQL Database Systems**

* MySQL, PostgreSQL

Data science organizations often use a wide range of systems to collect and store data, and they're constantly making changes to those systems. Spark DataFrames allow us to interface with different types of data, and ensure that our analysis logic will still work as the data storage mechanisms change.

Now that we've learned a bit about Spark DataFrames, let's read in census_2010.json. This data set contains valid JSON on each line, which is what Spark needs in order to read the data in properly.

In the following code cell, we:

* Import SQLContext from the pyspark.sql library
* [Instantiate the SQLContext object](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext) (which requires the SparkContext object (sc) as a parameter), and assign it to the variable sqlCtx
* Use the SQLContext method read.json() to read the JSON data set into a Spark DataFrame object named df
* Print df's data type to confirm that we successfully read it in as a Spark DataFrame

In [11]:
import pyspark

sc = pyspark.SparkContext()

In [12]:
# Import SQLContext
from pyspark.sql import SQLContext

In [13]:
# Pass in the SparkContext object `sc`
sqlCtx = SQLContext(sc)

In [14]:
# Read JSON data into a DataFrame object `df`
df = sqlCtx.read.json("census_2010.json")

# Print the type
print(type(df))

<class 'pyspark.sql.dataframe.DataFrame'>
