<h1>Dataframe Creation and basic data selection</h1>
<p>The idea of a dataframe has become prevasive across both users of R and the python package pandas.  If you're not familiar with it's usage, you can think of it as a dictionary that points to arrays, where the dictionaries are the column names.  Let's start by creating one from twitter data, which is stored as json formatted data.</p>

In [1]:
from pyspark.sql import SparkSession
path = "/Users/josephgartner/Desktop/data/"

spk = SparkSession.builder.master("local").getOrCreate()
df = spk.read.json(path)

<h2>The Spark dataframe</h2>
<p>SQL & Dataframes are covered in greater detail [at this link](http://spark.apache.org/docs/latest/sql-programming-guide.html).  I'll briefly cover the usage here, using an example just a bit more complicated than those covered in the docs.  The data they use is much more well behaved than Twitter data.  As a result, a few tricks like nested json will be shown.</p>

In [2]:
df.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |

<h2>Data Handles</h2>
<p>Dataframes allow you to quickly grab parts of data while avoiding the lambda funciton.  The schema above allows you see how to get these entities.  As it happens, Twitter data has many nested json objects, which you can see within the above schema.  In order to access these, you simply use period (.).</p>

In [3]:
df.select('lang', 'user.name').show(5)

+----+-------------------+
|lang|               name|
+----+-------------------+
|  en|      The Notorious|
| und|             Stefan|
|  en|Stuart Tomlin (5-5)|
|  en|       george coley|
|  en|              Nduka|
+----+-------------------+
only showing top 5 rows



<h2>Data slicing</h2>
<p>Dataframes also allow you to slice data quickly.  This is a really important technique in data science.  While it's much more interesting to spin up projects using deep learning, a great deal of questions can be solved by simple data slicing!</p>

In [4]:
n_tot = df.count()
print "There are {} tweets in this sample.".format(n_tot)

n_w_geo = df.filter(df['geo'] != None).count()
print "{0:.4f}% of which have geo tags".format(float(n_w_geo)/n_tot)

n_non_eng = df.filter(df['lang']!='en').filter(df['lang']!='und').count()
print "{0:.4f}% of which aren't english".format(float(n_non_eng)/n_tot)

There are 4868 tweets in this sample.
0.0951% of which have geo tags
0.0822% of which aren't english
