# WORK  IN PROGRESS


# Exploratory Data Analysis using Spark and Python  
Now that we have an idea of how to explore some data in Spark, the following content describes how to apply some of those principles to the __Exploratory Data Analysis__ methodology within Data Science. This document outlines some of the pitfalls and issues that one may encounter as they they try to explore data in Spark.

__Note:__ The infomration within this document is based on the [Python Tutorials](https://www.codementor.io/python/tutorial) from __Code Mentor__. 


## Getting the Data  
### Getting the Data  
For this exercise, we will use the Incidents derived from [SFPD Crime Incident Reporting system](https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry
).  

The Data isfomatted to show the following infortmation:
- Incident Number
- Catagory of the Incident
- Day of the Week
- Date
- Time
- Police Department District
- Resolution
- Address
- X map coordinates
- Y map coordinates
- Map location
- Poilice Deprtment ID

The data has been exported to `.csv` format and copied to HDFS using the following example proceedure:
```
wget https://data..org/api/views/tmnf-yvry/rows.csv?accessType=DOWNLOAD -O SFPD_Incidents.csv
hdfs dfs -put incidents.csv /data/
hdfs dfs -ls /data/
```

### Importing the Data into Spark  
#### Manual Schema Preparation  
The first step to doing this is to isolate the headers of the data to be used for the field names,in order to understand what the fields are for and hence the field types. Since this is a manual process, we will manually bring the data back into Spark using the `SQLContext`.  We will not be using any of the functionality of dataframes and `spark-csv` adn the reson for this is to highlight the ease of doing this with dataframes later, over the manual (or traditional) steps.

In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# Local fiile if using Spark.local
#data = sc.textFile("file:///vagrant/notebook.tmp/data/SFPD_Incidents.csv")
# If using HDFS on Spark.vsphere
data = sc.textFile("hdfs://master:54310/data/SFPD_Incidents.csv")
data.take(1)

The first thing we do is, as the [documentation](https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema) suggestions is to isolate the headings. We wil use these headings to build the schema.

In [None]:
header = data.first()
header

Next, we can construct the individual fields, by splitting them up based on the `,` delimeter. As a baseline, we force each field to be of type `string`.

In [None]:
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
fields

Now that we have individual fields, we can specify the exact type of data within each column based on the description from the origional source. For example, according to the website, the `DayOfWeek` column is of __Plain Text__ type, but the `Date` column is is of type, __Date & Time__. So all we need to do is change the type of data in each of our fields, to match the descript from the website.  

Therfore, the fonly fields we need to change are:
- __Date__ from `StringType` to `DateType` or `LongType`
- __Time__ from `StringType` to `TimestampType` or leave it as a `string`
- __X__ from `StringType` to `FloatType` or `DoubleType`
- __Y__ from `StringType` to `FloatType` or `DoubleType`
- __PdId__ from `StringType` to `LongType`
- Potentilly change __IncdntNum__ to `LongType`

In [None]:
# Set the necessary fields to the proper type based on my assumptions
#fields[4].dataType = DateType() #Date
#fields[5].dataType = TimestampType() #Time
#fields[9].dataType = FloatType() #X
#fields[10].dataType = FloatType() #Y
#fields[12].dataType = LongType() #PdId

# Set the fields based on what pandas' picked up
fields[0].dataType = LongType()
fields[9].dataType = DoubleType()
fields[10].dataType = DoubleType()
fields[12].dataType = LongType()
fields

As part of the data aquisition process, extracing the headers, also provides an opportunity to clean them up. Although this is not necessary, we can change the headings to something that's more understandable. For example:

In [None]:
# Change `IncidntNum` to `Incident`
fields[0].name = "Incident"

# Change `DayOfWeek` to `Day`
fields[3].name = "Day"

# Change `Descript` to `Description`
fields[2].name = "Description"
fields

So now that the data types have been changes, we can use this to contruct the schema. This will be used later as we construct the dataframe. 

In [None]:
# Create the schema
schema = StructType(fields)

Before creating the dataframe manually, a good practice is to strip out the header file so a to not conflict with the actual data using Spark's `subtract()` method.

In [None]:
dataHeader = data.filter(lambda x: "PdId" in x)
dataHeader.collect()

# Remove the header data collected
dataNoHeader = data.subtract(dataHeader)
dataNoHeader.first()

Now that the first row starts with the actual data we can use the raw data and the schema to create a dataframe.

In [None]:
df = sqlContext.createDataFrame(dataNoHeader, schema)
df.head()

As can be seen from the above result, one needs to have a very definite understanding on the the type of data they are dealing with and keeping in mind that we are working with __Big Data__, we will see that not all of the raw data in the rows conforms to the specifiec schema we have created. So another option to leverage `spark-csv`.

#### Using Spark-csv  
The first proceedure we will use to get the data into Spark, is `spark-csv` from [__Databricks__](http://spark-packages.org/package/databricks/spark-csv). This package allows us to import `.csv` data into a Spark DataFrame, using the example below:

In [None]:
# Local downladed file if using Spark.local
#data = "file:///vagrant/notebook.tmp/data/SFPD_Incidents.csv"

# HDFS location of the downloaded file if using Spark.vsphere
data = "hdfs://master:54310/data/SFPD_Incidents.csv"
    
# Create a sqlContext variable to read and load the file, captuing the header and schema
df = sqlContext.read.load(data,
                          format="com.databricks.spark.csv",
                          header="true",
                          infereSchema="true")

# Take the first row
df.head()

There are a few of important things to note from the output above. __Firstly__, the raw fomatting may not be helpful in descirbing the data. Therefore, another option to display this is shown below: 

In [None]:
# Show the first row
df.show(1)

The `show()` function attempts to display the formatting better, but may not be the best display output if the number of colums exceeds the width of the Notebook. __Secondly__, although `inferSchema` is set to `true`, `spark-csv` tries its best to fully capture the Schema of the data as scale, as is seen from the output below.

In [None]:
# Show the Schema
df.printSchema()
df.dtypes

As can be seen, the inferred Schema is set to string. __Thirdly__, calling the `.csv` file from the local filesystem seems to produce errors stating that the file cannot be found. I'm assuming that this is becuase the file needs to be on all nodes of the Spark Cluster and not just the Master node. To circumvent this issue, the data file should be copied copied onto HDFS - or some other shared filesystem - to ensure that all nodes can access the data.

__Side Note:__ It is possible what once the Data has been captured as a Spark Dataframe, it can be comnverted to a __Pandas__ dataframe by making use of the `toPandas()` function on the Spark DataFrame, as shown below. Pandas offers a number of differences over Spark dataframes. For more information on this, see [6 differences between Pandas and Spark DataFrames](https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2#.x2a9hwn4z).  

In [None]:
import pandas as pd
pddf = df.toPandas()
pddf.head()

Pandas dataframe are displayed in a better format, but we still need to see how Pandas enterprets the schema.

In [None]:
pddf.dtypes

Unfortunately, by converting to a Pandas dataframe, the class of the data is now converted to an `object`. So once again we still don't have a clear idea of the actual schema. So we will have to manually prepare the schema. Fortunately Spark `1.5` introduced a number of better ways to work with data types within newly created dataframes (instead of the raw `RDD`), without having to manually build and test the schema. As can be seen below we will make use of the [pyspark.sql.DataFrame](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame) class methods [withColumn](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn) and [withColumnRenamed](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumnRenamed) to change the data type and even rename the colums from within the dataframe directly by referencing the actual column name.

In [None]:
# Re-create the dataframe using `spark-csv`
data = "hdfs://master:54310/data/SFPD_Incidents.csv"
df = sqlContext.read.load(data,
                          format="com.databricks.spark.csv",
                          header="true",
                          infereSchema="true")
df.dtypes

In [None]:
# Ensure to import pyspark DataTypes
from pyspark.sql import SQLContext
from pyspark.sql.types import *

df = (df.withColumn("IncidntNum", df.IncidntNum.cast(LongType()))
#     .withColumn("Date", df.Date.cast(DateType()))
#     .withColumn("Time", df.Time.cast(TimestampType()))
     .withColumn("X", df.X.cast(DoubleType()))
     .withColumn("Y", df.Y.cast(DoubleType()))
     .withColumn("PdId", df.PdId.cast(LongType()))
     .withColumnRenamed("IncidntNum", "Incident")
     .withColumnRenamed("DayOfWeek", "Day")
     .withColumnRenamed("Descript", "Description")
     .withColumnRenamed("PdDistrict", "District")
     )

# Confirm the new changes
df.dtypes

In [None]:
df.head()

### Cleaning up the data

*Now that it's working show `describe()` to display neat format blah blah blah*

In [None]:
df.show(5)

In [None]:
df = df.select([x for x in df.columns if x not in {"X", "Y", "Location", "PdId"}])


#from functools import reduce
#from pyspark.sql import DataFrame
#df = reduce(DataFrame.drop, ["X", "Y", "Location", "PdId"], df)
df.show(5)

# Appendix A: Using Pandas  and JSON
Pandas also provides a method of reading `.csv` files, which can then be used as a Spark DataFrame. For an example on how to work with a `.csv` file in Pandas, see [Chris Albon's](http://chrisalbon.com/python/pandas_dataframe_importing_csv.html) post.

In [None]:
import pandas as pd
pd_csv = pd.read_csv("data/SFPD_Incidents.csv")
pd_df = sqlContext.createDataFrame(pd_csv)
pd_df.take(1)

- Dosn't have to be on HDFS
- (Confirm) use of broadcast variables for the above statement

In [None]:
#pd_df.printSchema()
pd_df.dtypes

In [None]:
pd_df.show(1)
pd_csv.head(1)

In [None]:
pd_csv.dtypes

Furthermore, Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json on a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

In [None]:
df = sqlContext.read.load("hdfs://master:54310/data/incidents.json", format='json')

In [None]:
df.printSchema()

In [None]:
input_csv = "hdfs://master:54310/data/incidents.csv"
df = sqlContext.read.load(input_csv, format='com.databricks.spark.csv', header='true', infereSchema='true')
#df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("hdfs://master:54310/data/incidents.csv")
#df.printSchema()
df.take(5)

$$c = \sqrt{a^2 + b^2}$$