# WORK  IN PROGRESS


# Exploratory Data Analysis using Spark and Python  
Now that we have an idea of how to explore some data in Spark, the following content describes how to apply some of those principles to the __Exploratory Data Analysis__ methodology within Data Science. 

__Note:__ The infomration within this document is based on the [Python Tutorials](https://www.codementor.io/python/tutorial) from __Code Mentor__. 


## Getting the Data  
### Getting the Data  
For this exercise, we will use the Incidents derived from [SFPD Crime Incident Reporting system](https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry
), showing data from __1/1/2003__ up until two weeks ago from current date (__3/25/2016__).  

The Data isfomatted to show the following infortmation:
- Incident Number
- Catagory of the Incident
- Day of the Week
- Date
- Time
- Police Department District
- Resolution
- Address
- X map coordinates
- Y map coordinates
- Map location
- Poilice Deprtment ID

The data has been exported to `.csv` format and copied to HDFS using the following proceedure:

In [None]:
# Export the Data to .csv format and copy

#!wget https://data.sfgov.org/api/views/tmnf-yvry/rows.json?accessType=DOWNLOAD -O incidents.json
#!wget https://data..org/api/views/tmnf-yvry/rows.csv?accessType=DOWNLOAD -O incidents.csv
#!hdfs dfs -put incidents.csv /data/
#!hdfs dfs -ls /data/

### Importing the Data into Spark  
#### Using Spark-csv  
The first proceedure we will use to get the data into Spark, is `spark-csv` from [__Databricks__](http://spark-packages.org/package/databricks/spark-csv). This package allows us to import `.csv` data into a Spark DataFrame, using the example below:

In [None]:
# HDFS location of the downloaded file
input_csv = "hdfs://master:54310/data/incidents.csv"

# Create a sqlContext variable to read and load the file, captuing the header and schema
df = sqlContext.read.load(input_csv,
                          format="com.databricks.spark.csv",
                          header="true",
                          infereSchema="true")

# Take the first row
df.take(1)

There are a few of important things to note from the output above. __Firstly__, the raw fomatting may not be helpful in descirbing the data. Therefore, another option to display this is shown below: 

In [None]:
# Show the first row
df.show(1)

The `show()` function attempts to display the formatting better, but may not be the best display output if the number of colums exceeds the width of the Notebook. __Secondly__, although `inferSchema` is set to `true`, `spark-csv` was not able to fully capture the Schema of the data, as is seen from the output below.

In [None]:
# Show the Schema
df.printSchema()
df.dtypes

As can be seen, there is no inferred Schema as all of the data type is set to __string__. This will need to be addressed when futher exploring the data.  

__Thirdly__, calling the `.csv` file from the local filesystem seems to produce errors stating that the file cannot be found. I'm assuming that this is becuase the file needs to be on all nodes of the Spark Cluster and not just the Master node. To circumvent this issue, the data file has been copied onto HDFS - as shown at the outset - to ensure that all nodes can access the data.

As a side note, it is possible what once the Data has been captured as a Spark Dataframe, it can be comnverted to a __Pandas__ dataframe by making use of the `toPandas()` function on the Spark DataFrame, as shown below. 
```
# Example to create Pandas dataframe
df.toPandas().head(1)
```
Pandas offers a number of differences over Spark dataframes. For more information on this, see [6 dofferences between Pandas and Spark DataFrames](https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2#.x2a9hwn4z).

#### Using Pandas  
Pandas also provides a method of reading `.csv` files, which can then be used as a Spark DataFrame. 

In [3]:
import pandas as pd
pd_csv = pd.read_csv("incidents.csv")
pd_df = sqlContext.createDataFrame(pd_csv)
pd_df.take(1)

[Row(IncidntNum=160203619, Category=u'ASSAULT', Descript=u'BATTERY OF A POLICE OFFICER', DayOfWeek=u'Wednesday', Date=u'03/09/2016', Time=u'23:36', PdDistrict=u'CENTRAL', Resolution=u'ARREST, BOOKED', Address=u'2600 Block of MASON ST', X=-122.414003178329, Y=37.8079694729269, Location=u'(37.8079694729269, -122.414003178329)', PdId=16020361904154)]

In [9]:
pd_df.show(1)
pd_csv.head(1)

+----------+--------+--------------------+---------+----------+-----+----------+--------------+--------------------+-----------------+----------------+--------------------+--------------+
|IncidntNum|Category|            Descript|DayOfWeek|      Date| Time|PdDistrict|    Resolution|             Address|                X|               Y|            Location|          PdId|
+----------+--------+--------------------+---------+----------+-----+----------+--------------+--------------------+-----------------+----------------+--------------------+--------------+
| 160203619| ASSAULT|BATTERY OF A POLI...|Wednesday|03/09/2016|23:36|   CENTRAL|ARREST, BOOKED|2600 Block of MAS...|-122.414003178329|37.8079694729269|(37.8079694729269...|16020361904154|
+----------+--------+--------------------+---------+----------+-----+----------+--------------+--------------------+-----------------+----------------+--------------------+--------------+
only showing top 1 row



Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,160203619,ASSAULT,BATTERY OF A POLICE OFFICER,Wednesday,03/09/2016,23:36,CENTRAL,"ARREST, BOOKED",2600 Block of MASON ST,-122.414003,37.807969,"(37.8079694729269, -122.414003178329)",16020361904154


In [14]:
#pd_df.printSchema()
pd_df.dtypes

[('IncidntNum', 'bigint'),
 ('Category', 'string'),
 ('Descript', 'string'),
 ('DayOfWeek', 'string'),
 ('Date', 'string'),
 ('Time', 'string'),
 ('PdDistrict', 'string'),
 ('Resolution', 'string'),
 ('Address', 'string'),
 ('X', 'double'),
 ('Y', 'double'),
 ('Location', 'string'),
 ('PdId', 'bigint')]

In [13]:
pd_csv.dtypes

IncidntNum      int64
Category       object
Descript       object
DayOfWeek      object
Date           object
Time           object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
Location       object
PdId            int64
dtype: object

__Mostly of type = `string`__

#### Manual Schema Creation

In [None]:
from pyspark.sql.types import *
incidentsFile = sc.textFile("hdfs://master:54310/data/incidents.csv")
incidentsFile.take(1)

Blah blah blah isolate header

In [None]:
header = incidentsFile.first()
header

In [None]:
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
fields

### Importing the Data into Spark (from JSON)

## Exploring the Data  

In [None]:
import urllib
import numpy as np
import pandas as pd
#url = 'https://www.quandl.com/api/v3/datasets/BLSE/CES9000000010.csv'
#file = './data/SFPD_Incidents.csv'
#f = urllib.urlretrieve(url, file)
#df = pd.read_csv(file, index_col = 0, thousands  = ',').T
#df.head(20)



#from pyspark import SparkContext
#from pyspark.sql import SQLContext
#import pandas as pd

#pandas_df = pd.read_csv('hdfs://localhost/data/data.csv')  # assuming the file contains a header
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
#s_df = sql_sc.createDataFrame(pandas_df)

file = sc.textFile('hdfs://master:54310/data/incidents.csv')
file.take(5)
#file.count()

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json on a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

In [None]:
df = sqlContext.read.load("hdfs://master:54310/data/incidents.json", format='json')

In [None]:
df.printSchema()

In [None]:
input_csv = "hdfs://master:54310/data/incidents.csv"
df = sqlContext.read.load(input_csv, format='com.databricks.spark.csv', header='true', infereSchema='true')
#df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("hdfs://master:54310/data/incidents.csv")
#df.printSchema()
df.take(5)

$$c = \sqrt{a^2 + b^2}$$