# Conducting EDA in Spark

EDA is a critical part of the data science process, in this example I will show you how to 

1. Read in the data
2. Transform it to a functional state
3. Conduct EDA (five num summary, grouping, crosstabs) 
4. Create a histogram

In [None]:
import getspark
from IPython.display import Image
from pyspark import SparkContext 
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
import matplotlib.pyplot as plt
import pandas as pd

%pylab inline

In [None]:
sc = SparkContext()
sqlContext = SQLContext(sc)

In [None]:
rdd = sc.textFile(r"C:\Spark\clickinfo.csv")
#Split it by its delimiter
rdd = rdd.map(lambda line: line.split(",")) #split it up by comma -transformation
#Strip out the header
header = rdd.first() #extract header
data = rdd.filter(lambda x:x !=header) #review the headerless rdd

In [None]:
#function to 
def signedin(clicks):
    if clicks == '0':
        return "Not_Signed_In"
    else: 
        return "Signed_In"

In [None]:
clicksmappd = data.map(lambda line: Row(user_id = str(line[0]), 
                              clicks = int(line[1]), 
                              impression=int(line[2]), 
                              signedin=signedin(line[3]))).toDF()

#### EDA One:  Looking up the frequency of events

It is possible to approximate the frequecy that something occurs.  This can be done using the **.freqItems()** argument.  Note that this algorithm is an approximation, and may produce some false positives.  In this count we see if people are more often signed in or out

In [None]:
freqcount = clicksmappd.freqItems(['signedin'], 0.7).collect() #0.7 is the frequency proportion (minimum proportion of rows)

In [None]:
freqcount[0]

#### EDA Two: Grouping and summarizing

Spark dataframes allow you to group and summarize data the same way you would with pandas data frames.  Another useful function is to create cross tabs, which transforms the dataframe from long to wide.  Crosstabs show what values occur in what columns related to one another.  

In [None]:
clicksmappd.groupby(['clicks', 'signedin']).count().show()

In [None]:
clicksmappd.crosstab('clicks', 'signedin').show()

### EDA Three: The five number summary

Spark can create summary statistics from dataframes as well.  This is accomplished using the **.describe()** function.

In [None]:
clicksmappd.describe('clicks','impression').show()

In [None]:
#funtion to make a histogram, kind(s) include: 'bar', 'box', and 'density
def spark_histogram(df, column):
    counts = df.groupby(column).count()
    df = counts.toPandas()
    df[column] = df.impression.astype(float) #Specify the column here
    return df.sort_values(column).set_index(column).iloc[:50,:].plot(kind='bar', figsize=(14,5))

In [None]:
spark_histogram(clicksmappd, 'impression')

In [None]:
# massive outliers, will skew histogram buckets
no_out_df = clicksmappd.filter(clicksmappd['impression'] < 12)

In [None]:
spark_histogram(no_out_df, 'impression')