# Explore House Sales Data

Let's explore the house sales data.

In [None]:
import numpy as np
import pandas as pd
import time

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Read and Display data

We will also start profiling out code.

#### 1.1 - Using perf_counter() for timing -- this is universal,  works on python and jupyter

In [None]:
t1 = time.perf_counter()
house_prices = spark.read.csv("/data/house-prices/house-sales-simplified.csv", \
                              header=True, inferSchema=True)
t2 = time.perf_counter()
print("read {:,} records in {:,.2f} ms".format(house_prices.count(), (t2-t1)*1000))

#### 1.2 - using %%time magic function -- only works in Jupyter notebook

In [None]:
%%time
house_prices = spark.read.csv("/data/house-prices/house-sales-simplified.csv", \
                              header=True, inferSchema=True)

In [None]:
print ("number of records read ", house_prices.count())

## Step 2: See schema and data

In [None]:
house_prices.printSchema()
house_prices.show(10)

## Step 3: 'Summary' of data

In [None]:
house_prices.describe().show()

## this output is hard to read. 

In [None]:
## convert 'describe' output to Pandas for better display
house_prices.describe().toPandas()

In [None]:
# you can also display vertically
house_prices.describe().toPandas().T

## Step 4: Get an idea of one or more attribute(s)

In [None]:
house_prices.describe("SalePrice").show()


In [None]:
house_prices.describe(["SalePrice", 'Bedrooms']).show()

## Step 5: Report on Bedrooms vs Sales
Let's calculte sales per bedrooms

In [None]:
## Hint : 'Bedrooms'
bedroom_sales = house_prices.groupBy("???").count()
bedroom_sales.show()

In [None]:
## order by bedrooms
bedroom_sales.orderBy("Bedrooms").show()

In [None]:
## order by count top to bottom
bedroom_sales.orderBy("???", ascending=False).show()

## Step 6 : Remove Outliers
There are some outlier data.  
For example, houses with large number of bedrooms (9, 33!).   
Let's remove those.


In [None]:
## TODO : filter data that is less than or equal to 5 bedrooms
print("raw data record count ", house_prices.count())
## Hint : 5
x = house_prices.filter("Bedrooms <= ???")
print ("less than 5br records count ", x.count())
x.show()

In [None]:
# do a summary on cleaned up data
x.groupBy('Bedrooms').count().orderBy('Bedrooms').show()

## Step 7: Calculate some percentiles

In [None]:
percentiles = (0.25, 0.5, 0.75, 0.9, 0.95)
prices = house_prices.stat.approxQuantile("SalePrice", percentiles, 0.0)

print(percentiles)
print(prices)

# get a Pandas dataframe for pretty print
percentile_pricing_df = pd.DataFrame({"percentile": percentiles, "price": prices} )
percentile_pricing_df

## Step 8: Explore Stat object in dataframe

[API for pyspark.sql.DataFrameStatFunctions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrameStatFunctions)


In [None]:
# hit the tab key after the dot(.)
house_prices.stat.

## Step 9: Covariance & Correlation

Q1 : Calculate Covariance between "SalePrice"  and "Bedrooms"

Q1 : Which attributes influences sale price more?  
- Number of Bedrooms ("Bedrooms")
- or size of the home ("SqFtTotLiving")

Hint : calculate Correlation

**Q=> Can you explain the result**

In [None]:
print (house_prices.stat.cov("SalePrice", "Bedrooms"))

print(house_prices.stat.corr("SalePrice", "Bedrooms"))
print(house_prices.stat.corr("SalePrice", "SqFtTotLiving"))

## Bonus Lab : Find the most expensive zip codes
We have data from many zip codes.  
To find the most expensive zip code, let's first calculate  **price per sqft**

In [None]:
## Step 1 : calculate price per sqft
## TODO : do the math, divide  house_prices['SalePrice'] by  house_prices['SqFtTotLiving']
a = house_prices.withColumn("price_per_sqft", house_prices["???"] / house_prices['???'])

price_per_sqft = a.select('SalePrice', 'SqFtTotLiving', 'price_per_sqft' , 'ZipCode')
price_per_sqft.show()

In [None]:
## Group data by zipcode and take the avg of price_per_sqft
b  = price_per_sqft.groupBy("???").avg("???")
b.show()

## How many sales by zipcode
price_per_sqft.groupBy("ZipCode").count().show()

In [None]:
## Now sort by avg price
b.orderBy('???', ascending=False).show()

## Bonus Lab 2 : Pre-Post Bubble data
The sales data we have spans the housing bubble (year 2004 - 2006) and post bubble (year 2008 +). 
You may want to separate the data into 2 segments - pre/post bubble to get better results.