### Predicting House prices with Apache Spark

This dataset is not big. Using of Apache Spark is not necessary here. But we are going use it to show how can we use Pyspark to buil a machine learning model.

In [5]:
import pyspark
from pyspark.sql import SparkSession

In [8]:
spark=SparkSession.builder.appName('Practise').getOrCreate()

In [9]:
spark

In [10]:
df= spark.sql('''select 'spark' as hello''')
df.show()

+-----+
|hello|
+-----+
|spark|
+-----+



In [11]:
df=spark.read.csv('cal_housing.data')

In [12]:
df

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string]

In [14]:
df=spark.read.option('header','true').csv('cal_housing.data')

In [15]:
df

DataFrame[-122.230000: string, 37.880000: string, 41.000000: string, 880.000000: string, 129.000000: string, 322.000000: string, 126.000000: string, 8.325200: string, 452600.000000: string]

#### Understanding the Data Set

The California Housing data set appeared in a 1997 paper titled Sparse Spatial Autoregressions, written by Pace, R. Kelley and Ronald Barry and published in the Statistics and Probability Letters journal. The researchers built this data set by using the 1990 California census data.

The data contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). In this sample a block group on average includes 1425.5 individuals living in a geographically compact area.

These spatial data contain 20,640 observations on housing prices with 9 economic variables:

Longitude:refers to the angular distance of a geographic place north or south of the earth’s equator for each block group

Latitude :refers to the angular distance of a geographic place east or west of the earth’s equator for each block group

Housing Median Age:is the median age of the people that belong to a block group. Note that the median is the value that lies at the midpoint of a frequency distribution of observed values

Total Rooms:is the total number of rooms in the houses per block group

Total Bedrooms:is the total number of bedrooms in the houses per block group

Population:is the number of inhabitants of a block group

Households:refers to units of houses and their occupants per block group

Median Income:is used to register the median income of people that belong to a block group

Median House Value:is the dependent variable and refers to the median house value per block group

In [18]:
from pyspark.sql.types import*

In [19]:
schema=StructType([StructField("long", FloatType(), nullable=True),
                  StructField("lat", FloatType(), nullable=True),
                  StructField("medage", FloatType(), nullable=True),
                  StructField("totrooms", FloatType(), nullable=True),
                  StructField("totbdrms", FloatType(), nullable=True),
                  StructField("pop", FloatType(), nullable=True),
                  StructField("houshlds", FloatType(), nullable=True),
                   StructField("medinc", FloatType(), nullable=True),
                   StructField("medhv", FloatType(), nullable=True)
                  ])

In [20]:
df=spark.read.csv("cal_housing.data",schema)

In [21]:
df

DataFrame[long: float, lat: float, medage: float, totrooms: float, totbdrms: float, pop: float, houshlds: float, medinc: float, medhv: float]

In [22]:
df.show()

+-------+-----+------+--------+--------+------+--------+------+--------+
|   long|  lat|medage|totrooms|totbdrms|   pop|houshlds|medinc|   medhv|
+-------+-----+------+--------+--------+------+--------+------+--------+
|-122.23|37.88|  41.0|   880.0|   129.0| 322.0|   126.0|8.3252|452600.0|
|-122.22|37.86|  21.0|  7099.0|  1106.0|2401.0|  1138.0|8.3014|358500.0|
|-122.24|37.85|  52.0|  1467.0|   190.0| 496.0|   177.0|7.2574|352100.0|
|-122.25|37.85|  52.0|  1274.0|   235.0| 558.0|   219.0|5.6431|341300.0|
|-122.25|37.85|  52.0|  1627.0|   280.0| 565.0|   259.0|3.8462|342200.0|
|-122.25|37.85|  52.0|   919.0|   213.0| 413.0|   193.0|4.0368|269700.0|
|-122.25|37.84|  52.0|  2535.0|   489.0|1094.0|   514.0|3.6591|299200.0|
|-122.25|37.84|  52.0|  3104.0|   687.0|1157.0|   647.0|  3.12|241400.0|
|-122.26|37.84|  42.0|  2555.0|   665.0|1206.0|   595.0|2.0804|226700.0|
|-122.25|37.84|  52.0|  3549.0|   707.0|1551.0|   714.0|3.6912|261100.0|
|-122.26|37.85|  52.0|  2202.0|   434.0| 910.0|   4

In [25]:
df.show(5,vertical=True)

-RECORD 0------------
 long     | -122.23  
 lat      | 37.88    
 medage   | 41.0     
 totrooms | 880.0    
 totbdrms | 129.0    
 pop      | 322.0    
 houshlds | 126.0    
 medinc   | 8.3252   
 medhv    | 452600.0 
-RECORD 1------------
 long     | -122.22  
 lat      | 37.86    
 medage   | 21.0     
 totrooms | 7099.0   
 totbdrms | 1106.0   
 pop      | 2401.0   
 houshlds | 1138.0   
 medinc   | 8.3014   
 medhv    | 358500.0 
-RECORD 2------------
 long     | -122.24  
 lat      | 37.85    
 medage   | 52.0     
 totrooms | 1467.0   
 totbdrms | 190.0    
 pop      | 496.0    
 houshlds | 177.0    
 medinc   | 7.2574   
 medhv    | 352100.0 
-RECORD 3------------
 long     | -122.25  
 lat      | 37.85    
 medage   | 52.0     
 totrooms | 1274.0   
 totbdrms | 235.0    
 pop      | 558.0    
 houshlds | 219.0    
 medinc   | 5.6431   
 medhv    | 341300.0 
-RECORD 4------------
 long     | -122.25  
 lat      | 37.85    
 medage   | 52.0     
 totrooms | 1627.0   
 totbdrms 

In [23]:
df.printSchema()

root
 |-- long: float (nullable = true)
 |-- lat: float (nullable = true)
 |-- medage: float (nullable = true)
 |-- totrooms: float (nullable = true)
 |-- totbdrms: float (nullable = true)
 |-- pop: float (nullable = true)
 |-- houshlds: float (nullable = true)
 |-- medinc: float (nullable = true)
 |-- medhv: float (nullable = true)



In [24]:
df.columns

['long',
 'lat',
 'medage',
 'totrooms',
 'totbdrms',
 'pop',
 'houshlds',
 'medinc',
 'medhv']

#### Data Exploration

In [32]:
df.select(['long','lat','medage']).show(5)

+-------+-----+------+
|   long|  lat|medage|
+-------+-----+------+
|-122.23|37.88|  41.0|
|-122.22|37.86|  21.0|
|-122.24|37.85|  52.0|
|-122.25|37.85|  52.0|
|-122.25|37.85|  52.0|
+-------+-----+------+
only showing top 5 rows

