The goal here is to predict house value based on other factors

# Prepare: 

Longitude:refers to the angular distance of a geographic place north or south of the earth’s equator for each block group
<br>
Latitude :refers to the angular distance of a geographic place east or west of the earth’s equator for each block group<br>
Housing Median Age:is the median age of the people that belong to a block group. Note that the median is the value that lies at the midpoint of a frequency distribution of observed values<br>
Total Rooms:is the total number of rooms in the houses per block group<br>
Total Bedrooms:is the total number of bedrooms in the houses per block group<br>
Population:is the number of inhabitants of a block group<br>
Households:refers to units of houses and their occupants per block group<br>
Median Income:is used to register the median income of people that belong to a block group<br>
Median House Value:is the dependent variable and refers to the median house value per block group<br>

## Import libs: 

In [38]:
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.types import StructType, StringType, IntegerType, StructField, FloatType
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator

## Preprocessing

In [2]:
spark = SparkSession.builder.appName("cali_housing").getOrCreate()

In [12]:
schema = StructType([
    StructField("Longtitude", FloatType(), True),
    StructField("Latitude", FloatType(), True),
    StructField("HousingMedianAge", FloatType(), True),
    StructField("Total Rooms", FloatType(), True),
    StructField("Total Bedrooms", FloatType(), True),
    StructField("Population", FloatType(), True),
    StructField("Households", FloatType(), True),
    StructField("MedianIncome", FloatType(), True),
    StructField("MedianHouseValue", FloatType(), True),
])
df = spark.read.csv("cal_housing.data", schema=schema)

In [13]:
df.show()

+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+
|Longtitude|Latitude|HousingMedianAge|Total Rooms|Total Bedrooms|Population|Households|MedianIncome|MedianHouseValue|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+
|   -122.23|   37.88|            41.0|      880.0|         129.0|     322.0|     126.0|      8.3252|        452600.0|
|   -122.22|   37.86|            21.0|     7099.0|        1106.0|    2401.0|    1138.0|      8.3014|        358500.0|
|   -122.24|   37.85|            52.0|     1467.0|         190.0|     496.0|     177.0|      7.2574|        352100.0|
|   -122.25|   37.85|            52.0|     1274.0|         235.0|     558.0|     219.0|      5.6431|        341300.0|
|   -122.25|   37.85|            52.0|     1627.0|         280.0|     565.0|     259.0|      3.8462|        342200.0|
|   -122.25|   37.85|            52.0|      919.0|      

Check null value: 

In [14]:
from pyspark.sql.functions import col, count, when

df.select([count(when(col(c).isNull() , c)).alias(c) for c in df.columns]).show()

+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+
|Longtitude|Latitude|HousingMedianAge|Total Rooms|Total Bedrooms|Population|Households|MedianIncome|MedianHouseValue|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+
|         0|       0|               0|          0|             0|         0|         0|           0|               0|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+



Find the mean of number of people in 1 household
<br>
Find the number of room per household
<br>


In [15]:
round_number = F.udf(lambda x: round(x), T.IntegerType())
df = df.withColumn("MeanPeopleInHousehold", round_number(df["Population"]/df["Households"]))
df = df.withColumn("NumberRoomPerHouse", round_number(df["Total Rooms"]/df["Households"]))
df.show()

+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+
|Longtitude|Latitude|HousingMedianAge|Total Rooms|Total Bedrooms|Population|Households|MedianIncome|MedianHouseValue|MeanPeopleInHousehold|NumberRoomPerHouse|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+
|   -122.23|   37.88|            41.0|      880.0|         129.0|     322.0|     126.0|      8.3252|        452600.0|                    3|                 7|
|   -122.22|   37.86|            21.0|     7099.0|        1106.0|    2401.0|    1138.0|      8.3014|        358500.0|                    2|                 6|
|   -122.24|   37.85|            52.0|     1467.0|         190.0|     496.0|     177.0|      7.2574|        352100.0|                    3|                 8|
|   -122.25|   37.85|            52.0|     127

Summary statistic: 

In [19]:
df.describe().select("summary","HousingMedianAge", "Total Rooms", "Total Bedrooms", "Population","Households","MedianIncome","MedianHouseValue","MeanPeopleInHousehold","NumberRoomPerHouse").show()

+-------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+---------------------+------------------+
|summary|  HousingMedianAge|       Total Rooms|   Total Bedrooms|        Population|       Households|      MedianIncome|  MedianHouseValue|MeanPeopleInHousehold|NumberRoomPerHouse|
+-------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+---------------------+------------------+
|  count|             20640|             20640|            20640|             20640|            20640|             20640|             20640|                20640|             20640|
|   mean|28.639486434108527|2635.7630813953488|537.8980135658915|1425.4767441860465|499.5396802325581|3.8706710030346416|206855.81690891474|   3.0757267441860465| 5.424806201550387|
| stddev| 12.58555761211163|2181.6152515827944| 421.247905943133|  1132.46212176534|382.32

# Predicting

## Modeling: 

Set all these features into a dense vector so we can use it in modeling

In [21]:
assembler = VectorAssembler(inputCols=["HousingMedianAge", "Total Rooms", "Total Bedrooms", "Population","Households","MedianIncome","MeanPeopleInHousehold","NumberRoomPerHouse"], outputCol="features")
transformed_df = assembler.transform(df)

In [22]:
transformed_df.show(5)

+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+--------------------+
|Longtitude|Latitude|HousingMedianAge|Total Rooms|Total Bedrooms|Population|Households|MedianIncome|MedianHouseValue|MeanPeopleInHousehold|NumberRoomPerHouse|            features|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+--------------------+
|   -122.23|   37.88|            41.0|      880.0|         129.0|     322.0|     126.0|      8.3252|        452600.0|                    3|                 7|[41.0,880.0,129.0...|
|   -122.22|   37.86|            21.0|     7099.0|        1106.0|    2401.0|    1138.0|      8.3014|        358500.0|                    2|                 6|[21.0,7099.0,1106...|
|   -122.24|   37.85|            52.0|     1467.0|         190.0|     496.0|     177.0|      7.2574|

Spliting the dataset: 

In [23]:
(train,test) = transformed_df.randomSplit([0.7,0.3], seed=11)

In [24]:
train.show()

+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+--------------------+
|Longtitude|Latitude|HousingMedianAge|Total Rooms|Total Bedrooms|Population|Households|MedianIncome|MedianHouseValue|MeanPeopleInHousehold|NumberRoomPerHouse|            features|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+--------------------+
|   -124.35|   40.54|            52.0|     1820.0|         300.0|     806.0|     270.0|      3.0147|         94600.0|                    3|                 7|[52.0,1820.0,300....|
|    -124.3|    41.8|            19.0|     2672.0|         552.0|    1298.0|     478.0|      1.9797|         85800.0|                    3|                 6|[19.0,2672.0,552....|
|    -124.3|   41.84|            17.0|     2677.0|         531.0|    1244.0|     456.0|      3.0313|

In [26]:
test.show()

+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+--------------------+
|Longtitude|Latitude|HousingMedianAge|Total Rooms|Total Bedrooms|Population|Households|MedianIncome|MedianHouseValue|MeanPeopleInHousehold|NumberRoomPerHouse|            features|
+----------+--------+----------------+-----------+--------------+----------+----------+------------+----------------+---------------------+------------------+--------------------+
|   -124.25|   40.28|            32.0|     1430.0|         419.0|     434.0|     187.0|      1.9417|         76100.0|                    2|                 8|[32.0,1430.0,419....|
|   -124.23|   40.54|            52.0|     2694.0|         453.0|    1152.0|     435.0|      3.0806|        106700.0|                    3|                 6|[52.0,2694.0,453....|
|   -124.21|   41.75|            20.0|     3810.0|         787.0|    1993.0|     721.0|      2.0074|

Let's build a simple Linear Regression

In [35]:
lr = LinearRegression(featuresCol="features", labelCol="MedianHouseValue")
model = lr.fit(train)

In [36]:
test_result = model.transform(test)

In [37]:
test_result.select("prediction", "MedianHouseValue", "features").show()

+------------------+----------------+--------------------+
|        prediction|MedianHouseValue|            features|
+------------------+----------------+--------------------+
|127009.41117843846|         76100.0|[32.0,1430.0,419....|
|207292.15866354905|        106700.0|[52.0,2694.0,453....|
|110955.87634998583|         66900.0|[20.0,3810.0,787....|
|124669.35465124811|         68400.0|[17.0,3461.0,722....|
|173964.67824032521|         90100.0|[21.0,5694.0,1056...|
|157929.62380723993|         69000.0|[30.0,2975.0,634....|
| 93126.48014926424|         74600.0|[15.0,3140.0,714....|
|172269.39573363625|        107000.0|[35.0,952.0,178.0...|
| 122685.5051451314|         70500.0|[39.0,1836.0,352....|
|127276.05728543006|         81300.0|[30.0,1895.0,366....|
|152423.62063192952|         70500.0|[43.0,2285.0,479....|
| 153937.4346045175|         75500.0|[52.0,1606.0,419....|
|168456.28338264956|        109400.0|[16.0,2739.0,480....|
|192066.02728325778|         82000.0|[43.0,2241.0,446...

### Evaluation: 

Inspect the metrics

In [43]:
evaluator = RegressionEvaluator(labelCol="MedianHouseValue", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(test_result)
print("R squared:", r2)

R squared: 0.5536347916735185


Inspect the co-efficients

In [44]:
model.coefficients

DenseVector([1904.9381, -16.979, 88.6624, -40.4206, 134.6983, 47848.282, 197.5792, -777.896])