In this second set of exercices you will have to develop two Machine Learning procedures. In both cases is mandatory to use Apache Spark 2.x and if you need any necessary library to manage data or to generate features you must use Apache MLlib (DataFrame version). Check the following aspects:

Problem 1:

Using the dataset, build a Machine Learning procedure to predict the price of houses having neighbourhood variables. The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood. You can download the data from here: https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data. More info here: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html.

Make sure that your dataset is technically correct
Check the consistency of your dataset
In this exercise is not mandatory to use Pipelines
Split your data into two sets: 80% of the data for training and 20% of the data for testing
Provide convenient measures to check how the model is behaving

There are 14 attributes in each case of the dataset. They are:

CRIM - per capita crime rate by town

ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS - proportion of non-retail business acres per town.

CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX - nitric oxides concentration (parts per 10 million)

RM - average number of rooms per dwelling

AGE - proportion of owner-occupied units built prior to 1940

DIS - weighted distances to five Boston employment centres

RAD - index of accessibility to radial highways

TAX - full-value property-tax rate per $10,000

PTRATIO - pupil-teacher ratio by town

B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT - % lower status of the population

MEDV - Median value of owner-occupied homes in $1000's

In [1]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import pyspark 

sc = pyspark.SparkContext('local[*]')

In [2]:
sc.textFile("housing.data").take(5)

[u' 0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30 396.90   4.98  24.00',
 u' 0.02731   0.00   7.070  0  0.4690  6.4210  78.90  4.9671   2  242.0  17.80 396.90   9.14  21.60',
 u' 0.02729   0.00   7.070  0  0.4690  7.1850  61.10  4.9671   2  242.0  17.80 392.83   4.03  34.70',
 u' 0.03237   0.00   2.180  0  0.4580  6.9980  45.80  6.0622   3  222.0  18.70 394.63   2.94  33.40',
 u' 0.06905   0.00   2.180  0  0.4580  7.1470  54.20  6.0622   3  222.0  18.70 396.90   5.33  36.20']

In [3]:
def MakeDataConsistent(line):
    for index in range(len(line)):
        if(index != 3 and index != 8):
            line[index] = float(line[index])
        else:
            line[index] = int(line[index])
        if(index == 9):
            line[index] = 10000 * line[index]
        elif(index == 13):
            line[index] = 1000 * line[index]
    return line
        

housing = sc.textFile("housing.data").map(lambda x: x.split()).map(lambda x : MakeDataConsistent(x))
print(type(housing))
housing.first()

<class 'pyspark.rdd.PipelinedRDD'>


[0.00632,
 18.0,
 2.31,
 0,
 0.538,
 6.575,
 65.2,
 4.09,
 1,
 2960000.0,
 15.3,
 396.9,
 4.98,
 24000.0]

In [4]:
spark = SparkSession.builder.getOrCreate()

schema = StructType([
    StructField('CRIM', FloatType()),
    StructField('ZN', FloatType()),
    StructField('INDUS', FloatType()),
    StructField('CHAS', IntegerType()),
    StructField('NOX', FloatType()),
    StructField('RM', FloatType()),
    StructField('AGE', FloatType()),
    StructField('DIS', FloatType()),
    StructField('RAD', IntegerType()),
    StructField('TAX', FloatType()),
    StructField('PTRATIO', FloatType()),
    StructField('B', FloatType()),
    StructField('LSTAT', FloatType()),
    StructField('MEDV', FloatType())
    ])

df = spark.createDataFrame(housing, schema)

df.show()

+-------+----+-----+----+-----+-----+-----+------+---+---------+-------+------+-----+-------+
|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM|  AGE|   DIS|RAD|      TAX|PTRATIO|     B|LSTAT|   MEDV|
+-------+----+-----+----+-----+-----+-----+------+---+---------+-------+------+-----+-------+
|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|2960000.0|   15.3| 396.9| 4.98|24000.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|2420000.0|   17.8| 396.9| 9.14|21600.0|
|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|2420000.0|   17.8|392.83| 4.03|34700.0|
|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|2220000.0|   18.7|394.63| 2.94|33400.0|
|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|2220000.0|   18.7| 396.9| 5.33|36200.0|
|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|2220000.0|   18.7|394.12| 5.21|28700.0|
|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5605|  5|3110000.0|   15.2| 395.6|12.43|22900.0|
|0.14455|12.5| 7.87|   0|0.524|6.172| 96.1|5.9505|  5|311000

In [18]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], outputCol = 'neighbourhood_variables')
vhouse_df = vectorAssembler.transform(df)
vhouse_df.take(1)

[Row(CRIM=0.006320000160485506, ZN=18.0, INDUS=2.309999942779541, CHAS=0, NOX=0.5379999876022339, RM=6.574999809265137, AGE=65.19999694824219, DIS=4.090000152587891, RAD=1, TAX=2960000.0, PTRATIO=15.300000190734863, B=396.8999938964844, LSTAT=4.980000019073486, MEDV=24000.0, neighbourhood_variables=DenseVector([0.0063, 18.0, 2.31, 0.0, 0.538, 6.575, 65.2, 4.09, 1.0, 2960000.0, 15.3, 396.9, 4.98]))]

In [20]:
vhouse_df = vhouse_df.select(['neighbourhood_variables', 'MEDV'])
vhouse_df.show(3)

+-----------------------+-------+
|neighbourhood_variables|   MEDV|
+-----------------------+-------+
|   [0.00632000016048...|24000.0|
|   [0.02731000073254...|21600.0|
|   [0.02728999964892...|34700.0|
+-----------------------+-------+
only showing top 3 rows



In [21]:
to_train, to_test = vhouse_df.randomSplit([0.8, 0.2], seed=12345)

In [22]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = 'neighbourhood_variables', labelCol='MEDV', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(to_train)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [-118.24514205819007,44.5401644332137,-74.77194523615698,3039.3675060963847,-14916.055048104574,3929.1882673212394,-9.869240142757794,-1447.8762290576617,195.74562092143822,-0.0003297656816680554,-896.8656687893629,13.204799949789539,-468.38338410322376]
Intercept: 30116.3848057


In [24]:
lr_predictions = lr_model.transform(to_test)
lr_predictions.select("prediction","MEDV","neighbourhood_variables").show(10)

from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="MEDV",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+------------------+-------+-----------------------+
|        prediction|   MEDV|neighbourhood_variables|
+------------------+-------+-----------------------+
| 32597.78121390356|31600.0|   [0.01432000007480...|
|23171.301064468396|33000.0|   [0.01950999908149...|
|42448.978840272095|50000.0|   [0.02009000070393...|
| 28020.14480641378|25000.0|   [0.02875000052154...|
|24874.105772169885|28700.0|   [0.02985000051558...|
|31516.822632207026|34900.0|   [0.03150000050663...|
| 33167.20604630139|30300.0|   [0.04665999859571...|
|30906.417127279237|28700.0|   [0.05302000045776...|
|28877.386733444808|24600.0|   [0.05424999818205...|
| 34033.92402249715|39800.0|   [0.06588000059127...|
+------------------+-------+-----------------------+
only showing top 10 rows

R Squared (R2) on test data = 0.794089


Problem 2:

Using the dataset, build a Machine Learning procedure to classify if the return of a SONAR signal is a Rock or a Mine. You have all the data available at: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29. Make sure that you use sonar.all-data dataset. Check the following aspects:

Make sure that your dataset is technically correct
Check the consistency of your dataset
In this exercise is mandatory to use Pipelines
Split your data into two sets: 80% of the data for training and 20% of the data for testing
Check that the labels in both sets are equaly distributed (hint: this is called stratified sampling)
Provide convenient measures to check how the model is behaving