
# **Asignatura de Big Data Analytics**

Prof. José M. Luna

jmluna@uco.es

---

En primer lugar, debemos instalar todas las dependencias necesarias para poder ejecutar el algoritmo. Así, tenemos que instalar Java 8, Spark 3.0 con Hadoop 3.2, y findspark para alojar spark en el sistema (acceder a Spark desde un Notebook).

In [None]:
# Instalacion de java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Instalacion de spark
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# Descomprimimos el fichero de spark
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# Instalamos findspark
!pip install -q findspark

# Establecemos las variables de entorno para Java y Spark 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

Ejecutamos una sesion local de spark para comprobar que esta funcionando.

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Importamos el conjunto de datos con el que queremos trabajar y comprobamos que se ha subido correctamente

In [None]:
from google.colab import files
uploaded = files.upload()

!ls

Saving BostonHousing.csv to BostonHousing.csv
BostonHousing.csv  spark-3.0.0-bin-hadoop3.2
sample_data	   spark-3.0.0-bin-hadoop3.2.tgz


Leemos el dataset con el que queremos trabajar, en este caso vamos a trabajar con el dataset de BostonHousing



In [None]:
dataset = spark.read.csv('BostonHousing.csv',inferSchema=True, header =True)

Mostramos los atributos que tiene el dataset y su tipo

In [None]:
dataset.printSchema()

root
 |-- crim: double (nullable = true)
 |-- zn: double (nullable = true)
 |-- indus: double (nullable = true)
 |-- chas: integer (nullable = true)
 |-- nox: double (nullable = true)
 |-- rm: double (nullable = true)
 |-- age: double (nullable = true)
 |-- dis: double (nullable = true)
 |-- rad: integer (nullable = true)
 |-- tax: integer (nullable = true)
 |-- ptratio: double (nullable = true)
 |-- b: double (nullable = true)
 |-- lstat: double (nullable = true)
 |-- medv: double (nullable = true)



Ahora vamos a trabajar con todas las variables y la varible medv será la salida. Creamos un conjunto de datos donde los inputs serán un vector

In [None]:
from pyspark.ml.feature import VectorAssembler

df = VectorAssembler(inputCols=['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat'], outputCol = 'inputs').transform(dataset)

df = df.select(['inputs','medv'])

df.printSchema()

root
 |-- inputs: vector (nullable = true)
 |-- medv: double (nullable = true)



Dividimos el conjunto de datos en train y test. Vamos a considerar una partición del tipo 80/20

In [None]:
trainData,testData = df.randomSplit([0.8,0.2])

Indicamos qué atributos serán las entradas y cuales serán la salida. En este caso, vamos a tomar como entrada "crim" y como salida "medv". Finalmente, entrenamos y testeamos con trainData y testData, y mostramos las predicciones

In [None]:
from pyspark.ml.regression import LinearRegression

regressor = LinearRegression(featuresCol = 'inputs', labelCol = 'medv')

#Entrenamos el modelo con el conjunto de train
regressor = regressor.fit(trainData)

#Evaluamos el modelo con el conjunto de test
pred = regressor.evaluate(testData)

#Mostramos las predicciones
pred.predictions.show()

+--------------------+----+------------------+
|              inputs|medv|        prediction|
+--------------------+----+------------------+
|[0.0136,75.0,4.0,...|18.9|15.401288163550717|
|[0.01538,90.0,3.7...|44.0|37.499867106244565|
|[0.01965,80.0,1.7...|20.1| 19.99056024515422|
|[0.02177,82.5,2.0...|42.3|36.826671537628634|
|[0.03049,55.0,3.7...|31.2| 28.30222960629167|
|[0.03113,0.0,4.39...|17.5|16.441806555211677|
|[0.03237,0.0,2.18...|33.4| 28.62830923442691|
|[0.03502,80.0,4.9...|28.5|33.521152900494705|
|[0.03705,20.0,3.3...|35.4|34.236283397973885|
|[0.04297,52.5,5.3...|24.8|26.813428863733627|
|[0.04462,25.0,4.8...|23.9| 27.23207314240119|
|[0.04544,0.0,3.24...|19.8| 21.36085090371814|
|[0.04981,21.0,5.6...|23.4|23.813218074934493|
|[0.0536,21.0,5.64...|25.0|27.459397609869274|
|[0.05479,33.0,2.1...|28.4|30.947328473865543|
|[0.05735,0.0,4.49...|26.6|27.931965051506985|
|[0.06129,20.0,3.3...|46.0| 40.72336979670956|
|[0.06162,0.0,4.39...|17.2|14.705108957708884|
|[0.06417,0.0

Mostramos resultados del modelo de regresión en train

In [None]:
#coeficientes del modelo de regresion
coeff = regressor.coefficients
#interseccioon del model con Y
intr = regressor.intercept
print ("Coeficientes: %a" %coeff)
print ("Interseccion: %f" %intr)

#Entrenamiento
print("Root Mean Squared Err (RMSE): %f" % regressor.summary.rootMeanSquaredError)
print("Mean Absolute Err (MAE): %f" % regressor.summary.meanAbsoluteError)
print("Mean Squared Err (MSE): %f" % regressor.summary.meanSquaredError)

Coeficientes: DenseVector([-0.1148, 0.0458, 0.0275, 3.0299, -20.2886, 3.936, 0.0052, -1.5201, 0.3199, -0.0127, -0.9036, 0.009, -0.5171])
Interseccion: 36.014081
Root Mean Squared Err (RMSE): 4.705631
Mean Absolute Err (MAE): 3.344118
Mean Squared Err (MSE): 22.142966


Mostramos resultados del modelo de regresión en test

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

rmse = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="rmse").evaluate(pred.predictions)
print ("Root Mean Squared Err (RMSE): %f" % rmse)
mae = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="mae").evaluate(pred.predictions)
print ("Mean Absolute Err (MAE): %f" % mae)
mse = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="mse").evaluate(pred.predictions)
print ("Mean Squared Err (MSE): %f" % mse)

Root Mean Squared Err (RMSE): 4.601477
Mean Absolute Err (MAE): 3.113926
Mean Squared Err (MSE): 21.173586
