## Cars Dataset
El conjunto de datos cars incluye información sobre varios modelos de automóviles, generalmente fabricados en los años 1970 y 1980. Las columnas de este conjunto de datos son:

<ul>
<li>mpg: Millas por galón (Miles Per Gallon), una medida de eficiencia de combustible. Tipo numérico.</li>
<li>cylinders: Número de cilindros en el motor. Tipo numérico.</li>
<li>displacement: Desplazamiento del motor (en pulgadas cúbicas). Tipo numérico.</li>
<li>horsepower: Potencia del motor (en caballos de fuerza). Tipo numérico.</li>
<li>weight: Peso del automóvil (en libras). Tipo numérico.</li>
<li>acceleration: Tiempo que toma el automóvil para acelerar de 0 a 60 mph (en segundos). Tipo numérico.</li>
<li>model year: Año del modelo del automóvil. Tipo numérico.</li>
<li>origin: Origen del automóvil. Tipo categórico.</li>
<li>car: Nombre del automóvil. Tipo categórico.</li>
</ul>

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488490 sha256=5a973d0efad1a3b6bf86604a8c8436779fe4cc6e7c7629be94632bb9fc1363e2
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,desc
from google.colab import drive
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col, mean, max, min, stddev, count, when, isnan,round
from pyspark.ml.feature import VectorAssembler, StandardScaler, MinMaxScaler, OneHotEncoder, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.regression import DecisionTreeRegressor


In [4]:
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Colab Notebooks/proyecto-pyspark/cars.csv'

Mounted at /content/drive


In [5]:
spark = SparkSession.builder \
    .appName("Car Analysis") \
    .getOrCreate()

In [6]:
# Leer datos
cars_df = spark.read.csv(path, header=True, inferSchema=True,sep=";" )
# Ver esquema
cars_df.printSchema()
# Mostrar datos
cars_df.show()

root
 |-- Car: string (nullable = true)
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Displacement: double (nullable = true)
 |-- Horsepower: double (nullable = true)
 |-- Weight: decimal(4,0) (nullable = true)
 |-- Acceleration: double (nullable = true)
 |-- Model: integer (nullable = true)
 |-- Origin: string (nullable = true)

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0|  3504|        12.0|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0|  3693|        11.5|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0|  3436|        11.0|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0| 

In [7]:
# Atributos numéricos
numeric_cols = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration"]

In [8]:
# Calcular estadísticas descriptivas
# Crear las agregaciones y redondearlas a 2 decimales
stats = cars_df.select([round(mean(col), 2).alias(f"{col}_mean") for col in numeric_cols] +
                       [round(max(col), 2).alias(f"{col}_max") for col in numeric_cols] +
                       [round(min(col), 2).alias(f"{col}_min") for col in numeric_cols] +
                       [round(stddev(col), 2).alias(f"{col}_stddev") for col in numeric_cols])


In [9]:
stats.show()

+--------+--------------+-----------------+---------------+-----------+-----------------+-------+-------------+----------------+--------------+----------+----------------+-------+-------------+----------------+--------------+----------+----------------+----------+----------------+-------------------+-----------------+-------------+-------------------+
|mpg_mean|cylinders_mean|displacement_mean|horsepower_mean|weight_mean|acceleration_mean|mpg_max|cylinders_max|displacement_max|horsepower_max|weight_max|acceleration_max|mpg_min|cylinders_min|displacement_min|horsepower_min|weight_min|acceleration_min|mpg_stddev|cylinders_stddev|displacement_stddev|horsepower_stddev|weight_stddev|acceleration_stddev|
+--------+--------------+-----------------+---------------+-----------+-----------------+-------+-------------+----------------+--------------+----------+----------------+-------+-------------+----------------+--------------+----------+----------------+----------+----------------+-----------

In [10]:
# Conteo de coches por origen
origin_count = cars_df.groupBy("origin").count()
origin_count.show()

+------+-----+
|origin|count|
+------+-----+
|Europe|   73|
|    US|  254|
| Japan|   79|
+------+-----+



In [11]:
# Comprobar si hay elementos nulos
cars_df.select([count(when(col(c).isNull() | isnan(c), c)).alias(c) for c in cars_df.columns]).show()

# Eliminar filas con elementos nulos
cars_df = cars_df.dropna()

+---+---+---------+------------+----------+------+------------+-----+------+
|Car|MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+---+---+---------+------------+----------+------+------------+-----+------+
|  0|  0|        0|           0|         0|     0|           0|    0|     0|
+---+---+---------+------------+----------+------+------------+-----+------+



In [12]:
# Función para calcular el número de outliers
def count_outliers(df, columna):
    q1 = df.approxQuantile(columna, [0.25], 0.05)[0]
    q3 = df.approxQuantile(columna, [0.75], 0.05)[0]
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return df.filter((col(columna) < lower_bound) | (col(columna) > upper_bound)).count()

In [13]:
# Número de outliers por atributo numérico
outliers = {columna: count_outliers(cars_df, columna) for columna in numeric_cols}
print(outliers)

{'mpg': 1, 'cylinders': 0, 'displacement': 0, 'horsepower': 28, 'weight': 0, 'acceleration': 6}


In [14]:
# Valor medio de mpg y weight por origen
mean_by_origin = cars_df.groupBy("origin").agg(round(mean("mpg"),2).alias("avg_mpg"), round(mean("weight"),2).alias("avg_weight"))
mean_by_origin.show()

+------+-------+----------+
|origin|avg_mpg|avg_weight|
+------+-------+----------+
|Europe|  26.75|   2431.49|
|    US|  19.69|   3372.70|
| Japan|  30.45|   2221.23|
+------+-------+----------+



In [15]:
# Porcentaje de coches por origen
total_cars = cars_df.count()
origin_percentage = origin_count.withColumn("percentage", (col("count") / total_cars) * 100)
origin_percentage.show()

+------+-----+------------------+
|origin|count|        percentage|
+------+-----+------------------+
|Europe|   73|17.980295566502463|
|    US|  254|  62.5615763546798|
| Japan|   79|19.458128078817737|
+------+-----+------------------+



In [16]:
# Calcular el valor medio de 'mpg' por 'origin'
mean_mpg_by_origin = cars_df.groupBy("origin").agg(mean("mpg").alias("mean_mpg"))
mean_mpg_by_origin.show()

+------+------------------+
|origin|          mean_mpg|
+------+------------------+
|Europe|26.745205479452057|
|    US|19.688188976377948|
| Japan|30.450632911392397|
+------+------------------+



In [17]:
# Unir el DataFrame original con el DataFrame que contiene el valor medio de 'mpg' por 'origin'
data_with_mean_mpg = cars_df.join(mean_mpg_by_origin, on="origin")
data_with_mean_mpg.show()

+------+--------------------+----+---------+------------+----------+------+------------+-----+------------------+
|Origin|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|          mean_mpg|
+------+--------------------+----+---------+------------+----------+------+------------+-----+------------------+
|    US|Chevrolet Chevell...|18.0|        8|       307.0|     130.0|  3504|        12.0|   70|19.688188976377948|
|    US|   Buick Skylark 320|15.0|        8|       350.0|     165.0|  3693|        11.5|   70|19.688188976377948|
|    US|  Plymouth Satellite|18.0|        8|       318.0|     150.0|  3436|        11.0|   70|19.688188976377948|
|    US|       AMC Rebel SST|16.0|        8|       304.0|     150.0|  3433|        12.0|   70|19.688188976377948|
|    US|         Ford Torino|17.0|        8|       302.0|     140.0|  3449|        10.5|   70|19.688188976377948|
|    US|    Ford Galaxie 500|15.0|        8|       429.0|     198.0|  4341|        10.0|

In [18]:
# Filtrar los coches que superan el valor medio de 'mpg' y contar por 'origin'
above_mean_count = data_with_mean_mpg.filter(col("mpg") > col("mean_mpg")) \
    .groupBy("origin").count().withColumnRenamed("count", "above_mean_count")
above_mean_count.show()

+------+----------------+
|origin|above_mean_count|
+------+----------------+
|Europe|              35|
|    US|             109|
| Japan|              46|
+------+----------------+



In [19]:
# Contar el número total de coches por 'origin'
origin_count = cars_df.groupBy("origin").count().withColumnRenamed("count", "total_count")
origin_count.show()

+------+-----------+
|origin|total_count|
+------+-----------+
|Europe|         73|
|    US|        254|
| Japan|         79|
+------+-----------+



In [20]:
# Paso 5: Calcular el porcentaje de coches que superan el valor medio de 'mpg' por 'origin'
percentage_above_mean_mpg = above_mean_count.join(origin_count, on="origin") \
    .withColumn("percentage_above_mean_mpg", (col("above_mean_count") / col("total_count")) * 100)

# Mostrar el resultado
percentage_above_mean_mpg.show()

+------+----------------+-----------+-------------------------+
|origin|above_mean_count|total_count|percentage_above_mean_mpg|
+------+----------------+-----------+-------------------------+
|Europe|              35|         73|        47.94520547945205|
|    US|             109|        254|        42.91338582677165|
| Japan|              46|         79|        58.22784810126582|
+------+----------------+-----------+-------------------------+



In [21]:
# Convertir la columna 'origin' a una columna indexada y luego a una columna de una sola codificación
indexer = StringIndexer(inputCol="Origin", outputCol="originIndex")
encoder = OneHotEncoder(inputCol="originIndex", outputCol="originVec")

In [22]:
# Crear el VectorAssembler para transformar las columnas de características en un vector
feature_cols = ["Cylinders", "Displacement", "Horsepower", "Weight", "Acceleration", "Model", "originVec"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

In [23]:
# Definir el tipo de escalador
scaler_type = "minmax"  # Puede ser "standard" o "minmax"

In [24]:
# Crear el escalador basado en la variable scaler_type
if scaler_type == "standard":
    scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withMean=True, withStd=True)
elif scaler_type == "minmax":
    scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
else:
    raise ValueError("Scaler debe ser 'standard' o 'minmax'")

In [25]:
# Crear el modelo de regresión lineal
lr = LinearRegression(featuresCol="scaledFeatures", labelCol="MPG")

In [26]:
# Crear el pipeline con los pasos de transformación y el modelo
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, lr])

In [27]:
# Crear el ParamGridBuilder para definir los parámetros a optimizar
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

In [28]:
# Crear el evaluador
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")

In [29]:

# Crear el CrossValidator
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)  # Utilizar 3 pliegues para la validación cruzada

In [30]:
# Dividir los datos en conjuntos de entrenamiento y prueba
train_data, test_data = cars_df.randomSplit([0.8, 0.2], seed=42)

In [31]:
# Entrenar el modelo usando la validación cruzada
cv_model = crossval.fit(train_data)

In [32]:
# Hacer predicciones en el conjunto de prueba
predictions = cv_model.transform(test_data)

In [33]:
# Evaluar el modelo
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

In [34]:

# Mostrar las métricas del modelo
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

# Mostrar algunas predicciones
predictions.select("MPG", "prediction").show(10)

RMSE: 5.715282960616281
R2: 0.5993569732283703
+----+------------------+
| MPG|        prediction|
+----+------------------+
|17.0|14.408205895889031|
|20.2|23.662740765679228|
|18.0| 21.36818993222188|
|18.0|20.435384887822213|
|15.5| 16.15572506707235|
|15.0|13.046200158267435|
|24.0| 21.95961280833238|
|29.0| 26.85451866166292|
|14.0|19.179663414281322|
|16.9|16.987137495978644|
+----+------------------+
only showing top 10 rows



In [35]:
# Crear el modelo de RandomForestRegressor
rf = RandomForestRegressor(featuresCol="scaledFeatures", labelCol="MPG")

In [36]:
# Crear el pipeline con los pasos de transformación y el modelo
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, rf])

In [37]:
# Crear el ParamGridBuilder para definir los parámetros a optimizar
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [10, 20]) \
    .addGrid(rf.maxDepth, [5, 10]) \
    .build()

In [38]:
# Crear el evaluador
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")

In [39]:
# Crear el CrossValidator
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)  # Utilizar 3 pliegues para la validación cruzada

In [40]:
# Dividir los datos en conjuntos de entrenamiento y prueba
train_data, test_data = cars_df.randomSplit([0.8, 0.2], seed=42)

In [41]:
# Entrenar el modelo usando la validación cruzada
cv_model = crossval.fit(train_data)

In [42]:
# Hacer predicciones en el conjunto de prueba
predictions = cv_model.transform(test_data)

In [43]:
# Evaluar el modelo
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

In [44]:

# Mostrar las métricas del modelo
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

# Mostrar algunas predicciones
predictions.select("MPG", "prediction").show(10)

RMSE: 5.5206353932919585
R2: 0.6261819694442986
+----+------------------+
| MPG|        prediction|
+----+------------------+
|17.0|14.029591450216452|
|20.2|20.329560439560446|
|18.0| 19.81438988095238|
|18.0|19.198333333333334|
|15.5|          17.13625|
|15.0|13.692091450216452|
|24.0|24.454317875107346|
|29.0|26.095010632642207|
|14.0| 9.618690476190476|
|16.9|            16.006|
+----+------------------+
only showing top 10 rows



In [45]:
# Crear el modelo de DecisionTreeRegressor
dt = DecisionTreeRegressor(featuresCol="scaledFeatures", labelCol="MPG")

In [46]:
# Crear el pipeline con los pasos de transformación y el modelo
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, dt])

In [47]:
# Crear el ParamGridBuilder para definir los parámetros a optimizar
paramGrid = ParamGridBuilder() \
    .addGrid(dt.maxDepth, [5, 10]) \
    .addGrid(dt.maxBins, [32, 64]) \
    .build()

In [48]:
# Crear el evaluador
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")

In [49]:
# Crear el CrossValidator
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)  # Utilizar 3 pliegues para la validación cruzada

In [50]:
# Dividir los datos en conjuntos de entrenamiento y prueba
train_data, test_data = cars_df.randomSplit([0.8, 0.2], seed=42)

In [51]:
# Entrenar el modelo usando la validación cruzada
cv_model = crossval.fit(train_data)

In [52]:
# Hacer predicciones en el conjunto de prueba
predictions = cv_model.transform(test_data)

In [53]:
# Evaluar el modelo
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

In [54]:

# Mostrar las métricas del modelo
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

# Mostrar algunas predicciones
predictions.select("MPG", "prediction").show(10)

RMSE: 6.477456099679724
R2: 0.4853747738398635
+----+------------------+
| MPG|        prediction|
+----+------------------+
|17.0|              14.5|
|20.2| 20.88421052631579|
|18.0| 19.23076923076923|
|18.0| 19.23076923076923|
|15.5|26.600000000000023|
|15.0|              14.5|
|24.0|24.060000000000002|
|29.0|27.789473684210527|
|14.0|               0.0|
|16.9|            15.625|
+----+------------------+
only showing top 10 rows



In [55]:
# Crear el modelo de GBTRegressor
gbt = GBTRegressor(featuresCol="scaledFeatures", labelCol="MPG")

In [56]:
# Crear el pipeline con los pasos de transformación y el modelo
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, gbt])

In [57]:
# Crear el ParamGridBuilder para definir los parámetros a optimizar
paramGrid = ParamGridBuilder() \
    .addGrid(gbt.maxDepth, [5, 10]) \
    .addGrid(gbt.maxIter, [10, 20]) \
    .build()

In [58]:
# Crear el evaluador
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")

In [59]:
# Crear el CrossValidator
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)  # Utilizar 3 pliegues para la validación cruzada

In [60]:
# Dividir los datos en conjuntos de entrenamiento y prueba
train_data, test_data = cars_df.randomSplit([0.8, 0.2], seed=42)

In [61]:
# Entrenar el modelo usando la validación cruzada
cv_model = crossval.fit(train_data)

In [62]:
# Hacer predicciones en el conjunto de prueba
predictions = cv_model.transform(test_data)

In [63]:
# Evaluar el modelo
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

In [64]:

# Mostrar las métricas del modelo
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

# Mostrar algunas predicciones
predictions.select("MPG", "prediction").show(10)

RMSE: 6.52414662789623
R2: 0.4779290351023283
+----+------------------+
| MPG|        prediction|
+----+------------------+
|17.0|14.457144170543382|
|20.2| 20.71159663122862|
|18.0|19.464722756692225|
|18.0|19.309464205455466|
|15.5|26.567774720304826|
|15.0| 14.26150002651576|
|24.0| 24.14503802614297|
|29.0|  26.7559393700798|
|14.0| 1.475400485122426|
|16.9|16.153409304431197|
+----+------------------+
only showing top 10 rows



In [65]:
# Parar sesión
spark.stop()
