## EJERCICIO M6

Dataset: diamonds

* PARTE 1 (10 %) Carga de datos de diamonds desde CSV con schema: https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/diamonds.csv

* PARTE 2 (40 %) Pipeline regresión price con preprocesados
  * Imputer, StringIndexer, OneHotEncoder, MinMaxScaler o StandardScaler, VectorAssembler

* PARTE 3 (40 %) Pipeline clasificación multiclase sobre variable cut con preprocesados
  * Imputer, StringIndexer, OneHotEncoder, MinMaxScaler o StandardScaler, VectorAssembler

* PARTE 4 (10 %) Gridsearch con CrossValidation sobre cualquiera de los pipelines

Los modelos, se puede utilizar RandomForest para los dos por ejemplo o el que se quiera. Ejemplo RandomForestRegressor para regresión y MultiLayerPerceptronClassifier para clasificación.

m6_nombre_apellido.ipynb

Entrega: 02/03/2025

Usar pyspark MLlib y dataframes de pyspark. Seguir el notebook 08.pipelines.ipynb


In [26]:
import seaborn as sns
import pandas as pd
import requests
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, StringType, IntegerType
from pyspark.sql.functions import col, sum
from pyspark.sql.types import NumericType, StringType, DoubleType
from pyspark.ml.feature import StringIndexer, Imputer, OneHotEncoder, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [27]:
spark = SparkSession.builder.appName('uso_pipelines').getOrCreate()

In [28]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/diamonds.csv'
csv_path = 'diamonds.csv'
with open(csv_path, 'wb') as file:
    file.write(requests.get(url).content)

In [29]:
    
schema = StructType([
    StructField("carat", DoubleType(), True),
    StructField("cut", StringType(), True),
    StructField("color", StringType(), True),
    StructField("clarity", StringType(), True),
    StructField("depth", DoubleType(), True),
    StructField("table", DoubleType(), True),
    StructField("price", IntegerType(), True),
    StructField("x", DoubleType(), True),
    StructField("y", DoubleType(), True),
    StructField("z", DoubleType(), True)
])
df = spark.read.csv(csv_path, header=True, inferSchema=False, schema=schema)
df.show(5)
df.printSchema()

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|  Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|   Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|   Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows

root
 |-- carat: double (nullable = true)
 |-- cut: string (nullable = true)
 |-- color: string (nullable = true)
 |-- clarity: string (nullable = true)
 |-- depth: double (nullable = true)
 |-- table: double (nullable = true)
 |-- price: integer (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



In [30]:
dfr = df
dfr.show(3)

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|  Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|   Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 3 rows



# Regresión 

In [31]:
# Como vamos a predecir 'price' borramos filas donde 'price' sea nan:
dfr = dfr.dropna(subset=['price']) #columna a predecir

# contar nulos en todas las columnas: equivalente a pandas df.isna().sum()
dfr.select([sum(col(c).isNull().cast('int')).alias(c) for c in dfr.columns]).show()

+-----+---+-----+-------+-----+-----+-----+---+---+---+
|carat|cut|color|clarity|depth|table|price|  x|  y|  z|
+-----+---+-----+-------+-----+-----+-----+---+---+---+
|    0|  0|    0|      0|    0|    0|    0|  0|  0|  0|
+-----+---+-----+-------+-----+-----+-----+---+---+---+



In [32]:
# seleccionar los nombres de las columnas a las que aplicar Preprocesados
numerical_cols = [field.name for field in dfr.schema.fields if isinstance(field.dataType, NumericType) and field.name != 'price']
categorical_cols = [field.name for field in dfr.schema.fields if isinstance(field.dataType, StringType)]
label = 'price'
print(numerical_cols)
print(categorical_cols)

['carat', 'depth', 'table', 'x', 'y', 'z']
['cut', 'color', 'clarity']


In [33]:
dfr = dfr.withColumnRenamed('price', 'label')

In [34]:
# Indexers para las features de la entrada que no son la columna label a predecir
# crea un objeto StringIndexer por cada columna categórica a indexar
indexers_features = [
    StringIndexer( inputCol=c, outputCol=c + '_indexed', handleInvalid='keep') for c in categorical_cols
]
categorical_cols_indexed = [c + '_indexed' for c in categorical_cols]
print(categorical_cols_indexed)

['cut_indexed', 'color_indexed', 'clarity_indexed']


In [35]:
# Imputer con la moda para las columnas categóricas indexadas
imputer_categorical = Imputer(
    inputCols=categorical_cols_indexed,
    outputCols= [c + '_imputed' for c in categorical_cols_indexed],
    strategy='mode'
)
categorical_cols_indexed_imputed = [c + '_imputed' for c in categorical_cols_indexed]
print(categorical_cols_indexed_imputed)

['cut_indexed_imputed', 'color_indexed_imputed', 'clarity_indexed_imputed']


In [36]:
# one hot encoders para las categóricas indexadas imputadas
encoders_onehot = [
    OneHotEncoder(inputCol=c, outputCol=c + '_onehot')
    for c in categorical_cols_indexed_imputed
]
categorical_cols_onehot = [c + '_onehot' for c in categorical_cols_indexed_imputed]
print(categorical_cols_onehot)

['cut_indexed_imputed_onehot', 'color_indexed_imputed_onehot', 'clarity_indexed_imputed_onehot']


In [37]:
# Imputer con la mediana para la columnas numéricas
imputer_numerical = Imputer(
    inputCols=numerical_cols,
    outputCols= [c + '_imputed' for c in numerical_cols],
    strategy='median'
)
numerical_cols_imputed = [c + '_imputed' for c in numerical_cols]
print(numerical_cols_imputed)

['carat_imputed', 'depth_imputed', 'table_imputed', 'x_imputed', 'y_imputed', 'z_imputed']


In [38]:
# (Opcional) escalar numéricas con MinMaxScaler
assembler_numerical = VectorAssembler(
    inputCols=numerical_cols_imputed,
    outputCol='numeric_features'
)
scaler = MinMaxScaler(
    inputCol='numeric_features',
    outputCol='numeric_features_scaled'
)

In [39]:
all_columns = ['numeric_features_scaled'] + categorical_cols_onehot

In [40]:
# Ensamblar todo: numéricas + categóricas y obtener features
assembler_all = VectorAssembler(
    inputCols=all_columns,
    outputCol='features')

In [41]:
regressor = RandomForestRegressor(seed=42)

In [42]:
# Particionamiento de datos
df_train, df_test = df.randomSplit([0.8, 0.2], seed=42)

In [43]:
pipeline = Pipeline(stages = [
    # 1. indexer para columnas categóricas 'cut' 'color' y 'clarity'
    *indexers_features, #ponemos * prorque es una lista de objetos 
    # 2. Imputer para categóricas
    imputer_categorical,
    # 3. OneHotEncoders para categóricas
    *encoders_onehot,
    # 4. Imputer para columnas numericas 'carat', 'depth', 'table', 'x', 'y' y 'z'
    imputer_numerical,
    # 5.ensamblar numéricas + escalado
    assembler_numerical,
    scaler,
    # 6. ensamblar todas las columnas (numericas escaladas + categoricas en features)
    assembler_all,
    # 7. modelo de regresión
    regressor
])

In [44]:
pipeline_model = pipeline.fit(df_train)
df_pred = pipeline_model.transform(df_test)

IllegalArgumentException: label does not exist. Available: carat, cut, color, clarity, depth, table, price, x, y, z, cut_indexed, color_indexed, clarity_indexed, cut_indexed_imputed, color_indexed_imputed, clarity_indexed_imputed, cut_indexed_imputed_onehot, color_indexed_imputed_onehot, clarity_indexed_imputed_onehot, carat_imputed, depth_imputed, table_imputed, x_imputed, y_imputed, z_imputed, numeric_features, numeric_features_scaled, features

In [None]:
evaluator_accuracy = MulticlassClassificationEvaluator(metricName='accuracy')
evaluator_f1 = MulticlassClassificationEvaluator(metricName='f1')
evaluator_precision = MulticlassClassificationEvaluator(metricName='weightedPrecision')
evaluator_recall = MulticlassClassificationEvaluator(metricName='weightedRecall')

In [None]:
print('accuracy', evaluator_accuracy.evaluate(df_pred))
print('f1', evaluator_f1.evaluate(df_pred))
print('precision', evaluator_precision.evaluate(df_pred))
print('recall', evaluator_recall.evaluate(df_pred))

In [None]:
paramGrid = (
    ParamGridBuilder()
    .addGrid(classifier.numTrees, [5, 10, 15, 20, 25, 30]) # por defecto es 20
    .addGrid(classifier.maxDepth, [3, 5, 10, 15]) # por defecto es 5 (rango de 0 a 30)
    .build()
)

In [None]:
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid, #parametros para grid search hyper parameter tuning
    evaluator=evaluator_f1,
    numFolds=3, # 3 por defecto
    parallelism=4, 
    seed=42
)
cv_model = crossval.fit(df_train)
df_pred = cv_model.transform(df_test)

In [None]:
print('accuracy', evaluator_accuracy.evaluate(df_pred))
print('f1', evaluator_f1.evaluate(df_pred))
print('precision', evaluator_precision.evaluate(df_pred))
print('recall', evaluator_recall.evaluate(df_pred))

In [None]:
best_model = cv_model.bestModel
best_rf =best_model.stages[-1]
print(best_rf.extractParamMap())
print(best_rf.getNumTrees)
print(best_rf.getOrDefault('maxDepth'))
print(best_rf.featureImportances)

# Clasificación

In [None]:
# Como vamos a predecir island borramos filas donde island sea nan:
df = df.dropna(subset=['island']) #columna a predecir

# contar nulos en todas las columnas: equivalente a pandas df.isna().sum()
df.select([sum(col(c).isNull().cast('int')).alias(c) for c in df.columns]).show()

+-------+------+--------------+-------------+-----------------+-----------+---+
|species|island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|sex|
+-------+------+--------------+-------------+-----------------+-----------+---+
|      0|     0|             2|            2|                2|          2| 11|
+-------+------+--------------+-------------+-----------------+-----------+---+



In [None]:
# seleccionar los nombres de las columnas a las que aplicar Preprocesados
numerical_cols = [field.name for field in df.schema.fields if isinstance(field.dataType, NumericType)]
categorical_cols = [field.name for field in df.schema.fields if isinstance(field.dataType, StringType) and field.name != 'island']
label_col = 'island'

In [None]:
# Indexer para 'island' la columna a predecir
indexer_label = StringIndexer(
    inputCol= label_col,
    outputCol='label',
    handleInvalid='keep'
)

In [None]:
# Indexers para las features de la entrada que no son la columna label a predecir
# crea un objeto StringIndexer por cada columna categórica a indexar
indexers_features = [
    StringIndexer( inputCol=c, outputCol=c + '_indexed', handleInvalid='keep') for c in categorical_cols
]
categorical_cols_indexed = [c + '_indexed' for c in categorical_cols]
print(categorical_cols_indexed)

['species_indexed', 'sex_indexed']


In [None]:
# Imputer con la moda para las columnas categóricas indexadas
imputer_categorical = Imputer(
    inputCols=categorical_cols_indexed,
    outputCols= [c + '_imputed' for c in categorical_cols_indexed],
    strategy='mode'
)
categorical_cols_indexed_imputed = [c + '_imputed' for c in categorical_cols_indexed]
print(categorical_cols_indexed_imputed)

['species_indexed_imputed', 'sex_indexed_imputed']


In [None]:
# one hot encoders para las categóricas indexadas imputadas
encoders_onehot = [
    OneHotEncoder(inputCol=c, outputCol=c + '_onehot')
    for c in categorical_cols_indexed_imputed
]
categorical_cols_onehot = [c + '_onehot' for c in categorical_cols_indexed_imputed]
print(categorical_cols_onehot)

['species_indexed_imputed_onehot', 'sex_indexed_imputed_onehot']


In [None]:
# Imputer con la mediana para la columnas numéricas
imputer_numerical = Imputer(
    inputCols=numerical_cols,
    outputCols= [c + '_imputed' for c in numerical_cols],
    strategy='median'
)
numerical_cols_imputed = [c + '_imputed' for c in numerical_cols]
print(numerical_cols_imputed)

['bill_length_mm_imputed', 'bill_depth_mm_imputed', 'flipper_length_mm_imputed', 'body_mass_g_imputed']


In [None]:
# (Opcional) escalar numéricas con MinMaxScaler
assembler_numerical = VectorAssembler(
    inputCols=numerical_cols_imputed,
    outputCol='numeric_features'
)
scaler = MinMaxScaler(
    inputCol='numeric_features',
    outputCol='numeric_features_scaled'
)

In [None]:
all_columns = ['numeric_features_scaled'] + categorical_cols_onehot

In [None]:
# Ensamblar todo: numéricas + categóricas y obtener features
assembler_all = VectorAssembler(
    inputCols=all_columns,
    outputCol='features')

In [None]:
classifier = RandomForestClassifier(seed=42)

In [None]:
# Particionamiento de datos
df_train, df_test = df.randomSplit([0.8, 0.2], seed=42)

In [None]:
pipeline = Pipeline(stages = [
    # 1. indexer para la columna 'island' StringIndexer
    indexer_label,
    # 2. indexer para columnas categóricas 'species' y 'sex'
    *indexers_features, #ponemos * prorque es una lista de objetos 
    # 3. Imputer para categóricas
    imputer_categorical,
    # 4. OneHotEncoders para categóricas
    *encoders_onehot,
    # 5. Imputer para columnas numericas
    imputer_numerical,
    # 6.ensamblar numéricas + escalado
    assembler_numerical,
    scaler,
    # 7. ensamblar todas las columnas (numericas escaladas + categoricas en features)
    assembler_all,
    # 8. modelo de clasificacion
    classifier
])

In [None]:
pipeline_model = pipeline.fit(df_train)
df_pred = pipeline_model.transform(df_test)

In [None]:
evaluator_accuracy = MulticlassClassificationEvaluator(metricName='accuracy')
evaluator_f1 = MulticlassClassificationEvaluator(metricName='f1')
evaluator_precision = MulticlassClassificationEvaluator(metricName='weightedPrecision')
evaluator_recall = MulticlassClassificationEvaluator(metricName='weightedRecall')

In [None]:
print('accuracy', evaluator_accuracy.evaluate(df_pred))
print('f1', evaluator_f1.evaluate(df_pred))
print('precision', evaluator_precision.evaluate(df_pred))
print('recall', evaluator_recall.evaluate(df_pred))

accuracy 0.6037735849056604
f1 0.6368579517843184
precision 0.6937556154537285
recall 0.6037735849056604


## GridSearch y validación cruzada

In [None]:
paramGrid = (
    ParamGridBuilder()
    .addGrid(classifier.numTrees, [5, 10, 15, 20, 25, 30]) # por defecto es 20
    .addGrid(classifier.maxDepth, [3, 5, 10, 15]) # por defecto es 5 (rango de 0 a 30)
    .build()
)

In [None]:
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid, #parametros para grid search hyper parameter tuning
    evaluator=evaluator_f1,
    numFolds=3, # 3 por defecto
    parallelism=4, 
    seed=42
)
cv_model = crossval.fit(df_train)
df_pred = cv_model.transform(df_test)

In [None]:
print('accuracy', evaluator_accuracy.evaluate(df_pred))
print('f1', evaluator_f1.evaluate(df_pred))
print('precision', evaluator_precision.evaluate(df_pred))
print('recall', evaluator_recall.evaluate(df_pred))

accuracy 0.6415094339622641
f1 0.6595386045996479
precision 0.6801257861635219
recall 0.6415094339622642


In [None]:
best_model = cv_model.bestModel
best_rf =best_model.stages[-1]
print(best_rf.extractParamMap())
print(best_rf.getNumTrees)
print(best_rf.getOrDefault('maxDepth'))
print(best_rf.featureImportances)

{Param(parent='RandomForestClassifier_fceb85e06d54', name='bootstrap', doc='Whether bootstrap samples are used when building trees.'): True, Param(parent='RandomForestClassifier_fceb85e06d54', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False, Param(parent='RandomForestClassifier_fceb85e06d54', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10, Param(parent='RandomForestClassifier_fceb85e06d54', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Support

## Exportar modelo

In [None]:
pipeline_model.write().overwrite().save('pipeline_spark')