# Árboles de Decisión y Derivados: Ejemplo 2

Se utilizará un conjunto de datos para clasificar algunas universidades como privadas o públicas, con base en los siguientes atributos:
* Apps: Número de aplicaciones (postulaciones) recibidas
* Accept: Número de postulaciones aceptadas
* Enroll: Número de alumnos nuevos inscritos
* Top10perc: Estudiantes nuevos del 10% superior del colegio
* Top25perc: Estudiantes nuevos del 25% superior del colegio
* F.Undergrad: Número de estudiantes de pregrado de tiempo completo
* P.Undergrad: Número de estudiantes a tiempo parcial
* Outstate: Costo de inscripción si no se es del estado
* Room.Board: Costos
* Books: Costos estimados de libros
* Personal: Gasto personal estimado
* PhD: Porcentaje de profesores con Ph.D.
* Terminal: Porcentaje de profesores con grado terminal
* S.F.Ratio: Razón estudiante/profesor
* perc.alumni: Porcentaje de ex-alumnos que realizan donaciones
* Expend: Gasto institucional por estudiante
* Grad.Rate: tasa de graduación

In [1]:
# Solo si se usa colab
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 63kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 17.5MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=6e30248033cd0dd400ebe6261020e4ce066a3227adecc104df88ad999a1b7e04
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Ejemplo-Arboles2').getOrCreate()

In [3]:
# Cargar los datos
df = spark.read.csv('/content/College.csv', inferSchema=True, header=True)

# Esquema de los datos
df.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [4]:
# Algunos valores
df.show(5)

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

### Pre-procesamiento de Datos

In [5]:
# Los datos necesitan el formato: ("label","features")
# Este formato se obtiene con Vector Assembler

from pyspark.ml.feature import VectorAssembler

In [6]:
# Ver las columnas disponibles
df.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [7]:
# Se tomará todas las columnas, excepto las dos primeras
assembler = VectorAssembler(inputCols=['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F_Undergrad',
                                       'P_Undergrad', 'Outstate', 'Room_Board', 'Books', 'Personal', 'PhD',
                                       'Terminal', 'S_F_Ratio', 'perc_alumni', 'Expend', 'Grad_Rate'],
                            outputCol="features")

# Transformar los datos
df2 = assembler.transform(df)
df2.show(5)

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+--------------------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|            features|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+--------------------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|[1660.0,1232.0,72...|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|[2186.0,1924

In [8]:
# Convertir la columna objetivo (Private: Yes/No) de categórica a indizada
from pyspark.ml.feature import StringIndexer

# Objeto que realiza la indización
indexer = StringIndexer(inputCol="Private", outputCol="PrivateIndex")
# Aplicar a los datos
df3 = indexer.fit(df2).transform(df2)

df3.show(5)

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+--------------------+------------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|            features|PrivateIndex|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+--------------------+------------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|[1660.0,1232.0,72...|         0.0|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30

In [9]:
# Escoger las columnas necesarias para PySpark
df = df3.select("features", 'PrivateIndex')

df.show(5)

+--------------------+------------+
|            features|PrivateIndex|
+--------------------+------------+
|[1660.0,1232.0,72...|         0.0|
|[2186.0,1924.0,51...|         0.0|
|[1428.0,1097.0,33...|         0.0|
|[417.0,349.0,137....|         0.0|
|[193.0,146.0,55.0...|         0.0|
+--------------------+------------+
only showing top 5 rows



In [10]:
# Hacer la división en datos de entrenamiento y datos de prueba
df_train, df_test = df.randomSplit([0.7,0.3])

### Clasificadores

In [11]:
from pyspark.ml.classification import (DecisionTreeClassifier,
                                       GBTClassifier,
                                       RandomForestClassifier)
from pyspark.ml import Pipeline

In [12]:
# Crear los tres modelos
dt = DecisionTreeClassifier(labelCol='PrivateIndex',featuresCol='features')
rf = RandomForestClassifier(labelCol='PrivateIndex',featuresCol='features', numTrees=100)
gb = GBTClassifier(labelCol='PrivateIndex',featuresCol='features')

In [13]:
# Entrenar los modelos 
modelo_DT = dt.fit(df_train)
modelo_RF = rf.fit(df_train)
modelo_GB = gb.fit(df_train)

### Comparación de Modelos

In [14]:
preds_DT = modelo_DT.transform(df_test)

#preds_DT.printSchema()
preds_DT.select("prediction", "PrivateIndex", "features").show(5)

+----------+------------+--------------------+
|prediction|PrivateIndex|            features|
+----------+------------+--------------------+
|       0.0|         0.0|[167.0,130.0,46.0...|
|       0.0|         0.0|[174.0,146.0,88.0...|
|       0.0|         0.0|[213.0,166.0,85.0...|
|       0.0|         1.0|[233.0,233.0,153....|
|       0.0|         0.0|[244.0,198.0,82.0...|
+----------+------------+--------------------+
only showing top 5 rows



In [15]:
preds_RF = modelo_RF.transform(df_test)

preds_RF.select("prediction", "PrivateIndex", "features").show(5)

+----------+------------+--------------------+
|prediction|PrivateIndex|            features|
+----------+------------+--------------------+
|       0.0|         0.0|[167.0,130.0,46.0...|
|       0.0|         0.0|[174.0,146.0,88.0...|
|       0.0|         0.0|[213.0,166.0,85.0...|
|       0.0|         1.0|[233.0,233.0,153....|
|       0.0|         0.0|[244.0,198.0,82.0...|
+----------+------------+--------------------+
only showing top 5 rows



In [16]:
preds_GB = modelo_GB.transform(df_test)

preds_GB.select("prediction", "PrivateIndex", "features").show(5)

+----------+------------+--------------------+
|prediction|PrivateIndex|            features|
+----------+------------+--------------------+
|       0.0|         0.0|[167.0,130.0,46.0...|
|       0.0|         0.0|[174.0,146.0,88.0...|
|       0.0|         0.0|[213.0,166.0,85.0...|
|       0.0|         1.0|[233.0,233.0,153....|
|       0.0|         0.0|[244.0,198.0,82.0...|
+----------+------------+--------------------+
only showing top 5 rows



### Métricas de Evaluación

In [None]:
from pyspark.ml.evaluation import (BinaryClassificationEvaluator,
                                   MulticlassClassificationEvaluator)

# Evaluador: usando "exactitud"
evaluadorEX = MulticlassClassificationEvaluator(labelCol="PrivateIndex", 
                                                predictionCol="prediction", 
                                                metricName='accuracy')

# Evaluador: usando AUC
evaluadorAUC = BinaryClassificationEvaluator(labelCol="PrivateIndex", 
                                             rawPredictionCol="prediction",
                                             metricName="areaUnderROC")

In [None]:
# Mëtricas con árboles de decisión
exactitud_dt = evaluadorEX.evaluate(preds_DT)
auc_dt = evaluadorAUC.evaluate(preds_DT)

print("Usando Árboles de decisión: exactitud={}, AUC={:.3f}".format(exactitud_dt, auc_dt))

Usando Árboles de decisión: exactitud=0.9008620689655172, AUC=0.871


In [None]:
# Mëtricas con random forest
exactitud_rf = evaluadorEX.evaluate(preds_RF)
auc_rf = evaluadorAUC.evaluate(preds_RF)

print("Usando Random Forest: exactitud={:3f}, AUC={:.3f}".format(exactitud_rf, auc_rf))

Usando Random Forest: exactitud=0.935345, AUC=0.900


In [None]:
# Mëtricas con gradient boosting
exactitud_gb = evaluadorEX.evaluate(preds_GB)
auc_gb = evaluadorAUC.evaluate(preds_GB)

print("Usando Gradient Boosting: exactitud={:3f}, AUC={:.3f}".format(exactitud_gb, auc_gb))

Usando Gradient Boosting: exactitud=0.896552, AUC=0.875


In [17]:
modelo_RF.featureImportances

SparseVector(17, {0: 0.0298, 1: 0.0723, 2: 0.1142, 3: 0.0117, 4: 0.0063, 5: 0.2488, 6: 0.0948, 7: 0.1916, 8: 0.0805, 9: 0.0057, 10: 0.0063, 11: 0.0157, 12: 0.0211, 13: 0.0316, 14: 0.0174, 15: 0.0358, 16: 0.0163})

In [18]:
modelo_GB.featureImportances

SparseVector(17, {0: 0.0497, 1: 0.0029, 2: 0.0265, 3: 0.0393, 4: 0.0306, 5: 0.3715, 6: 0.0393, 7: 0.2393, 8: 0.0551, 9: 0.0115, 10: 0.0217, 11: 0.0131, 12: 0.0269, 13: 0.0325, 14: 0.0139, 15: 0.0117, 16: 0.0144})

In [19]:
modelo_DT.featureImportances

SparseVector(17, {0: 0.0214, 2: 0.0086, 3: 0.0083, 4: 0.0154, 5: 0.5424, 6: 0.0431, 7: 0.2615, 8: 0.031, 12: 0.0303, 14: 0.0109, 15: 0.0271})