# Intro to Machine Learning Big Data

A continuación se llevará a cabo un ejercicio básico de clasificación, más adelante se abordarán otros detalles relacionados con aprendizaje supervisado y deep learning en el ámbito de Big Data.

Usaremos un conjunto de datos el cual representa información de clientes que han aplicado a préstamos bancarios, nuestra clase o salida del sistema de ML será un dato binario que indica si basado en los features, al cliente se le otorgo o no el préstamo.

En terminos generales y a manera de repaso, las etapas por las que haremos el recorrido son:

* Cargar el set de datos
* Llevar a cabo un analisis exploratorio de los datos.
* Ejecutar alguna transformacion en caso de que sea necesaria.
* Dividir los datos en *training* y *testing*.
* Entrenar y evaluar el modelo.
* Llevar a cabo cualquier ajuste de hyper-parámetros.
* Construir el modelo final con los mejores parámetros.


Para empezar, inicializamos el objeto SPARK:


In [1]:
#import SparkSession
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('binary_class').getOrCreate()

Seguidamente usamos SPARK para cargar el conjunto de datos y crear la estrucutra de datos Spark "DataFrame" :


In [2]:
#read the dataset
df=spark.read.csv('classification_data.csv',inferSchema=True,header=True)

Seguidamente, llevamos a cabo una exploración de los datos.

Empezamos validando el tamaño de los datos:

In [3]:
#check the shape of the data 
print((df.count(),len(df.columns)))

(46751, 12)


Exploramos los feautures y clases, en este caso, la clase binaria:

In [4]:
#printSchema
df.printSchema()

root
 |-- loan_id: string (nullable = true)
 |-- loan_purpose: string (nullable = true)
 |-- is_first_loan: integer (nullable = true)
 |-- total_credit_card_limit: integer (nullable = true)
 |-- avg_percentage_credit_card_limit_used_last_year: double (nullable = true)
 |-- saving_amount: integer (nullable = true)
 |-- checking_amount: integer (nullable = true)
 |-- is_employed: integer (nullable = true)
 |-- yearly_salary: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- dependent_number: integer (nullable = true)
 |-- label: integer (nullable = true)



In [5]:
#number of columns in dataset
df.columns

['loan_id',
 'loan_purpose',
 'is_first_loan',
 'total_credit_card_limit',
 'avg_percentage_credit_card_limit_used_last_year',
 'saving_amount',
 'checking_amount',
 'is_employed',
 'yearly_salary',
 'age',
 'dependent_number',
 'label']

Visualizamos los datos en forma de Spark DataFrame:

In [6]:
#view the data
df.show()

+-------+------------+-------------+-----------------------+-----------------------------------------------+-------------+---------------+-----------+-------------+---+----------------+-----+
|loan_id|loan_purpose|is_first_loan|total_credit_card_limit|avg_percentage_credit_card_limit_used_last_year|saving_amount|checking_amount|is_employed|yearly_salary|age|dependent_number|label|
+-------+------------+-------------+-----------------------+-----------------------------------------------+-------------+---------------+-----------+-------------+---+----------------+-----+
|    A_1|    personal|            1|                   7900|                                            0.8|         1103|           6393|          1|        16400| 42|               4|    0|
|    A_2|    personal|            0|                   3300|                                           0.29|         2588|            832|          1|        75500| 56|               1|    0|
|    A_3|    personal|            0|    

Con el fin de tener una mejor vista de los datos en términos del formato y orden, podemos hacer la misma visualización anterior, pero esta vez en forma de Pandas DataFrame:

In [7]:
df.limit(5).toPandas().head()

Unnamed: 0,loan_id,loan_purpose,is_first_loan,total_credit_card_limit,avg_percentage_credit_card_limit_used_last_year,saving_amount,checking_amount,is_employed,yearly_salary,age,dependent_number,label
0,A_1,personal,1,7900,0.8,1103,6393,1,16400,42,4,0
1,A_2,personal,0,3300,0.29,2588,832,1,75500,56,1,0
2,A_3,personal,0,7600,0.9,1651,8868,1,59000,46,1,0
3,A_4,personal,1,3400,0.38,1269,6863,1,26000,55,8,0
4,A_5,emergency,0,2600,0.89,1310,3423,1,9700,41,4,1


Ejecutamos un análisis estadístico de los datos: 

In [8]:
#Exploratory Data Analysis
df.describe().show()


+-------+-------+------------+------------------+-----------------------+-----------------------------------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+
|summary|loan_id|loan_purpose|     is_first_loan|total_credit_card_limit|avg_percentage_credit_card_limit_used_last_year|     saving_amount|   checking_amount|       is_employed|     yearly_salary|               age|  dependent_number|              label|
+-------+-------+------------+------------------+-----------------------+-----------------------------------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+
|  count|  46751|       46751|             46751|                  46751|                                          46751|             46751|             46751|             46751|             46751|             46751|             467

Validamos la distribución de las cantidades en la salida (clase):

In [9]:
df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|16201|
|    0|30550|
+-----+-----+



Notamos como un poco más del 1/3 de los clientes obtuvieron la aprobación de su solicitid de préstamo.

Y cuáles serán los propósitos de préstamos más comunes?, el siguiente comando nos brinda esa información:

In [52]:
df.groupBy('loan_purpose').count().show()

+------------+-----+
|loan_purpose|count|
+------------+-----+
|      others| 6763|
|   emergency| 7562|
|    property|11388|
|  operations|10580|
|    personal|10458|
+------------+-----+



Posteriormente, hacemos una validación de los datos que requieran de algun tipo de transformacion, por ejemplo, en nuestro caso notamos que casi todas las features relevantes son numericas, con excepcion de *"loan_purpose"*, la cual represneta un feature de relevancia pero que requiere transformación de tipo One Hot Encoder: 

In [11]:
#converting categorical data to numerical form

In [12]:
#import required libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler



In [13]:
loan_purpose_indexer = StringIndexer(inputCol="loan_purpose", outputCol="loan_index").fit(df)
df = loan_purpose_indexer.transform(df)
loan_encoder = OneHotEncoder(inputCol="loan_index", outputCol="loan_purpose_vec")
df = loan_encoder.transform(df)

In [53]:
df.select(['loan_purpose','loan_index','loan_purpose_vec']).show(20,False)

+------------+----------+----------------+
|loan_purpose|loan_index|loan_purpose_vec|
+------------+----------+----------------+
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|emergency   |3.0       |(4,[3],[1.0])   |
|operations  |1.0       |(4,[1],[1.0])   |
|operations  |1.0       |(4,[1],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|others      |4.0       |(4,[],[])       |
|operations  |1.0       |(4,[1],[1.0])   |
|emergency   |3.0       |(4,[3],[1.0])   |
|property    |0.0       |(4,[0],[1.0])   |
|others      |4.0       |(4,[],[])       |
|emergency   |3.0       |(4,[3],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|emergency   |3.0       |(4,[3],[1.0])   |
|operations  |1.0       |(4,[1],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
+----------

Ahora usamos "VectorAssembler" para crear un único vector de features y clase, para ser usado por nuestro modelo de entrenamiento:
   

In [15]:
from pyspark.ml.feature import VectorAssembler

Validamos las columnas en el DF despu´s de la aplicación del OHE:

In [16]:
df.columns

['loan_id',
 'loan_purpose',
 'is_first_loan',
 'total_credit_card_limit',
 'avg_percentage_credit_card_limit_used_last_year',
 'saving_amount',
 'checking_amount',
 'is_employed',
 'yearly_salary',
 'age',
 'dependent_number',
 'label',
 'loan_index',
 'loan_purpose_vec']

Construimos el vector:

In [17]:
df_assembler = VectorAssembler(inputCols=['is_first_loan',
 'total_credit_card_limit',
 'avg_percentage_credit_card_limit_used_last_year',
 'saving_amount',
 'checking_amount',
 'is_employed',
 'yearly_salary',
 'age',
 'dependent_number',
 'loan_purpose_vec'], outputCol="features")
df = df_assembler.transform(df)

Revisamos el Schema:

In [18]:
df.printSchema()

root
 |-- loan_id: string (nullable = true)
 |-- loan_purpose: string (nullable = true)
 |-- is_first_loan: integer (nullable = true)
 |-- total_credit_card_limit: integer (nullable = true)
 |-- avg_percentage_credit_card_limit_used_last_year: double (nullable = true)
 |-- saving_amount: integer (nullable = true)
 |-- checking_amount: integer (nullable = true)
 |-- is_employed: integer (nullable = true)
 |-- yearly_salary: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- dependent_number: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- loan_index: double (nullable = false)
 |-- loan_purpose_vec: vector (nullable = true)
 |-- features: vector (nullable = true)



Imprimimos los primeros 20 resultados, nótese el dato categórico como aparece ahora ya vectorizado:

In [54]:
df.select(['features','label']).show(20,False)

+--------------------------------------------------------------------+-----+
|features                                                            |label|
+--------------------------------------------------------------------+-----+
|[1.0,7900.0,0.8,1103.0,6393.0,1.0,16400.0,42.0,4.0,0.0,0.0,1.0,0.0] |0    |
|[0.0,3300.0,0.29,2588.0,832.0,1.0,75500.0,56.0,1.0,0.0,0.0,1.0,0.0] |0    |
|[0.0,7600.0,0.9,1651.0,8868.0,1.0,59000.0,46.0,1.0,0.0,0.0,1.0,0.0] |0    |
|[1.0,3400.0,0.38,1269.0,6863.0,1.0,26000.0,55.0,8.0,0.0,0.0,1.0,0.0]|0    |
|[0.0,2600.0,0.89,1310.0,3423.0,1.0,9700.0,41.0,4.0,0.0,0.0,0.0,1.0] |1    |
|[0.0,7600.0,0.51,1040.0,2406.0,1.0,22900.0,52.0,0.0,0.0,1.0,0.0,0.0]|0    |
|[1.0,6900.0,0.82,2408.0,5556.0,1.0,34800.0,48.0,4.0,0.0,1.0,0.0,0.0]|0    |
|[0.0,5700.0,0.56,1933.0,4139.0,1.0,32500.0,64.0,2.0,0.0,0.0,1.0,0.0]|0    |
|[1.0,3400.0,0.95,3866.0,4131.0,1.0,13300.0,23.0,3.0,0.0,0.0,1.0,0.0]|0    |
|[0.0,2900.0,0.91,88.0,2725.0,1.0,21100.0,52.0,1.0,0.0,0.0,1.0,0.0]  |1    |

Procedemos ahora a contruir el modelo de aprendizaje automático:

In [20]:
#select data for building model
model_df=df.select(['features','label'])

In [21]:
from pyspark.ml.classification import LogisticRegression

Partimos el conjunto de datos:

In [22]:
#split the data 
training_df,test_df = model_df.randomSplit([0.75,0.25])

Exploramos los conjuntos de entrenamiento y pruebas:

In [23]:
training_df.count()

34970

In [24]:
training_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|12137|
|    0|22833|
+-----+-----+



In [25]:
test_df.count()

11781

In [26]:
test_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1| 4064|
|    0| 7717|
+-----+-----+



Aplicamos Regresión Logística.

**ENTRENAMIENTO**

In [27]:
log_reg=LogisticRegression().fit(training_df)

In [28]:
#Training Results

In [29]:
lr_summary=log_reg.summary

In [30]:
lr_summary.accuracy

0.8933943380040035

In [31]:
lr_summary.areaUnderROC

0.9584570193368336

In [32]:
print(lr_summary.precisionByLabel)

[0.9229951733604924, 0.8394284330346331]


In [33]:
print(lr_summary.recallByLabel)

[0.9128892392589673, 0.8567191233418472]


In [34]:
predictions = log_reg.transform(test_df)
predictions.show(10)


+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(13,[0,1,2,3,4,7]...|    1|[-1.5737970518925...|[0.17167576548564...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-4.5573852767910...|[0.01038056372090...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-6.1031900223565...|[0.00223073700819...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-6.6930010076807...|[0.00123802266475...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-6.7192708021125...|[0.00120596223900...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-5.5200562402430...|[0.00398964165953...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-6.3680258628245...|[0.00171260455519...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-5.1219420736524...|[0.00592906479907...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-3.6564198944700...|[0.02517467199970...|       1.0|
|(13,[0,1,2,3,4,

**TESTING**

In [35]:
model_predictions = log_reg.transform(test_df)


In [36]:
model_predictions = log_reg.evaluate(test_df)


In [37]:
model_predictions.accuracy

0.8950853068500128

In [38]:
model_predictions.weightedPrecision

0.895524739623859

In [39]:
model_predictions.recallByLabel

[0.9157703770895426, 0.8558070866141733]

In [40]:
print(model_predictions.precisionByLabel)

[0.9234287207630995, 0.8425387596899225]


In [41]:
model_predictions.areaUnderROC

0.9601851138553903

                        A que corresponde el caso? Underfitting, Fitting, Overfitting?

Validamos el caso ahora del ajuste de hyper-parámetros, para esto recurrimos al uso de un método mas avanzado como lo seria RandomForest. Primeramente entrenamos y probamos con los hyper-parámetros por defecto:

In [42]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
rf_model = rf.fit(training_df)


In [43]:
model_predictions = rf_model.transform(test_df)


En los parámetros a probar se selecciona **maxDepth**, **maxBins** y **numTrees**, se aplica cross-validation para determinar el mejor modelo y se usa además five-fold cross-validation (4 partes para entrenamiento y 1 para testing).



         ****DEPENDIENDO DEL PODER DE COMPUTO, EL TIEMPO DE EJECUCION PUEDE TOMAR ALREDEDOR DE UNA HORA****

In [44]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import time


start = time.time()

evaluator = BinaryClassificationEvaluator()

rf = RandomForestClassifier()
paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [5,10,20,25,30])
             .addGrid(rf.maxBins, [20,30,40 ])
             .addGrid(rf.numTrees, [5, 20,50])
             .build())
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cv_model = cv.fit(training_df)

end = time.time()
m = (end - start)/60
m = round(m, 2)
s = (end - start)
s = round(s, 2)
print(f"Time: {m} min / {s} sec")

Time: 47.33 min / 2839.69 sec


In [45]:
best_rf_model = cv_model.bestModel

Ejecutamos testing, ahora con el mejor modelo obtenido:

In [46]:
# Generate predictions for entire dataset
model_predictions = best_rf_model.transform(test_df)

In [47]:
true_pos=model_predictions.filter(model_predictions['label']==1).filter(model_predictions['prediction']==1).count()
actual_pos=model_predictions.filter(model_predictions['label']==1).count()
pred_pos=model_predictions.filter(model_predictions['prediction']==1).count()

In [48]:
#Recall 
float(true_pos)/(actual_pos)

0.9023129921259843

In [49]:
#Precision on test Data 
float(true_pos)/(pred_pos)

0.848646146725295

recordar que era recall y validar si puedo tener un porcentaje de accuracy???

In [73]:
model_predictions.limit(20).toPandas().head(20)

Unnamed: 0,features,label,rawPrediction,probability,prediction
0,"(1.0, 3600.0, 0.88, 1599.0, 3470.0, 0.0, 0.0, ...",1,"[9.082473629848312, 40.91752637015169]","[0.18164947259696626, 0.8183505274030338]",1.0
1,"(1.0, 500.0, 0.78, 779.0, 3461.0, 0.0, 0.0, 45...",1,"[2.2665492292811757, 47.73345077071883]","[0.045330984585623506, 0.9546690154143765]",1.0
2,"(1.0, 500.0, 0.89, 1208.0, 1735.0, 0.0, 0.0, 3...",1,"[1.1966448465429012, 48.80335515345711]","[0.023932896930858018, 0.9760671030691419]",1.0
3,"(1.0, 700.0, 0.72, 1234.0, 437.0, 0.0, 0.0, 32...",1,"[1.4666908845655189, 48.5333091154345]","[0.029333817691310368, 0.9706661823086897]",1.0
4,"(1.0, 800.0, 0.57, 822.0, 46.0, 0.0, 0.0, 43.0...",1,"[2.450566764893948, 47.549433235106065]","[0.04901133529787895, 0.9509886647021211]",1.0
5,"(1.0, 900.0, 0.76, 1165.0, 1136.0, 0.0, 0.0, 5...",1,"[1.3702831467895977, 48.62971685321043]","[0.027405662935791938, 0.9725943370642081]",1.0
6,"(1.0, 1100.0, 1.04, 1093.0, 1795.0, 0.0, 0.0, ...",1,"[1.1796400060645191, 48.820359993935504]","[0.023592800121290374, 0.9764071998787097]",1.0
7,"(1.0, 1200.0, 1.06, 1755.0, 2056.0, 0.0, 0.0, ...",1,"[2.604731487311582, 47.39526851268844]","[0.052094629746231615, 0.9479053702537684]",1.0
8,"(1.0, 1300.0, 0.79, 1886.0, 1977.0, 0.0, 0.0, ...",1,"[1.3295840411165634, 48.67041595888343]","[0.026591680822331272, 0.9734083191776688]",1.0
9,"(1.0, 1500.0, 0.79, 545.0, 7686.0, 0.0, 0.0, 3...",0,"[42.78919515850288, 7.210804841497128]","[0.8557839031700575, 0.14421609682994255]",0.0
