# MLE challenge - Train model notebook

### Notebook 2

In this notebook, we train the model with a few features (for reasons of time and complexity in solving the challenge). It also shows how to persist the model in a file, load it into memory and then make a predict.



In [1]:
import pyspark
from pyspark import SparkContext
sc = pyspark.SparkContext(master="local[1]",appName="CreditRisk_N2_czavalaj")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/18 08:01:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Read dataset

In [2]:
#cRiskDF = SparkContext.read.csv('/home/jovyan/train_model/part-00000-4168813d-3d69-47ae-b5e0-9b7fc41d2016-c000.csv',header=True)
cRiskDF= sqlContext.read.csv('/home/jovyan/train_model/*.csv', sep=',', header='true', inferSchema='true')
cRiskDF = cRiskDF.na.fill(value=0,subset=['avg_amount_loans_previous'])
cRiskDF.printSchema()
cRiskDF.show()

                                                                                

root
 |-- id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- years_on_the_job: integer (nullable = true)
 |-- nb_previous_loans: integer (nullable = true)
 |-- avg_amount_loans_previous: double (nullable = false)
 |-- flag_own_car: integer (nullable = true)
 |-- status: integer (nullable = true)

+-------+---+----------------+-----------------+-------------------------+------------+------+
|     id|age|years_on_the_job|nb_previous_loans|avg_amount_loans_previous|flag_own_car|status|
+-------+---+----------------+-----------------+-------------------------+------------+------+
|5008804| 33|              12|                0|                      0.0|           1|     0|
|5008804| 33|              12|                1|       102.28336090329137|           1|     0|
|5008804| 33|              12|                2|       119.44270503521452|           1|     0|
|5008804| 33|              12|                3|        117.8730346375606|           1|     0|
|5008804| 33|    

Let's begin with the estimators and the pipeline after that.

In [3]:
from pyspark.ml.feature import VectorAssembler

numericCols = (['age', 'years_on_the_job', 'nb_previous_loans', 'avg_amount_loans_previous','flag_own_car'])
# stage 1
vectorizer = VectorAssembler(inputCols=numericCols, outputCol = 'features')
#cRiskDF = vectorizer.transform(cRiskDF)
#cRiskDF.show()


In [4]:
from pyspark.ml.feature import StringIndexer
# stage 2
label_stringIdx = StringIndexer(inputCol='status',outputCol='labelIndex')
#cRiskDF = label_stringIdx.fit(cRiskDF).transform(cRiskDF)
#cRiskDF.show()

Let's split the data

In [5]:
train_cRiskDF, test_cRiskDF = cRiskDF.randomSplit([0.7,0.3],seed=123)
print(train_cRiskDF.cache().count())
print(test_cRiskDF.count())


                                                                                

543980


[Stage 6:>                                                          (0 + 1) / 1]

233735


                                                                                

## Train model

In [6]:
# ************ RANDOM FOREST ********************
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


In [15]:
# Train a RandomForest model.
rfc = RandomForestClassifier(numTrees=20, maxDepth=5, labelCol='labelIndex', featuresCol='features',predictionCol='prediction')


In [19]:
# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[vectorizer,label_stringIdx,rfc])


In [20]:
# Train model. This also runs the indexer.
model = pipeline.fit(train_cRiskDF)
#model = rfc.fit(train_cRiskDF)

                                                                                

In [21]:
# Make predictions.
predictions = model.transform(test_cRiskDF)

# Select example rows to display.
predictions.select('age', 'years_on_the_job','nb_previous_loans','avg_amount_loans_previous','flag_own_car','features','labelIndex', 'prediction').show(5)


[Stage 33:>                                                         (0 + 1) / 1]

+---+----------------+-----------------+-------------------------+------------+--------------------+----------+----------+
|age|years_on_the_job|nb_previous_loans|avg_amount_loans_previous|flag_own_car|            features|labelIndex|prediction|
+---+----------------+-----------------+-------------------------+------------+--------------------+----------+----------+
| 33|              12|                2|       119.44270503521452|           1|[33.0,12.0,2.0,11...|       0.0|       0.0|
| 33|              12|                6|       116.29139797883522|           1|[33.0,12.0,6.0,11...|       0.0|       0.0|
| 33|              12|                9|       127.88591513832655|           1|[33.0,12.0,9.0,12...|       0.0|       0.0|
| 33|              12|               12|       133.61051710521372|           1|[33.0,12.0,12.0,1...|       0.0|       0.0|
| 33|              12|               13|       131.99517467789153|           1|[33.0,12.0,13.0,1...|       1.0|       0.0|
+---+-----------

                                                                                

In [23]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="labelIndex", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)  # summary only

[Stage 34:>                                                         (0 + 1) / 1]

Test Error = 0.0149357
RandomForestClassificationModel: uid=RandomForestClassifier_6aa661ec2b84, numTrees=20, numClasses=2, numFeatures=5


                                                                                

In [24]:
print(accuracy)

0.9850642821999273


## Model persistance

In [None]:
from joblib import dump, load

In [None]:
# dump model
dump(model, 'model_risk.joblib') 

### Load model & predict

In [None]:
my_model = load('model_risk.joblib') 

In [None]:
# example dict 'user_id' -> features
d = {
    '5008804': [32, 12, 2, 119.45, 1],
    '5008807': [29, 2, 1, 100, 0]
}

In [None]:
my_model.predict([d['5008804']])