# App for training + saving ML model for Pre2Post


In machine learning, the first thing to do is to train a model. Then you usually save it for further usage (predictions).

In this notebook, we will train a bunch of models, select the best one, and save it in order to use it in notebooks 2 & 3. In order to run this, the following assumptions should be fullfilled:

1. The following tables exist in Hive's metastore:
    + `raw_es.vf_pre_ac_final`
    + `raw_es.vf_pos_ac_final`
    + `raw_es.vf_pre_info_tarif`
    + `raw_es.campaign_msisdncontacthist`
    + `raw_es.campaign_msisdnresponsehist`
    + `raw_es.anonymisation_lookup_msisdn` --> This is a temporary workaround. Chris knows about it.
2. Date and time clocks in the Spark driver are accurate. This is important because we rely heavilly on date in order to compute *which is the next month after the current one* and that sort of stuff.

With all that said, let's start:

## 1. The one and only line that we have to change between executions

To understand better the whole workflow:

1. We use historical data from one month (eg. 2017/04) to train a couple of models, and keep/save the best one (this is done in notebook #1, which you do not currently have). Once we have got a good model, we just save it to HDFS. This *does not has to happen every single month*; as long as the model is not extremely outated, there should be no need for running this all months. This notebook covers that.
2. We then use historical data from next month (2017/05) to get an unbiased measure on how good our saved model is. This should run all months; it is always important to keep track of model performance on a monthly basis.
3. Finally, in order to make predictions, we will use the most recent data to predict customer behaviour. This can be ran as many times as we want, when needed (usually once a month).

Given that this notebook (#2) has to run on a monthly basis, I took care to structure the code so we just need to change one line from one monthly execution to another:

In [1]:
months = ["201703","201704","201705", "201706"]

## 2. Imports and app setup

Your usual stuff:

In [2]:
# Standard Library stuff:
from functools import partial
from datetime import date, timedelta, datetime

# Numpy stuff
from numpy import nan as np_nan

# Spark stuff
from pyspark.sql import SparkSession
from pyspark import StorageLevel
from pyspark.sql.functions import (udf, col, decode, when, lit, lower, 
                                   translate, count, sum as sql_sum, max as sql_max, isnull)
from pyspark.sql.types import DoubleType, StringType, IntegerType

In [3]:
spark = (SparkSession.builder
         .appName("Pre2Post Spain training")
         .master("yarn")
         .config("spark.submit.deployMode", "client")
         .config("spark.ui.showConsoleProgress", "true")
         .enableHiveSupport()
         .getOrCreate()
         )

sc = spark.sparkContext

## 3. Data imports and first transformations

In [4]:
# This function will be very handy:
def get_next_month(dt):
    """
    Given a yyyymm string, returns the yyyymm string
    for the next month.
    """
    current_month = datetime.strptime(dt, "%Y%m")
    return (datetime(current_month.year, current_month.month, 28) + timedelta(days=4)).strftime("%Y%m")

In [5]:
historical_dataframes = []

for month_for_verifying_in_pospago in months:
    # For the next 4 months
    time_delta = (datetime.strptime(month_for_verifying_in_pospago, "%Y%m")
                                      - timedelta(5*365/12))

    if time_delta.day > 16:
        if time_delta.month > 11:
            date_prepaid = datetime(time_delta.year + 1, 1, 1)
        else:
            date_prepaid = datetime(time_delta.year, time_delta.month + 1, 1)
    else:
        date_prepaid = time_delta

    month_for_getting_prepaid_data = date_prepaid.strftime("%Y%m")

    # Do they migrate next month? (Well, really next 2 months due to delays in lists):
    month_for_appearance_in_pospago = get_next_month(get_next_month(month_for_getting_prepaid_data))
    print(month_for_getting_prepaid_data, month_for_appearance_in_pospago, month_for_verifying_in_pospago)




    # I know, it is atrocious, but the most straightforward/simple/robust
    # way to select columns is with a simple, hardcoded Python list:

    useful_columns_from_acFinalPrepago = ["FECHA_EJECUCION",
                                          "MSISDN",
                                          "NUM_DOCUMENTO_CLIENTE",
                                          "NACIONALIDAD",
                                          "NUM_PREPAGO",
                                          "NUM_POSPAGO",
                                          #"Tipo_Documento_Cliente", Very uninformed
                                          "Tipo_Documento_Comprador",
                                          "X_FECHA_NACIMIENTO"]

    # Lots of tables in Hive have empty string instead
    # of null for missing values in StringType columns:
    def empty_str_to_null(string_value):
        if string_value == "":
            result = None
        elif string_value == u"":
            result = None
        else:
            result = string_value
        return result

    # We register previous function as a udf:
    empty_string_to_null = udf(empty_str_to_null, StringType())

    # Function that returns customer age out of his/her birthdate:
    def get_customer_age_raw(birthdate, month_for_getting_prepaid_data):
            if birthdate is None:
                return np_nan
            parsed_date = datetime.strptime(str(int(birthdate)), "%Y%m%d")
            timedelta = datetime.strptime(month_for_getting_prepaid_data, "%Y%m") - parsed_date
            return timedelta.days / 365.25

    # We register previous function as a udf:
    def get_customer_age_udf(birthdate, month):
        return udf(partial(get_customer_age_raw, month_for_getting_prepaid_data=month), DoubleType())(birthdate)

    # Self-explanatory.
    def subsitute_crappy_characters(string_column):
        """
        I really hate charset encoding.
        """
        return (string_column
                .replace(u"\ufffd", u"ñ")
                # add more here in the future
               )

    # We register previous function as a udf:
    substitute_crappy_characters_udf = udf(subsitute_crappy_characters, StringType())

    # And we finally read raw_es.vf_pre_ac_final,
    # filtering by date, and with new columns
    # that we create using our UDFs:
    acFinalPrepago = (spark.read.table("raw_es.vf_pre_ac_final")
                      .where((col("year") == int(month_for_getting_prepaid_data[:4]))
                             & (col("month") == int(month_for_getting_prepaid_data[4:]))
                            )
                      #.select(*useful_columns_from_acFinalPrepago)
                      .withColumn("X_FECHA_NACIMIENTO", empty_string_to_null(col("X_FECHA_NACIMIENTO")))
                      .withColumn("NUM_DOCUMENTO_CLIENTE", empty_string_to_null(col("NUM_DOCUMENTO_CLIENTE")))
                      .withColumn("NUM_DOCUMENTO_COMPRADOR", empty_string_to_null(col("NUM_DOCUMENTO_COMPRADOR")))
                      .withColumn("age_in_years", get_customer_age_udf(col("X_FECHA_NACIMIENTO"),
                                                                       month_for_getting_prepaid_data)
                                 )
                      .withColumn("NACIONALIDAD", substitute_crappy_characters_udf(col("NACIONALIDAD")))
                     )




    # In this acFinalPrepago DF we have a column (nationality)
    # with lot's of different values (high cardinality), which
    # is terrible for ML models, so we will get the most frequent
    # countries, and replace all others with "Other":
    most_frequent_countries = [u"España",
                               u"Marruecos",
                               u"Rumania",
                               u"Colombia",
                               u"Italia",
                               u"Ecuador",
                               u"Alemania",
                               u"Estados Unidos",
                               u"Francia",
                               u"Brasil",
                               u"Argentina",
                               u"Afganistan",
                               u"Bolivia",
                               u"Gran Bretaña",
                               u"Portugal",
                               u"Paraguay",
                               u"China",
                               u"Gran Bretana",
                               u"Venezuela",
                               u"Honduras",
                               u"Corea del Sur"]


    acFinalPrepago = acFinalPrepago.withColumn("NACIONALIDAD", when(col("NACIONALIDAD").isin(most_frequent_countries),
                                                                      col("NACIONALIDAD")
                                                                     ).otherwise(lit("Other"))
                                                )



    # Now we read another table: raw_es.vf_pos_ac_final,
    # with some yyyymm predicate, and only two columns:
    acFinalPospago_nextMonth = (spark.read.table("raw_es.vf_pos_ac_final")
                                   .where((col("year") == int(month_for_appearance_in_pospago[:4]))
                                          & (col("month") == int(month_for_appearance_in_pospago[4:]))
                                         )
                                   .select("x_id_red","x_num_ident")
                                   .na.drop()
                                   .withColumnRenamed("x_id_red", "x_id_red_NextMonth")
                                   .withColumnRenamed("x_num_ident", "x_num_ident_NextMonth")
                                  )



    # And yet again, we read the same table, but with 
    # different yyyymm predicate:
    acFinalPospago_4monthsLater = (spark.read.table("raw_es.vf_pos_ac_final")
                                   .where((col("year") == int(month_for_verifying_in_pospago[:4]))
                                          & (col("month") == int(month_for_verifying_in_pospago[4:]))
                                         )
                                   .select("x_id_red","x_num_ident")
                                   .na.drop()
                                  )


    # And we perform one join:

    join_prepago_pospago_1 = (acFinalPrepago
                             .join(acFinalPospago_nextMonth,
                                   how="left",
                                   on=(acFinalPrepago["MSISDN"]==acFinalPospago_nextMonth["x_id_red_NextMonth"])
                                    & (acFinalPrepago["NUM_DOCUMENTO_COMPRADOR"]==acFinalPospago_nextMonth["x_num_ident_NextMonth"])
                                 )
                           )



    # And another:

    join_prepago_pospago_2 = (join_prepago_pospago_1
                             .join(acFinalPospago_4monthsLater,
                                  on=(join_prepago_pospago_1["x_id_red_NextMonth"]==acFinalPospago_4monthsLater["x_id_red"])
                                      & (join_prepago_pospago_1["x_num_ident_NextMonth"]==acFinalPospago_4monthsLater["x_num_ident"]),
                                  how="left"
                                 )
                             )



    # Beautiful datetime manipulations:
    datetime_for_appearance_in_pospago = datetime.strptime(month_for_appearance_in_pospago, "%Y%m")

    datetime_min_contact = datetime((datetime_for_appearance_in_pospago - timedelta(days=8)).year,
                                    (datetime_for_appearance_in_pospago - timedelta(days=8)).month,
                                    1)

    datetime_max_contact = datetime((datetime_for_appearance_in_pospago + timedelta(days=8)).year,
                                    (datetime_for_appearance_in_pospago + timedelta(days=8)).month,
                                    7)

    month_for_getting_prepaid_data, month_for_appearance_in_pospago, datetime_min_contact, datetime_max_contact


    # Now, we read raw_es.campaign_msisdncontacthist
    # with yyyymm predicates and other stupid business filters
    # that I do not fully understand:
    contacts = (spark.read.table("raw_es.campaign_msisdncontacthist")
                .where(col("contactdatetime") >= datetime_min_contact.strftime("%Y-%m-%d %H:%M:%S"))
                .where(col("contactdatetime") < datetime_max_contact.strftime("%Y-%m-%d %H:%M:%S"))
                #.where(col("contactdatetime") >= "2017-05-01 00:00:00")
                #.where(col("contactdatetime") < "2017-06-08 00:00:00")
                .where(col("CampaignCode").isin(['AUTOMMES_PXXXP_MIG_PROPENSOS']))
                .where(~(col("Canal").like("PO%")))
                .where(~(col("Canal").like("NBA%")))
                .where(col("Canal")=="TEL")
                .where(col("Flag_Borrado") == 0)
                )

    # We read raw_es.campaign_msisdnresponsehist:
    responses = (spark.read.table("raw_es.campaign_msisdnresponsehist")
                )

    # We are going to join contacts DF with responses DF, and they
    # happen to have columns with same names (but not same data),
    # so we rename all columns in responses DF, adding responses_
    # at the beggining:
    responses_columns = [(column,"responses_"+column) for column in responses.columns]

    for existing, new in responses_columns:
        responses = responses.withColumnRenamed(existing, new)

    # Beautiful join. I do not expect you to understand
    # it, because neither do I. I just translated some
    # Teradata Query that VF Spain's CVM department uses
    # to Spark DF syntax. It runs quite fast...
    contacts_and_responses = (contacts.join(responses,
                                           how="left_outer",
                                           on=(contacts["TREATMENTCODE"]==responses["responses_TREATMENTCODE"])
                                              & (contacts["MSISDN"]==responses["responses_MSISDN"])
                                              & (contacts["CampaignCode"]==responses["responses_CampaignCode"])
                                           )
                                      .groupBy("MSISDN",
                                               "CAMPAIGNCODE",
                                               "CREATIVIDAD",
                                               "CELLCODE",
                                               "CANAL",
                                               "contactdatetime",
                                               "responses_responsedatetime")
                                      .agg(sql_max("responses_responsedatetime"))
                                      .select(col("MSISDN"), 
                                              col("CAMPAIGNCODE"), 
                                              col("CREATIVIDAD"), 
                                              col("CELLCODE"), 
                                              col("CANAL"), 
                                              col("contactdatetime").alias("DATEID"),
                                              when(isnull(col("max(responses_responsedatetime)")), "0")
                                                  .otherwise("1").alias("EsRespondedor")

                                             )
                             ).withColumnRenamed("msisdn","msisdn_contact")


    # Here is the lookup table
    lookup_msisdn = spark.read.table("raw_es.anonymisation_lookup_msisdn")

    # Join between customer data and lookup table:
    join_prepago_pospago_3 = join_prepago_pospago_2.join(lookup_msisdn,
                                                         how="left",
                                                         on=join_prepago_pospago_2["x_id_red"]==lookup_msisdn["cvm_value"]
                                                        )

    # Another join, where we
    # also create the target column for our 
    # machine learning model:
    join_prepago_pospago_4 = (join_prepago_pospago_3.join(contacts_and_responses,
                                                        how="left",
                                                        on=join_prepago_pospago_3["correct_value"]==contacts_and_responses["msisdn_contact"]
                                                       )
                                                  .withColumn("migrated_to_postpaid", ((~col("msisdn_contact").isNull())
                                                                                      #&(~col("EsRespondedor").isNull())
                                                                                      ).cast(IntegerType()))
                            )

    join_prepago_pospago_5 = (join_prepago_pospago_2
                              .where(col("x_id_red").isNull())
                              .join(lookup_msisdn,
                                    how="left",
                                    on=join_prepago_pospago_2["MSISDN"]==lookup_msisdn["cvm_value"]
                                   )
                              )

    join_prepago_pospago_6 = (join_prepago_pospago_5.join(contacts_and_responses,
                                                         how="left",
                                                         on=join_prepago_pospago_3["correct_value"]==contacts_and_responses["msisdn_contact"]
                                                         )
                              .withColumn("migrated_to_postpaid", lit("0").cast(IntegerType()))
                              .where(~(col("EsRespondedor").isNull()))
                              )

    join_prepago_pospago = (join_prepago_pospago_4
                            .where(col("migrated_to_postpaid")==1)
                            .union(join_prepago_pospago_6)
                            )


    # We will read raw_es.vf_pre_info_tarif.
    # The columns that we care about are the following:

    useful_columns_from_tarificadorPre = ['MSISDN',
                                          'MOU',
                                          'TOTAL_LLAMADAS',
                                          'TOTAL_SMS',
                                          'MOU_Week',
                                          'LLAM_Week',
                                          'SMS_Week',
                                          'MOU_Weekend',
                                          'LLAM_Weekend',
                                          'SMS_Weekend',
                                          'MOU_VF',
                                          'LLAM_VF',
                                          'SMS_VF',
                                          'MOU_Fijo',
                                          'LLAM_Fijo',
                                          'SMS_Fijo',
                                          'MOU_OOM',
                                          'LLAM_OOM',
                                          'SMS_OOM',
                                          'MOU_Internacional',
                                          'LLAM_Internacional',
                                          'SMS_Internacional',
                                          'ActualVolume',
                                          'Num_accesos',
                                          'Plan',
                                          'Num_Cambio_Planes',
                                          #'TOP_Internacional', # No idea of what is
                                          'LLAM_COMUNIDAD_SMART',
                                          'MOU_COMUNIDAD_SMART',
                                          'LLAM_SMS_COMUNIDAD_SMART',
                                          'Flag_Uso_Etnica',
                                          'cuota_SMART8',
                                          'cuota_SMART12',
                                          'cuota_SMART16']


    # Read raw_es.vf_pre_info_tarif + yyyymm predicates + 
    # column selection:
    tarificadorPre = (spark.read.table("raw_es.vf_pre_info_tarif")
                      .where((col("year") == int(month_for_getting_prepaid_data[:4]))
                             & (col("month") == int(month_for_getting_prepaid_data[4:]))
                            )
                      .select(*useful_columns_from_tarificadorPre)
                     )

    
    
    # Just as it happend with Nationlity column,
    # Plan is a column with very high cardenality.
    # We will replace any category not included
    # in the following list with "Other":
    plan_categories = ['PPIB7',
                       'PPFCL',
                       'PPIB4',
                       'PPXS8',
                       'PPIB8',
                       'PPIB9',
                       'PPTIN',
                       'PPIB1',
                       'PPVIS',
                       'PPREX',
                       'PPIB5',
                       'PPREU',
                       'PPRET',
                       'PPFCS',
                       'PPIB6',
                       'PPREY',
                       'PPVSP',
                       'PPIB2',
                       'PPIB3',
                       'PPRE2',
                       'PPRE5',
                       'PPVE2',
                       'PPVE1',
                       'PPRES',
                       'PPJ24',
                       'PPVE3',
                       'PPJAT',
                       'PPJMI']

    tarificadorPre_2 = tarificadorPre.withColumn("Plan",
                                                 when(col("Plan").isin(plan_categories),
                                                      col("Plan")
                                                     ).otherwise(lit("Other"))
                                                )

    # Only one step left:
    prepaid_dataset_1 = join_prepago_pospago.join(tarificadorPre_2,
                                                   how="inner",
                                                   on="MSISDN")
    
    historical_dataframes.append(prepaid_dataset_1)

('201610', '201612', '201703')
('201611', '201701', '201704')
('201612', '201702', '201705')
('201701', '201703', '201706')


In [6]:
prepaid_dataset_1 = historical_dataframes[0]

for dataframe in historical_dataframes[1:]:
    prepaid_dataset_1 = prepaid_dataset_1.union(dataframe)

## 5. Feature engineering

Up until now, we have only integrated different sources, done some data cleansing, validation and preparation according to business rules. Now, we have to perform further preprocessing before feeding data to our machine learning model trained in notebook #1.

Let's start by removing some columns that, after a lot of local testing, we decided that are pretty much useless.

Also, we will separate numeric columns (IntegerType, DoubleType) from categorical columns (StringType), since it is a requirement to treat them differently before feeding them to any machine learning model:

In [7]:
numeric_columns = ['NUM_PREPAGO',
                   'NUM_POSPAGO',
                   'age_in_years',
                   #'documenttype_Other',
                   #'documenttype_cif',
                   #'documenttype_nif',
                   #'documenttype_pasaporte',
                   #'documenttype_tarj_residente',
                   #'nationality_Afganistan',
                   #'nationality_Alemania',
                   #'nationality_Argentina',
                   #'nationality_Bolivia',
                   #'nationality_Brasil',
                   #'nationality_China',
                   #'nationality_Colombia',
                   #'nationality_Corea_del_Sur',
                   #'nationality_Ecuador',
                   #'nationality_España',
                   #'nationality_Estados_Unidos',
                   #'nationality_Francia',
                   #'nationality_Gran_Bretana',
                   #'nationality_Gran_Bretaña',
                   #'nationality_Honduras',
                   #'nationality_Italia',
                   #'nationality_Marruecos',
                   #'nationality_Other',
                   #'nationality_Paraguay',
                   #'nationality_Portugal',
                   #'nationality_Rumania',
                   #'nationality_Venezuela',
                   #'migrated_to_postpaid',
                   'MOU',
                   'TOTAL_LLAMADAS',
                   'TOTAL_SMS',
                   'MOU_Week',
                   'LLAM_Week',
                   'SMS_Week',
                   'MOU_Weekend',
                   'LLAM_Weekend',
                   'SMS_Weekend',
                   'MOU_VF',
                   'LLAM_VF',
                   'SMS_VF',
                   'MOU_Fijo',
                   'LLAM_Fijo',
                   'SMS_Fijo',
                   'MOU_OOM',
                   'LLAM_OOM',
                   'SMS_OOM',
                   'MOU_Internacional',
                   'LLAM_Internacional',
                   'SMS_Internacional',
                   'ActualVolume',
                   'Num_accesos',
                   'Num_Cambio_Planes',
                   'LLAM_COMUNIDAD_SMART',
                   'MOU_COMUNIDAD_SMART',
                   'LLAM_SMS_COMUNIDAD_SMART',
                   #'Flag_Uso_Etnica',
                   'cuota_SMART8',
                   #'cuota_SMART12',
                   #'cuota_SMART16',
                   #'plan_PPFCL',
                   #'plan_PPFCS',
                   #'plan_PPIB1',
                   #'plan_PPIB2',
                   #'plan_PPIB3',
                   #'plan_PPIB4',
                   #'plan_PPIB5',
                   #'plan_PPIB6',
                   #'plan_PPIB7',
                   #'plan_PPIB8',
                   #'plan_PPIB9',
                   #'plan_PPJ24',
                   #'plan_PPJAT',
                   #'plan_PPJMI',
                   #'plan_PPRE2',
                   #'plan_PPRE5',
                   #'plan_PPRES',
                   #'plan_PPRET',
                   #'plan_PPREU',
                   #'plan_PPREX',
                   #'plan_PPREY',
                   #'plan_PPTIN',
                   #'plan_PPVE1',
                   #'plan_PPVE2',
                   #'plan_PPVE3',
                   #'plan_PPVIS',
                   #'plan_PPVSP',
                   #'plan_PPXS8'
                  ]

categorical_columns = ["tipo_documento_comprador", "NACIONALIDAD", "Plan"]

# We just rename our big DF...
prepaid_dataset_2 = prepaid_dataset_1

# In order to perform an easy, recursive
# typecasting:
for column in numeric_columns:
    prepaid_dataset_2 = prepaid_dataset_2.withColumn(column, col(column).cast(DoubleType()))
    
# Good old repartition for underpartitioned tables + 
# disk persistence:
prepaid_dataset_2 = (prepaid_dataset_2
                     #.repartition(int(prepaid_dataset_2.count() / 50000)+1)
                     .persist(StorageLevel.DISK_ONLY_2)
                     )

Now we start with the ML stuff. First of all, we have to treat categorical columns differently from numeric ones.

In order for our ML model to "understand" correctly these columns, we have to pass them through a Spark ML transformer called StringIndexer:

In [8]:
from pyspark.ml.feature import StringIndexer


string_indexer_document = (StringIndexer(inputCol="tipo_documento_comprador", outputCol="documentType_indexed")
                           .fit(prepaid_dataset_2)
                          )
                           
string_indexer_nation = (StringIndexer(inputCol="NACIONALIDAD", outputCol="nationality_indexed")
                         .fit(prepaid_dataset_2)
                         )
                         
string_indexer_plan = (StringIndexer(inputCol="Plan", outputCol="tariffPlan_indexed")
                       .fit(prepaid_dataset_2)
                       )
                       
string_indexer_label = (StringIndexer(inputCol="migrated_to_postpaid", outputCol="label")
                        .fit(prepaid_dataset_2)
                        )

# A list with the new columns that these 
# StringIndexers generate (except for the
# label one, which has to be treated differently):
categorical_columns_indexed = ["documentType_indexed", "nationality_indexed", "tariffPlan_indexed"]

In [9]:
# This "recursive" function call
# applies all StringIndexer transformations in
# only one statement.
#
# Beautiful, huh?
prepaid_dataset_3 = string_indexer_label.transform(
    string_indexer_plan.transform (
        string_indexer_nation.transform (
            string_indexer_document.transform (
                prepaid_dataset_2)
        )))

In [10]:
# Filling with extreme value
#prepaid_dataset_3 = prepaid_dataset_3.na.fill(-9999.0, subset=["age_in_years"])

In [13]:
# In (supervised) machine learning, you pretty much always
# have to separate your dataset in two subsets, 
# the training one and the testing one.

# Furthermore, we will also use another testing methodology,
# which is the whole notebook #2. This allows us to really
# make sure that our model is robust and stable.
train, test = prepaid_dataset_3.randomSplit([0.8, 0.2])

#number_of_partitions_train = int(train.count() / 50000)+1

# Good old repartition for underpartitioned tables:
#train = train.repartition(number_of_partitions_train).cache()
#test = test.repartition(int(test.count() / 50000)+1).cache()

One column (`age_in_years`) in both `train` and `test` DFs has a very uninformed column (with lots of missing values). It turns out that most ML models cannot handle nulls, so we will have to *impute* that column (which means replacing the nulls with some value).

Here we will do *median imputation* (replacing nulls with the median of the values of that column, excluding the nulls for the median computation obviously).

The correct way to perform this median imputation is as follows:

In [14]:
## NOT IN USE. See above

# We compute the median age in the
# train DF:
age_median_train = (train
                    .na.drop(subset=["age_in_years"])
                    .approxQuantile("age_in_years",
                                    probabilities=[0.5],
                                    relativeError=0.0
                                   )
                   )[0]

# And we impute this value in BOTH train and test.
# No, this is not an error, but rather a technique
# to deal with a common ML problem called "data leakage",
# which leads to overfitting (another common ML problem):
train_filled = train.na.fill(age_median_train, subset=["age_in_years"])
test_filled = test.na.fill(age_median_train, subset=["age_in_years"])


#train_filled = train
#test_filled = test

With all our columns prepared for the model, we have to `VectorAssemble` them before feeding them to the ML model:

In [15]:
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=numeric_columns + categorical_columns_indexed, outputCol="features")

train_assembled = vector_assembler.transform(train_filled)
test_assembled = vector_assembler.transform(test_filled)

## 6. Machine Learning hyperparameter tuning through Cross Validation

We will use Random Forest for this project.

Given that the Random Forest accepts lots of different configurations, we have to try a couple of them and see which one performs the best.

This is done in Spark through a *CrossValidator* object, which depends on an *Estimator* (a model, in this case our Random Forest) + a *ParamGridBuilder* (the different Random Forest configurations that we want to try) + an *Evaluator* (a metric that will be used to decide which Random Forest configuration is the best one).

We will start by creating our Random Forest Estimator:

In [16]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol="features",
                            labelCol="label",
                            maxBins=128,
                            maxMemoryInMB=4076,
                            cacheNodeIds=True,
                            checkpointInterval=5
                           )

Now, we will define our Param Grid, and all the different configurations that we want to try. The more configurations we try, the more time the whole CrossValidation process will take.

After a lot of *pruning*, I have determined that the following configurations are a good compromise between processing time and good results:

In [17]:
from pyspark.ml.tuning import ParamGridBuilder

hyperparam_grid_pipeline_random_forest = (ParamGridBuilder()
                                          .addGrid(rf.maxDepth, [14, 12, 10, 8, 6])
                                          #.addGrid(rf.maxDepth, [3])
                                          .addGrid(rf.numTrees, [256])
                                          #.addGrid(rf.numTrees, [5])
                                          .addGrid(rf.featureSubsetStrategy, ["all","0.6","onethird","0.1","sqrt"])
                                          #.addGrid(rf.featureSubsetStrategy, ["sqrt"])
                                          .addGrid(rf.minInstancesPerNode, [1,6,12,32,64])
                                          .build()
                                          )

Given the recursive behaviour of Random Forest, Spark gives us the option to perform model checkpointing in order to remove some pressure from executor RAM. It is important to set this checkpoint directory, although the path itself does not really matter (any temporary directory in HDFS is ok) because it will be removed when the Spark process finishes.

In [18]:
# Feel free to change it at will.
sc.setCheckpointDir("hdfs:///user/jsotovi2/spark_checkpoints/")

Now we define our Evaluator:

In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

random_forest_evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",
                                                        labelCol="label",
                                                        metricName="areaUnderROC")

And finally, the CrossValidator which uses all the other components:

In [20]:
from pyspark.ml.tuning import CrossValidator

cross_validator_pipeline_random_forest = CrossValidator(estimator=rf,
                                                        estimatorParamMaps=hyperparam_grid_pipeline_random_forest,
                                                        evaluator=random_forest_evaluator,
                                                        numFolds=10)

Done. Now, let the process run!

NOTE: this will take quite some time (70h approx):

In [None]:
cross_validator_model_rf = cross_validator_pipeline_random_forest.fit(train_assembled.coalesce(16))

## 7. Get best model from cross validation and its results

We are done with our cross validation. Now, there are a couple things in our TODO list:

1. Get the model which performed the best
2. Save the configuration for this best model. This is important for the future, because it will help us troubleshoot and determine other useful configurations in future releases (so, this configurations will be read by a human being at some point)
3. Save the performance results (using both the train and test DFs)

In [None]:
# This is the best model:
best_rf = cross_validator_model_rf.bestModel

# The configuration of this "winner" model, as a string that we will
# save to HDFS as a TextFile for human consumption and/or logging:
string_best_model = best_rf._call_java("parent").extractParamMap().toString()

# Performance results:
auc_train = random_forest_evaluator.evaluate(best_rf.transform(train_assembled))

auc_test = random_forest_evaluator.evaluate(best_rf.transform(test_assembled))

print(auc_train)
print(auc_test)

In [None]:
sorted(zip(numeric_columns + categorical_columns_indexed, best_rf.featureImportances.toArray()),
       key=lambda x: -x[1])

In [None]:
sorted(zip(cross_validator_model_rf.avgMetrics, cross_validator_model_rf.getEstimatorParamMaps()),
       key=lambda x: -x[0])

Almost done. But if we really want to take advantage of best practises, we *should* re-train our "winner" model using both the train and test DFs at the same time.

It turns out that this part has to be quite manual due to some Spark API limitations, but that's OK. First of all, we will *manually* extract the configurations that happen to be the best ones:

In [None]:
best_max_depth = best_rf._call_java("parent").getMaxDepth()
best_num_trees = best_rf._call_java("parent").getNumTrees()
best_num_features = best_rf._call_java("parent").getFeatureSubsetStrategy()
best_min_instances_per_node = best_rf._call_java("parent").getMinInstancesPerNode()

print(best_max_depth, best_num_trees, best_num_features, best_min_instances_per_node)

In [None]:
print("DONEEE")

And now, we will train again a Random Forest, using this configuration, and the whole train + test DFs.

But there is one thing left: we have to recompute and re-impute the median, now using the whole train + test DFs (which happens to be no other but `prepaid_dataset_2`, since `prepaid_dataset_3` already has the StringIndexer + VectorAssembler transformers applied, and we actually wan to re-apply them also):

In [None]:
# Impute median in whole dataset
age_median_whole = (prepaid_dataset_2
                    .na.drop(subset=["age_in_years"])
                    .approxQuantile("age_in_years",
                                    probabilities=[0.5],
                                    relativeError=0.0
                                   )
                   )[0]

#age_median_whole = -9999.0

prepaid_dataset_4_filled = prepaid_dataset_2.na.fill(age_median_whole, subset=["age_in_years"])

We create the StringIndexers again, but now using our new full DF:

In [None]:
string_indexer_document_whole = (StringIndexer(inputCol="tipo_documento_comprador", outputCol="documentType_indexed")
                                 .fit(prepaid_dataset_4_filled)
                                )
                           
string_indexer_nation_whole = (StringIndexer(inputCol="NACIONALIDAD", outputCol="nationality_indexed")
                               .fit(prepaid_dataset_4_filled)
                              )
                         
string_indexer_plan_whole = (StringIndexer(inputCol="Plan", outputCol="tariffPlan_indexed")
                             .fit(prepaid_dataset_4_filled)
                            )
                       
string_indexer_label_whole = (StringIndexer(inputCol="migrated_to_postpaid", outputCol="label")
                              .fit(prepaid_dataset_4_filled)
                             )

And we apply them in conjunction to our VectorAssembler (which we don't have to re-recreate; the *old* one is fine):

In [None]:
prepaid_dataset_5 = vector_assembler.transform(
    string_indexer_label_whole.transform(
        string_indexer_plan_whole.transform(
            string_indexer_nation_whole.transform(
                string_indexer_document_whole.transform(
                    prepaid_dataset_4_filled)
            ))))

Now our `prepaid_dataset_5` is ready. We can now train our Random Forest with the best configuration parameters:

In [None]:
final_rf_trained = RandomForestClassifier(featuresCol="features",
                                          labelCol="label",
                                          maxBins=128,
                                          maxMemoryInMB=4076,
                                          cacheNodeIds=True,
                                          checkpointInterval=1,
                                          featureSubsetStrategy=best_num_features,
                                          maxDepth=best_max_depth,
                                          numTrees=best_num_trees,
                                          minInstancesPerNode=best_min_instances_per_node
                                         ).fit(prepaid_dataset_5)

Our final model is trained, and ready to be further tested (in notebook #2) and used for actual predictions (notebook #3). We just have to save it in order to be able to use it in other Spark programs.

But rather than only saving the Random Forest itself, we will save a Machine Learning Pipeline, with all our StringIndexers + VectorAssembler + Random Forest. When used, this Pipeline will apply each of the elements sequentially to any provided dataset:

In [None]:
from pyspark.ml import PipelineModel

final_pipeline_model = PipelineModel([string_indexer_document_whole,
                                      string_indexer_nation_whole,
                                      string_indexer_plan_whole,
                                      string_indexer_label_whole,
                                      vector_assembler,
                                      final_rf_trained])

Everything is prepared. We will save two files:

1. A TextFile containing the human-readable best configuration for our model + model performance metrics + the median value for the age, because we will want in the future to impute the exact same value to that column (again, in order to prevent data leakage).
2. Our Pipeline with all the transformers + the Random Forest.

We will name the files with the current timestamp, which will make easier the task of retreiveing the latest model + results in the future:

In [None]:
files_surname = datetime.now().strftime("%Y%m%d_%H%M%S")

# Best Randfom Forest config + 
# results in AUC for both train and test
# + median value for age:
rdd_results = sc.parallelize([("month_of_training", month_for_verifying_in_pospago),
                              ("best_model",string_best_model), 
                              ("auc_train",auc_train), 
                              ("auc_test",auc_test),
                              ("age_median_value",age_median_whole)])

rdd_results.saveAsTextFile("hdfs:///user/jsotovi2/pre2post_v2/best_model_pre2post_results_"
                           + files_surname + ".txt")

# Save Pipeline
final_pipeline_model.save("hdfs:///user/jsotovi2/pre2post_v2/best_model_pre2post_"
                          + files_surname + ".sparkModel")

In [None]:
print("Done!")

And we're done.

In [None]:
# Not really needed anymore
prepaid_dataset_1.unpersist()
train.unpersist()
test.unpersist()

spark.stop()


print("Finished!")