# App for making predictions using model from stage 1, once tested and validated in stage 2


We have already created a model and trained it (Notebook #1), tested it in another month (Notebook #2).

There's only one step left: actually using it to make predictions! This notebook does exactly that. Assumptions:

1. We have a saved model (which in fact is a pyspark.ml.PipelineModel) in Julio's HDFS home directory for the project (`"hdfs:///user/jsotovi2/pre2post/best_model_pre2post_yyyymmdd_hhmmss.sparkModel"`), where `yyyymmdd_hhmmss` is the datetime at which the model was saved. By default, this code will always grab the latest model in presence of more than one.
2. The model scored well in Notebook #2, meaning that the `auc_test` variable >= $0.8$.
3. The following tables exist in Hive's metastore:
    + `raw_es.vf_pre_ac_final`
    + `raw_es.vf_pre_info_tarif`
4. Date and time clocks in the Spark driver are accurate. This is important because we rely heavilly on date in order to compute *which is the next month after the current one* and that sort of stuff.

With all that said, let's start:

## 1. The one and only line that we have to change between executions

To understand better the whole workflow:

1. We use historical data from one month (eg. 2017/04) to train a couple of models, and keep/save the best one (this is done in notebook #1, which you do not currently have). Once we have got a good model, we just save it to HDFS. This *does not has to happen every single month*; as long as the model is not extremely outated, there should be no need for running this all months.
2. We then use historical data from next month (2017/05) to get an unbiased measure on how good our saved model is. This should run all months; it is always important to keep track of model performance on a monthly basis.
3. Finally, in order to make predictions, we will use the most recent data to predict customer behaviour (notebook #3). This can be ran as many times as we want, when needed (usually once a month). This notebook does exactly that.

This is notebook (#3), which in theory will only have to be ran once a month (probably towards the end of the month), since its output will be used for marketing campaigns (which start at the beggining of the month after):

In [None]:
# The yyyymm date to predict people who will
# migrate from prepaid to postpaid in two months.
# Should be manually changed.
# We should get the most recent month for which
# we have data.

# Given that data needed for this notebook
# comes from Spain CVM, the data of 201706
# is usually available on 2017/07/15.
month_for_getting_prepaid_data = "201712"

And that's it. There are no other dates needed in this notebook.

## 2. Imports and app setup

Your usual stuff:

In [2]:
# This literal_eval is needed since 
# we have to read from a textfile
# which is formatted as python objects.
# It is totally safe.
from ast import literal_eval

# Standard Library stuff:
from functools import partial
from datetime import date, timedelta, datetime

# Numpy stuff
from numpy import nan as np_nan, round as np_round, int64 as np_int64

# Spark stuff
from pyspark.sql import SparkSession
from pyspark import StorageLevel
from pyspark.sql.functions import (udf, col, decode, when, lit, lower, 
                                   translate, count, sum as sql_sum, max as sql_max, 
                                   isnull)
from pyspark.sql.types import DoubleType, StringType, IntegerType

In [3]:
spark = (SparkSession.builder
         .appName("Pre2Post prediction and list creation")
         .master("yarn")
         .config("spark.submit.deployMode", "client")
         .config("spark.ui.showConsoleProgress", "true")
         .enableHiveSupport()
         .getOrCreate()
         )

sc = spark.sparkContext

## 3. Data imports and first transformations

Compared to notebooks #1 and #2, we only have to use two tables in this one.

The first one is `raw_es.vf_pre_ac_final`, which contains information about *which VF clients were prepaid for a given month*, and very basic info about them (age, nationality, number of prepaid/postpaid services...)

The following cell includes reading the table and some transformations (that might can be easily translated to RedAgent):

In [4]:
useful_columns_from_acFinalPrepago = ["FECHA_EJECUCION",
                                      "MSISDN",
                                      "NUM_DOCUMENTO_CLIENTE",
                                      "NACIONALIDAD",
                                      "NUM_PREPAGO",
                                      "NUM_POSPAGO",
                                      #"Tipo_Documento_Cliente", Very uninformed
                                      "Tipo_Documento_Comprador",
                                      "X_FECHA_NACIMIENTO"]

# Lots of tables in Hive have empty string instead
# of null for missing values in StringType columns:
def empty_str_to_null(string_value):
    """
    Turns empty strings to None, that are
    handled as nulls by Spark DataFrames:
    """
    if string_value == "":
        result = None
    elif string_value == u"":
        result = None
    else:
        result = string_value
    return result

# We register previous function as a udf:
empty_string_to_null = udf(empty_str_to_null, StringType())

# Function that returns customer age out of his/her birthdate:
def get_customer_age_raw(birthdate, month_for_getting_prepaid_data):
        if birthdate is None:
            return np_nan
        
        # Now, they use only birth year:
        #parsed_date = datetime.strptime(str(int(birthdate)), "%Y%m%d")
        parsed_date = datetime(int(birthdate), 6, 1)
        timedelta = datetime.strptime(month_for_getting_prepaid_data, "%Y%m") - parsed_date
        return timedelta.days / 365.25

# We register previous function as a udf:
def get_customer_age_udf(birthdate, month):
    return udf(partial(get_customer_age_raw, month_for_getting_prepaid_data=month), DoubleType())(birthdate)

# Self-explanatory.
def subsitute_crappy_characters(string_column):
    """
    I really hate charset encoding.
    """
    return (string_column
            .replace(u"\ufffd", u"ñ")
            # add more here in the future if needed
           )

# We register previous function as a udf:
substitute_crappy_characters_udf = udf(subsitute_crappy_characters, StringType())

# And we finally read raw_es.vf_pre_ac_final,
# filtering by date, and with new columns
# that we create using our UDFs:
acFinalPrepago = (spark.read.table("raw_es.vf_pre_ac_final")
                  .where((col("year") == int(month_for_getting_prepaid_data[:4]))
                         & (col("month") == int(month_for_getting_prepaid_data[4:]))
                        )
                  .where(col("X_FECHA_NACIMIENTO") != "X_FE")
                  #.select(*useful_columns_from_acFinalPrepago)
                  .withColumn("X_FECHA_NACIMIENTO", empty_string_to_null(col("X_FECHA_NACIMIENTO")))
                  .withColumn("NUM_DOCUMENTO_CLIENTE", empty_string_to_null(col("NUM_DOCUMENTO_CLIENTE")))
                  .withColumn("NUM_DOCUMENTO_COMPRADOR", empty_string_to_null(col("NUM_DOCUMENTO_COMPRADOR")))
                  .withColumn("age_in_years", get_customer_age_udf(col("X_FECHA_NACIMIENTO"),
                                                                   month_for_getting_prepaid_data)
                             )
                  .withColumn("NACIONALIDAD", substitute_crappy_characters_udf(col("NACIONALIDAD")))
                 )

# Good old repartition for underpartitioned tables:
acFinalPrepago = acFinalPrepago.repartition(int(acFinalPrepago.count() / 500)+1)

In [5]:
# In this acFinalPrepago DF we have a column (nationality)
# with lot's of different values (high cardinality), which
# is terrible for ML models, so we will get the most frequent
# countries, and replace all others with "Other":

most_frequent_countries = [u"España",
                           u"Marruecos",
                           u"Rumania",
                           u"Colombia",
                           u"Italia",
                           u"Ecuador",
                           u"Alemania",
                           u"Estados Unidos",
                           u"Francia",
                           u"Brasil",
                           u"Argentina",
                           u"Afganistan",
                           u"Bolivia",
                           u"Gran Bretaña",
                           u"Portugal",
                           u"Paraguay",
                           u"China",
                           u"Gran Bretana",
                           u"Venezuela",
                           u"Honduras",
                           u"Corea del Sur"]


acFinalPrepago = acFinalPrepago.withColumn("NACIONALIDAD", when(col("NACIONALIDAD").isin(most_frequent_countries),
                                                                  col("NACIONALIDAD")
                                                                 ).otherwise(lit("Other"))
                                            )

There is only one data source remaining, which is the one with pretty much all customer consumption patterns (MOU, MB, number of monthly calls...). We just have to read it, and join it with `acFinalPrepago`:

In [6]:
# We will read raw_es.vf_pre_info_tarif.
# The columns that we care about are the following:

useful_columns_from_tarificadorPre = ['MSISDN',
                                      'MOU',
                                      'TOTAL_LLAMADAS',
                                      'TOTAL_SMS',
                                      'MOU_Week',
                                      'LLAM_Week',
                                      'SMS_Week',
                                      'MOU_Weekend',
                                      'LLAM_Weekend',
                                      'SMS_Weekend',
                                      'MOU_VF',
                                      'LLAM_VF',
                                      'SMS_VF',
                                      'MOU_Fijo',
                                      'LLAM_Fijo',
                                      'SMS_Fijo',
                                      'MOU_OOM',
                                      'LLAM_OOM',
                                      'SMS_OOM',
                                      'MOU_Internacional',
                                      'LLAM_Internacional',
                                      'SMS_Internacional',
                                      'ActualVolume',
                                      'Num_accesos',
                                      'Plan',
                                      'Num_Cambio_Planes',
                                      #'TOP_Internacional', # No idea of what is
                                      'LLAM_COMUNIDAD_SMART',
                                      'MOU_COMUNIDAD_SMART',
                                      'LLAM_SMS_COMUNIDAD_SMART',
                                      'Flag_Uso_Etnica',
                                      'cuota_SMART8',
                                      'cuota_SMART12',
                                      'cuota_SMART16']

# Read raw_es.vf_pre_info_tarif + yyyymm predicates + 
# column selection:
tarificadorPre = (spark.read.table("raw_es.vf_pre_info_tarif")
                  .where((col("year") == int(month_for_getting_prepaid_data[:4]))
                         & (col("month") == int(month_for_getting_prepaid_data[4:]))
                        )
                  .select(*useful_columns_from_tarificadorPre)
                 )

# Good old repartition for underpartitioned tables:
tarificadorPre = tarificadorPre.repartition(int(tarificadorPre.count() / 500)+1)

# Just as it happend with Nationlity column,
# Plan is a column with very high cardenality.
# We will replace any category not included
# in the following list with "Other":
plan_categories = ['PPIB7',
                   'PPFCL',
                   'PPIB4',
                   'PPXS8',
                   'PPIB8',
                   'PPIB9',
                   'PPTIN',
                   'PPIB1',
                   'PPVIS',
                   'PPREX',
                   'PPIB5',
                   'PPREU',
                   'PPRET',
                   'PPFCS',
                   'PPIB6',
                   'PPREY',
                   'PPVSP',
                   'PPIB2',
                   'PPIB3',
                   'PPRE2',
                   'PPRE5',
                   'PPVE2',
                   'PPVE1',
                   'PPRES',
                   'PPJ24',
                   'PPVE3',
                   'PPJAT',
                   'PPJMI']

tarificadorPre_2 = tarificadorPre.withColumn("Plan",
                                             when(col("Plan").isin(plan_categories),
                                                  col("Plan")
                                                 ).otherwise(lit("Other"))
                                            )

In [7]:
# Only one step left:
prepaid_dataset_1 = acFinalPrepago.join(tarificadorPre_2,
                                        how="inner",
                                        on="MSISDN").persist(StorageLevel.DISK_ONLY_2)

`prepaid_dataset_1` is the DF that we will use for model predictions.

## 4. Feature engineering

100% analogous to notebook #2:

In [8]:
numeric_columns = ['NUM_PREPAGO',
                   'NUM_POSPAGO',
                   'age_in_years',
                   #'documenttype_Other',
                   #'documenttype_cif',
                   #'documenttype_nif',
                   #'documenttype_pasaporte',
                   #'documenttype_tarj_residente',
                   #'nationality_Afganistan',
                   #'nationality_Alemania',
                   #'nationality_Argentina',
                   #'nationality_Bolivia',
                   #'nationality_Brasil',
                   #'nationality_China',
                   #'nationality_Colombia',
                   #'nationality_Corea_del_Sur',
                   #'nationality_Ecuador',
                   #'nationality_España',
                   #'nationality_Estados_Unidos',
                   #'nationality_Francia',
                   #'nationality_Gran_Bretana',
                   #'nationality_Gran_Bretaña',
                   #'nationality_Honduras',
                   #'nationality_Italia',
                   #'nationality_Marruecos',
                   #'nationality_Other',
                   #'nationality_Paraguay',
                   #'nationality_Portugal',
                   #'nationality_Rumania',
                   #'nationality_Venezuela',
                   #'migrated_to_postpaid',
                   'MOU',
                   'TOTAL_LLAMADAS',
                   'TOTAL_SMS',
                   'MOU_Week',
                   'LLAM_Week',
                   'SMS_Week',
                   'MOU_Weekend',
                   'LLAM_Weekend',
                   'SMS_Weekend',
                   'MOU_VF',
                   'LLAM_VF',
                   'SMS_VF',
                   'MOU_Fijo',
                   'LLAM_Fijo',
                   'SMS_Fijo',
                   'MOU_OOM',
                   'LLAM_OOM',
                   'SMS_OOM',
                   'MOU_Internacional',
                   'LLAM_Internacional',
                   'SMS_Internacional',
                   'ActualVolume',
                   'Num_accesos',
                   'Num_Cambio_Planes',
                   'LLAM_COMUNIDAD_SMART',
                   'MOU_COMUNIDAD_SMART',
                   'LLAM_SMS_COMUNIDAD_SMART',
                   #'Flag_Uso_Etnica',
                   'cuota_SMART8',
                   #'cuota_SMART12',
                   #'cuota_SMART16',
                   #'plan_PPFCL',
                   #'plan_PPFCS',
                   #'plan_PPIB1',
                   #'plan_PPIB2',
                   #'plan_PPIB3',
                   #'plan_PPIB4',
                   #'plan_PPIB5',
                   #'plan_PPIB6',
                   #'plan_PPIB7',
                   #'plan_PPIB8',
                   #'plan_PPIB9',
                   #'plan_PPJ24',
                   #'plan_PPJAT',
                   #'plan_PPJMI',
                   #'plan_PPRE2',
                   #'plan_PPRE5',
                   #'plan_PPRES',
                   #'plan_PPRET',
                   #'plan_PPREU',
                   #'plan_PPREX',
                   #'plan_PPREY',
                   #'plan_PPTIN',
                   #'plan_PPVE1',
                   #'plan_PPVE2',
                   #'plan_PPVE3',
                   #'plan_PPVIS',
                   #'plan_PPVSP',
                   #'plan_PPXS8'
                  ]

categorical_columns = ["tipo_documento_comprador", "NACIONALIDAD", "Plan"]

prepaid_dataset_2 = prepaid_dataset_1

for column in numeric_columns:
    prepaid_dataset_2 = prepaid_dataset_2.withColumn(column, col(column).cast(DoubleType()))
    
prepaid_dataset_2 = (prepaid_dataset_2
                     .repartition(int(prepaid_dataset_2.count() / 50000)+1)
                     .persist(StorageLevel.DISK_ONLY_2)
                     )

## 5. Load machine learning model from HDFS

100% analogous to notebook #2:

In [9]:
import subprocess

directory_list = subprocess.check_output(["hadoop", "fs", "-ls", "/user/jsotovi2/pre2post_v2"]).split("\n")
files_list = [item.split(" ")[-1].split("/")[-1] for item in directory_list if "." in item.split(" ")[-1].split("/")[-1]]
history_list = ["_".join(theFile.replace(".txt","").split("_")[-2:]) 
                for theFile in files_list 
                if "model_pre2post_results" in theFile]

most_recent_model_date = list(reversed(sorted([datetime.strptime(a_date, "%Y%m%d_%H%M%S") for a_date in history_list])))[0]
most_recent_model_date_str = most_recent_model_date.strftime("%Y%m%d_%H%M%S")

In [10]:
most_recent_model_date_str

'20170810_130214'

In [11]:
from pyspark.ml import PipelineModel

most_recent_model = PipelineModel.load("hdfs:///user/jsotovi2/pre2post_v2/best_model_pre2post_"
                                       + most_recent_model_date_str + ".sparkModel")

## 6. Final data preparations

100% analogous to notebook #2:

In [12]:
# Get the median value for age:
training_results_file = sc.textFile("hdfs:///user/jsotovi2/pre2post_v2/best_model_pre2post_results_"
                                    + most_recent_model_date_str + ".txt")

In [13]:
training_results = dict([literal_eval(row) for row in training_results_file.collect()])

In [14]:
# The actual imputation:
prepaid_dataset_2_filled = (prepaid_dataset_2
                            .na.fill(float(training_results["age_median_value"]), subset=["age_in_years"])
                            )

In [15]:
prepaid_dataset_2_filled_filtered = (prepaid_dataset_2_filled
                                     .where(col("NACIONALIDAD")!="Rumania")
                                     .where(col("NACIONALIDAD")!="Marruecos")
                                     .where(col("NACIONALIDAD")!="Colombia")
                                     .where(col("NACIONALIDAD")!="Ecuador")
                                     .where(col("NACIONALIDAD")!="Bolivia")
                                     .where(col("NACIONALIDAD")!="Gran Bretana")
                                     .where(col("NACIONALIDAD")!="Argentina")
                                     .where(col("tipo_documento_comprador")!="Pasaporte")
                                     .where(col("tipo_documento_cliente")!="Pasaporte")
                                     .where(col("tipo_documento_comprador")!="PASAPORTE")
                                     .where(col("tipo_documento_cliente")!="PASAPORTE")
                                     .where(col("tipo_documento_comprador")!="CIF")
                                     .where(col("tipo_documento_cliente")!="CIF")
                                     .where(col("tipo_documento_comprador")!="Otros")
                                     .where(col("tipo_documento_cliente")!="Otros")
                                     .where(col("Plan")!="PPVE3")
                                     .where(col("Plan")!="PPJAT")
                                     .where(col("Plan")!="PPJ24")
                                    )

## 7. Make predictions

In [16]:
predictions = most_recent_model.transform(prepaid_dataset_2_filled_filtered).cache()

## 8. Output results

In [17]:
# In order to export the predictions, 
# we only care about two columns:
# MSISDN and the second (first-indexed)
# column of te probability column created
# by our model.transform (this probability
# column is of type org.apache.spark.sql.types.VectorUDTType):
results = (predictions
           .select(col("MSISDN"),
                   udf(lambda x: x.tolist()[1], DoubleType())
                   (col("probability")).alias("raw_score")
                  )
          )

In [18]:
# This cell will add one column more, called percentiles,
# which are the percentiles of each raw_score.

# This code is a little harder to understand, but that's OK.


# Percentile computation in Spark DFs is as counter-intuitive as it gets...
percentiles = list(zip(list(reversed([i/100.0 for i in range(1, 101, 1)])),
                       results
                       .approxQuantile("raw_score", 
                                       list(reversed([i/100.0 for i in range(1, 101, 1)])),
                                       relativeError=0.0)
                  ))

# Broadcasting this list is not really neccesary,
# but may help understanding the code.
# If you decide to remove the broadcast,
# remember to subsitute percentiles.value with
# just percentiles in the next function.
percentiles_broadcast = sc.broadcast(percentiles)

def get_percentile(row, percentiles):
    """
    For each row of a column,
    returns the corresponding percentile.
    
    percentiles argument must be a broadcast
    value.
    """
    resulting_percentile = 1.0
    for p, q in percentiles.value:
        if row <= q:
            resulting_percentile = p
    return resulting_percentile

def get_percentile_udf(column, percentiles):
    """
    Computes the corresponding percentiles
    for one column.
    
    Args:
        column -> A Spark DF column
        percentiles -> A broadcasted list of tuples (percentile, value)
        
    Returns:
        A Spark DF column with the percentile each row belongs to.
    """
    return udf(partial(get_percentile, percentiles=percentiles),
                       DoubleType())(column)

results_with_percentile = (results
                           .withColumn("percentile", 
                                       get_percentile_udf(col("raw_score"), 
                                                          percentiles_broadcast)
                                      )
                          )

In [19]:
results_with_percentile.show()

+---------+--------------------+----------+
|   MSISDN|           raw_score|percentile|
+---------+--------------------+----------+
|625383731|0.007328565115779894|      0.57|
|631920790|0.015470708484037669|      0.88|
|683827273|0.003903671522442377|      0.03|
|628933506|0.006139007037681297|      0.35|
|656769413|0.003516374760892...|      0.02|
|660395176|0.012018122158675678|      0.81|
|638563158|0.004931211157711...|      0.09|
|657501368|0.007424749033163...|      0.59|
|617634924| 0.00781499097487612|      0.62|
|671581870|0.006926897911887...|      0.52|
|630681055|0.006573026811215898|      0.45|
|677719608|0.006426949732688776|      0.42|
|601022800|0.007370248474481394|      0.58|
|640574722|0.006433768431484456|      0.42|
|684448694|0.029722009194229435|      0.99|
|615742660| 0.01090617127031737|      0.77|
|670846968|0.005423790997771696|      0.19|
|622321146|0.007711084329766974|      0.62|
|644122841| 0.00528795536017321|      0.16|
|691265708|0.010184206075255765|

In [20]:
results_with_percentile.write.format("parquet").saveAsTable("tests_es.output_pre2post_201712_notprepared")

And we are done! There are only two tasks left:

+ Deanonymize the `MSISDN` column
+ Figure out a way to return `results_with_percentile` (with the `MSISDN` column deanonymized) back to Spain CVM