For our first taste of programming with Spark, we'll revisit the Lending Club dataset you first used earlier in the course. Instead of building a random forest though, we'll perform a logistic regression to determine the likelihood that someone will be approved for a loan.

Some notes:
* data loading: Pyspark can't naively load a csv from a URL so we want to copy the data file into our container and load from there. This isn't the case for something like HDFS however, you can load directly from Hadoop. For details see [here](https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets)
* Note on dropping columns - need to create a new dataframe. It won't remove columns in-place.

* Data prep strategy:
    1. Drop columns that have high portion of nulls. Only keep columns with fewer than 10 nulls total. Otherwise drop it.
    2. From remaining columns, review the schema and type. Cast any string columns that should be numeric into the appropriate value.
    3. Once all column types are correct, look at the categorical variables, and create the dummies as needed.
    
    
* At that point data prep will be complete and we can run the random forest.

In [1]:
#CSV_PATH = "/home/ds/notebooks/datasets/LoanStats3d.csv"
CSV_PATH = "https://www.dropbox.com/s/m7z42lubaiory33/LoanStats3d.csv?dl=0"
APP_NAME = "Lending Club Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 141107
TRAINING_DATA_RATIO = 0.7
RF_NUM_TREES = 10
RF_MAX_DEPTH = 4
RF_NUM_BINS = 32

In [2]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, OneHotEncoder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [4]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName(APP_NAME) \
    .master(SPARK_URL) \
    .getOrCreate()

df = spark.read \
    .options(header = "true", inferschema = "true") \
    .csv(CSV_PATH)

TypeError: join() argument must be str or bytes, not 'NoneType'

In [None]:
from pyspark.sql.functions import isnan, when, count, col

null_counts = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).toPandas().to_dict(orient='records')

In [None]:
null_counts[0]

In [None]:
null_columns_to_drop = [key for key, value in null_counts[0].items() if value > 10]

In [None]:
null_columns_to_drop

In [None]:
print(len(df.columns))
print(len(null_counts[0].keys()))
print(len(null_columns_to_drop))

In [None]:
df_updated = df.drop(*null_columns_to_drop)

In [None]:
print(len(df_updated.columns))

In [None]:
print("Total number of rows: %d" % df_updated.count())

In [None]:
df_updated.printSchema()

We still have a ton of columns and we probably should look at reducing it further. Now let's look at only the columns that are strings, and check whether we should convert them to a numeric datatype.

In [None]:
categoricals = [col[0] for col in df_updated.dtypes if col[1] == 'string']
print(len(categoricals))

In [None]:
df_updated.select(*categoricals[:6]).show(5)
df_updated.select(*categoricals[6:12]).show(5)
df_updated.select(*categoricals[12:17]).show(5)
df_updated.select(*categoricals[17:22]).show(5)
df_updated.select(*categoricals[22:]).show(5)

We want to make the following numeric:
term, int_rate, annual_inc, inq_last_6mths, total_acc, out_prncp, dti, last_payment_amnt

We'll drop the following columns:
last_credit_pull_d, issue_d, zip_code, addr_state, earliest_cr_line

In [None]:
from pyspark.sql.functions import regexp_replace

df_updated = df_updated.withColumn('term', regexp_replace(df_updated['term'], " months", "").cast("int"))
df_updated = df_updated.withColumn('int_rate', regexp_replace(df_updated['int_rate'], "%", "").cast("float"))
df_updated = df_updated.withColumn('annual_inc', df_updated['annual_inc'].cast("int"))
df_updated = df_updated.withColumn('inq_last_6mths', df_updated['inq_last_6mths'].cast("int"))
df_updated = df_updated.withColumn('total_acc', df_updated['total_acc'].cast("int"))
df_updated = df_updated.withColumn('out_prncp', df_updated['out_prncp'].cast("float"))
df_updated = df_updated.withColumn('dti', df_updated['dti'].cast("float"))
df_updated = df_updated.withColumn('last_pymnt_amnt', df_updated['last_pymnt_amnt'].cast("float"))

df_final = df_updated.drop('last_credit_pull_d', 'issue_d', 'zip_code', 'addr_state', 'earliest_cr_line')

In [None]:
df_final.printSchema()

In [None]:
categoricals = [col[0] for col in df_final.dtypes if col[1] == 'string']
categoricals.remove('loan_status')
uniques = [df_final.select(col).distinct().count() for col in categoricals]
categorical_unique_counts = dict(zip(categoricals, uniques))

In [None]:
len(categoricals)

In [None]:
df_updated.select(categoricals[:8]).show(10)
df_updated.select(categoricals[8:]).show(10)

In [None]:
categorical_unique_counts

now we need to convert the categoricals to dummies

In [None]:
categorical_dummies = [c+'_dmy' for c in categoricals]
 
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp')
            for x in categoricals ]
 
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y)
for x,y in zip(categoricals, categorical_dummies)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]

In [None]:
# prepare labeled sets
non_categoricals = [col[0] for col in df_final.dtypes if col[1] != 'string']
cols_now = non_categoricals.extend(categorical_dummies)

print(len(non_categoricals))

In [None]:
df_final.printSchema()

In [None]:
assembler_features = VectorAssembler(inputCols=cols_now, outputCol='features')
labelIndexer = StringIndexer(inputCol='binary_response', outputCol="label")
tmp += [assembler_features, labelIndexer]
pipeline = Pipeline(stages=tmp)

In [None]:
df_updated.select("loan_amnt", "int_rate", "loan_status").show(10)

In [None]:
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="loan_status", outputCol="indexed_loan_status").fit(df_updated)

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import functools
%matplotlib inline
 
statuses = df_updated.groupBy('loan_status').count().collect()
categories = [i[0] for i in statuses]
counts = [i[1] for i in statuses]
 
ind = np.array(range(len(categories)))
width = 0.35
plt.bar(ind, counts, width=width, color='r')
 
plt.ylabel('counts')
plt.title('Status distribution')
plt.xticks(ind + width/2., categories)

In [None]:
categorical_unique_counts