## Lab4 : Spark ML : Logistic Regression Example

### Category : Supervised learning


### Concepts :

* Creating DataFrames from CSV input data format
* Performing basic data analysis using Spark SQL
* Using Spark ML to perform Logistic Regression

### Reference :
* Spark Reference Documentation : https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#logistic-regression 

### Input Dataset :

* Adults dataset : https://archive.ics.uci.edu/ml/datasets/Adult

### Objective :

* Build and train model that is able to predict whether income exceeds 50K/yr based on census data. Also known as "Census Income" dataset.

### Dataset Details:

Features:

* age: 
* workclass:
* fnlwgt:
* education: 
* education-num: 
* marital-status:
* occupation: 
* relationship: 
* race: 
* sex:
* capital-gain: 
* capital-loss: 
* hours-per-week: 
* native-country: 

Target :

* Income: >50K/yr, <=50K/yr.


In [36]:
import os
dataset_path="/data/shared/spark/adults_data/"
outputs_path=os.environ['HOME']

In [37]:
import findspark
findspark.init()
import pyspark

In [38]:
os.environ['HADOOP_HOME']="/usr/hdp/current/hadoop-client/"
os.environ['HADOOP_CONF_DIR']="/usr/hdp/current/hadoop-client/conf/"

In [39]:
print(os.environ['HADOOP_HOME'])
print(os.environ['HADOOP_CONF_DIR'])

/usr/hdp/current/hadoop-client/
/usr/hdp/current/hadoop-client/conf/


In [40]:
# Create a SparkSession and specify configuration
from pyspark.sql import SparkSession

# spark = SparkSession \
#    .builder \
#    .master("yarn") \
#    .config("spark.executor.cores", "2")  \
#    .config("spark.executor.memory","3G") \
#    .appName("Lab4-ML-LogisticRegression") \
#    .getOrCreate()

spark = SparkSession \
    .builder \
    .master("local[16]") \
    .config("spark.executor.memory","4G") \
    .appName("Lab4-ML-LogisticRegression") \
    .getOrCreate()

In [48]:
train_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"train.csv")

In [49]:
# Note : 
# According to the UCI , there are missing values
# Let's drop NA values for starters
# And Cache
train_df.dropna()
train_df.cache()

DataFrame[age: int, workclass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string]

In [51]:
# IF YOU WANT TO SEE A NICE TABLE
pd_df=train_df.toPandas()

In [52]:
pd_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States


In [53]:
# Create a table for SQL access
train_df.registerTempTable("train_data")

### Inspect Data

1. How many features and records do we have?
2. What is the data format?
3. Do we need to make some transformations on our data even before starting a model?

Always inspect the manuals:
Feature Extraction and Operation: https://spark.apache.org/docs/2.2.1/ml-features.html

In [56]:
print('Nb. of records  : %d' % train_df.count())
print('Nb. of features : %d' % len(train_df.columns))

Nb. of records  : 32561
Nb. of features : 14


In [57]:
train_df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)



In [67]:
# How many distict workclass we have
# How many distict education we have
# How many distict occupation we have
train_df.select('workclass').distinct().collect()

[Row(workclass='Self-emp-not-inc'),
 Row(workclass='Local-gov'),
 Row(workclass='State-gov'),
 Row(workclass='Private'),
 Row(workclass='Without-pay'),
 Row(workclass='Federal-gov'),
 Row(workclass='Never-worked'),
 Row(workclass='?'),
 Row(workclass='Self-emp-inc')]

In [68]:
train_df.select('education').distinct().collect()

[Row(education='Masters'),
 Row(education='10th'),
 Row(education='5th-6th'),
 Row(education='Assoc-acdm'),
 Row(education='Assoc-voc'),
 Row(education='7th-8th'),
 Row(education='9th'),
 Row(education='HS-grad'),
 Row(education='Bachelors'),
 Row(education='11th'),
 Row(education='1st-4th'),
 Row(education='Preschool'),
 Row(education='12th'),
 Row(education='Doctorate'),
 Row(education='Some-college'),
 Row(education='Prof-school')]

In [69]:
train_df.select('occupation').distinct().collect()

[Row(occupation='Sales'),
 Row(occupation='Exec-managerial'),
 Row(occupation='Prof-specialty'),
 Row(occupation='Handlers-cleaners'),
 Row(occupation='Farming-fishing'),
 Row(occupation='Craft-repair'),
 Row(occupation='Transport-moving'),
 Row(occupation='Priv-house-serv'),
 Row(occupation='Protective-serv'),
 Row(occupation='Other-service'),
 Row(occupation='Tech-support'),
 Row(occupation='Machine-op-inspct'),
 Row(occupation='Armed-Forces'),
 Row(occupation='?'),
 Row(occupation='Adm-clerical')]


### Preprocess Data

The data inspetion shows that our dataset contains categorical variables. 
For example : workclass , education ,      marital-status , occupation , relationship
Since models work over nunmerical values we already know we will need to transform  these variables into numeric representation

In order to convert categorical variables (features) into numerical ones
We will the following elements for our transformation:

 * StringIndexer 
 * OneHotEncoder
 * VectorAssembler

1. **StringIndexer**: https://spark.apache.org/docs/2.2.1/ml-features.html#stringindexer
   StringIndexer encodes a string column of labels to a column of label indices.

2. **OneHotEncoder**: https://spark.apache.org/docs/2.2.1/ml-features.html#onehotencoder
   OneHotEncoder maps a column of label indices to a column of binary vectors, with at most a single one-value.
   This encoding allows algorithms which expect continuous features, such as Logistic Regression, 
   to use   categorical features.

   Each categorical column will be indexed using the StringIndexer, 
   and then converted nto one-hot encoded variables using the One-Hot encoder. 

   The resulting output has the binary vectors appended to the end of each row.
   
3. **VectorAssembler**: https://spark.apache.org/docs/2.2.1/ml-features.html#vectorassembler
   TBW

4. **Pipelines** : 

   We will have more than 1 'process' or stage in our transforamtion so we use a **Pipeline** 
   to put stages   together. This greately 'cleans' the code elaboration.

In [96]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

# For demonstration purposes , let's see how the indexer works
a_df=train_df.select('workclass').distinct()
indexer = StringIndexer(inputCol='workclass', outputCol='workclassIndex')
model = indexer.fit(a_df)
indexed = model.transform(a_df)
indexed.show()

+----------------+--------------+
|       workclass|workclassIndex|
+----------------+--------------+
|Self-emp-not-inc|           0.0|
|       Local-gov|           2.0|
|       State-gov|           1.0|
|         Private|           7.0|
|     Without-pay|           3.0|
|     Federal-gov|           4.0|
|    Never-worked|           6.0|
|               ?|           5.0|
|    Self-emp-inc|           8.0|
+----------------+--------------+



In [97]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = [ \
           "workclass", "education", "marital-status", \
           "occupation", "relationship", "race", "sex", "native-country"]

stages = [] # stages in our Pipeline
for col in categoricalColumns:
  
  # Category Indexing with StringIndexer
  indexer = StringIndexer(inputCol=col, outputCol=col+"_Index")
   
  # Use OneHotEncoder to convert categorical variables into binary SparseVectors
  encoder = OneHotEncoder(inputCol=col+"_Index", outputCol=col+"_Vector")
  
  # Add stages.  These are not run here, but will run all at once later on.
  stages += [indexer, encoder]

In [98]:
# Use StringIndexer to encode ALSO our target (income) to label indices.
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label")
stages += [label_stringIdx]

In [99]:
# Now , use the VectorAssembler to combine all the feature columns into a single vector column. 
# It is useful for combining raw features and features generated by different feature transformers 
# into a single feature vector, in order to train ML models like logistic regression and decision trees
# This output will include both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [89]:
# Transform all numerical features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
assemblerInputs = map(lambda c: c + "_Vector", categoricalColumns) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

TypeError: unsupported operand type(s) for +: 'map' and 'list'

In [103]:
# Check the stages of our pipeline
n=0
for s in stages:
    print('stage number %d %s' %(n,s))
    n+=1      

stage number 0 StringIndexer_4f3fbb3dd14984065fc3
stage number 1 OneHotEncoder_41af825b9f54e7006688
stage number 2 StringIndexer_4a9699b520cd4cdcfb98
stage number 3 OneHotEncoder_4b5dae9a965c9ea8d5b8
stage number 4 StringIndexer_41968dd7bf734b0ed23b
stage number 5 OneHotEncoder_444bbf3fddace177da4d
stage number 6 StringIndexer_4aa38bf9d5b0ab398743
stage number 7 OneHotEncoder_41c98174068ccdc121b2
stage number 8 StringIndexer_4ed698d713c1894043c2
stage number 9 OneHotEncoder_430bb9b38b2b6106d6fc
stage number 10 StringIndexer_4a4db52534db0e430568
stage number 11 OneHotEncoder_447b89c5551d0c84c7b5
stage number 12 StringIndexer_4a79982800a3829f2d9c
stage number 13 OneHotEncoder_4f7ba8335dfefd534190
stage number 14 StringIndexer_42218f584c4e07eeb73b
stage number 15 OneHotEncoder_47a4b3e6ca623ab984e8
stage number 16 StringIndexer_4ac091cae220553a3091


In [None]:
from pyspark.ml import Pipeline
# Create a Pipeline.
pipeline = Pipeline(stages=stages)

# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)

# Keep relevant columns
selectedcols = ["label", "features"] + cols
dataset = dataset.select(selectedcols)
display(dataset)

In [35]:
spark.stop()