## Lab4 : Spark ML : Logistic Regression Example

### Category : Supervised learning


### Concepts :

* Creating DataFrames from CSV input data format
* Performing basic data analysis using Spark SQL
* Using Spark ML to perform Logistic Regression

### Reference :
* Spark Reference Documentation : https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#logistic-regression 

### Input Dataset :

* Adults dataset : https://archive.ics.uci.edu/ml/datasets/Adult

### Objective :

* Build and train model that is able to predict whether income exceeds 50K/yr based on census data. Also known as "Census Income" dataset.

### Dataset Details:

Features:

* age: 
* workclass:
* fnlwgt:
* education: 
* education-num: 
* marital-status:
* occupation: 
* relationship: 
* race: 
* sex:
* capital-gain: 
* capital-loss: 
* hours-per-week: 
* native-country: 

Target :

* Income: >50K/yr, <=50K/yr.


In [1]:
import os
dataset_path="/user/common/adults_data/"
outputs_path=os.environ['HOME']

In [2]:
import findspark
findspark.init()
import pyspark

In [3]:
# Create a SparkSession and specify configuration
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("yarn") \
    .appName("Lab4-ML-LogisticRegression") \
    .getOrCreate()

In [4]:
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("hdfs://"+dataset_path+"data.csv")

In [5]:
# Note : 
# According to the UCI , there are missing values
# Let's drop NA values for starters
# And Cache
df.dropna()
df.cache()

DataFrame[age: int, workclass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string]

In [6]:
# IF YOU WANT TO SEE A NICE TABLE
pd_df=df.toPandas()

In [7]:
pd_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States


In [8]:
# Create a table for SQL access
df.registerTempTable("train_data")

### Inspect Data

1. How many features and records do we have?
2. What is the data format?
3. Do we need to make some transformations on our data even before starting a model?

Always inspect the manuals:
Feature Extraction and Operation: https://spark.apache.org/docs/2.2.1/ml-features.html

In [9]:
print('Nb. of records  : %d' % df.count())
print('Nb. of features : %d' % len(df.columns))

Nb. of records  : 32561
Nb. of features : 14


In [10]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)



In [11]:
# Some stats on numerical features
df.select('age').describe().show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|             32561|
|   mean| 38.58164675532078|
| stddev|13.640432553581356|
|    min|                17|
|    max|                90|
+-------+------------------+



In [None]:
# How many distict workclass we have
# How many distict education we have
# How many distict occupation we have
df.select('workclass').distinct().collect()

In [None]:
df.select('education').distinct().collect()

In [None]:
df.select('occupation').distinct().collect()


### Preprocess Data

The data inspetion shows that our dataset contains categorical variables. 
For example : workclass , education ,      marital-status , occupation , relationship

#### Feature Transformation

Since models work over nunmerical values we have to transform  these variables into numeric representation. For this transformation process ( categorical -> numerical ) we will use the following 'functions':

 1. **StringIndexer** 
     https://spark.apache.org/docs/2.2.1/ml-features.html#stringindexer
     StringIndexer encodes a string column of labels to a column of label indices.

 2. **OneHotEncoder**: 
     https://spark.apache.org/docs/2.2.1/ml-features.html#onehotencoder
     OneHotEncoder maps a column of label indices to a column of binary vectors, with at most a single one-value.
     This encoding allows algorithms which expect continuous features, such as Logistic Regression, 
     to use categorical features.Each categorical column will be indexed using the StringIndexer, 
     and then converted nto one-hot encoded variables using the One-Hot encoder. 
     The resulting output has the binary vectors appended to the end of each row.
   
 3. **VectorAssembler**: 
     https://spark.apache.org/docs/2.2.1/ml-features.html#vectorassembler
     TBW

 4. **Pipelines** : 
    We will have more than 1 'process' or stage in our transforamtion so we use a **Pipeline** 
    to put stages   together. This greately 'cleans' the code elaboration.

In [12]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

# For demonstration purposes , let's see how the indexer works
a_df=df.select('workclass').distinct()
indexer = StringIndexer(inputCol='workclass', outputCol='workclassIndex')
model = indexer.fit(a_df)
indexed = model.transform(a_df)
indexed.show()

+----------------+--------------+
|       workclass|workclassIndex|
+----------------+--------------+
|Self-emp-not-inc|           0.0|
|       Local-gov|           2.0|
|       State-gov|           1.0|
|         Private|           7.0|
|     Without-pay|           3.0|
|     Federal-gov|           4.0|
|    Never-worked|           6.0|
|               ?|           5.0|
|    Self-emp-inc|           8.0|
+----------------+--------------+



In [13]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = [ \
           "workclass", "education", "marital-status", \
           "occupation", "relationship", "race", "sex", "native-country"]

stages = [] # stages in our Pipeline
for col in categoricalColumns:
  
  # Category Indexing with StringIndexer
  indexer = StringIndexer(inputCol=col, outputCol=col+"_index")
   
  # Use OneHotEncoder to convert categorical variables into binary SparseVectors
  encoder = OneHotEncoder(inputCol=col+"_index", outputCol=col+"_vector")
  
  # Add stages.  These are not run here, but will run all at once later on.
  stages += [indexer, encoder]

In [14]:
# Use StringIndexer to encode ALSO our target (income) to label indices.
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label")
stages += [label_stringIdx]

In [15]:
# Now , use the VectorAssembler to combine all the feature columns into a single vector column. 
# Vector assembler can be used to combine raw features and features generated by different feature transformers 
# into a single feature vector, in order to train ML models like logistic regression 
# This output will include both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [16]:
# Transform all numerical features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
assemblerInputs = [ col + "_vector" for col in categoricalColumns ] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [17]:
print(assemblerInputs)

['workclass_vector', 'education_vector', 'marital-status_vector', 'occupation_vector', 'relationship_vector', 'race_vector', 'sex_vector', 'native-country_vector', 'age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']


In [18]:
# Check the stages of our pipeline
n=0
for s in stages:
    print('stage number %d %s' %(n,s.getOutputCol()))
    n+=1      

stage number 0 workclass_index
stage number 1 workclass_vector
stage number 2 education_index
stage number 3 education_vector
stage number 4 marital-status_index
stage number 5 marital-status_vector
stage number 6 occupation_index
stage number 7 occupation_vector
stage number 8 relationship_index
stage number 9 relationship_vector
stage number 10 race_index
stage number 11 race_vector
stage number 12 sex_index
stage number 13 sex_vector
stage number 14 native-country_index
stage number 15 native-country_vector
stage number 16 label
stage number 17 features


In [20]:
from pyspark.ml import Pipeline
# Create a Pipeline.
pipeline = Pipeline(stages=stages)

# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(df)
dataset = pipelineModel.transform(df)

# Keep relevant columns
selectedcols = ["label", "features"] + cols
dataset = df.select(selectedcols)

IllegalArgumentException: 'Field "income" does not exist.'

In [None]:
spark.stop()