# Binary Classification
--> https://spark.apache.org/docs/3.3.0/ml-classification-regression.html#classification

**About Dataset**

This data was extracted from the **1994 Census bureau database** by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

**Description of fnlwgt (final weight):**

The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:

- A single cell estimate of the population 16+ for each state.
- Controls for Hispanic Origin by age and sex.
- Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('adult_census_df').getOrCreate()

In [2]:
df = spark.read.csv('data/adult.csv', 
                    inferSchema=True, 
                    header=True)
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



In [3]:
df.show(3)

+---+---------+------+------------+-------------+--------------+---------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|workclass|fnlwgt|   education|education-num|marital-status|     occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+---------+------+------------+-------------+--------------+---------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 90|        ?| 77053|     HS-grad|            9|       Widowed|              ?|Not-in-family|White|Female|           0|        4356|            40| United-States| <=50K|
| 82|  Private|132870|     HS-grad|            9|       Widowed|Exec-managerial|Not-in-family|White|Female|           0|        4356|            18| United-States| <=50K|
| 66|        ?|186061|Some-college|           10|       Widowed|              ?|    Unmarried|Black|Female|           0|        4356|            

In [4]:
cols = df.columns #for later use

In [5]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["workclass","education","marital-status","occupation","relationship","race","sex","native-country"]
stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + 'Index')
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]

**OneHotEncoder:**

One-hot encoding maps a categorical feature, represented as a label index, to a *binary vector* with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as *Logistic Regression*, to use categorical features. For string type input data, it is common to encode categorical features using *StringIndexer* first.

--> https://spark.apache.org/docs/3.3.0/ml-features.html#onehotencoder

In [6]:
label_stringIdx = StringIndexer(inputCol='income', outputCol='label')
stages += [label_stringIdx]

In [7]:
numericCols = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [8]:
stages

[StringIndexer_6ab842a02e81,
 OneHotEncoder_a76b22138103,
 StringIndexer_6db8f836e0ba,
 OneHotEncoder_1163fba67edc,
 StringIndexer_89ee5de91eac,
 OneHotEncoder_ac6de442e167,
 StringIndexer_a5cdf82f90ef,
 OneHotEncoder_5c492b60adfe,
 StringIndexer_0bddf29077eb,
 OneHotEncoder_186b41d1cfac,
 StringIndexer_215765d09a59,
 OneHotEncoder_81546ad4a6ad,
 StringIndexer_37061eb76596,
 OneHotEncoder_c0a23a7b15d3,
 StringIndexer_ca61bae2a5c2,
 OneHotEncoder_4faaa7c153a2,
 StringIndexer_de0580952fdc,
 VectorAssembler_96cc3cc8cd59]

**Pipeline:**

In machine learning, it is common to run a sequence of algorithms to process and learn from data. A Pipeline is specified as *a sequence of stages*, and each stage is either a *Transformer* or an *Estimator*. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For *Transformer* stages, the transform() method is called on the DataFrame. For *Estimator* stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.

--> https://spark.apache.org/docs/3.3.0/ml-pipeline.html#pipeline

In [9]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedcols = ["label", "features"] + cols
df = df.select(selectedcols)
df.show(3)

+-----+--------------------+---+---------+------+------------+-------------+--------------+---------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|label|            features|age|workclass|fnlwgt|   education|education-num|marital-status|     occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+-----+--------------------+---+---------+------+------------+-------------+--------------+---------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|  0.0|(100,[3,8,27,36,4...| 90|        ?| 77053|     HS-grad|            9|       Widowed|              ?|Not-in-family|White|Female|           0|        4356|            40| United-States| <=50K|
|  0.0|(100,[0,8,27,31,4...| 82|  Private|132870|     HS-grad|            9|       Widowed|Exec-managerial|Not-in-family|White|Female|           0|        4356|            18| United-States| <=50K|
|  0.0|(10

In [10]:
df.select(["label", "features"]).show(3, truncate=False)

+-----+------------------------------------------------------------------------------------------------------+
|label|features                                                                                              |
+-----+------------------------------------------------------------------------------------------------------+
|0.0  |(100,[3,8,27,36,44,48,53,94,95,96,98,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,90.0,77053.0,9.0,4356.0,40.0])  |
|0.0  |(100,[0,8,27,31,44,48,53,94,95,96,98,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,82.0,132870.0,9.0,4356.0,18.0]) |
|0.0  |(100,[3,9,27,36,46,49,53,94,95,96,98,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,66.0,186061.0,10.0,4356.0,40.0])|
+-----+------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [11]:
train, test = df.randomSplit([0.7, 0.3], seed=1)
print(train.count())
print(test.count())
print(df.count())

22866
9695
32561


## Logistic Regression

In [12]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol='label', featuresCol='features', maxIter=10)

In [13]:
lrModel = lr.fit(train)

In [14]:
predictions = lrModel.transform(test)

In [15]:
predictions.take(1)

[Row(label=0.0, features=SparseVector(100, {0: 1.0, 8: 1.0, 23: 1.0, 29: 1.0, 43: 1.0, 48: 1.0, 52: 1.0, 53: 1.0, 94: 26.0, 95: 58426.0, 96: 9.0, 99: 50.0}), age=26, workclass='Private', fnlwgt=58426, education='HS-grad', education-num=9, marital-status='Married-civ-spouse', occupation='Prof-specialty', relationship='Husband', race='White', sex='Male', capital-gain=0, capital-loss=0, hours-per-week=50, native-country='United-States', income='<=50K', rawPrediction=DenseVector([0.9007, -0.9007]), probability=DenseVector([0.7111, 0.2889]), prediction=0.0)]

In [16]:
predictions.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [17]:
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show(5)

+-----+----------+--------------------+---+--------------+
|label|prediction|         probability|age|    occupation|
+-----+----------+--------------------+---+--------------+
|  0.0|       0.0|[0.71109703485537...| 26|Prof-specialty|
|  0.0|       0.0|[0.73511243507650...| 26|Prof-specialty|
|  0.0|       0.0|[0.68297237231899...| 37|Prof-specialty|
|  0.0|       0.0|[0.70718806674485...| 39|Prof-specialty|
|  0.0|       0.0|[0.65134053112092...| 39|Prof-specialty|
+-----+----------+--------------------+---+--------------+
only showing top 5 rows



In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.905866418491401


In [19]:
evaluator.getMetricName()

'areaUnderROC'

In [20]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

In [21]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# ParamGridBuilder: Builder for a param grid used in grid search-based model selection.
# Create ParamGrid for Cross Validation:
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

CrossValidator --> https://spark.apache.org/docs/3.3.0/ml-tuning.html#cross-validation

In [22]:
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [23]:
cvModel = cv.fit(train)

In [24]:
predictions = cvModel.transform(test)
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.903116589697211


## Decision Tree

In [25]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol='features', labelCol='label', maxDepth=3)

In [26]:
dtModel = dt.fit(train)

In [27]:
print("numNodes:", dtModel.numNodes)
print("depth:", dtModel.depth)

numNodes: 11
depth: 3


In [28]:
predictions = dtModel.transform(test)

In [29]:
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show(5)

+-----+----------+--------------------+---+--------------+
|label|prediction|         probability|age|    occupation|
+-----+----------+--------------------+---+--------------+
|  0.0|       0.0|[0.68463574313971...| 26|Prof-specialty|
|  0.0|       0.0|[0.68463574313971...| 26|Prof-specialty|
|  0.0|       0.0|[0.68463574313971...| 37|Prof-specialty|
|  0.0|       0.0|[0.68463574313971...| 39|Prof-specialty|
|  0.0|       0.0|[0.68463574313971...| 39|Prof-specialty|
+-----+----------+--------------------+---+--------------+
only showing top 5 rows



In [30]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.6654557586612584


In [31]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1, 2, 6, 10])
             .addGrid(dt.maxBins, [20, 40, 80])
             .build())

In [32]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [33]:
# Run cross validations
cvModel = cv.fit(train)

In [34]:
print("numNodes:", cvModel.bestModel.numNodes)
print("depth:", cvModel.bestModel.depth)

numNodes: 431
depth: 10


In [35]:
predictions = cvModel.transform(test)

In [36]:
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.7767684536456595


In [37]:
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show(5)

+-----+----------+--------------------+---+--------------+
|label|prediction|         probability|age|    occupation|
+-----+----------+--------------------+---+--------------+
|  0.0|       1.0|[0.28571428571428...| 26|Prof-specialty|
|  0.0|       1.0|[0.28571428571428...| 26|Prof-specialty|
|  0.0|       0.0|[0.65982241953385...| 37|Prof-specialty|
|  0.0|       0.0|[0.65982241953385...| 39|Prof-specialty|
|  0.0|       0.0|[0.65982241953385...| 39|Prof-specialty|
+-----+----------+--------------------+---+--------------+
only showing top 5 rows



## Random Forest

In [38]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol='features', labelCol='label')

In [39]:
rfModel = rf.fit(train)

In [40]:
predictions = rfModel.transform(test)

In [41]:
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show(5)

+-----+----------+--------------------+---+--------------+
|label|prediction|         probability|age|    occupation|
+-----+----------+--------------------+---+--------------+
|  0.0|       0.0|[0.65378863422984...| 26|Prof-specialty|
|  0.0|       0.0|[0.64427856964731...| 26|Prof-specialty|
|  0.0|       0.0|[0.61239774963949...| 37|Prof-specialty|
|  0.0|       0.0|[0.61239774963949...| 39|Prof-specialty|
|  0.0|       0.0|[0.62620954129365...| 39|Prof-specialty|
+-----+----------+--------------------+---+--------------+
only showing top 5 rows



In [42]:
evaluator = BinaryClassificationEvaluator()
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.8897445960420728


In [43]:
paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())

In [44]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [45]:
# Run cross validations. This can take about 5 minutes since it is training over 20 trees!
cvModel = cv.fit(train)

In [46]:
predictions = cvModel.transform(test)
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.8958787979888452


In [47]:
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show(5)

+-----+----------+--------------------+---+--------------+
|label|prediction|         probability|age|    occupation|
+-----+----------+--------------------+---+--------------+
|  0.0|       0.0|[0.68698170495699...| 26|Prof-specialty|
|  0.0|       0.0|[0.68531737261994...| 26|Prof-specialty|
|  0.0|       0.0|[0.62443624953687...| 37|Prof-specialty|
|  0.0|       0.0|[0.62443624953687...| 39|Prof-specialty|
|  0.0|       0.0|[0.62191661844471...| 39|Prof-specialty|
+-----+----------+--------------------+---+--------------+
only showing top 5 rows



## Making Predictions

In [48]:
bestModel = cvModel.bestModel
final_predictions = bestModel.transform(df)
print('Area Under ROC:', evaluator.evaluate(predictions))

Area Under ROC: 0.8958787979888452
