# Logistic Regression

Let's see an example of how to run a logistic regression with Python and Spark! This is documentation example, we will quickly run through this and then show a more realistic example, afterwards, you will have another consulting project!

In [1]:
import findspark
findspark.init('/home/czh/spark-2.4.6-bin-hadoop2.7/')

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logregdoc').getOrCreate()

In [3]:
from pyspark.ml.classification import LogisticRegression

In [4]:
# Load training data
training = spark.read.format("libsvm").load("sample_libsvm_data.txt")

lr = LogisticRegression()

# Fit the model
lrModel = lr.fit(training)

trainingSummary = lrModel.summary

In [5]:
training.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 20 rows



In [6]:
trainingSummary.predictions.select('probability', 'prediction').show(20, False)

+-------------------------------------------+----------+
|probability                                |prediction|
+-------------------------------------------+----------+
|[0.9999999976135945,2.3864054543471998E-9] |0.0       |
|[1.4132155511105642E-9,0.9999999985867845] |1.0       |
|[1.2580486512697989E-12,0.9999999999987419]|1.0       |
|[6.427105091703036E-9,0.9999999935728949]  |1.0       |
|[1.2715720920060412E-9,0.9999999987284278] |1.0       |
|[0.9999999976067364,2.393263547463311E-9]  |0.0       |
|[1.4710981469558102E-9,0.9999999985289019] |1.0       |
|[3.088501681026319E-9,0.9999999969114983]  |1.0       |
|[0.9999999957267036,4.273296361841363E-9]  |0.0       |
|[0.9999999999448097,5.519035886749981E-11] |0.0       |
|[2.5681887277651024E-11,0.9999999999743181]|1.0       |
|[0.9999999999962461,3.753800775785978E-12] |0.0       |
|[0.9999999999939617,6.038255127600314E-12] |0.0       |
|[2.5311068452957554E-9,0.9999999974688931] |1.0       |
|[0.9999999992612372,7.38762892

In [7]:
train, test = training.randomSplit([0.3, 0.3])

In [8]:
lrModel = lr.fit(train)

In [9]:
lrModel.evaluate(test)

<pyspark.ml.classification.BinaryLogisticRegressionSummary at 0x7f3462bd32e8>

In [10]:
predictionAndLabels = lrModel.evaluate(test)

In [11]:
predictionAndLabels.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[98,99,100,1...|[27.2697742627702...|[0.99999999999856...|       0.0|
|  0.0|(692,[121,122,123...|[22.4609704788300...|[0.99999999982407...|       0.0|
|  0.0|(692,[122,123,124...|[18.6050132640227...|[0.99999999168340...|       0.0|
|  0.0|(692,[123,124,125...|[31.9491790434436...|[0.99999999999998...|       0.0|
|  0.0|(692,[123,124,125...|[31.7577501460002...|[0.99999999999998...|       0.0|
|  0.0|(692,[124,125,126...|[37.1587404143751...|[1.0,7.2805482360...|       0.0|
|  0.0|(692,[124,125,126...|[30.1092845501855...|[0.99999999999991...|       0.0|
|  0.0|(692,[124,125,126...|[29.6147429436880...|[0.99999999999986...|       0.0|
|  0.0|(692,[124,125,126...|[17.9280323186061...|[0.99999998363355...|       0.0|
|  0.0|(692,[124

In [12]:
predictionAndLabels = predictionAndLabels.predictions.select('label','prediction')

In [13]:
predictionAndLabels.show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



## Evaluators

Evaluators will be a very important part of our pipline when working with Machine Learning, let's see some basics for Logistic Regression, useful links:

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator

In [14]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [15]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')

In [16]:
# For multiclass
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label',
                                             metricName='accuracy')

In [17]:
AUC = evaluator.evaluate(predictionAndLabels)

In [18]:
AUC

1.0

Okay let's move on see some more examples!