# To predict whether or not a particular internet user will click on an ad based off the features of that user

In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement.
We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This data set contains the following features:

 - **`Daily Time Spent on Site`**: consumer time on site in minutes
 - **`Age`**: cutomer age in years
 - **`Area Income`**: Avg. Income of geographical area of consumer
 - **`Daily Internet Usage`**: Avg. minutes a day consumer is on the internet
 - **`Ad Topic Line`**: Headline of the advertisement
 - **`City`**: City of consumer
 - **`Male`**: Whether or not consumer was male
 - **`Country`**: Country of consumer
 - **`Timestamp`**: Time at which consumer clicked on Ad or closed window
 - **`Clicked on Ad`**: 0 or 1 indicated clicking on Ad

### Initialize and create a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Advertisement").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577539551875)
SparkSession available as 'spark'


2019-12-28 18:56:04 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@187ea8d3


### Initialize Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Import statements to setup ML for Logistic Regression

In [3]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer,VectorAssembler,OneHotEncoder}
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler, OneHotEncoder}
import org.apache.spark.ml.linalg.Vectors


### Using Spark to read in the carprices csv file

In [4]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("advertising.csv")

data: org.apache.spark.sql.DataFrame = [Daily Time Spent on Site: double, Age: int ... 8 more fields]


### Printing the first row of the dataframe

In [5]:
data.head(1)

res1: Array[org.apache.spark.sql.Row] = Array([68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11.0,0])


### Printing the schema of the dataframe

In [6]:
data.printSchema

root
 |-- Daily Time Spent on Site: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Area Income: double (nullable = true)
 |-- Daily Internet Usage: double (nullable = true)
 |-- Ad Topic Line: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Male: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Clicked on Ad: integer (nullable = true)



### Show

In [7]:
data.show(5)

+------------------------+---+-----------+--------------------+--------------------+--------------+----+----------+-------------------+-------------+
|Daily Time Spent on Site|Age|Area Income|Daily Internet Usage|       Ad Topic Line|          City|Male|   Country|          Timestamp|Clicked on Ad|
+------------------------+---+-----------+--------------------+--------------------+--------------+----+----------+-------------------+-------------+
|                   68.95| 35|    61833.9|              256.09|Cloned 5thgenerat...|   Wrightburgh|   0|   Tunisia|2016-03-27 00:53:11|            0|
|                   80.23| 31|   68441.85|              193.77|Monitored nationa...|     West Jodi|   1|     Nauru|2016-04-04 01:39:02|            0|
|                   69.47| 26|   59785.94|               236.5|Organic bottom-li...|      Davidton|   0|San Marino|2016-03-13 20:35:42|            0|
|                   74.15| 29|   54806.18|              245.89|Triple-buffered r...|West Terrifurt| 

### Inspecting hour of the timestamp column to check whether it contains any valuable information

In [21]:
data.groupBy(hour(data("Timestamp"))).count().orderBy($"Count".desc).show(5)

+---------------+-----+
|hour(Timestamp)|count|
+---------------+-----+
|              7|   54|
|             20|   50|
|              9|   49|
|             21|   48|
|              0|   45|
+---------------+-----+
only showing top 5 rows



### Converting the "Timestamp" column to numerical type by getting hour from the timestamp 

In [9]:
val filtered_data = data.withColumn("Hour",hour(data("Timestamp")))

filtered_data: org.apache.spark.sql.DataFrame = [Daily Time Spent on Site: double, Age: int ... 9 more fields]


### Count

In [10]:
filtered_data.count()

res5: Long = 1000


### Count by dropping duplicates

In [11]:
filtered_data.na.drop().count()

res6: Long = 1000


### Checking out whether the string columns "City", "Country" and "Ad Topic Line" are useful or not

In [16]:
filtered_data.groupBy("City").count().orderBy($"Count".desc).show(5)

+------------+-----+
|        City|count|
+------------+-----+
|   Lisamouth|    3|
|Williamsport|    3|
|  Lake David|    2|
|  South Lisa|    2|
|North Daniel|    2|
+------------+-----+
only showing top 5 rows



In [17]:
filtered_data.groupBy("Country").count().orderBy($"Count".desc).show(5)

+--------------+-----+
|       Country|count|
+--------------+-----+
|        France|    9|
|Czech Republic|    9|
|          Peru|    8|
|        Turkey|    8|
|        Greece|    8|
+--------------+-----+
only showing top 5 rows



In [18]:
filtered_data.groupBy("Ad Topic Line").count().orderBy($"Count".desc).show(5)

+--------------------+-----+
|       Ad Topic Line|count|
+--------------------+-----+
|Phased zero admin...|    1|
|Customizable modu...|    1|
|Reactive national...|    1|
|Vision-oriented h...|    1|
|Ergonomic client-...|    1|
+--------------------+-----+
only showing top 5 rows



### Since the "City", "Country" and "Ad Topic Line" columns contains large number of categories, these are not much useful features

In [22]:
val df = filtered_data.drop("City","Country","Ad Topic Line")

df: org.apache.spark.sql.DataFrame = [Daily Time Spent on Site: double, Age: int ... 6 more fields]


### Assembling all the features to a single vector column "features"

In [24]:
df.printSchema

root
 |-- Daily Time Spent on Site: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Area Income: double (nullable = true)
 |-- Daily Internet Usage: double (nullable = true)
 |-- Male: integer (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Clicked on Ad: integer (nullable = true)
 |-- Hour: integer (nullable = true)



In [31]:
val assembler = new VectorAssembler().setInputCols(Array("Daily Time Spent on Site","Age","Area Income","Daily Internet Usage"
                                                         ,"Male","Hour")).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_e6a36fdcc8e8


### Splitting the resultant data into training data and testing data,

<code>
<b>Training data is to train the model</b>
<b>Testing data is to test the builted model</b>
</code>

In [27]:
val Array(train_data,test_data) = df.randomSplit(Array(0.7,0.3))

train_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Daily Time Spent on Site: double, Age: int ... 6 more fields]
test_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Daily Time Spent on Site: double, Age: int ... 6 more fields]


In [28]:
df.describe().show()

+-------+------------------------+-----------------+------------------+--------------------+-------------------+------------------+-----------------+
|summary|Daily Time Spent on Site|              Age|       Area Income|Daily Internet Usage|               Male|     Clicked on Ad|             Hour|
+-------+------------------------+-----------------+------------------+--------------------+-------------------+------------------+-----------------+
|  count|                    1000|             1000|              1000|                1000|               1000|              1000|             1000|
|   mean|       65.00020000000012|           36.009| 55000.00008000003|  180.00010000000003|              0.481|               0.5|            11.66|
| stddev|      15.853614567500212|8.785562310125924|13414.634022282358|    43.9023393019801|0.49988887654046543|0.5002501876563867|6.960952151455644|
|    min|                    32.6|               19|           13996.5|              104.78|        

In [29]:
train_data.describe().show()

+-------+------------------------+------------------+------------------+--------------------+------------------+------------------+------------------+
|summary|Daily Time Spent on Site|               Age|       Area Income|Daily Internet Usage|              Male|     Clicked on Ad|              Hour|
+-------+------------------------+------------------+------------------+--------------------+------------------+------------------+------------------+
|  count|                     721|               721|               721|                 721|               721|               721|               721|
|   mean|        65.2325104022192|36.059639389736475|55102.108585298214|  180.93984743411903|0.4882108183079057|0.4895977808599168|11.515950069348127|
| stddev|      15.841922948837537| 8.917557237943695|13475.913660724953|  43.448656245734746|0.5002080011183364|0.5002388087432303| 7.056130763602599|
|    min|                    32.6|                19|           13996.5|               105.0| 

In [30]:
test_data.describe().show()

+-------+------------------------+------------------+------------------+--------------------+-------------------+------------------+-----------------+
|summary|Daily Time Spent on Site|               Age|       Area Income|Daily Internet Usage|               Male|     Clicked on Ad|             Hour|
+-------+------------------------+------------------+------------------+--------------------+-------------------+------------------+-----------------+
|  count|                     279|               279|               279|                 279|                279|               279|              279|
|   mean|       64.39985663082437|35.878136200716845|54736.128279569886|  177.57157706093193|0.46236559139784944|0.5268817204301075|12.03225806451613|
| stddev|       15.89651735469592| 8.449140399601253| 13275.35319978117|   45.04194916468742|0.49947756414594924|0.5001740240205834|6.706785374727563|
|    min|                    32.6|                19|          17709.98|              104.78| 

### Creating a logistic regression model object

In [34]:
val lor = new LogisticRegression().setLabelCol("Clicked on Ad").setFeaturesCol("features")

lor: org.apache.spark.ml.classification.LogisticRegression = logreg_f3ae57488577


### Setting Up the Pipeline

In [35]:
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.Pipeline


In [36]:
val pipeline = new Pipeline().setStages(Array(assembler,lor))

pipeline: org.apache.spark.ml.Pipeline = pipeline_56e02631b389


### Fitting the pipeline to training set.

In [37]:
val pipelineModel = pipeline.fit(train_data)

2019-12-29 00:53:37 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-29 00:53:37 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


pipelineModel: org.apache.spark.ml.PipelineModel = pipeline_56e02631b389


### Getting Results on Test Set

In [38]:
val results = pipelineModel.transform(test_data)

results: org.apache.spark.sql.DataFrame = [Daily Time Spent on Site: double, Age: int ... 10 more fields]


In [39]:
results.printSchema

root
 |-- Daily Time Spent on Site: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Area Income: double (nullable = true)
 |-- Daily Internet Usage: double (nullable = true)
 |-- Male: integer (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Clicked on Ad: integer (nullable = true)
 |-- Hour: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [40]:
val output = results.select("Clicked on Ad","rawPrediction","prediction","probability","features")

output: org.apache.spark.sql.DataFrame = [Clicked on Ad: int, rawPrediction: vector ... 3 more fields]


In [41]:
output.show(5)

+-------------+--------------------+----------+--------------------+--------------------+
|Clicked on Ad|       rawPrediction|prediction|         probability|            features|
+-------------+--------------------+----------+--------------------+--------------------+
|            1|[-9.1403211454047...|       1.0|[1.07241378264061...|[32.6,38.0,40159....|
|            1|[-10.588619855570...|       1.0|[2.52005419826471...|[32.84,40.0,41232...|
|            1|[-10.003348132975...|       1.0|[4.52461316382355...|[32.99,45.0,49282...|
|            1|[-9.3818791610223...|       1.0|[8.42296640411551...|[34.3,41.0,53167....|
|            1|[-11.620008633307...|       1.0|[8.98442875143056...|[34.96,42.0,36913...|
+-------------+--------------------+----------+--------------------+--------------------+
only showing top 5 rows



### MODEL EVALUATION

#### 1) Converting the data to rdd and evaluating using MulticlassMetrics to print the confusion matrix

In [43]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.evaluation.MulticlassMetrics


In [44]:
val clean_result = output.withColumn("Clicked on Ad",output("Clicked on Ad").cast("double"))

clean_result: org.apache.spark.sql.DataFrame = [Clicked on Ad: double, rawPrediction: vector ... 3 more fields]


In [45]:
clean_result.select("Clicked on Ad","prediction").show(5)

+-------------+----------+
|Clicked on Ad|prediction|
+-------------+----------+
|          1.0|       1.0|
|          1.0|       1.0|
|          1.0|       1.0|
|          1.0|       1.0|
|          1.0|       1.0|
+-------------+----------+
only showing top 5 rows



In [49]:
val predictionAndLabel = clean_result.select("Clicked on Ad","prediction").as[(Double,Double)].rdd

predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[171] at rdd at <console>:51


In [50]:
val metrics = new MulticlassMetrics(predictionAndLabel)

metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@5f1e727


### Printing the confusion matrix

In [53]:
println(metrics.confusionMatrix)

130.0  7.0    
2.0    140.0  


### Printing the Accuracy

In [54]:
println(metrics.accuracy)

0.967741935483871


### Recall

In [56]:
println(metrics.recall)

0.967741935483871


### precision

In [57]:
println(metrics.precision)

0.967741935483871


#### 2) Evaluating using BinaryClassificationEvaluator

In [58]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator


In [59]:
val bin_eval = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("Clicked on Ad")

bin_eval: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_46a0709642e4


### Calculating Area Under ROC

In [61]:
val AOC =bin_eval.evaluate(output)

AOC: Double = 0.9891774891774894


### Printing Area Under ROC

In [62]:
println(AOC)

0.9891774891774894


#### 3) Evaluating using MulticlassClassificationEvaluator

In [63]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator


In [64]:
val multi_eval = new MulticlassClassificationEvaluator().setPredictionCol("prediction").setLabelCol("Clicked on Ad")

multi_eval: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_261a6dc05813


### Calculating Area Under ROC

In [65]:
val AOC_2 = multi_eval.evaluate(output)

AOC_2: Double = 0.967762682621492


### Printing Area Under ROC

In [66]:
println(AOC_2)

0.967762682621492


### Stopping the created spark session

In [67]:
spark.stop()

## Thank You!