# Dog Food Prediction

To predict why some batches of their dog food are spoiling much quicker than intended for a Dog Food company! Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot, but which is the chemical that has the strongest effect? The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists believe one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one! Using Machine Learning with Random Forest Tree model to find out which parameter had the most predicitive power, thus finding out which chemical causes the early spoiling!

- **Pres_A**: Percentage of preservative A in the mix
- **Pres_B**: Percentage of preservative B in the mix
- **Pres_C**: Percentage of preservative C in the mix
- **Pres_D**: Percentage of preservative D in the mix
- **Spoiled**: Label indicating whether or not the dog food batch was spoiled.

### Initialize and create a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("dog_food").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577781517846)
SparkSession available as 'spark'


2019-12-31 14:08:34 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-12-31 14:08:53 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@75214a33


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Using Spark to read the Dog Food data set

In [3]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("dog_food.csv")

data: org.apache.spark.sql.DataFrame = [A: int, B: int ... 3 more fields]


### Count

In [4]:
data.count()

res1: Long = 490


### Schema

In [5]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



### Show

In [6]:
data.show(5)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
+---+---+----+---+-------+
only showing top 5 rows



### Describe

In [7]:
data.describe().show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



### Import VectorAssembler and Vectors

In [8]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Assembling all the dependant features to a single vector column "features"

In [9]:
val assembler = new VectorAssembler().setInputCols(Array("A", "B", "C", "D")).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_5a5f7b950f56


In [10]:
val output = assembler.transform(data)

output: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]


In [11]:
output.show(5)

+---+---+----+---+-------+------------------+
|  A|  B|   C|  D|Spoiled|          features|
+---+---+----+---+-------+------------------+
|  4|  2|12.0|  3|    1.0|[4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0|[5.0,6.0,12.0,7.0]|
|  6|  2|13.0|  6|    1.0|[6.0,2.0,13.0,6.0]|
|  4|  2|12.0|  1|    1.0|[4.0,2.0,12.0,1.0]|
|  4|  2|12.0|  3|    1.0|[4.0,2.0,12.0,3.0]|
+---+---+----+---+-------+------------------+
only showing top 5 rows



### Creating a Random Forest Tree Model

In [12]:
import org.apache.spark.ml.classification.RandomForestClassifier

import org.apache.spark.ml.classification.RandomForestClassifier


In [13]:
val rfc = new RandomForestClassifier().setLabelCol("Spoiled").setFeaturesCol("features")

rfc: org.apache.spark.ml.classification.RandomForestClassifier = rfc_65dcc09a5f40


In [14]:
val rfc_model = rfc.fit(output)

rfc_model: org.apache.spark.ml.classification.RandomForestClassificationModel = RandomForestClassificationModel (uid=rfc_65dcc09a5f40) with 20 trees


### Using the Random Forest Classifier Model, identifying the important feature

In [15]:
rfc_model.featureImportances

res6: org.apache.spark.ml.linalg.Vector = (4,[0,1,2,3],[0.022967322706242896,0.02015552665936881,0.9292833974441577,0.027593753190230587])


## *From the above observation, Feature at index 2 (Chemical C) is by far the most important feature, meaning it is causing the early spoilage!*

### Closing spark session

In [16]:
spark.stop()

## Thank You!