# Dog Food Prediction

To predict why some batches of their dog food are spoiling much quicker than intended for a Dog Food company! Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot, but which is the chemical that has the strongest effect? The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists beelive one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one! Using Machine Learning with Random Forest Tree model to find out which parameter had the most predicitive power, thus finding out which chemical causes the early spoiling!


* **Pres_A**: Percentage of preservative A in the mix
* **Pres_B**: Percentage of preservative B in the mix
* **Pres_C**: Percentage of preservative C in the mix
* **Pres_D**: Percentage of preservative D in the mix
* **Spoiled**: Label indicating whether or not the dog food batch was spoiled.
___
____

In [17]:
# Initialize pyspark
import findspark
findspark.init()
import pyspark

In [18]:
# Initialize and create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dog_food').getOrCreate()

In [19]:
# Using Spark to read the Dog Food data set
data = spark.read.csv('dog_food.csv', header=True, inferSchema=True)

In [20]:
# Printing the first few rows of the dataframe
data.show(4)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
+---+---+----+---+-------+
only showing top 4 rows



In [21]:
# Printing the schema of the dataframe
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [22]:
data.describe().show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



In [23]:
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [24]:
# Import VectorAssembler and Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors

In [25]:
#Assembling all the dependant features to a single vector column "features"
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'], outputCol='features')

In [26]:
output = assembler.transform(data)

In [27]:
output.show(3)

+---+---+----+---+-------+------------------+
|  A|  B|   C|  D|Spoiled|          features|
+---+---+----+---+-------+------------------+
|  4|  2|12.0|  3|    1.0|[4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0|[5.0,6.0,12.0,7.0]|
|  6|  2|13.0|  6|    1.0|[6.0,2.0,13.0,6.0]|
+---+---+----+---+-------+------------------+
only showing top 3 rows



**Creating a Random Forest Tree Model**

In [28]:
from pyspark.ml.classification import RandomForestClassifier

In [29]:
rfc = RandomForestClassifier(labelCol='Spoiled', featuresCol='features')

In [30]:
rfc_model = rfc.fit(output)

Using the Random Forest Classifier Model, identifying the important feature

In [31]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0202, 1: 0.0167, 2: 0.9428, 3: 0.0204})

__From the above observation, Feature at index 2 (Chemical C) is by far the most important feature, meaning it is causing the early spoilage!__

In [None]:
#Closing spark session
spark.stop()