To predict why some batches dog food are spoiling much quicker than intended and identify the strongest preservative is used in the dog food.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dogfood').getOrCreate()

In [2]:
data = spark.read.csv('data/dog_food.csv',inferSchema=True,header=True)

In [3]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [4]:
data.head()

Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0)

In [5]:
data.describe().show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



In [6]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [8]:
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol="features")

In [9]:
output = assembler.transform(data)

In [10]:
from pyspark.ml.classification import RandomForestClassifier,DecisionTreeClassifier

In [11]:
rfc = DecisionTreeClassifier(labelCol='Spoiled',featuresCol='features')

In [12]:
output.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [13]:
final_data = output.select('features','Spoiled')
final_data.head()

Row(features=DenseVector([4.0, 2.0, 12.0, 3.0]), Spoiled=1.0)

In [14]:
rfc_model = rfc.fit(final_data)

In [15]:
rfc_model.featureImportances

SparseVector(4, {1: 0.0019, 2: 0.9832, 3: 0.0149})

By the above feature importance of the calssifier, we found that Feature at index 2 (Chemical C) is by far the most important feature, meaning it is causing the early spoilage!