<a href="https://colab.research.google.com/github/abarrenos/Computational_Biology_UPM/blob/main/BigData/Exercise4_Tree_methods_Doog_Food_Spoil.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Consulting project: predicting dog food spoiling
### Use Machine Learning with RF to find out which parameter had the most predictive power, thus finding out which chemical causes the early spoiling!

### Your main task is to figure out which preservative chemical (A,B,C,D) was having an effect on the dog food being spoiled.

## Load Spark

In [None]:
appname = "Tree Methods - College data"

# Look into https://spark.apache.org/downloads.html for the latest version
spark_mirror = "https://mirrors.sonic.net/apache/spark"
spark_version = "3.3.1"
hadoop_version = "3"

# Install Java 8 (Spark does not work with newer Java versions)
! apt-get update
! apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download and extract Spark binary distribution
! rm -rf spark-{spark_version}-bin-hadoop{hadoop_version}.tgz spark-{spark_version}-bin-hadoop{hadoop_version}
! wget -q {spark_mirror}/spark-{spark_version}/spark-{spark_version}-bin-hadoop{hadoop_version}.tgz
! tar xzf spark-{spark_version}-bin-hadoop{hadoop_version}.tgz

# The only 2 environment variables needed to set up Java and Spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-{spark_version}-bin-hadoop{hadoop_version}"

# Set up the Spark environment based on the environment variable SPARK_HOME 
! pip install -q findspark
import findspark
findspark.init()

# Get the Spark session object (basic entry point for every operation)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(appname).master("local[*]").getOrCreate()

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Exploratory data analysis and Preprocessing

In [None]:
# Load the data into a dataframe

df = spark.read.format('csv').options(inferSchema=True, header=True).load('/content/drive/MyDrive/files/dog_food.csv')
df.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [None]:
# Statistical summary

df.describe().show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



In [None]:
# Count null values within the data columns

print({col: df.filter(df[col].isNull()).count() for col in df.columns})

{'A': 0, 'B': 0, 'C': 0, 'D': 0, 'Spoiled': 0}


We observe that all features are cuantitative and numeric. We first need to assemble all significant features into a vector to build the machine learning model.

In [None]:
from pyspark.ml.feature import VectorAssembler, StringIndexer
assembler = VectorAssembler(
    inputCols=["A", "B", "C", "D"], 
    outputCol='features')
print(assembler.explainParams())

final_data = assembler.transform(df)
final_data.show()

handleInvalid: How to handle invalid data (NULL and NaN values). Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), or 'keep' (return relevant number of NaN in the output). Column lengths are taken from the size of ML Attribute Group, which can be set using `VectorSizeHint` in a pipeline before `VectorAssembler`. Column lengths can also be inferred from first rows of the data since it is safe to do so but only in case of 'error' or 'skip'). (default: error)
inputCols: input column names. (current: ['A', 'B', 'C', 'D'])
outputCol: output column name. (default: VectorAssembler_d32ad6eeed4e__output, current: features)
+---+---+----+---+-------+-------------------+
|  A|  B|   C|  D|Spoiled|           features|
+---+---+----+---+-------+-------------------+
|  4|  2|12.0|  3|    1.0| [4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0| [5.0,6.0,12.0,7.0]|
|  6|  2|13.0|  6|    1.0| [6.0,2.0,13.0,6.0]|
|  4|  2|12.0|  1|    1.0| [4.0,2.0,12.0,1.0]|
|  4|  2|12.0|  3

## Building the model

In [None]:
from pyspark.ml.classification import (RandomForestClassifier, GBTClassifier, DecisionTreeClassifier)

dtc = DecisionTreeClassifier(featuresCol="features", labelCol="Spoiled")
rfc = RandomForestClassifier(numTrees = 100, featuresCol="features", labelCol="Spoiled")
gbt = GBTClassifier(featuresCol="features", labelCol="Spoiled")

print(
    dtc.explainParams(),"\n\n",
    rfc.explainParams(),"\n\n",
    gbt.explainParams(),"\n\n")

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features, current: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name. (default: label, current: Spoiled)
leafCol: Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default: )
maxBins: Max number of bins for discre

In [None]:
# Fit the different models with the data

dtc_model = dtc.fit(final_data)
rfc_model = rfc.fit(final_data)
gbt_model = gbt.fit(final_data)

In [None]:
# Identify the relative importances of every feature

print("Decission Tree Classifier:\nChemical A:", dtc_model.featureImportances[0], "\nChemical B:", dtc_model.featureImportances[1], "\nChemical C:", dtc_model.featureImportances[2],  "\nChemical D:", dtc_model.featureImportances[3])
print("\nFandom Forest Classifier:\nChemical A:", rfc_model.featureImportances[0], "\nChemical B:", rfc_model.featureImportances[1], "\nChemical C:", rfc_model.featureImportances[2],  "\nChemical D:", rfc_model.featureImportances[3])
print("\nGBTClassifier:\nChemical A:", gbt_model.featureImportances[0], "\nChemical B:", gbt_model.featureImportances[1], "\nChemical C:", gbt_model.featureImportances[2],  "\nChemical D:", gbt_model.featureImportances[3])


Decission Tree Classifier:
Chemical A: 0.0 
Chemical B: 0.0019107795086908742 
Chemical C: 0.9831676511855764 
Chemical D: 0.014921569305732818

Fandom Forest Classifier:
Chemical A: 0.019387308308770945 
Chemical B: 0.021112956784513963 
Chemical C: 0.9348987066729869 
Chemical D: 0.024601028233728145

GBTClassifier:
Chemical A: 0.02962567485294246 
Chemical B: 0.03830179415146122 
Chemical C: 0.8286277188140007 
Chemical D: 0.10344481218159562


We can see that for every model the conclussions are similar: the chemical with the greatest weight to determine food spoil is Chemical C. When we compare the average percentage of each chemical, we realise that chemical C is the most differential.

In [None]:
print(df.filter(df.Spoiled == 0).summary("mean").show())
print(df.filter(df.Spoiled == 1).summary("mean").show())

+-------+-----------------+----+----------------+------------------+-------+
|summary|                A|   B|               C|                 D|Spoiled|
+-------+-----------------+----+----------------+------------------+-------+
|   mean|5.422857142857143|5.66|8.01142857142857|5.6085714285714285|    0.0|
+-------+-----------------+----+----------------+------------------+-------+

None
+-------+-----------------+-----------------+------------------+-----------------+-------+
|summary|                A|                B|                 C|                D|Spoiled|
+-------+-----------------+-----------------+------------------+-----------------+-------+
|   mean|5.814285714285714|5.114285714285714|11.914285714285715|5.507142857142857|    1.0|
+-------+-----------------+-----------------+------------------+-----------------+-------+

None
