## REFER: Chapter-8 of "Introduction to Statistical Learning" by Gareth James

#### pyspark API Documentation:
* http://spark.apache.org/docs/latest/
* http://spark.apache.org/docs/latest/ml-guide.html
* https://spark.apache.org/docs/latest/api/python/

## [Introduction to Statistical Learning](<https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf>)

#### Very powrful algorithms : Decision Trees, Random Forests, Gradient Boosted Trees fall under "Tree Methods" title

In [1]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/MyTrials/tools')
from chinmay_tools import *

# Find out which component (A/B/C/D) is having major effect on the Dog food being spoiled

* Many ML models produce some sort of coefficient value for each feature involved indicating their "importance" or predictive power.
* RandomForestClassifier models featurimportances  tell the relative importance of each of the features which is actually the average of the feqture's importance across all the trees in the random forest.

In [8]:
printHighlighted('Refer: "Tree_Methods_Consulting_Project.ipynb"')

[7m[1mRefer: "Tree_Methods_Consulting_Project.ipynb"[0m[0m


In [50]:
from pyspark.sql import SparkSession

In [51]:
spark3 = SparkSession.builder.appName('dog_food').getOrCreate()

In [54]:
sdf_dogfood = spark3.read.csv('Tree_Methods/dog_food.csv', inferSchema=True, header=True)

In [55]:
sdf_dogfood.printSchema()
sdf_dogfood.describe().show()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|            

In [56]:
from pyspark.ml.feature import VectorAssembler

In [58]:
assembler_food = VectorAssembler(inputCols=['A', 'B', 'C', 'D'], outputCol='features')

In [59]:
final_food_data = assembler_food.transform(sdf_dogfood)

In [60]:
sdf_dogfood.printSchema()
final_food_data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [65]:
training_food_data, test_food_data = final_food_data.select('features', 'Spoiled').randomSplit([0.7, 0.3])

In [66]:
from pyspark.ml.classification import RandomForestClassifier

In [67]:
rfc_food = RandomForestClassifier(labelCol='Spoiled')

In [68]:
rfc_model_food = rfc_food.fit(training_food_data)

In [69]:
rfc_food_predictions = rfc_model_food.transform(test_food_data)

In [73]:
rfc_model_food.featureImportances

SparseVector(4, {0: 0.0368, 1: 0.0205, 2: 0.9188, 3: 0.0239})

In [3]:
printHighlighted('The relative values of the feature Importance tell, how important are they towards building of label field')

[7m[1mThe relative values of the feature Importance tell, how important are they towards building of label field[0m[0m


In [74]:
rfc_model_food = rfc_food.fit(final_food_data.select('features', 'Spoiled'))

In [78]:
final_food_data.head(1)

[Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0, features=DenseVector([4.0, 2.0, 12.0, 3.0]))]

In [75]:
rfc_model_food.featureImportances

SparseVector(4, {0: 0.0199, 1: 0.0154, 2: 0.9455, 3: 0.0193})

In [77]:
printHighlighted('The featureImportance is highest for feature with index 2 ("2:") i.e. for 3rd feature or chemical C so it is the one that is causign the food to spoil')

[7m[1mThe featureImportance is highest for feature with index 2 ("2:") i.e. for 3rd feature or chemical C so it is the one that is causign the food to spoil[0m[0m


# Predict if a college is Private or Public based on its features

In [5]:
printHighlighted('Refer: "Tree Methods Code Along.ipynb"')

[7m[1mRefer: "Tree Methods Code Along.ipynb"[0m[0m


In [23]:
from pyspark.sql import SparkSession
spark2 = SparkSession.builder.appName('tree_example_2').getOrCreate()

In [24]:
sdf_college = spark2.read.csv('Tree_Methods/College.csv', inferSchema=True, header=True)

In [25]:
sdf_college.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [26]:
sdf_college.select('Private').distinct().show()

+-------+
|Private|
+-------+
|     No|
|    Yes|
+-------+



In [30]:
sdf_college.select('School', 'Private', 'Enroll').describe().show()

+-------+--------------------+-------+----------------+
|summary|              School|Private|          Enroll|
+-------+--------------------+-------+----------------+
|  count|                 777|    777|             777|
|   mean|                null|   null|779.972972972973|
| stddev|                null|   null| 929.17619013287|
|    min|Abilene Christian...|     No|              35|
|    max|York College of P...|    Yes|            6392|
+-------+--------------------+-------+----------------+



In [31]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

In [32]:
indexer = StringIndexer(inputCol='Private', outputCol='PrivateIndex')

In [33]:
sdf_college_indexed = indexer.fit(sdf_college).transform(sdf_college)

In [34]:
sdf_college_indexed.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate',
 'PrivateIndex']

In [35]:
assembler = VectorAssembler(inputCols=[
                         'Apps',
                         'Accept',
                         'Enroll',
                         'Top10perc',
                         'Top25perc',
                         'F_Undergrad',
                         'P_Undergrad',
                         'Outstate',
                         'Room_Board',
                         'Books',
                         'Personal',
                         'PhD',
                         'Terminal',
                         'S_F_Ratio',
                         'perc_alumni',
                         'Expend',
                         'Grad_Rate'], outputCol='features')

In [36]:
data = assembler.transform (sdf_college_indexed)

In [37]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate',
 'PrivateIndex',
 'features']

In [38]:
final_data = data.select('features', 'PrivateIndex')

In [39]:
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)



In [40]:
training_data, test_data = final_data.randomSplit([0.7,0.3])

In [41]:
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier

In [42]:
dtc = DecisionTreeClassifier(labelCol='PrivateIndex')
rfc = RandomForestClassifier(labelCol='PrivateIndex')
gbt = GBTClassifier(labelCol='PrivateIndex')

In [43]:
%%time
dtc_model = dtc.fit(training_data)

Wall time: 2.66 s


In [44]:
%%time
rfc_model=rfc.fit(training_data)

Wall time: 1.64 s


In [45]:
%%time
gbt_model=gbt.fit(training_data)

Wall time: 7.55 s


In [46]:
%%time
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

Wall time: 218 ms


In [47]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [48]:
multi_eval = MulticlassClassificationEvaluator(labelCol='PrivateIndex')

In [49]:
multi_eval.evaluate(dtc_predictions)
multi_eval.evaluate(rfc_predictions)
multi_eval.evaluate(gbt_predictions)

0.8712457802352499

0.9527283781729747

0.881598308668076

#### Change the parameters of the classifier constructors to improve prediction

# A Sample code showing how to model and evaluate using the tree based models

#### Decision Trees, Random Forests, Gradient Boosted Trees

In [6]:
printHighlighted('Refer: "Tree_Methods_Doc_Example.ipynb"')

[7m[1mRefer: "Tree_Methods_Doc_Example.ipynb"[0m[0m


In [1]:
from pyspark.sql import SparkSession
spark1 = SparkSession.builder.appName('tree_example_1').getOrCreate()

In [2]:
sdf_data = spark1.read.format("libsvm").load('Tree_Methods/sample_libsvm_data.txt')

In [3]:
sdf_data.count()
sdf_data.printSchema()
sdf_data.describe().show()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                100|
|   mean|               0.57|
| stddev|0.49756985195624287|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+



In [4]:
training_data, test_data = sdf_data.randomSplit([0.7,0.3])

In [5]:
training_data.describe().show()
test_data.describe().show()

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|                72|
|   mean|0.5555555555555556|
| stddev|0.5003910833605344|
|    min|               0.0|
|    max|               1.0|
+-------+------------------+

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                 28|
|   mean| 0.6071428571428571|
| stddev|0.49734746139343805|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+



In [6]:
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier

In [7]:
%%time
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
gbt = GBTClassifier()

Wall time: 121 ms


In [8]:
%%time
dtc_model = dtc.fit(training_data)

Wall time: 3.71 s


In [9]:
%%time
rfc_model = rfc.fit(training_data)

Wall time: 2.52 s


In [10]:
%%time
gbt_model = gbt.fit(training_data)

Wall time: 13.3 s


In [11]:
%%time
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)


Wall time: 315 ms


In [12]:
type(dtc_predictions)

pyspark.sql.dataframe.DataFrame

In [13]:
dtc_predictions.printSchema()
rfc_predictions.printSchema()
gbt_predictions.printSchema()


root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [14]:
test_data.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [15]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [16]:
# bin_eval = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction')
bin_eval = BinaryClassificationEvaluator()

In [17]:
bin_eval.evaluate(dtc_predictions)
bin_eval.evaluate(rfc_predictions)
bin_eval.evaluate(gbt_predictions)


1.0

In [18]:
bin_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction')

In [19]:
bin_eval.evaluate(dtc_predictions)
bin_eval.evaluate(rfc_predictions)
bin_eval.evaluate(gbt_predictions)


0.9545454545454545

In [20]:
multi_eval = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')

In [21]:
multi_eval.evaluate(dtc_predictions)
multi_eval.evaluate(rfc_predictions)
multi_eval.evaluate(gbt_predictions)


0.9642857142857143