# Intro to decision Trees - HOMEWORK

Consider the dataset shown in the following cell that outlines how likely a person is to eat a food on the basis of three characteristics. A decision tree works in the following way. It loops over every attribute (taste, temperature, texture) and then calculates the information gained (which is the mutual information) for each potential splitting. It then splits the data along the attribute that creates the largest information gain. 

Suppose X is the distribution of 'yes', 'no' responses without any splits. If you split on the basis of 'taste', you will end up with three classes. You can then calculate the conditional entropy H(X|taste) and use this to calculate the information gain

$ I(X,taste) = H(X) - H(X|taste) $ .

Recall that one formulation of the conditional entropy (the one useful here) is

$$ H(X|taste) = \displaystyle \sum_{t \in taste} p(t) H(X|taste=t)  $$

where the last quantity $H(X|taste = t)$ can be calculated within the propsed class. 

TASK: For this data, calculate and output the information gained by splitting along each of the three attributes. 

The one with the highest information gain would be the one chosen for the first split. For a full decision tree, you would subsequently split each class along the attribute with the largest information gain recursively until no further splitting is possible (that is, you end up with pure classes with no mixing). This can be a bit of a challenge from the perspective of managing data structures, thus you'll just be working with a library for constructing full decision trees in this HW. This is just a taste of how they work.

In [51]:
using Pkg
Pkg.status()
#Pkg.add("DataFrames")
using DataFrames
using Pkg
Pkg.add("DataFrames")
Pkg.add("StatsBase")
Pkg.add("DecisionTree")
Pkg.add("CategproicalArrays")



[32m[1mStatus[22m[39m `~/.julia/environments/v1.10/Project.toml`


  [90m[a93c6f00] [39mDataFrames v1.6.1
  [90m[7806a523] [39mDecisionTree v0.12.4
  [90m[31c24e10] [39mDistributions v0.25.107
  [90m[033835bb] [39mJLD2 v0.4.45
  [90m[f0f68f2c] [39mPlotlyJS v0.18.12
[32m⌃[39m [90m[91a5bcdd] [39mPlots v1.40.0
  [90m[2913bbd2] [39mStatsBase v0.34.2
  [90m[f3b207a7] [39mStatsPlots v0.15.6
[36m[1mInfo[22m[39m Packages marked with [32m⌃[39m have new versions available and may be upgradable.


[32m[1m   Resolving[22m[39m package versions...


[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


[32m[1m   Resolving[22m[39m package versions...


[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


[32m[1m   Resolving[22m[39m package versions...


[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Pkg.Types.PkgError: The following package names could not be resolved:
 * CategproicalArrays (not found in project, manifest or registry)

In [5]:

using DataFrames

dataset = Dict(
    "Taste" => ["Salty", "Spicy", "Spicy", "Spicy", "Spicy", "Sweet", "Salty", "Sweet", "Spicy", "Salty"],
    "Temperature" => ["Hot", "Hot", "Hot", "Cold", "Hot", "Cold", "Cold", "Hot", "Cold", "Hot"],
    "Texture" => ["Soft", "Soft", "Hard", "Hard", "Hard", "Soft", "Soft", "Soft", "Soft", "Hard"],
    "Eat" => ["No", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"]
)

df = DataFrame(dataset)

println(first(df, 10))


[1m10×4 DataFrame[0m
[1m Row [0m│[1m Eat    [0m[1m Taste  [0m[1m Temperature [0m[1m Texture [0m
     │[90m String [0m[90m String [0m[90m String      [0m[90m String  [0m
─────┼──────────────────────────────────────
   1 │ No      Salty   Hot          Soft
   2 │ No      Spicy   Hot          Soft
   3 │ Yes     Spicy   Hot          Hard
   4 │ No      Spicy   Cold         Hard
   5 │ Yes     Spicy   Hot          Hard
   6 │ Yes     Sweet   Cold         Soft
   7 │ No      Salty   Cold         Soft
   8 │ Yes     Sweet   Hot          Soft
   9 │ Yes     Spicy   Cold         Soft
  10 │ Yes     Salty   Hot          Hard


In [30]:
using DataFrames
using StatsBase

println("Total Entropy Log e = ")
#calculate the entropy of the dataset
function entropy(labels)
    counts = countmap(labels)
    #println(counts)
    probs = values(counts) ./ sum(values(counts))
   #println(probs)
   epsilon = 1e-10
   return -sum(p * log2(p + epsilon) for p in probs if p > 0)
end

total_ent = entropy(df.Eat)


Total Entropy Log e = 


0.9709505941661296

In [43]:

# Conditional Entropy 
function conditional_entropy(df, attribute_name)
    labels = df[!,:Eat]  
    attribute = df[!, attribute_name] 
    println("Attribute: $attribute_name")
    unique_values = unique(attribute)
    println("Unique Values: $unique_values")
    total_count = length(labels)
    println("Total Count: $total_count")
    weighted_entropy = sum((count(==(v), attribute) / total_count) * entropy(df[attribute .== v, :Eat]) for v in unique_values)
    println("Weighted Entropy: $weighted_entropy")
    return weighted_entropy
end

# Modified Information Gain Function
function information_gain(df, attribute_name)
    labels = df[!,:Eat] 
    println("Labels: $labels")
    total_entropy = entropy(labels)
    println("Total Entropy: $total_entropy")
    cond_entropy = conditional_entropy(df, attribute_name)
    println("Conditional Entropy: $cond_entropy")
    return total_entropy - cond_entropy
end

# Assuming df is your DataFrame
attribute_list = [:Taste, :Temperature, :Texture]  # Using symbols for DataFrame column names

# Corrected loop for calculating and printing information gain for each attribute
for attribute in attribute_list
    ig = information_gain(df, attribute)
    println("$attribute Information Gain = $ig \n")



end



Labels: ["No", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"]
Total Entropy: 0.9709505941661296
Attribute: Taste
Unique Values: ["Salty", "Spicy", "Sweet"]
Total Count: 10


Weighted Entropy: 0.7609640471839959
Conditional Entropy: 0.7609640471839959
Taste Information Gain = 0.2099865469821337 

Labels: ["No", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"]
Total Entropy: 0.9709505941661296
Attribute: Temperature
Unique Values: ["Hot", "Cold"]
Total Count: 10
Weighted Entropy: 0.9509775001441545
Conditional Entropy: 0.9509775001441545
Temperature Information Gain = 0.019973094021975113 

Labels: ["No", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"]
Total Entropy: 0.9709505941661296
Attribute: Texture
Unique Values: ["Soft", "Hard"]
Total Count: 10
Weighted Entropy: 0.9245112494951141
Conditional Entropy: 0.9245112494951141
Texture Information Gain = 0.046439344671015514 



In [56]:
using Pkg
Pkg.add("DecisionTree")
Pkg.add("DataFrames")
Pkg.add("CategoricalArrays")  # For handling categorical variables



[32m[1m   Resolving[22m[39m package versions...


[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


[32m[1m   Resolving[22m[39m package versions...


[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


[32m[1m   Resolving[22m[39m package versions...


[32m[1m   Installed[22m[39m WorkerUtilities ─ v1.6.1


[32m[1m   Installed[22m[39m WeakRefStrings ── v1.4.2


[32m[1m   Installed[22m[39m FilePathsBase ─── v0.9.21


[32m[1m   Installed[22m[39m CSV ───────────── v0.10.12


[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[336ed68f] [39m[92m+ CSV v0.10.12[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
 

 [90m[336ed68f] [39m[92m+ CSV v0.10.12[39m
  [90m[48062228] [39m[92m+ FilePathsBase v0.9.21[39m
  [90m[ea10d353] [39m[92m+ WeakRefStrings v1.4.2[39m
  [90m[76eceee3] [39m[92m+ WorkerUtilities v1.6.1[39m


[32m[1mPrecompiling[22m[39m 

project...


[32m  ✓ [39m[90mWorkerUtilities[39m


[32m  ✓ [39m[90mWeakRefStrings[39m


[32m  ✓ [39m[90mFilePathsBase[39m


[32m  ✓ [39mCSV
  4 dependencies successfully precompiled in 19 seconds. 243 already precompiled.
[32m[1m   Resolving[22m[39m package versions...


[32m[1m   Installed[22m[39m CategoricalArrays ─ v0.10.8


[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[324d7699] [39m[92m+ CategoricalArrays v0.10.8[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
 

 [90m[324d7699] [39m[92m+ CategoricalArrays v0.10.8[39m


[32m[1mPrecompiling[22m[39m 

project...


[32m  ✓ [39mCategoricalArrays


[32m  ✓ [39m[90mCategoricalArrays → CategoricalArraysRecipesBaseExt[39m


[32m  ✓ [39m[90mCategoricalArrays → CategoricalArraysSentinelArraysExt[39m


[32m  ✓ [39m[90mCategoricalArrays → CategoricalArraysJSONExt[39m
  4 dependencies successfully precompiled in 6 seconds. 247 already precompiled.


In [None]:
# Initialize and train the classifier
model = DecisionTreeClassifier(max_depth=5)  # You can adjust max_depth as needed
DecisionTree.fit!(model, features, labels)

# To visualize the tree
println(DecisionTree.print_tree(model))


In [60]:
using DataFrames, DecisionTree, CategoricalArrays

# Assuming 'df' is your DataFrame
# Convert categorical features and the target variable to categorical arrays with integer codes
df.Taste = categorical(df.Taste, compress=true)
df.Temperature = categorical(df.Temperature, compress=true)
df.Texture = categorical(df.Texture, compress=true)
df.Eat = categorical(df.Eat, compress=true)

# Convert categories to integer codes
df.Taste = CategoricalArrays.levelcode.(df.Taste)
df.Temperature = CategoricalArrays.levelcode.(df.Temperature)
df.Texture = CategoricalArrays.levelcode.(df.Texture)
df.Eat = CategoricalArrays.levelcode.(df.Eat)

# Prepare the features and labels for the model
features = Matrix(df[:, [:Taste, :Temperature, :Texture]])
labels = df[:, :Eat]

# Initialize and train the classifier
model = DecisionTreeClassifier(max_depth=5)  # Adjust max_depth as needed
DecisionTree.fit!(model, features, labels)

# To visualize the tree
println(DecisionTree.print_tree(model))


Feature 1 < 3.0 ?
├─ Feature 3 < 2.0 ?
    ├─ Feature 2 < 2.0 ?
        ├─ 1 : 1/1
        └─ 2 : 3/3
    └─ Feature 1 < 2.0 ?
        ├─ 1 : 2/2
        └─ Feature 2 < 2.0 ?
            ├─ 2 : 1/1
            └─ 1 : 1/1
└─ 2 : 2/2
nothing
