## Generating Decision Tree for Balloon Data set
Dataset Link - https://archive.ics.uci.edu/ml/datasets/Balloons
1. Task 1: Understand the data (10 Marks)
2. Task 2: Manually - Generate Decision Tree using Information Gain & Gini Impurity ( 60 Marks)
3. Task 3: scikit - Generate & Display Decision Tree using export_graphviz (30 Marks)
4. Task 4: Share the link of GitHub 


In [22]:
import pandas as pd
import numpy as np

In [2]:
balloon_data = pd.read_csv('https://raw.githubusercontent.com/edyoda/Data-Scientist-program/master/Practice%20Problems/data/yellow-small%2Badult-stretch.data',names=['color','size','act','age','inflated'])

In [3]:
balloon_data

Unnamed: 0,color,size,act,age,inflated
0,YELLOW,SMALL,STRETCH,ADULT,T
1,YELLOW,SMALL,STRETCH,CHILD,T
2,YELLOW,SMALL,DIP,ADULT,T
3,YELLOW,SMALL,DIP,CHILD,T
4,YELLOW,LARGE,STRETCH,ADULT,T
5,YELLOW,LARGE,STRETCH,CHILD,F
6,YELLOW,LARGE,DIP,ADULT,F
7,YELLOW,LARGE,DIP,CHILD,F
8,PURPLE,SMALL,STRETCH,ADULT,T
9,PURPLE,SMALL,STRETCH,CHILD,F


## Task 1: Understand the data (10 Marks)

In [6]:
balloon_data.describe()

Unnamed: 0,color,size,act,age,inflated
count,16,16,16,16,16
unique,2,2,2,2,2
top,YELLOW,LARGE,STRETCH,ADULT,F
freq,8,8,8,8,9


In [12]:
balloon_data['color'].value_counts()

YELLOW    8
PURPLE    8
Name: color, dtype: int64

In [11]:
balloon_data['size'].value_counts()

LARGE    8
SMALL    8
Name: size, dtype: int64

In [13]:
balloon_data['act'].value_counts()

STRETCH    8
DIP        8
Name: act, dtype: int64

In [14]:
balloon_data['age'].value_counts()

ADULT    8
CHILD    8
Name: age, dtype: int64

In [15]:
balloon_data['inflated'].value_counts()

F    9
T    7
Name: inflated, dtype: int64

### Dataset Conclusion

1) Color
*YELLOW    8 - 50 %
*PURPLE    8 - 50 %

2) Size
*LARGE     8 - 50 %
*SMALL     8 - 50 %

3) act
*STRETCH   8 - 50 %
*DIP       8 - 50 %

4) age
*ADULT      8 - 50 %
*CHILD      8 - 50 %

5) inflated
*F      9 - 56 %   *T      7 - 44 %


#### Binning Technique

* For Color column value ranges from [0,1]
    * Buckets[.,0.5,1.0]
* For Size column value ranges from [0.1]
    * Buckets[.,0.5,.]
* For Act column value ranges from [0.1]
    * Buckets[.,0.5,.]
* For Age column value ranges from [0.1]
    * Buckets[.,0.5,.]


### What are the possible questions or decision?
* For Color attribute, questions can be any of the following
    * (a) color <0.5
* For Size
    * (b) size <0.5
* For Act
    * (c) act<0.5
* For Windy
    * (d) windy<0.5
    

### Task 2: Manually - Generate Decision Tree using Information Gain & Gini Impurity ( 60 Marks)

In [35]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
enc_data = oe.fit_transform(balloon_data)
data = pd.DataFrame(enc_data, columns=['color','size','act','age','inflated'], dtype=int)# convert the data to dataframe

In [36]:
data

Unnamed: 0,color,size,act,age,inflated
0,1,1,1,0,1
1,1,1,1,1,1
2,1,1,0,0,1
3,1,1,0,1,1
4,1,0,1,0,1
5,1,0,1,1,0
6,1,0,0,0,0
7,1,0,0,1,0
8,0,1,1,0,1
9,0,1,1,1,0


## Based on Gini Impurity

In [83]:
#Based on Gini Impurity
class DecisionTreeClassifier:
    
    def __init__(self):
        self.feature_ranges = {}
    
    def createDecisionsFromData(self,feature_data):
        
        for col in feature_data.columns:
            ranges_info = []
            i = 0.5
            cat_max = feature_data[col].max()
            while i < cat_max:
                ranges_info.append(i)
                i = i+1
                
            self.feature_ranges[col] = ranges_info
            
        decisions = list(dt.feature_ranges.items())
        
        self.decisions = []
        for f,buckets in decisions:
            for bucket in buckets:
                self.decisions.append((f,bucket))
    
    def calGini(self,target):
        total = (target.value_counts()).sum()
        
        l = list(target.value_counts().values)
        
        gini = 0
        
        for e in l:
            gini += (e/total) * (1 - (e/total)) # Formula to calculate gini index
        
        return gini
    
    def select_decision(self, data):
        GiniBeforeSplit = self.calGini(data.target)
        if GiniBeforeSplit == 0:
            return
        #print('GiniBeforeSplit',GiniBeforeSplit)
        
        max_gini_gain = 0
        for feature,value in self.decisions:
            
            data_left = data[data[feature] < value]
            GiniLeft = self.calGini(data_left.target)
            
            data_right = data[data[feature] > value]
            GiniRight = self.calGini(data_right.target)
            
            GiniSplit = (data_left.shape[0]/data.shape[0])* GiniLeft + (data_right.shape[0]/data.shape[0])* GiniRight
            GiniGain = GiniBeforeSplit - GiniSplit
            
            if GiniGain > max_gini_gain:
                max_gini_gain = GiniGain
                best_data_left = data_left
                best_data_right = data_right
                best_feature = feature
                best_value = value
        
        print(best_feature,best_value)
        print('Gini-Left',GiniLeft)
        self.select_decision(best_data_left)
        print('Gini-Right',GiniRight)
        self.select_decision(best_data_right)
    
    #Calculate Gini Gain for each decision & chose the one which is best
    def myfit(self, feature_data, target_data):
        
        feature_data['target'] = target_data
        data = feature_data
        self.select_decision(data)

In [84]:
dt = DecisionTreeClassifier()

In [85]:
dt.createDecisionsFromData(data.drop(columns=['inflated']))

In [86]:
dt.myfit(data.drop(columns=['inflated']), data.inflated)

color 0.5
Gini-Left 0.46875
act 0.5
Gini-Left 0.5
Gini-Right 0.0
age 0.5
Gini-Left 0.0
Gini-Right 0.0
Gini-Right 0.375
size 0.5
Gini-Left 0.375
act 0.5
Gini-Left 0.5
Gini-Right 0.0
age 0.5
Gini-Left 0.0
Gini-Right 0.0
Gini-Right 0.5


## Using Information Gain

In [89]:
#Based on Information Gain
class DecisionTreeClassifier:
    
    def __init__(self):
        self.feature_ranges = {}
    
    def createDecisionsFromData(self,feature_data):
        
        for col in feature_data.columns:
            ranges_info = []
            i = 0.5
            cat_max = feature_data[col].max()
            while i < cat_max:
                ranges_info.append(i)
                i = i+1
                
            self.feature_ranges[col] = ranges_info
            
        decisions = list(dt.feature_ranges.items())
        
        self.decisions = []
        for f,buckets in decisions:
            for bucket in buckets:
                self.decisions.append((f,bucket))
    
    def calGini(self,target):
        total = (target.value_counts()).sum()  
        
        l = list(target.value_counts().values)
        
        entropy = 0 
        
        for e in l:
            entropy += (e/total) * np.log2(e/total) # formula to calculate Information Gain
        
        if entropy < 0:
            return -entropy
        return entropy
    
    def select_decision(self, data):
        GiniBeforeSplit = self.calGini(data.target)
        if GiniBeforeSplit == 0:
            return
        #print('GiniBeforeSplit',GiniBeforeSplit)
        
        max_gini_gain = 0
        for feature,value in self.decisions:
            
            data_left = data[data[feature] < value]
            GiniLeft = self.calGini(data_left.target)
            
            data_right = data[data[feature] > value]
            GiniRight = self.calGini(data_right.target)
            
            GiniSplit = (data_left.shape[0]/data.shape[0])* GiniLeft + (data_right.shape[0]/data.shape[0])* GiniRight
            GiniGain = GiniBeforeSplit - GiniSplit
            
            if GiniGain > max_gini_gain:
                max_gini_gain = GiniGain
                best_data_left = data_left
                best_data_right = data_right
                best_feature = feature
                best_value = value
        
        print(best_feature,best_value)
        print('Gini-Left',GiniLeft)
        self.select_decision(best_data_left)
        print('Gini-Right',GiniRight)
        self.select_decision(best_data_right)
    
    #Calculate Gini Gain for each decision & chose the one which is best
    def myfit(self, feature_data, target_data):
        
        feature_data['target'] = target_data
        data = feature_data
        self.select_decision(data)

In [43]:
dt = DecisionTreeClassifier()

In [91]:
dt.createDecisionsFromData(data.drop(columns=['inflated']))

In [93]:
dt.myfit(data.drop(columns=['inflated']), data.inflated)

color 0.5
Gini-Left 0.46875
act 0.5
Gini-Left 0.5
Gini-Right 0.0
age 0.5
Gini-Left 0.0
Gini-Right 0.0
Gini-Right 0.375
size 0.5
Gini-Left 0.375
act 0.5
Gini-Left 0.5
Gini-Right 0.0
age 0.5
Gini-Left 0.0
Gini-Right 0.0
Gini-Right 0.5


## Task 3: scikit - Generate & Display Decision Tree using export_graphviz(30 Marks)

In [50]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

In [51]:
dt.fit(data.drop(columns=['inflated']), data['inflated'])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [52]:
from sklearn.tree import export_graphviz

In [55]:
export_graphviz(dt, 'BalloonDecisionTree.tree', feature_names=['color','size','act','age'])

## Tree

*Click here to see tree

https://github.com/akkysanap22/Edyoda_graded_Assignments/blob/master/ballon_decisionTree.JPG