# FE Assignment2 Question2

### (B)	TAKE THE IRIS DATA SET, OBTAINED FROM THE UNIVERSITY OF CALIFORNIA-IRVINE MACHINE LEARNING REPOSITORY (LINK PROVIDED IN THE REFERENCE SECTION), AS A DATA SET TO BE DISCRETIZED. PERFORM DATA DISCRETIZATION FOR EACH OF FOUR NUMERIC ATTRIBUTE USING CHIMERGE METHOD. (LET THE STOPPING CRITERIA BE: MAX-INTERVAL 6). YOU NEED TO WRITE A SMALL PYTHON PROGRAM RO DO THIS TO AVOID CLUMSY NUMERICAL COMPUTATIONS. SUBMIT YOUR SIMPLE ANALYSIS AND YOUR TEST RESULTS: SPLIT-POINTS, FINAL INTERVALS AND THE WELL DOCUMENTED SOURCE PROGRAM IN PYTHON JUPYTER NOTEBOOK.                                                                                                                                 [8]

In [4]:
#import necessary libraries
import pandas as pd
from collections import Counter
import numpy as np


In [7]:
#Read the iris dataset into pandas dataframe with headers off as this data file has no headers
iris = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
0    150 non-null float64
1    150 non-null float64
2    150 non-null float64
3    150 non-null float64
4    150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


###### Assigning column names to Iris data

In [8]:
iris.columns = ['sepal_l', 'sepal_w', 'petal_l', 'petal_w', 'type']

In [9]:
iris.head(5)

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#Chimerge algorithm
1. Sort the distinct feature values
2. Form intervals for each distinct value in the sorted distinct feature list.
3. Check if num of intervals > max_intervals(6) then go to step 4. Else Step 12.
4. Calculate the occurrences of the class label for each interval created in Step2. This is observed values
5. Calculate expected values for each interval.
6. Calculate chi square value using observed (Step3) and expected (Step4)) values for each interval.
7. Repeat step 3,4,5 till for all the intervals created in Step2.
8. Once Chi square values are calculated for all intervals, find mininum chi square value.
9. Create new intervals: Copy existing intervals from Step 2 for all expect for the interval with mininum chi square.
   For minimum chi square interval, form a new interval by combining it with next interval and add it to the new intervals.
10. Repeat Step 8 till all intervals are added to new intervals. We get New merged intervals for next Iteration.
11. Go to Step 3.
12. Repeat Step 1 to 11 for other features.

In [10]:
def chimerge(data, attr, label, max_intervals):
    #In case of division by zero, ignore the error
    np.seterr(divide='ignore', invalid='ignore')
    
    #1.Sort the distinct feature values
    distinct_vals = sorted(set(data[attr]))  
    
    #Get the distinct sorted labels
    labels = sorted(set(data[label]))  
    
    #A dictionary of counts for each label
    empty_count = {l: 0 for l in labels}  
    
    # 2.Form intervals for each distinct value in the sorted distinct feature list.
    intervals = [[distinct_vals[i], distinct_vals[i]] for i in range(len(distinct_vals))] 
 
    # 3. Keep applying chimerge process as long as we reach the max_intervals condition
    while len(intervals) > max_intervals:
        
        #Array to hold the chi values for this iteration
        chi = []
        
        #Calculate chi values for each consecutive intervals in this iteration
        for i in range(len(intervals)-1):
            
            # 4.Calculate the occurrences of the class label for each interval created in Step2. This is observed values
            obs0 = data[data[attr].between(intervals[i][0], intervals[i][1])]
            obs1 = data[data[attr].between(intervals[i+1][0], intervals[i+1][1])]
            total = len(obs0) + len(obs1)
            
            #Count the values for each label for given attribute
            count_0 = np.array([v for i, v in {**empty_count, **Counter(obs0[label])}.items()])
            count_1 = np.array([v for i, v in {**empty_count, **Counter(obs1[label])}.items()])
            count_total = count_0 + count_1
            
            # 5. Calculate expected values for each interval.
            expected_0 = count_total*sum(count_0)/total
            expected_1 = count_total*sum(count_1)/total
  
            # 6.Calculate chi square value using observed (Step3) and expected (Step4)) values for each interval.
            chi_ = (count_0 - expected_0)**2/expected_0 + (count_1 - expected_1)**2/expected_1
            chi_ = np.nan_to_num(chi_) # Deal with the zero counts
            
            # Finally do the summation for Chi2 and append it to list of chi values
            chi.append(sum(chi_)) 
            
        
        #8. Once Chi square values are calculated for all intervals, find mininum chi square value.
        min_chi = min(chi)  
 
        #Find the first index with minumum chi
        for i, v in enumerate(chi):
            if v == min_chi:
                min_chi_index = i # Find the index of the interval to be merged
                break
                
        
        # 9. Create new intervals: Copy existing intervals from Step 2 for all expect for the interval with mininum chi square. 
        #For minimum chi square interval, form a new interval by combining it with next interval and add it to the new intervals.
        new_intervals = [] 
        skip = False
        done = False
        
        #Merge the intervals found at min_chi_index with next interval
        for i in range(len(intervals)):
            if skip:
                skip = False
                continue
            if i == min_chi_index and not done: #For minimum chi square interval, form a new interval by combining it with next interval and add it to the new intervals
                t = intervals[i] + intervals[i+1]
                new_intervals.append([min(t), max(t)])
                skip = True
                done = True
            else:
                new_intervals.append(intervals[i])
        
        #Start the chimerge with new set of merged intervals
        intervals = new_intervals
    
    #Print split points for the given attribute
    print('\nSplit points for',attr)
    for i in intervals:
        print(i[0])
        
    #print intervals for the given attribute
    print('Intervals for', attr)
    
    for i in intervals:
        print('[', i[0], ',', i[1], ']', sep='')
        

#12. Perform chimerge and get final intervals on each feature with stopping criteria as maximum 6 intervals

In [15]:
for attr in ['sepal_l','sepal_w', 'petal_l', 'petal_w']:
    chimerge(data=iris, attr=attr, label='type', max_intervals=6)


Split points for sepal_l
4.3
4.9
5.0
5.5
5.8
7.1
Intervals for sepal_l
[4.3,4.8]
[4.9,4.9]
[5.0,5.4]
[5.5,5.7]
[5.8,7.0]
[7.1,7.9]

Split points for sepal_w
2.0
2.3
2.5
2.9
3.0
3.4
Intervals for sepal_w
[2.0,2.2]
[2.3,2.4]
[2.5,2.8]
[2.9,2.9]
[3.0,3.3]
[3.4,4.4]

Split points for petal_l
1.0
3.0
4.5
4.8
5.0
5.2
Intervals for petal_l
[1.0,1.9]
[3.0,4.4]
[4.5,4.7]
[4.8,4.9]
[5.0,5.1]
[5.2,6.9]

Split points for petal_w
0.1
1.0
1.4
1.7
1.8
1.9
Intervals for petal_w
[0.1,0.6]
[1.0,1.3]
[1.4,1.6]
[1.7,1.7]
[1.8,1.8]
[1.9,2.5]


In [18]:
print("Final intervals for each feature where merging is stopped at 6 and their respective split points are printed above!")

Final intervals for each feature where merging is stopped at 6 and their respective split points are printed above!
