# <center>DECISION TREE AND RANDOM FOREST</center>

#### Given all the details, we want to find the best operator for a user. This model does the same.
#### The decision tree algorithm has an average accuracy of 0.906437 (90.64%) and a total accuracy(after appending all data) is 0.89493(89.49%).
#### The random forest algorithm has an average accuracy of 0.919886 (91.99%).
#### The total accuracy of the random forest wasn’t calculated because of computational constraint as it took nearly a day to predict total accuracy of the decision tree and hours to predict accuracy of even one month in the random forest.

<br> </br>
### Some points: 
#### We found the maximum depth for the decision tree and total number of trees for random forest using hit and trail. The csv files containing "max. depth and accuracy" and "no. of trees and accuracy" are stored in hyperparameter tuning sub-folder within the same folder and plots corresponding to those are stored in plots folder. They are also shown in hyperparameter_tuning IPython notebook in this folder itself.
#### The final reports and output showing actual class vs. classification and decision trees are in result sub-folder within the same folder.
#### Initially the algorithm was run generally without giving any preference to any feature, but due to that it was taking a lot of time to execute even for a single month.
#### But it was found that if we fix state as base criteria for the decision tree then it has to search in a small subset of the dataset and the algorithm was very much faster than the original one. So this was done to reduce time complexity while getting almost the same result with almost the same accuracy.
#### So till we match with a state at some node we recursively run the algorithm with only state name as the feature. Once we find a state, we run the algorithm on all the remaining features.
<br> </br>
### The Decision Tree algorithm will take some minutes to execute but the Random Forest algorithm will take several hours.

# Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random                # to use random.sample while finding random columns to be used for decision tree in random forest
from pprint import pprint    # It is data pretty printer. Used to print decison tree that can be viewed nicely
import json                  # to store decision tree in decision_tree.txt file
import ast                   # to read decision tree stored in decision_tree.txt file
%matplotlib inline

# Defining months and a function which is used to rename columns of the dataframe for ease in further coding

In [2]:
month=["April 2018","May 2018","June 2018","July 2018","August 2018","September 2018","October 2018","November 2018","December 2018","January 2019","February 2019","March 2019"]

In [3]:
month2="April 2018" 
#this is used to print a decision tree in this file, change it if you want to print some other decision tree

In [4]:
def start(df):
    df.rename(
        columns={
            "Operator": "op",
            "In Out Travelling": "inOut",
            "Network Type": "nwrk",
            "Latitude": "lat",
            "Longitude": "long",
            "Call Drop Category": "cdc",
            "Rating": "rt",
            "State Name": "stn"
        },
        inplace=True
        )
    df=df.dropna()
    column_titles = ['inOut','nwrk','rt','cdc','lat','long','stn','op']
    df=df.reindex(columns=column_titles)
    # shifting the column "op" to the end because we are classifying with respect to it
    return df

# Defining helper functions for making Decision tree:

#### This Function checks whether the data is pure or not, i.e. all the data have the same class labels 

In [5]:
def purityCheck(data):
    
    label_column = data[:, -1]                    # getting last column, i.e., "op"(Opeartor type)
    unique_classes = np.unique(label_column)

    if len(unique_classes) == 1:
        return True
    else:
        return False
# returns true if the no. of unique class values is 1 (data is pure)

#### This Function classifies the data by finding which unique class appears most in data

In [6]:
def classify_data(data):
    
    label_column = data[:, -1]                         # getting last column data, i.e., "op"(Opeartor type)
    unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)

    index = counts_unique_classes.argmax()             # finds index of the unique class appearing maximum time
    classification = unique_classes[index]             # storing unique class in classification
    
    return classification

#### This Function finds potential splits in the data, i.e. those data points which qualify for splitting data into different classes

In [7]:
def potentialSplits(data, random_subspace,counter2):
    
    potential_splits = {}                              # dictionary to store potential splits
    _, n_columns = data.shape
    column_indices = list(range(n_columns - 1))        # excluding the last column which is "op"
    
    # this if is executed when we are using random forest and 
    # it is passing the no. of columns that should be there in decision tree of random forest
    if random_subspace and random_subspace <= len(column_indices):
        column_indices = random.sample(population=column_indices, k=random_subspace)
        # finding random columns to be used for decision tree in random forest
    
    # this if is used to hard code that the state name is of most priority and it should be used first
    # (reason for it is mentioned in the starting)
    if counter2==0:
        column_indices=[6]
    
    for column_index in column_indices:       
        values = data[:, column_index]
        unique_values = np.unique(values)                # find all the unique values in a specified column
        
        type_of_feature = feature_types[column_index]
        
        # for continuous feature, potential splits will consist of values between two unique values
        # e.g, if we have unique values 1,3,5 then the potential splits will be (1+3)/2=2 and (3+5)/2=4
        if type_of_feature == "continuous":
            if len(unique_values)>1:
                potential_splits[column_index] = []
                for index in range(len(unique_values)):
                    if index != 0:
                        current_value = unique_values[index]
                        previous_value = unique_values[index - 1]
                        potential_split = (current_value + previous_value) / 2

                        potential_splits[column_index].append(potential_split)
        
        # for categorical feature, all the unique values will be considered as a potential split
        elif len(unique_values) > 1:                           
            potential_splits[column_index] = unique_values
    
    return potential_splits

#### This function splits data into two parts using the split column and split value from potential splits

In [8]:
def dataSplitting(data, split_column, split_value):
    
    split_column_values = data[:, split_column]           # values of column with respect to which we want to split

    type_of_feature = feature_types[split_column]         # feature type for split_column
    
    # for continuous feature, data will be splitting on these criteria
    if type_of_feature == "continuous":
        data_below = data[split_column_values <= split_value]
        data_above = data[split_column_values >  split_value]
    
    # for categorical feature, data will be splitting on these criteria  
    else:
        data_below = data[split_column_values == split_value]
        data_above = data[split_column_values != split_value]
    
    return data_below, data_above

#### This function calculates entropy for a given part of data, i.e, for the data from both the splits individually

In [9]:
def entropyCalculation(data):
    
    label_column = data[:, -1]                                  # getting last column data, i.e., "op"(Opeartor type)
    _, counts = np.unique(label_column, return_counts=True)     # stores count of unique values individually in label column

    probabilities = counts / counts.sum()                       # finds probability of each unique label (class)
    entropy = sum(probabilities * -np.log2(probabilities))      # calculate entropy of the split
     
    return entropy

#### This function calculates overall entropy by combining entropy from both the splits

In [10]:
def overallEntropyCalculation(data_below, data_above):
    
    n = len(data_below) + len(data_above)
    p_data_below = len(data_below) / n                    # finds probability of data below
    p_data_above = len(data_above) / n                    # finds probability of data above
    
    # overall entropy of the split
    overall_entropy =  ( p_data_below * entropyCalculation(data_below) + p_data_above * entropyCalculation(data_above) )
    
    
    return overall_entropy

#### This function calculates finds best split which minimizes overall entropy

In [11]:
def bestSplit(data, potential_splits):
    
    overall_entropy = 9999
    for column_index in potential_splits:
        for value in potential_splits[column_index]:
            data_below, data_above = dataSplitting(data, split_column=column_index, split_value=value)    # splitting data
            current_overall_entropy = overallEntropyCalculation(data_below, data_above)         # calculating overall entropy

            # finding split with minimum overall entropy
            # that split will be the best split
            if current_overall_entropy <= overall_entropy:
                overall_entropy = current_overall_entropy
                best_split_column = column_index
                best_split_value = value
    
    return best_split_column, best_split_value

#### This function finds whether the feature is continuous or categorical

In [12]:
def featureType(df):
    
    typeOf_feature = []
    n_unique_values_treshold = 15
    for feature in df.columns:
        if feature != "op":
            unique_values = df[feature].unique()          # unique values in a feature
            example_value = unique_values[0]

            # if the values of a feature is string or
            # no. of unique values of less than threshold then feature type is categorical
            if (isinstance(example_value, str)) or (len(unique_values) <= n_unique_values_treshold):
                typeOf_feature.append("categorical")
            else:
                typeOf_feature.append("continuous")
    
    return typeOf_feature

# Main decison tree algorithm

In [13]:
def decisionTree(df, counter2 , counter=0, min_samples=2, max_depth=None, random_subspace=None):
    
    # data preparations
    if counter == 0:
        global column_headers, feature_types        # made global because we are using them outside this function also
        
        column_headers = df.columns                 # stores column names of the dataframe 
                                                    # because we are converting that into numpy array
        feature_types = featureType(df)
        
        data = df.values                            # converting into numpy 2D array
        
    else:
        data = df                                   # when counter is not 0, then df already in numpy array
    
    
    # base cases
    # classify data when :
                           # 1: It is pure (same class for all data points)                   
                           # 2: length of remaining data is less than min_samples
                           #    (to remove unnecessary computations for very minimal change in accuracy)
                           # 3: The tree's depth has reached max_depth
                           #    (to remove unnecessary computations for very minimal change in accuracy)
    if (purityCheck(data)) or (len(data) < min_samples) or (counter == max_depth):
        classification = classify_data(data)
        
        return classification

    
    # recursive part
    else:    
        counter += 1                                # increase counter at every increasing depth    

        potential_splits = potentialSplits(data, random_subspace,counter2)   # finding potential splits in data
        
        if len(potential_splits)==0:                        # if we cannot find any potential split then we classify the data
            classification = classify_data(data)
            return classification
        
        split_column, split_value = bestSplit(data, potential_splits)        # finding the best split among all potential splits
        data_below, data_above = dataSplitting(data, split_column, split_value)     # splitting the data

        
        # determine question
        feature_name = column_headers[split_column]           # storing the feature name of the column which contains best split
        type_of_feature = feature_types[split_column]         # finding whether that feature is continuous or categorical
        
        # framing questions (nodes of the decision tree) based on the type of feature
        if type_of_feature == "continuous":                   
            question = "{} <= {}".format(feature_name, split_value)     # continuous feature
            
        else:
            question = "{} = {}".format(feature_name, split_value)      # categorical feature
        
        # instantiating sub-tree
        sub_tree = {question: []}
        
        # finding answers (recursion)
        
        # as already told in potentialSplits function, counter2 is used to hard code that the state name is of most
        # priority and it should be used first (reason for it is mentioned in the starting)
        # counter2=0 means it is the root node
        # or we haven't matched with any state, i.e., we are recursively going in no part of subtree till now
        if counter2==0:
            yes_answer = decisionTree(data_below, 1, counter, min_samples, max_depth, random_subspace)
            no_answer = decisionTree(data_above, 0, counter, min_samples, max_depth, random_subspace)
            
        # counter2=1 means we have matched with some state, i.e., 
        # we have entered yes part of the subtree and ow there is no need to hard code the state part
        else:
            yes_answer = decisionTree(data_below, 1, counter, min_samples, max_depth, random_subspace)
            no_answer = decisionTree(data_above, 1, counter, min_samples, max_depth, random_subspace)
        
        # If the answers are the same, then sub_tree is a single answer
        if yes_answer == no_answer:
            sub_tree = yes_answer
        
        # otherwise subtree will append both the answers to the sub_tree (python dictionary)
        else:
            sub_tree[question].append(yes_answer)
            sub_tree[question].append(no_answer)
        
        return sub_tree   # will return subtree at every recursion, but at the outermist recursion, it will be the whole tree

#### This function is used for classification of the testing dataset using the decision tree we already got

In [14]:
def classifyData(example, tree):
    question = list(tree.keys())[0]
    A = question.split()             
    # as question is made of feature name, operator and feature value, so we are dividing it into 3 parts
    # to extract all those 3 information
    feature_name, comparison_operator=A[0],A[1]
    
    # looping because feature value can be of more than 1 words, e.g. state name can be "Uttar Pradesh"
    value=""
    for i in range(2,len(A)):
        if i==2:
            value+=A[i]
        else:
            value+=" "+A[i]


    if comparison_operator == "<=":                     # feature is continuous
        if example[feature_name] <= float(value):
            answer = tree[question][0]                  # as yes_anser was appended first
        else:
            answer = tree[question][1]                  # as no_anser was appended after that

    else:                                               # feature is categorical
        if str(example[feature_name]) == value:
            answer = tree[question][0]                  # as yes_anser was appended first
        else:
            answer = tree[question][1]                  # as no_anser was appended after that

    # base case
    if not isinstance(answer, dict):                    # if answer is not a dictionary then it a single value
        return answer                                   # which is the direct answer (leaf of the sub tree)
                                                        # so return that and it will be the classification of that entry (row)
    
    # recursive part
    else:
        residual_tree = answer
        return classifyData(example, residual_tree)          # otherwise recursively do this till we reach the leaf of the tree

#### This function calculates accuracy by comparing actual class and classification we got

In [15]:
def calc_accuracy(df, tree):

    df["classification"] = df.apply(classifyData, axis=1, args=(tree,))    # calling classifyData for all rows of the dataframe
    df["classification_correct"] = df["classification"] == df["op"]        # stores True it is correctly classified
    
    # mean of classification correct will be the accuracy as accuracy=(correct classified)/(total entries)
    accuracy = df["classification_correct"].mean()                         
    
    return accuracy

# Running the algorithm for all the data over 12 months

In [31]:
sumAccuracy=0

for i in month:
    print("month:{}...".format(i))
    df = pd.read_csv('../../data/{}/Training_Data_Set.csv'.format(i))
    df2=pd.read_csv('../../data/{}/Testing_Data_Set.csv'.format(i))
    
    df=start(df)                         # function which is used to rename columns of the dataframe for ease in further coding
    df2=start(df2)

    tree = decisionTree(df, 0,max_depth=40)     # max depth of 40 is decided by hit and trial (the max depth which
                                                # gives maximum accuracy). Results of which are stored in separate file
    accuracy=calc_accuracy(df2,tree)
    print("accuracy:{}\n".format(accuracy))
    
    sumAccuracy+=accuracy                       # adding all accuracies to find average accuracy later on
        
    with open('./result/output/{}/decision_tree.txt'.format(i), 'w') as file:
        file.write(json.dumps(tree))                                      # storing decision tree in a txt file
        
    DF2 = pd.DataFrame()
    DF2['Actual operator']=df2['op']
    DF2['Classification']=df2['classification']
    DF2.to_csv('./result/output/{}/output.csv'.format(i),index=False)           # storing output of the testing dataset
    
    a=[[i,accuracy]]
    DF = pd.DataFrame(a, columns = ["Month", "Accuracy"])
                                                                          # storing report for all the months
    if i=="April 2018":
        DF.to_csv('./result/report_decision_tree.csv',index=False,header=True)          
    else:
        DF.to_csv('./result/report_decision_tree.csv',index=False,mode='a',header=False)

sumAccuracy/=len(month) 
a=[["Average over 12 months",sumAccuracy]]
DF = pd.DataFrame(a, columns = ["Month", "Accuracy"])
DF.to_csv('./result/report_decision_tree.csv',index=False,mode='a',header=False)

month:April 2018...
accuracy:0.9231703087062089

month:May 2018...
accuracy:0.8992654774396642

month:June 2018...
accuracy:0.9263634892377407

month:July 2018...
accuracy:0.9029441971927422

month:August 2018...
accuracy:0.9109613338063143

month:September 2018...
accuracy:0.9167774086378737

month:October 2018...
accuracy:0.8998756218905473

month:November 2018...
accuracy:0.9102458340207886

month:December 2018...
accuracy:0.9024709302325581

month:January 2019...
accuracy:0.9061967026719727

month:February 2019...
accuracy:0.887459807073955

month:March 2019...
accuracy:0.8915094339622641



#### Printing decision stored in decision.txt file

In [8]:
#reading the decision tree saved
val = month2   # change month2 to visualize some other decision tree 
file=open('./result/output/{}/decision_tree.txt'.format(val),"r")       # open txt file
contents = file.read()                                           # reading contents of file
dictionary = ast.literal_eval(contents)         # converting the data read into dictionary as our decision tree is a dictionary

file.close()                                                     # closing the file
print("Decision tree for ",month2,":\n")                          
pprint(dictionary)                                               # using pprint to print prettier decision tree


Decision tree for  April 2018 :

{'stn = Maharashtra': [{'nwrk = 4G': [{'long <= 72.86281237': [{'lat <= 19.170561515000003': [{'rt = 5': [{'inOut = Indoor': [{'long <= 72.82295730499999': [{'long <= 72.82190324': ['RJio',
                                                                                                                                                                                      'Airtel']},
                                                                                                                                                             {'long <= 72.84645653999999': ['RJio',
                                                                                                                                                                                            {'long <= 72.84693654': ['Airtel',
                                                                                                                                                                    

                                                                                                                                                                                                                                                            'RJio']},
                                                                                                                                                                                                                                   'Vodafone']},
                                                                                                                                                                                                                      {'lat <= 19.054916235': [{'lat <= 19.034172809999998': ['Airtel',
                                                                                                                                                                                                                         

                                                                                                                                                                                                                                                             {'cdc = 2': [{'long <= 73.87641833': ['Idea',
                                                                                                                                                                                                                                                                                                   {'lat <= 18.466023200000002': [{'long <= 73.88140045': ['Idea',
                                                                                                                                                                                                                                                                                                                                                          

                                                                                                                                                           {'long <= 72.87905015000001': [{'lat <= 19.18813478': [{'lat <= 19.15320063': [{'long <= 72.872458545': [{'rt = 3': [{'long <= 72.871942845': ['Airtel',
                                                                                                                                                                                                                                                                                                          'Vodafone']},
                                                                                                                                                                                                                                                                                'Vodafone']},
                                                                                              

                                                                                                                                                     {'nwrk = 3G': [{'long <= 73.07597864': [{'lat <= 19.075248655': [{'inOut = Travelling': [{'lat <= 19.013546124999998': [{'lat <= 18.98251329': ['Vodafone',
                                                                                                                                                                                                                                                                                                     'Airtel']},
                                                                                                                                                                                                                                                                             'Vodafone']},
                                                                                                           

                                                                                                                                                                                                                          {'lat <= 18.853431434999997': [{'long <= 73.76709629000001': [{'long <= 73.36636555': ['Vodafone',
                                                                                                                                                                                                                                                                                                                 'Airtel']},
                                                                                                                                                                                                                                                                                        'Vodafone']},
                                                                        

                                                                                                           {'lat <= 22.22664043': [{'long <= 84.81515454999999': ['Airtel',
                                                                                                                                                                  {'lat <= 22.226614705': [{'lat <= 22.225510415000002': ['Airtel',
                                                                                                                                                                                                                          {'long <= 84.83512345000003': ['RJio',
                                                                                                                                                                                                                                                         {'long <= 84.83513734500002': ['Airtel',
                                                     

                                                                                                                                                                                                                                                                      'Airtel']}]}]},
                                                                                                                                                                                         {'lat <= 22.57247437': [{'lat <= 22.336092665000002': ['RJio',
                                                                                                                                                                                                                                                {'cdc = 1': ['RJio',
                                                                                                                                                                                                                     

                                                                {'stn = Assam': [{'nwrk = 4G': [{'long <= 91.809988175': ['RJio',
                                                                                                                          {'long <= 92.547243675': [{'lat <= 26.1524127': ['Airtel',
                                                                                                                                                                           'RJio']},
                                                                                                                                                    'RJio']}]},
                                                                                                {'long <= 92.28031103500001': [{'long <= 91.76047471999999': [{'long <= 90.751708495': ['Vodafone',
                                                                                                                                                        

                                                                                                   {'stn = Uttar Pradesh': [{'nwrk = 4G': [{'long <= 77.39151165': [{'lat <= 28.538955715': [{'long <= 77.26355001499999': ['RJio',
                                                                                                                                                                                                                            {'rt = 1': [{'long <= 77.32864785000001': ['Vodafone',
                                                                                                                                                                                                                                                                       'Airtel']},
                                                                                                                                                                                                                              

                                                                                                                                                                                                                                                    {'long <= 77.282384685': [{'rt = 3': [{'long <= 77.25986871500001': ['Airtel',
                                                                                                                                                                                                                                                                                                                         'Idea']},
                                                                                                                                                                                                                                                                                          {'long <= 77.26828085000001': [{'long <= 77.26734585000001': ['Vodafone'

                                                                                                                            {'stn = Madhya Pradesh': [{'nwrk = 4G': [{'rt = 4': [{'long <= 77.40761294500001': [{'long <= 75.911926675': [{'long <= 75.891969325': ['RJio',
                                                                                                                                                                                                                                                                    'Airtel']},
                                                                                                                                                                                                                                          'RJio']},
                                                                                                                                                                                                                {'lat <=

                                                                                                                                                      {'stn = Telangana': [{'nwrk = 4G': [{'lat <= 17.399138649999998': [{'lat <= 17.38386419': [{'long <= 78.42860046500002': [{'lat <= 17.350959664999998': [{'long <= 78.225549625': ['Airtel',
                                                                                                                                                                                                                                                                                                                                         'RJio']},
                                                                                                                                                                                                                                                                                                               {'long <= 78.386135

                                                                                                                                                                                                                                                                                                                                         {'lat <= 17.57800143': ['Airtel',
                                                                                                                                                                                                                                                                                                                                                                 'RJio']}]}]}]}]},
                                                                                                                                                                                                                                      {'long <= 78.31644291': [{'lat <= 17

                                                                                                                                                                           {'stn = Andhra Pradesh': [{'nwrk = 4G': [{'rt = 1': [{'lat <= 16.49892665': [{'long <= 79.911073735': [{'long <= 79.85291555': [{'lat <= 15.116440224999998': [{'cdc = 2': [{'long <= 79.09447362': ['RJio',
                                                                                                                                                                                                                                                                                                                                                                                'Airtel']},
                                                                                                                                                                                                                                                    

                                                                                                                                                                                                     {'stn = Rajasthan': [{'nwrk = 4G': [{'long <= 71.71849994': ['Vodafone',
                                                                                                                                                                                                                                                                  {'long <= 75.77650259': [{'rt = 5': [{'inOut = Outdoor': ['Idea',
                                                                                                                                                                                                                                                                                                                            {'long <= 75.737941605': ['RJio',
                                                        

                                                                                                                                                                                                                          {'stn = NCT': [{'nwrk = 4G': [{'long <= 77.015002055': [{'rt = 5': [{'long <= 76.9996235': ['Airtel',
                                                                                                                                                                                                                                                                                                                      {'long <= 77.00331809': ['RJio',
                                                                                                                                                                                                                                                                                                                                               'A

                                                                                                                                                                                                                                                                                                                                                                                                        'Airtel']},
                                                                                                                                                                                                                                                                                                                                                                               'Vodafone']},
                                                                                                                                                                                                                       

                                                                                                                                                                                                                                                                          {'stn = Bihar': [{'nwrk = 4G': [{'lat <= 25.621865275': [{'lat <= 25.60879025': [{'lat <= 25.41470451': [{'long <= 84.97906': ['RJio',
                                                                                                                                                                                                                                                                                                                                                                                                         {'rt = 5': ['Airtel',
                                                                                                                                                                                        

                                                                                                                                                                                                                                                                                                                                                                                                                                                                         {'long <= 80.16252937499999': ['RJio',
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        'Airtel']}]},
  

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          'BSNL']}]}]}]},
                                                                                                                                                                                                                                                                                                                                                                                                          {'lat <= 10.769333845': ['Vodafone',
                                                                                               

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               {'long <= 73.205158825': [{'long <= 73.190577765': [{'long <= 73.179863545': [{'long <= 73.179725055': ['Vodafone',
                                                                                                                                                                                                                                                                                     

                                                                                                                                                                                                                                                                                                                                    {'stn = Himachal Pradesh': [{'nwrk = 4G': [{'long <= 77.170308375': [{'lat <= 31.107605805': ['RJio',
                                                                                                                                                                                                                                                                                                                                                                                                                                  {'inOut = Indoor': [{'long <= 76.936791285': [{'lat <= 32.109846024999996': ['RJio',
                                                                       

                                                                                                                                                                                                                                                                                                                                                                                     {'stn = Chhattisgarh': [{'nwrk = 4G': [{'lat <= 21.26933716': [{'lat <= 19.119154955': ['RJio',
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             {'lat <= 21.102023205000002': [{'cdc = 0': [{'inOut = Outdoor': [{'lon

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  'RJio']}]},
                                                                                                                                                                                                                                                                                                          

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   'Airtel']}]}]}]},
                                                                                                                                                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         {'long <= 79.93524538499999': [{'cdc = 0': [{'long <= 79.63757794999998': [{'long <= 79.621742305': ['Airtel',
                                                                                                                                                                                                                                                                                                                                                                

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    'RJio']}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}]}


# Making a random forest using the above functions:

#### This function takes an input of whole dataset selects and ouputs some part of it

In [36]:
def bootstrap(df, n):
    bootstrap_indices = np.random.randint(low=0, high=len(df), size=n)       
    # finding random n indices between 0 and length of the dataframe
    
    df_bootstrapped = df.iloc[bootstrap_indices]           # dataframe at those indices
    
    return df_bootstrapped

#### This function calls classifyData function already created to classify the testing data

In [37]:
def predictions_decisionTree(df, tree):
     # storing predictions of the whole test data for a given decision tree
    predictions = df.apply(classifyData, args=(tree,), axis=1)         
    return predictions

## Main Random Forest Algorithm

In [38]:
def randomForest(df, n_trees, n_bootstrap, n_features, dt_max_depth):
    forest = []                                    # will append all the trees into this random forest
    for i in range(n_trees):
        df_bootstrapped = bootstrap(df, n_bootstrap)      # bootstrapping the dataframe (finding random n_bootstrap data points)
        tree = decisionTree(df_bootstrapped,0, max_depth=dt_max_depth, random_subspace=n_features)
        # calling decisionTree with botstrapped dataframe and a random subspace of features 
        
        forest.append(tree)                              # appending the tree into the forest
    
    return forest

#### This function calls predictions_decisionTree function already created to classify the testing data for the whole forest

In [39]:
def predictions_randomForest(df, forest):
    df_predictions = {}                  # initializing df_predictions. It will later be converted into a dataframe
    for i in range(len(forest)):
        column_name = "tree_{}".format(i)
        predictions = predictions_decisionTree(df, tree=forest[i])     
        # finding predictions for different decision trees at each iteration
        
        df_predictions[column_name] = predictions         # storing them

    df_predictions = pd.DataFrame(df_predictions)
    predictions_randomForest = df_predictions.mode(axis=1)[0]     
    # finding mode of all predictions over all decision trees. It will be prediction of the random forest
    
    df_predictions["Random Forest"]=predictions_randomForest     # storing it into dataframe
    
    return df_predictions

#### This function calculates accuracy by comparing actual class and classification we got

In [40]:
def calc_accuracy2(predictions, labels):
    predictions_correct = predictions == labels      #storing True if predictions and label are same
    
    accuracy = predictions_correct.mean()
    # mean of classification correct will be the accuracy as accuracy=(correct classified)/(total entries)
    
    return accuracy

# Running the Random Forest algorithm for all the data over 12 months

In [52]:
for i in month:
    print("Month:{}...".format(i))
    df = pd.read_csv('../../data/{}/Training_Data_Set.csv'.format(i))
    df2=pd.read_csv('../../data/{}/Testing_Data_Set.csv'.format(i))
    
    df=start(df)                         # function which is used to rename columns of the dataframe for ease in further coding
    df2=start(df2)

    # forest will store all the decision trees
    # no. of trees is found to be 96 using hit and trial (which gives best accuracy)
    forest = randomForest(df, n_trees=96, n_bootstrap=int(0.6*(len(df))), n_features=int((len(df.axes[1]))/2), dt_max_depth=40)
    
    predictions = predictions_randomForest(df2, forest)        # finding predictions on testing data
    print("Forest created and stored")
          
    for j in range(len(forest)):                                    # storing all decison trees in txt files
        with open('./result/output/{}/random forest/decision_tree_{}.txt'.format(i,j), 'w') as file:
            file.write(json.dumps(forest[j]))                         
      
    accuracy = calc_accuracy2(predictions["Random Forest"], df2["op"])       # calculating accuracy
    print("Accuracy= {}\n".format(accuracy))
    
    DF2 = pd.DataFrame()                                                # storing output of the testing dataset
    DF2['Actual operator']=df2['op']
    DF2['Classification using random forest']=predictions["Random Forest"]
    for j in predictions:
        if j!="Random Forest":
            DF2[j]=predictions[j]
    DF2.to_csv('./result/output/{}/random forest/output.csv'.format(i),index=False)
    
    a=[[i,accuracy]]                                                    # storing result of all the months
    DF = pd.DataFrame(a, columns = ["Month", "Accuracy"])
    if i=="April 2018":
        DF.to_csv('./result/report_random_forest.csv',index=False,header=True)
    else:
        DF.to_csv('./result/report_random_forest.csv',index=False,mode='a',header=False)
          
sumAccuracy/=len(month) 
a=[["Average over 12 months",sumAccuracy]]
DF = pd.DataFrame(a, columns = ["Month", "Accuracy"])
DF.to_csv('./result/report_random_forest.csv',index=False,mode='a',header=False)

Month:April 2018...
Forest created and stored
Accuracy= 0.9290669441553937

Month:May 2018...
Forest created and stored
Accuracy= 0.9094088842252536

Month:June 2018...
Forest created and stored
Accuracy= 0.9302476128823434

Month:July 2018...
Forest created and stored
Accuracy= 0.9108182129407737

Month:August 2018...
Forest created and stored
Accuracy= 0.9184107839659453

Month:September 2018...
Forest created and stored
Accuracy= 0.922093023255814

Month:October 2018...
Forest created and stored
Accuracy= 0.9160447761194029

Month:November 2018...
Forest created and stored
Accuracy= 0.9224550404223726

Month:December 2018...
Forest created and stored
Accuracy= 0.9142441860465116

Month:January 2019...
Forest created and stored
Accuracy= 0.915861284820921

Month:February 2019...
Forest created and stored
Accuracy= 0.905144694533762

Month:March 2019...
Forest created and stored
Accuracy= 0.9386792452830188

