# The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2020 Semester 1
-----
## Project 1: Understanding Student Success with Naive Bayes
-----
###### Student Name: Benjamin De Worsop
###### Python version: Python 3.6.8
###### Submission deadline: 11am, Wed 22 Apr 2019

This iPython notebook is a template which you will use for your Project 1 submission. 

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 


# Question 1 is found below
## 1) Part A
##### Explain the ‘naive’ assumption underlying Naive Bayes. (1) Why is it necessary? (2) Why can it be problematic? Link your discussion to the features of the students data set. [no programming required]

The Naive assumption implemented in the Naive Bayes algorithm is the assumption that each of the classes are all independent of eachother. This is required to apply bayes rule to be able to compute the probabilities of combinations of events that haven't been seen before whilst still applying the logic behind bayes rule. This can be seen in the example:

p(A+ | internet, abences) = p(internet, abences|A+)*p(A+)/p(internet, abences)

It is very difficult to find these probabilities. Instead, we can assume independence to get this equation:

p(A+ | internet, abences) = 
p(internet|A+)*p(abences|A+)*p(A+)/p(internet)*p(abences)

This is much easier to solve for because tha variables are easier to measure as there is usually more data available on the occurance of two overlapping variables than many overlapping variables. In order to be able to measure the probabilities of instances with previously unseen class combinations and to use more samples to guide probabilities, this assumption is necessary.

Even though it keeps some predicting power, this assumption is only an approximation of the desired probability. This is especially so when the assumption is less correct (the features *are* correlated). In our class example, there are many values that we would assume are correlated like parents' jobs/education, extra paid class attendance and address which are all linked to wealth. Assuming these factors are prbabilitsically independent may be flawed, thus creating a more inaccurate result.


## 1) Part B


In [343]:
import pandas as pd
path = "student.csv"

# This function should open a data file in csv, and transform it into a usable format 
def load_data(path):
    data = pd.read_csv(path)
    return data

rawData = load_data(path)

In [390]:
from sklearn.model_selection import train_test_split

# This function should split a data set into a training set and hold-out test set
def split_data(rawData):
    return train_test_split(rawData, test_size=0.95)

trainData, testData = split_data(rawData)

In [394]:
import pandas as pd
featureFrames = None
GRADES = ["A+", "A","B","C","D","F"]


def make_grade_struct():
    #Create the grades df manually inserting them as column/row names
    featureFrames[len(list(rawData))-1] = pd.DataFrame(columns=GRADES, index=GRADES)

def make_data_structs():
    global featureFrames
    featureFrames = []
    paramNames = []

    #For each feature
    for feature in list(rawData):
        
        #Find all the names of the paramaters by iterating through all data
        for j in range(len(rawData)):
            if str(rawData[feature][j]) not in paramNames:
                paramNames.append(str(rawData[feature][j]))        
        
        #Make a dataframe to store frequency information with format:     
        #     A+    A    B    C    D    F
        # T  NaN  NaN  NaN  NaN  NaN  NaN
        # A  NaN  NaN  NaN  NaN  NaN  NaN
        
        df = pd.DataFrame(columns=GRADES, index=paramNames)
        featureFrames.append(df.copy())

        #Reset paramNames 
        paramNames = []
        
    #In case we don't see all the grade values, do this manually 
    make_grade_struct()    
    
def count_param_freq(data):
    #For every feature
    for i, feature in enumerate(list(data)):
        
        #Go and count the number of times each paramater resulted in what grade
        for j in range(len(data)):

            #Define the parameter feature, grade, and position in data struct
            param = str(data[feature].iloc[j])
            grade = data["Grade"].iloc[j]
            freqCount = featureFrames[i][grade][param]
        
            #Count the number of times that this parameter is seen
            if pd.isnull(freqCount):
                featureFrames[i][grade][param] = 1
            else:
                featureFrames[i][grade][param]+=1
                
                
#         print(feature)
#         print(featureFrames[i])
#         print()
  
            
 

# This function should build a supervised NB model
def train(data):
    make_data_structs()
    count_param_freq(data)
    return


train(trainData)

In [399]:
import operator
import random

eps = None
totTrainingRows = None


def find_probability(testGrade, testInst):

    #P(label|params) = const * p(param1|label) * p(param2|label) * ... * p(label)
    probability = 1
    
    #For every parameter, find the probability and update main prob
    #for every feature name
    for i, label in enumerate(rawData):
        #Don't use the "Grade" column for the instance for predictions
        #otherwise you're cheating!
        if i == len(list(rawData))-1:
            break
            

        # p(param1|label) = numberParam1|label / numberOfLabel
        #Get the actual paramater we're finding the probability of
        testParam = str(testInst[label])
        
        #Number of times this prameter was counted in training given the label
        try:
            #Try find the label, if you can't find it, or the value is null, probability *= eps
            paramFreq = featureFrames[i][testGrade][testParam]       
            if pd.isnull(paramFreq):
                probability*=eps
                continue
            #The number of times this grade has been counted in training
            gradeFreq = featureFrames[len(featureFrames)-1][testGrade][testGrade]            
       
            #Update probability
            probability *= (paramFreq/gradeFreq)
            
        except:
            probability*=eps
            continue

    #prob *= p(label)
    try:
        probability *= gradeFreq/totTrainingRows
    except:
        probability = eps
        

    return probability

def predict_grade(instance):
    labelProbability = {"A+":0,"A":0,"B":0,"C":0,"D":0,"F":0}
    
    #For every potential label (A+, A, B, C, D, F), find P(label|params)
    for label in labelProbability.keys():
        labelProbability[label] = find_probability(label, instance)  
    
#     print("Calculated probabilities = {}".format(labelProbability))
    return max(labelProbability.items(), key=operator.itemgetter(1))[0]


# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict(data):
    global eps
    global totTrainingRows
    eps = 1/len(data)
    totTrainingRows = len(data)    
    
    for i, inst in enumerate(data.iterrows()):
        print("instance number {} is predicted to be = {}, actual grade = {}".format(i, predict_grade(inst[1]), inst[1]["Grade"]))
    


predict(testData)


Calculated probabilities = {'A+': 9.651212619190701e-54, 'A': 1.14752957271735e-22, 'B': 1.6872884142953789e-21, 'C': 7.00311995033447e-23, 'D': 3.951309109328609e-17, 'F': 1.2528789717623695e-22}
instance number 0 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 8.629938554541597e-40, 'A': 1.507174587021624e-28, 'B': 1.17118907057278e-18, 'C': 1.3330053653864397e-18, 'D': 1.3314331174793682e-11, 'F': 9.4170438034473e-13}
instance number 1 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 8.629938554541597e-40, 'A': 1.2833291712460495e-36, 'B': 1.3971077905512032e-31, 'C': 8.10173439254319e-23, 'D': 3.963029296684352e-19, 'F': 1.1446973829149882e-16}
instance number 2 is predicted to be = F, actual grade = D
Calculated probabilities = {'A+': 9.651212619190701e-54, 'A': 1.507174587021624e-28, 'B': 1.862810387401604e-31, 'C': 2.2133317373896607e-22, 'D': 1.3660441893874085e-19, 'F': 7.310152833621603e-26}
instance number 3 is predicted t

Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 7.53587293510812e-29, 'B': 6.836662942849995e-25, 'C': 1.4939989227380206e-22, 'D': 6.302641976233693e-14, 'F': 1.6280140557013154e-16}
instance number 46 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 5.954798186040662e-51, 'A': 5.133316684984198e-36, 'B': 2.1877321417119984e-23, 'C': 2.0178267459994875e-23, 'D': 7.354823092136655e-16, 'F': 3.5012529456578657e-32}
instance number 47 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 3.014349174043248e-28, 'B': 5.909595196412745e-27, 'C': 1.4403083364521235e-20, 'D': 3.19691006057835e-14, 'F': 1.8041457193378126e-21}
instance number 48 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 6.028698348086496e-28, 'B': 6.563196425135997e-23, 'C': 1.6599988030422455e-21, 'D': 5.900546889153401e-15, 'F': 8.772183400345929e-24}
instance number 49 is pre

Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 1.3484278991471246e-41, 'B': 5.469330354279996e-24, 'C': 8.8867024359096e-19, 'D': 2.8209723206976245e-18, 'F': 5.26331004020756e-23}
instance number 92 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 8.319800137737759e-39, 'B': 3.0764983242824975e-24, 'C': 1.839590834075708e-27, 'D': 1.7876213060630415e-17, 'F': 5.923948811686878e-27}
instance number 93 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.2057396696172992e-27, 'B': 5.350305553047372e-12, 'C': 2.7653920059880767e-18, 'D': 2.753588548271586e-15, 'F': 4.147724289544135e-28}
instance number 94 is predicted to be = B, actual grade = A+
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 6.028698348086496e-28, 'B': 4.432196397309559e-27, 'C': 9.33749326711263e-22, 'D': 4.2027560526495213e-13, 'F': 2.4367176112072023e-24}
instance number 95 is pre

Calculated probabilities = {'A+': 2.5351960837299478e-59, 'A': 3.014349174043248e-28, 'B': 2.0127610884944395e-34, 'C': 6.14531556886239e-20, 'D': 2.2414233197163906e-16, 'F': 7.730263255773824e-21}
instance number 138 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 7.716733885550715e-26, 'A': 1.9541971954899498e-30, 'B': 7.292440472373328e-24, 'C': 1.7507799875836173e-22, 'D': 2.088647906244521e-14, 'F': 2.7828947720785773e-18}
instance number 139 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.0266633369968396e-35, 'B': 4.432196397309559e-27, 'C': 2.2409983841070304e-22, 'D': 2.6290672945735225e-15, 'F': 1.8041457193378122e-21}
instance number 140 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 1.14752957271735e-22, 'B': 1.9269964174490804e-15, 'C': 2.274995823592858e-16, 'D': 8.876220783195789e-13, 'F': 1.7060972577658208e-26}
instance number 14

Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 3.014349174043248e-28, 'B': 2.4623313318386442e-28, 'C': 2.837568861561779e-25, 'D': 2.4385221931856554e-13, 'F': 9.539144857624902e-18}
instance number 184 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 5.954798186040662e-51, 'A': 1.3484278991471246e-41, 'B': 4.19132337165361e-31, 'C': 4.2656171692366073e-19, 'D': 1.9862458956256004e-18, 'F': 3.6082914386756265e-22}
instance number 185 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 5.954798186040662e-51, 'A': 1.5836281973176252e-33, 'B': 1.7240310135401848e-28, 'C': 3.072657784431197e-19, 'D': 6.385873256659524e-16, 'F': 1.2183588056036007e-25}
instance number 186 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 5.393711596588498e-41, 'B': 4.4994357714543444e-21, 'C': 1.2774936347747968e-27, 'D': 1.9339332465227055e-16, 'F': 5.151138223117447e-16}
instance number 

Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.2833291712460495e-36, 'B': 1.6872884142953789e-21, 'C': 5.5541890224435e-20, 'D': 6.304134078974284e-14, 'F': 7.062782852585475e-14}
instance number 219 is predicted to be = F, actual grade = A
Calculated probabilities = {'A+': 1.5642159836613777e-56, 'A': 6.028698348086496e-28, 'B': 2.734665177139998e-24, 'C': 2.8012479801337892e-21, 'D': 4.4028872932518784e-14, 'F': 7.310152833621604e-26}
instance number 220 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 2.1842594275409767e-17, 'B': 1.8739025129164477e-17, 'C': 5.263771586837973e-15, 'D': 2.0546807368508768e-12, 'F': 1.8552631813857166e-19}
instance number 221 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 5.954798186040662e-51, 'A': 1.14752957271735e-22, 'B': 3.037119145731682e-20, 'C': 5.530784011976151e-18, 'D': 1.294448864216053e-11, 'F': 9.020728596689062e-21}
instance number 222 i

Calculated probabilities = {'A+': 5.954798186040662e-51, 'A': 3.014349174043248e-28, 'B': 6.896124054160739e-28, 'C': 2.489998204563367e-21, 'D': 1.2784680140405455e-13, 'F': 7.310152833621606e-24}
instance number 269 is predicted to be = D, actual grade = A
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 3.014349174043248e-28, 'B': 5.909595196412746e-27, 'C': 1.1374979117964284e-16, 'D': 4.931233768442104e-13, 'F': 4.638157953464294e-19}
instance number 270 is predicted to be = D, actual grade = A
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 1.859853440384684e-25, 'B': 2.5309326214430688e-21, 'C': 1.579131476051392e-14, 'D': 2.941929236854665e-12, 'F': 3.006909532229688e-21}
instance number 271 is predicted to be = D, actual grade = A
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 9.29926720192342e-26, 'B': 2.050998882854999e-24, 'C': 3.4124937353892858e-18, 'D': 6.366305026498679e-15, 'F': 3.949299207791252e-27}
instance number 272 is p

Calculated probabilities = {'A+': 5.324672088152166e-37, 'A': 9.29926720192342e-26, 'B': 5.205284758101244e-19, 'C': 3.199212876927454e-17, 'D': 4.045152700675167e-10, 'F': 4.357737020045238e-10}
instance number 306 is predicted to be = F, actual grade = C
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 2.2950591454347e-22, 'B': 8.436442071476894e-22, 'C': 5.459989976622859e-15, 'D': 4.494614111861294e-13, 'F': 2.7413073126081026e-24}
instance number 307 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 9.770985977449749e-31, 'B': 5.469330354279996e-24, 'C': 2.017826745999488e-23, 'D': 6.491441987968968e-18, 'F': 7.680970906563211e-30}
instance number 308 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 3.014349174043248e-28, 'B': 7.451241549606418e-31, 'C': 3.0659847234595123e-27, 'D': 3.137992647602949e-15, 'F': 4.2926151859312044e-16}
instance number 309 is pred

Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 7.175997032787283e-51, 'B': 1.7958656391043594e-30, 'C': 3.3992512830485257e-36, 'D': 2.494214242668472e-21, 'F': 2.132621572207277e-27}
instance number 343 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 2.5351960837299478e-59, 'A': 1.507174587021624e-28, 'B': 2.676383913123723e-17, 'C': 1.2638865686626991e-17, 'D': 1.1235421514553248e-16, 'F': 1.991823897974252e-32}
instance number 344 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 4.108907753208991e-62, 'A': 2.4427464943624373e-31, 'B': 3.4927694763780086e-32, 'C': 1.5166638823952388e-17, 'D': 1.6141984472437747e-16, 'F': 1.3742690232486787e-20}
instance number 345 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 9.651212619190701e-54, 'A': 3.014349174043248e-28, 'B': 3.8311800300893004e-29, 'C': 2.0484385229541304e-19, 'D': 4.359422809879565e-18, 'F': 3.1594393662330028e-27}
instance number 

Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 1.0399750172172199e-39, 'B': 1.3296589191928676e-26, 'C': 5.521312305887832e-31, 'D': 4.8151415651529726e-17, 'F': 1.7170460743724815e-17}
instance number 375 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 1.5642159836613777e-56, 'A': 7.439413761538736e-25, 'B': 5.0618652428861375e-21, 'C': 3.734997306845051e-21, 'D': 1.2841754605317975e-13, 'F': 2.818977686465333e-23}
instance number 376 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.2057396696172992e-27, 'B': 3.3745768285907577e-21, 'C': 1.5555330033682918e-18, 'D': 7.584661313765934e-12, 'F': 1.3914473860392883e-18}
instance number 377 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 4.8854929887248745e-31, 'B': 8.02915173937117e-17, 'C': 1.1380214016411839e-20, 'D': 6.224882107695749e-13, 'F': 6.318878732466001e-28}
instance number

Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 1.2057396696172992e-27, 'B': 1.605830347874234e-16, 'C': 1.536328892215598e-19, 'D': 1.2608268157948557e-15, 'F': 1.7771846435060627e-27}
instance number 408 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 3.1672563946352505e-33, 'B': 1.7240310135401848e-28, 'C': 9.832504910179827e-18, 'D': 3.476753217514629e-16, 'F': 1.8275382084054016e-24}
instance number 409 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 5.324672088152166e-37, 'A': 1.14752957271735e-22, 'B': 1.124858942863586e-21, 'C': 5.748721356486586e-27, 'D': 6.136646467394618e-13, 'F': 8.34868431623573e-18}
instance number 410 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 7.918140986588126e-34, 'B': 2.1877321417119984e-23, 'C': 9.726555486575656e-24, 'D': 4.052657622197753e-13, 'F': 8.34868431623573e-20}
instance number 411 is p

Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 1.14752957271735e-22, 'B': 1.0938660708559992e-23, 'C': 1.4939989227380202e-20, 'D': 5.547637989497369e-12, 'F': 3.655076416810803e-24}
instance number 443 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 1.2833291712460495e-36, 'B': 9.057424898224978e-34, 'C': 5.675137723123558e-25, 'D': 2.7017156085867206e-17, 'F': 5.482614625216206e-25}
instance number 444 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 3.014349174043248e-28, 'B': 6.206511648744665e-27, 'C': 4.668746633556315e-23, 'D': 1.896628372477732e-15, 'F': 4.3860917001729653e-23}
instance number 445 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 7.918140986588126e-34, 'B': 2.2160981986547795e-27, 'C': 2.5222834324993598e-26, 'D': 1.0345030706383341e-17, 'F': 1.4308717286437346e-17}
instance number 44

Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 2.4427464943624373e-31, 'B': 6.209367958005349e-32, 'C': 2.188474984479522e-23, 'D': 7.342902795390896e-14, 'F': 1.9618841257181876e-14}
instance number 478 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 8.629938554541597e-40, 'A': 7.918140986588126e-34, 'B': 2.7942155811024056e-31, 'C': 1.226393889383805e-26, 'D': 1.1573219777221022e-18, 'F': 7.730263255773822e-20}
instance number 479 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 3.2853226783898868e-34, 'A': 1.14752957271735e-22, 'B': 2.0821139032404976e-18, 'C': 5.761233345808494e-20, 'D': 4.142236365491367e-12, 'F': 2.4055276257837507e-20}
instance number 480 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 2.5351960837299478e-59, 'A': 6.742139495735623e-42, 'B': 1.5523419895013372e-32, 'C': 8.281968458831745e-31, 'D': 4.924576600724633e-18, 'F': 1.4005011782631464e-32}
instance number 481 

Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 5.73764786358675e-23, 'B': 4.216280654062008e-17, 'C': 1.5791314760513916e-14, 'D': 3.78248044738457e-13, 'F': 2.1930458500864824e-23}
instance number 514 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 2.2669261666456337e-45, 'A': 8.319800137737759e-39, 'B': 2.734665177139998e-24, 'C': 1.5562488778521052e-22, 'D': 5.3947857272259645e-15, 'F': 5.8481222668972854e-24}
instance number 515 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 1.5642159836613777e-56, 'A': 1.2213732471812186e-31, 'B': 1.1080490993273897e-27, 'C': 4.739574632485119e-19, 'D': 3.395319688463894e-16, 'F': 6.184210604619056e-20}
instance number 516 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 5.133316684984198e-36, 'B': 4.375464283423997e-23, 'C': 5.254757151040332e-26, 'D': 2.1037080516566088e-19, 'F': 1.8552631813857175e-19}
instance number 517

Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.7710360676919013e-47, 'B': 1.2154067453955548e-24, 'C': 1.5528690860309523e-32, 'D': 4.297723002751828e-22, 'F': 1.066310786103638e-27}
instance number 554 is predicted to be = D, actual grade = D
Calculated probabilities = {'A+': 2.0270440925665603e-31, 'A': 1.507174587021624e-28, 'B': 1.367332588569999e-24, 'C': 2.7064558908991913e-14, 'D': 2.593700878206562e-12, 'F': 7.847536502872752e-15}
instance number 555 is predicted to be = D, actual grade = B
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 7.918140986588126e-34, 'B': 1.0938660708559992e-23, 'C': 1.1671866583890788e-22, 'D': 5.200910615153783e-11, 'F': 1.4125565705170948e-13}
instance number 556 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.9541971954899498e-30, 'B': 3.4701898387341626e-19, 'C': 7.58331941197619e-17, 'D': 8.582613656950397e-15, 'F': 4.608582543937927e-29}
instance number 5

Calculated probabilities = {'A+': 1.0793345101142903e-67, 'A': 8.319800137737759e-39, 'B': 2.2643562245562454e-34, 'C': 3.0267401189992323e-25, 'D': 1.1922959935323855e-19, 'F': 1.5034547661148432e-23}
instance number 603 is predicted to be = D, actual grade = F
Calculated probabilities = {'A+': 2.5351960837299478e-59, 'A': 1.6639600275475519e-38, 'B': 8.620155067700924e-29, 'C': 3.791659705988096e-17, 'D': 9.990343939307342e-18, 'F': 5.686990859219403e-26}
instance number 604 is predicted to be = C, actual grade = B
Calculated probabilities = {'A+': 1.3986934448203561e-42, 'A': 1.2057396696172992e-27, 'B': 4.684756282291119e-18, 'C': 8.641850018712739e-20, 'D': 2.4516076973788866e-15, 'F': 5.637955372930664e-22}
instance number 605 is predicted to be = D, actual grade = C
Calculated probabilities = {'A+': 3.674110480787089e-48, 'A': 1.2057396696172992e-27, 'B': 2.024746097154455e-20, 'C': 1.2802740768463318e-20, 'D': 1.8785652451208012e-13, 'F': 5.412437158013438e-22}
instance number 

In [397]:
accuracy = None


# This function should evaluate a set of predictions in terms of accuracy
def evaluate(data):
    count = 0
    global accuracy
    accuracy = {"success":0, "fail":0}
    print("evaluating")
    
    #Go through every row and count if the prediction was a success or fail
    for inst in data.iterrows():
        if(predict_grade(inst[1]) == inst[1]["Grade"]):
            accuracy["success"] += 1
        else:
            accuracy["fail"] += 1
    
    s = accuracy["success"]
    f = accuracy["fail"]
    print("accuracy is {}. ({} successes and {} fails)".format((s/(s+f)), s, f))
            
    return (s/(s+f))

evaluate(testData)

# print(featureFrames[len(list(rawData))-1])


evaluating
accuracy is 0.31442463533225284. (194 successes and 423 fails)


0.31442463533225284

## 1) Part C

#### What accuracy does your classiﬁer achieve? Manually inspect a few instances for which your classiﬁer made correct predictions, and some for which it predicted incorrectly, and discuss any patterns you can ﬁnd.

From inspection, the accuracy of this model is around 30% for an 80-20 train-test data split. This is approximately twice as good as randomly guessing (16.6%) and about as good as guessing the most common class (30%). Observing the classifier output, it seems that this Naive Bayes model produces the better results more common classes within the dataset (Grade = C and D), and poorly with classes that are less common (Grade = A+). This may be due to instances with less common classes having less opportunity to have it's defining characteristics become more salient in the data. 

# Question 2

### A Closer Look at Evaluation

#### - A) You learnt in the lectures that precision, recall and f-1 measure can provide a more holistic and realistic picture of the classifier performance. (i) Explain the intuition behind accuracy, precision, recall, and F1-measure, (ii) contrast their utility, and (iii) discuss the difference between micro and macro averaging in the context of the data set. [no programming required]
#### - B) Compute precision, recall and f-1 measure of your model’s predictions on the test data set (1) separately for each class, and (2) as a single number using macro-averaging. Compare the results against your accuracy scores from Question 1. In the context of the student dataset, and your response to question 2a analyze the additional knowledge you gained about your classifier performance.



## 2) Part a

##### Intuition
Accuracy - This is the proportion of guesses we got right compared to all guesses. I.e. How many times was our prediction accurate? More formally:

(True positives + True Negatives)/(True or Flase positives or negatives)


Precision - With respect to a class that we care to measure (maybe A+ students in this example), how often is our model correct. If we really care about one label more than the others, we want to see how accurately we can predict that class. I.e. When our model detects an interesting class, is it right? More formally:

(True positives)/(True or False positives)


Recall - How good is our model at detecting our desired class. More formally:

(True positives)/(True positives + False Negatives)


F1 measure - A combined measure of precision and recall using the harmonic mean. This is important because precision and recall are usually inversely correlated (models with higher recall have lower precisino and vice versa). It shows how good your classifier at balancing precision and recall. It is defined:

(2 * Precision*Recall)/(Precision + Recall)


##### Compare/Contrast:
The value of precision and recall are based on your desire for type 1 vs type 2 error.

If you are a doctor and your model is determining if a patient has cancer, you really don't want it to tell patients who have cancer, that they dont have cancer (false negative). Better safe than sorry. This is where you don't mind false positives, so you can have a low precision, but you need a high recall. 

This is opposite if you don't care about letting false negatives pass, but want to make sure those that you classify as part of an interesting class is important. An example of this is an employer who wants close to 100% of the candidates they interview to be high quality, but doesn't mind rejecting some potentially great candidates to make this happen.

F1 is important when you want a balance of these 2, and want to overall maximise your performance without having preference for type 1 or type 2 error.


##### Micro vs macro averaging
Macro averaging uses the precision and recall scores from each class and takes the average to produce one single score.

Micro averaging first aggregates all of the TP, FP and FN results from all the classes, and then adds, and divides these results by eachother according to the precision or recall equation respectively. E.g.

micro_precision = sum_TP/(sum_TP + sum_FP)


## 2) Part b


In [386]:
#First evaluate all the predicitons into TP, TN, FN and FP for each class

#Setup dataframes for storing the TP, TN, FN and FP results
labelAccuracy = {"A+":None,"A":None,"B":None,"C":None,"D":None,"F":None}

for label in labelAccuracy.keys():
    labelAccuracy[label] = pd.DataFrame(0, index=["positive", "negative"], columns=["true", "false"]) 
    
# print(labelAccuracy)


def add_one(label, truFal, posNeg):
    if pd.isnull(labelAccuracy[label][truFal][posNeg]):
        labelAccuracy[label][truFal][posNeg] = 1
    else:
        labelAccuracy[label][truFal][posNeg] += 1

        
def count_error_type(label, predictedGrade, actualGrade):
    if(predictedGrade == actualGrade and predictedGrade == label):
        add_one(label, "true", "positive")
    elif(label != predictedGrade and label != actualGrade):
        add_one(label, "true", "negative")
    elif(label == predictedGrade and label != actualGrade):
        add_one(label, "false", "positive")
    elif(label != predictedGrade and label == actualGrade):
        add_one(label, "false", "negative")
    

# This function should evaluate a set of predictions in terms of accuracy
def evaluate_F1(data):
    #For each instance
    for inst in data.iterrows():
        
        predictedGrade = predict_grade(inst[1])
        actualGrade = inst[1]["Grade"]
        
        #Go through all the labels and find if it evaluated a TP, TN, FN or FP
        for label in labelAccuracy.keys():
            #If interesting key is correct, label as correct
            count_error_type(label, predictedGrade, actualGrade)

    return

eps = 1e-30

def safe_divide(num, denom):
    if (denom < eps):
        return 0
    else:
        return num/denom
    
def ave(lst, index): 
    sum = 0
    count = 0
    for indList in lst:
        sum += indList[index]
        count += 1
    return sum/count
        

def print_func():
    mac_ave = []
    for key in labelAccuracy.keys():
        theKey = key
        la = labelAccuracy
        
        precision = safe_divide(la[key]["true"]["positive"],(la[key]["true"]["positive"] + la[key]["false"]["positive"]))
        recall = safe_divide(la[key]["true"]["positive"],(la[key]["true"]["positive"] + la[key]["false"]["negative"]))
        f1 = safe_divide((2*precision*recall),(precision + recall))
        mac_ave.append([key, precision, recall, f1].copy())
      
        print("For the label {}, precision = {:.3f}, recall = {:.3f}, f-1 = {:.3f}".format(key, precision, recall, f1))
        print()
 
    print("macro averaging scores: precision = {}, recall = {}, f1 = {}".format(ave(mac_ave, 1),ave(mac_ave,2),ave(mac_ave,3)))
        
 

evaluate_F1(testData)
print(labelAccuracy)
print_func()

{'A+':           true  false
positive     0      0
negative   601     16, 'A':           true  false
positive     0      0
negative   554     63, 'B':           true  false
positive     1      0
negative   509    107, 'C':           true  false
positive    44     96
negative   375    102, 'D':           true  false
positive   158    310
negative   118     31, 'F':           true  false
positive     5      3
negative   519     90}
For the label A+, precision = 0.000, recall = 0.000, f-1 = 0.000

For the label A, precision = 0.000, recall = 0.000, f-1 = 0.000

For the label B, precision = 1.000, recall = 0.009, f-1 = 0.018

For the label C, precision = 0.314, recall = 0.301, f-1 = 0.308

For the label D, precision = 0.338, recall = 0.836, f-1 = 0.481

For the label F, precision = 0.625, recall = 0.053, f-1 = 0.097

macro averaging scores: precision = 0.379482091982092, recall = 0.19987325619986038, f1 = 0.1506837391660061



## 2) Part b continued

#### Compare the results against your accuracy scores from Question 1. In the context of the student dataset, and your response to question 2a analyze the additional knowledge you gained about your classiﬁer performance.


The data gathered from part b supports the hypothesis created in Question 1 - that less common classes were predicted less accurately than more common classes around 20-30%. While the D and F classes can be predicted 40-50% of the time on a similar 80-20 data split, there are close-to or exactly 0 correct predictions of the A+ class depending on the random sampling. 

The model mostly performs better in recall for common classes like "D" and "F ". This suggests a high number of false negatives were made for these classes. The prevelance of the "D: and "F" classes suggest that this conservative guessing may be due to overfitting of the "D" and "F" class traits. If the model has a very clear idea of what the "D" and "F" classes looks like, it may be conservative with its guesses. A more balanced data set may help improve the accuracy of this classifier.



# Question 3: 
### Training Strategies 

#### There are other evaluation strategies, which tend to be preferred over the hold-out strategy you implemented in Question 1.

#### - A) Select one such strategy, (i) describe how it works, and (ii) explain why it is preferable over hold-out evaluation. [no programming required]

#### - B) Implement your chosen strategy from Question 3a, and report the accuracy score(s) of your classifier under this strategy. Compare your outcomes against your accuracy score in Question 1, and explain your observations in the context of your response to question 3a.



## 3) Part a - Cross Validation


Cross validation separates the data into n chunks and iteratively uses each chunk as testing data, with the rest as training data. E.g. if n=3, chunk 1 = test data, chunk 2 and 3 are used as training and accuracy provided. Then chunk 2 = test, and the process is repeated till every chunk is used as a test.

This is more effective as the hold-out strategy because is uses all of the data present in the set, unlike the holdout that uses one selection for testing and one selection for training. This makes the assessment more reflective of its performance using all the data.

By design, each run uses lots of data. As n increases, the model can be trained with more data on each iteration, making each iteration more accurate. Having the model be the most informed during training creates the best proxy of how it would perform outside a research sandbox.

Increasing the number of times the data is run gives a more accurate representation of the model's performance too. Outlier data doesn't affect the  model as much because the final accuracy results aggregate all the previous results. The hold-out strategy is affected by this because it is only run once with possibly outlier-containing data.

It is also repeatable - giving flexibility to researchers to better understand how their model could be improved. 



## 3) Part b - Implementation of Cross Validation


In [388]:
n = 3
chunkSize = int(len(rawData)/n)
accuracyLog = []

for i in range(n):
    startIndex = i*chunkSize
    endIndex = startIndex + chunkSize
    
    #Deals with ugly divisors len(rawData)/chunkSize
    if i == n-1:
        endIndex = len(rawData)-1
    
#     tempDF = rawData.copy()
    
    tempTrainDF = rawData.drop(rawData.loc[startIndex:endIndex].index, inplace=False)
    tempTestDF = rawData.loc[startIndex:endIndex]

    print("Iteration number {}...".format(i+1))
    train(tempTrainDF)
    accuracyLog.append(evaluate(tempTestDF))
#     print(featureFrames[0])
    print()
    

print("average value = {}".format((sum(accuracyLog)/len(accuracyLog))))



Iteration number 1...
evaluating
accuracy is 0.3317972350230415. (72 successes and 145 fails)

Iteration number 2...
evaluating
accuracy is 0.35023041474654376. (76 successes and 141 fails)

Iteration number 3...
evaluating
accuracy is 0.3640552995391705. (79 successes and 138 fails)

average value = 0.3486943164362519



## 3) Part b - Continued

#### Report the accuracy score(s) of your classifier under this strategy. Compare your outcomes against your accuracy score in Question 1, and explain your observations in the context of your response to question 3a.

Unfortunately, the benefits of cross validation don't seem to be apparent in this example. After experimenting with train-test separation values manually, an interesting guessing pattern emerged. The accuracy tends to be similar, even with different training-test splits.

The model, trained with low amounts of data, guessed primarily "D" and "F", the most popular label. As the amount training data increased, it began predicting other classes. Although, even though there was more variety, the overall accuracy remained the same. The performance of the least common classes improved, whilst the performance of the most common classes decreased. This happens in an almost linear fashion, creating around a 28-37% accuracy throughout a variety of statistically signifact splits (at a train-test 99-1 split, the results are not reliable because the law or large numbers don't apply).

The this behaviour can largely be explained by the disproportionate spread of classes in the data. Ridding this bias would reduce the current model's higher accuracy at low training volumes due to the successful guess-the-most-popular-class strategy used. It may also make the differences between the classes more salient, creating more accurate classifiers when provided more training data. Other effects may be the low correlation between the classes and high-low performing student outcomes and/or high levels of dependence between the classes