# ID2214 Assignment 1 Group no. 5
### Project members: 
[Ceren Dikmen, cerend@kth.se]
[Jakob Heyder, heyder@kth.se]
[Lutfi Altin, lutfia@kth.se]
[Muhammad Fasih Ullah, mufu@kth.se]

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy and pandas may be used in the solution for this assignment.

### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).


## Load NumPy and pandas

In [0]:
import numpy as np
import pandas as pd

## 1a. Create and apply normalization

In [0]:
# Insert the functions create_normalization and apply_normalization below (after the comments)
#
# Input to create_normalization:
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from create_normalization:
# df: a new dataframe, where each numeric value in a column has been replaced by a normalized value
# normalization: a mapping (dictionary) from each column name to a triple, consisting of
#                ("minmax",min_value,max_value) or ("zscore",mean,std)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider columns of type "float" or "int" only (and which are not labeled "CLASS" or "ID"),
#         the other columns should remain unchanged
# Hint 3: Take a close look at the lecture slides on data preparation
#
# Input to apply_normalization:
# df: a dataframe
# normalization: a mapping (dictionary) from column names to triples (see above)
#
# Output from apply_normalization:
# df: a new dataframe, where each numerical value has been normalized according to the mapping
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: For minmax-normalization, you may consider to limit the output range to [0,1]

def create_normalization(df, normalizationtype = "minmax"):
    copydf = df.copy()
    normalization = {}
    
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])] # Removing ID and CLASS columns
    cols = cols.select_dtypes(include=[np.int_, np.float_]) # Selecting only int and float type columns from the remaining
    
    if normalizationtype == 'minmax':
        for column in cols:
            min = copydf[column].min()
            max = copydf[column].max()
            copydf[column] = [(x-min)/(max-min) for x in copydf[column]]
            normalization[column] = ("minmax", min, max)
            
    elif normalizationtype == 'zscore':
        for column in cols:
            mean = copydf[column].mean()
            std = copydf[column].std()
            copydf[column] = copydf[column].apply(lambda x: (x-mean)/std)
            normalization[column] = ("zscore", mean, std)
            
    return copydf, normalization

def apply_normalization(df, normalization):
    copydf = df.copy()
    
    for column in normalization:
        type = normalization[column][0]
        
        if type == 'minmax':
            min = normalization[column][1]
            max = normalization[column][2]
            copydf[column] = np.clip([(x-min)/(max-min) for x in copydf[column]], 0, 1) # Clipping outliers to 0-1
            
        elif type == 'zscore':
            mean = normalization[column][1]
            std = normalization[column][2]
            copydf[column] = copydf[column].apply(lambda x: (x-mean)/std)
   
    return copydf

In [3]:
# Test your code (leave this part unchanged)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

glass_train_norm, normalization = create_normalization(glass_train_df,normalizationtype="minmax")
print("normalization:\n")
for f in normalization:
    print("{}:{}".format(f,normalization[f]))

glass_test_norm = apply_normalization(glass_test_df,normalization)
print("\nglass_test_norm:\n")
glass_test_norm

normalization:

RI:('minmax', 1.51131, 1.53125)
Na:('minmax', 10.73, 15.79)
Mg:('minmax', 0.0, 4.49)
Al:('minmax', 0.29, 3.04)
Si:('minmax', 69.81, 75.18)
K:('minmax', 0.0, 6.21)
Ca:('minmax', 5.43, 14.68)
Ba:('minmax', 0.0, 3.15)
Fe:('minmax', 0.0, 0.37)

glass_test_norm:



Unnamed: 0,ID,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,CLASS
0,101,0.262788,0.399209,0.634744,0.418182,0.644320,0.091787,0.363243,0.034921,0.594595,2
1,104,0.799398,0.606719,0.701559,0.134545,0.141527,0.012882,0.671351,0.000000,0.000000,2
2,44,0.541123,0.592885,0.855234,0.156364,0.363128,0.027375,0.465946,0.000000,0.000000,1
3,17,0.327482,0.385375,0.817372,0.316364,0.614525,0.098229,0.353514,0.000000,0.000000,1
4,81,0.231194,0.420949,0.783964,0.665455,0.530726,0.111111,0.274595,0.000000,0.000000,2
5,142,0.361083,0.488142,0.808463,0.283636,0.562384,0.091787,0.322162,0.028571,0.459459,2
6,120,0.261284,0.559289,0.795100,0.429091,0.491620,0.103060,0.273514,0.000000,0.000000,2
7,123,0.278837,0.494071,0.788419,0.432727,0.564246,0.090177,0.288649,0.000000,0.000000,2
8,133,0.342026,0.533597,0.886414,0.323636,0.499069,0.093398,0.294054,0.000000,0.000000,2
9,185,0.000000,1.000000,0.000000,0.018182,1.000000,0.000000,0.131892,0.000000,0.000000,6


### Comment on assumptions, things that do not work properly, etc.


## 1b. Create and apply imputation

In [0]:
# Insert the functions create_imputation and apply_imputation below (after the comments)
#
# Input to create_imputation:
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
#
# Output from create_imputation:
# df: a new dataframe, where each missing numeric value in a column has been replaced by the mean of that column 
#     and each missing categoric value in a column has been replaced by the mode of that column
# imputation: a mapping (dictionary) from column name to value that has replaced missing values
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Handle columns of type "float" or "int" only (and which are not labeled "CLASS" or "ID") in one way
#         and columns of type "object" and "category" in other ways
# Hint 3: Consider using the pandas functions mean() and mode() respectively, as well as fillna
# Hint 4: In the rare case of all values in a column being missing, replace numeric values with 0,
#         object values with "" and category values with the first category (cat.categories[0])  
#
# Input to apply_imputation:
# df: a dataframe
# imputation: a mapping (dictionary) from column name to value that should replace missing values
#
# Output from apply_imputation:
# df: a new dataframe, where each missing value has been replaced according to the mapping
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider using fillna

def create_imputation(df):
    copydf = df.copy()
    imputation = {}
    
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])] # Dropping ID and CLASS columns
    numbercols = cols.select_dtypes(include=[np.int_, np.float_]) # Choosing int and float columns
    othercols = cols.select_dtypes(exclude=[np.int_, np.float_]) # Choosing all other columns 
    
    for column in numbercols:
        copydf[column].fillna(copydf[column].mean(),inplace=True) # For numeric columns replacing them with mean
        imputation[column] = copydf[column].mean()
        
        if copydf[column].isna().any(): # Check if whole column was NaN
            copydf[column].fillna(0, inplace=True) #Replace with 0
            imputation[column] = 0
        
    for column in othercols:
        # Mode returns an array. In order to get the correct mode we have to use iloc[0]
        copydf[column].fillna(copydf[column].mode().iloc[0],inplace=True)
        imputation[column] = copydf[column].mode().iloc[0]
        
        # Check if column still has some NA 
        # if it does, change it to category[to match the output] and replace it with the first category
        if copydf[column].isna().any():
          
            copydf[column] = copydf[column].astype('category')
            
            # [Old code, doesn't work] fill = "" if copydf[column].dtype == 'object' else df.cat.categories[0]
            
            fill = "" if copydf[column].dtype == 'object' else copydf[column].astype('category').cat.categories[0]
            copydf[column].fillna(fill, inplace=True)
            imputation[column] = fill
            
            # print("CopyDF: ", column, copydf[column].dtype, copydf[column].astype('category').cat.categories[0])
            
    return copydf, imputation

def apply_imputation(df, imputation):
    copydf = df.copy()
    
    for column in imputation:
        copydf[column].fillna(imputation[column], inplace=True)
    
    return copydf

In [5]:
# Test your code (leave this part unchanged)

anneal_train_df = pd.read_csv("anneal_train.txt")
anneal_test_df = pd.read_csv("anneal_test.txt")

anneal_train_imp, imputation = create_imputation(anneal_train_df)
anneal_test_imp = apply_imputation(anneal_test_df,imputation)

print("Imputation:\n")
for f in imputation:
    print("{}:{}".format(f,imputation[f]))

print("\nNo. of replaced missing values in training data:\n{}".format(anneal_train_imp.count()-anneal_train_df.count()))
print("\nNo. of replaced missing values in test data:\n{}".format(anneal_test_imp.count()-anneal_test_df.count()))



Imputation:

carbon:3.859688195991091
hardness:13.084632516703786
formability:2.251748251748252
strength:26.302895322939868
enamelability:1.7142857142857142
m:0
marvi:0
exptl:0
corr:0
jurofm:0
s:0
p:0
thick:1.1911937639198218
width:769.4917594654788
len:1229.293986636971
bore:35.18930957683742
packing:3.0
family:TN
product-type:C
steel:A
temper_rolling:T
condition:S
non-ageing:N
surface-finish:P
surface-quality:E
bc:Y
bf:Y
bt:Y
bw/me:B
bl:Y
chrom:C
phos:P
cbond:Y
ferro:Y
blue-bright-varn-clean:B
lustre:Y
shape:SHEET
oil:Y

No. of replaced missing values in training data:
family                    382
product-type                0
steel                      43
carbon                      0
hardness                    0
temper_rolling            374
condition                 160
formability               163
strength                    0
non-ageing                391
surface-finish            444
surface-quality           128
enamelability             442
bc                        448
bf

### Comment on assumptions, things that do not work properly, etc.
We did not have any categorical data to test our code against. Explicit cast to category type in order to match outputs. Line 57 adds sanity check with object type data. We don't really need to check against 'object' here after explicit cast.

## 1c. Create and apply discretization

In [0]:
# Insert the functions create_bins and apply_bins below
#
# Input to create_bins:
# df: a dataframe
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from create_bins:
# df: a new dataframe, where each numeric feature value has been replaced by a categoric (corresponding to some bin)
# binning: a mapping (dictionary) from column name to bins (threshold values for the bin)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Discretize columns of type "float" or "int" only (and which are not labeled "CLASS" or "ID")
# Hint 3: Consider using pd.cut and pd.qcut respectively, with labels=False, retbins=True and duplicates="drop"
#         (the last option will avoid errors when not enough bins can be created)
# Hint 4: Set all columns in the new dataframe to be of type "category"
# Hint 5: Set the categories of the discretized features to be [0,...,nobins-1]
# Hint 6: Change the first and the last element of each binning to -np.inf and np.inf respectively 
#
# Input to apply_bins:
# df: a dataframe
# binning: a mapping (dictionary) from column name to bins (threshold values for the bin)
#
# Output from apply_bins:
# df: a new dataframe, where each numeric feature value has been replaced by a categoric (corresponding to some bin)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider using pd.cut 
# Hint 3: Set all columns in the new dataframe to be of type "category"
# Hint 4: Set the categories of the discretized features to be [0,...,nobins-1]
#

def create_bins(df, nobins = 10, bintype = "equal-width"):
    copydf = df.copy()
    binning = {}
    
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
    numbercols = cols.select_dtypes(include=[np.int_, np.float_])
    
    for col in numbercols:
        cutfunc = pd.cut if bintype == "equal-width" else pd.qcut
        copydf[col], bins = cutfunc(copydf[col], nobins, labels=False, retbins=True, duplicates="drop")
        # hint 4-5
        copydf[col] = copydf[col].astype('category', categories=list(range(nobins)))
        
        # hint 6
        bins[0] = -np.inf
        bins[-1] = np.inf
        binning[col] = bins
        
    return copydf, binning

def apply_bins(df, binning):
    copydf = df.copy()
    
    for column in binning:
        copydf[column] = pd.cut(copydf[column], labels=False, bins=binning[column])
        
        # hint 3-4
        nobins = len(binning[column] - 1) # n bins will generate n+1 values
        copydf[column] = copydf[column].astype('category', categories=list(range(nobins)))
    
    return copydf

In [7]:
# Test your code  (leave this part unchanged)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

glass_train_disc, binning = create_bins(glass_train_df,nobins=10,bintype="equal-size")
print("binning:\n")
for f in binning:
    print("{}:{}".format(f,binning[f]))

glass_test_disc = apply_bins(glass_test_df,binning)
print("\nglass_test_disc:\n")
glass_test_disc


binning:

RI:[    -inf 1.515896 1.51618  1.516516 1.516866 1.51753  1.517902 1.518618
 1.520114 1.521846      inf]
Na:[  -inf 12.73  12.872 13.    13.222 13.38  13.492 13.794 14.198 14.82
    inf]
Mg:[ -inf 1.82  3.188 3.41  3.476 3.55  3.61  3.728   inf]
Al:[ -inf 0.906 1.172 1.23  1.348 1.48  1.54  1.622 1.808 2.094   inf]
Si:[  -inf 71.756 72.196 72.388 72.72  72.79  72.966 73.06  73.208 73.372
    inf]
K:[ -inf 0.006 0.148 0.39  0.54  0.576 0.6   0.636 0.67    inf]
Ca:[  -inf  7.978  8.112  8.338  8.554  8.67   8.81   9.032  9.674 10.924
    inf]
Ba:[-inf 0.78  inf]
Fe:[ -inf 0.062 0.118 0.24    inf]

glass_test_disc:



  del sys.path[0]


Unnamed: 0,ID,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,CLASS
0,101,3,1,1,4,8,4,5,0,2,2
1,104,9,7,1,0,0,1,9,0,0,2
2,44,9,6,7,0,1,2,8,0,0,1
3,17,5,0,6,1,7,6,5,0,0,1
4,81,1,1,4,9,3,8,0,0,0,2
5,142,6,3,6,1,5,4,3,0,2,2
6,120,3,6,5,4,3,7,0,0,0,2
7,123,4,4,4,4,5,4,1,0,0,2
8,133,6,5,7,2,3,5,2,0,0,2
9,185,0,9,0,0,9,0,0,0,0,6


### Comment on assumptions, things that do not work properly, etc.

## 1d. Divide a dataset into a training and a test set

In [0]:
# Insert the function split below
#
# Input to split:
# df: a dataframe
# testfraction: a float in the range (0,1) (default = 0.5)
#
# Output from split:
# trainingdf: a dataframe consisting of a random sample of (1-testfraction) of the rows in df
# testdf: a dataframe consisting of the rows in df that are not included in trainingdf
#
# Hint: You may use np.random.permutation(df.index) to get a permuted list of indexes where a 
#       prefix corresponds to the test instances, and the suffix to the training instances 


# OLD FUNCTION - Creates totally random split
def split(df, testfraction = 0.5):
    permutation = np.random.permutation(len(df))
    testsamples = int(len(permutation)*testfraction)
    
    testdf = df.iloc[ permutation[range(1,testsamples)] ]
    trainingdf = df.iloc[ permutation[range(testsamples,len(permutation))] ]

    return trainingdf, testdf

##  FASIH
# Defined again to cater np.random.permutation(df.index), Previously we were doing it on len(df)
# Keeps indexes ordered
def split(df, testfraction = 0.5):
    permutation = np.random.permutation(df.index)
    cut = int(np.ceil(len(permutation)*testfraction))
    
    testdf = df[df.index.isin(permutation[:cut])]
    trainingdf = df[df.index.isin(permutation[cut:])]
    
    return trainingdf, testdf


In [0]:
# Test your code  (leave this part unchanged)

glass_df = pd.read_csv("glass.txt")

glass_train, glass_test = split(glass_df,testfraction=0.25)

print("Training IDs:\n{}".format(glass_train["ID"].values))

print("\nTest IDs:\n{}".format(glass_test["ID"].values))

print("\nOverlap: {}".format(set(glass_train["ID"]).intersection(set(glass_test["ID"]))))


Training IDs:
[  1   3   4   5   7   8   9  12  13  16  17  18  19  20  21  22  23  25
  26  27  28  29  30  31  32  33  34  36  38  39  42  43  45  46  48  49
  50  51  52  53  54  55  56  57  59  61  62  63  64  65  66  67  69  70
  71  72  73  74  75  77  78  79  80  81  82  84  85  86  90  91  93  95
  96  97  98  99 100 101 103 104 105 106 107 108 109 110 112 113 114 115
 116 117 118 119 120 122 123 124 126 127 128 129 130 132 133 136 137 138
 141 142 144 145 146 147 148 149 150 151 152 153 155 156 159 160 162 164
 165 166 167 169 170 173 174 175 176 177 178 180 183 186 187 188 190 192
 195 196 197 198 199 200 201 203 206 207 208 209 210 211 213 214]

Test IDs:
[  2   6  10  11  14  15  24  35  37  40  41  44  47  58  60  68  76  83
  87  88  89  92  94 102 111 121 125 131 134 135 139 140 143 154 157 158
 161 163 168 171 172 179 181 182 184 185 189 191 193 194 202 204 205 212]

Overlap: set()


### Comment on assumptions, things that do not work properly, etc.

## 1e. Calculate accuracy of a set of predictions

In [0]:
# Insert the function accuracy below
#
# Input to accuracy:
# df: a dataframe with class labels as column names and each row corresponding to
#     a prediction with estimated probabilities for each class
# correctlabels: an array (or list) of the correct class label for each prediction
#                (the number of correct labels must equal the number of rows in df)
#
# Output from accuracy:
# accuracy: the fraction of cases for which the predicted class label coincides with the correct label
#
# Hint: In case the label receiving the highest probability is not unique, you may
#       resolve that by picking the first (as ordered by the column names) or 
#       by randomly selecting one of the labels with highest probaility.

def accuracy(df, correctlabels):
    return sum(df.idxmax(axis=1)==correctlabels)/len(correctlabels)

In [0]:
# Test your code  (leave this part unchanged)

predictions = pd.DataFrame({"A":[0.5,0.5,0.5,0.25,0.25],"B":[0.5,0.25,0.25,0.5,0.25],"C":[0.0,0.25,0.25,0.25,0.5]})
predictions


In [0]:
correctlabels = ["B","A","B","B","C"]

accuracy(predictions,correctlabels) # Note that depending on how ties are resolved the accuracy may be 0.6 or 0.8

### Comment on assumptions, things that do not work properly, etc.

## 2a. Create and apply one-hot encoding

In [0]:
# Insert the functions create_one_hot and apply_one_hot below
#
# Input to create_one_hot:
# df: a dataframe
#
# Output from create_one_hot:
# df: a new dataframe, where each categoric feature has been replaced by a set of binary features 
#    (as many new features as there are possible values)
# one_hot: a mapping (dictionary) from column name to a set of categories (possible values for the feature)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider columns of type "object" or "category" only (and which are not labeled "CLASS" or "ID")
# Hint 3: Consider creating new column names by merging the original column name and the categorical value
# Hint 4: Set all new columns to be of type "float"
# Hint 5: Do not forget to remove the original categoric feature
#
# Input to apply_one_hot:
# df: a dataframe
# one_hot: a mapping (dictionary) from column name to categories
#
# Output from apply_one_hot:
# df: a new dataframe, where each categoric feature has been replaced by a set of binary features
#
# Hint: See the above Hints

def create_one_hot(df):
    copydf = df.copy()
    one_hot = {}
    
    # filter all columns except ID and CLASS
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
    # filter only categorial data
    cols = cols.select_dtypes(include=["object", "category"])
    
    for col in cols:
        # convert to get categories
        if copydf[col].dtype == 'object':
            copydf[col] = copydf[col].astype('category')
            
        # convert categorial to binary (one-hot)    
        dummies = pd.get_dummies(copydf[col])

        # create binary column for each category
        one_hot[col] = {}
        for cat in dummies:
            one_hot[col][cat] = col + '-' + cat
            copydf[col + '-' + cat] = dummies[cat]
            # convert to float type (hint 4)
            copydf[col + '-' + cat] = copydf[col + '-' + cat].astype('float')
            
        # drop original columns     
        copydf.drop(col, axis=1, inplace=True)
    
    return copydf, one_hot

def apply_one_hot(df, one_hot):
    copydf = df.copy()
    
    for col in one_hot:
        # convert categorial to binary (one-hot)
        dummies = pd.get_dummies(copydf[col])
        # iterate over categories generated in one-hot
        for (cat, col_name) in one_hot[col].items():
            # for each category take dummy value
            copydf[col_name] = dummies[cat]
            # convert to float type (hint 4)
            copydf[col_name] = copydf[col_name].astype('float')
        copydf.drop(col, axis=1, inplace=True)
    
    return copydf

# Float return type makes not much sense, binary would be enough. 

In [0]:
# Test your code  (leave this part unchanged)

tictactoe = pd.read_csv("tic-tac-toe.txt")

train_df, test_df = split(tictactoe) # Using your above function

new_train, one_hot = create_one_hot(train_df)

new_test = apply_one_hot(test_df,one_hot)
new_test

### Comment on assumptions, things that do not work properly, etc.

## 2b. Divide a dataset into a number of folds

In [0]:
# Insert the function folds below
#
# Input to folds:
# df: a dataframe
# nofolds: an integer greater than 1 (default = 10)
#
# Output from folds:
# folds: a list (of length = nofolds) dataframes consisting of random non-overlapping, 
#        approximately equal-sized subsets of the rows in df
#
# Hint: You may use np.random.permutation(df.index) to get a permuted list of indexes from which a 
#       prefix corresponds to the test instances, and the suffix to the training instances 

def folds(df, nofolds=10):
    if nofolds < 1:
        raise ValueError("nofolds parameter can not be {}. It must be greater than 1.".format(nofolds))
    
    df_shuffled = df.reindex(np.random.permutation(df.index))
    return np.array_split(df_shuffled, nofolds)



In [0]:
# Test your code  (leave this part unchanged)

glass_df = pd.read_csv("glass.txt")

glass_folds = folds(glass_df,nofolds=5)

fold_sizes = [len(f) for f in glass_folds]

print("Fold sizes:{}\nTotal no. instances: {}".format(fold_sizes,sum(fold_sizes)))

### Comment on assumptions, things that do not work properly, etc.

## 2c. Calculate Brier score of a set of predictions

In [0]:
# Insert the function brier_score below
#
# Input to brier_score:
# df: a dataframe with class labels as column names and each row corresponding to
#     a prediction with estimated probabilities for each class
# correctlabels: an array (or list) of the correct class label for each prediction
#                (the number of correct labels must equal the number of rows in df)
#
# Output from brier_score:
# brier_score: the average square error of the predicted probabilties 
#
# Hint: Compare each predicted vector to a vector for each correct label, which is all zeros except 
#       for at the index of the correct class. The index can be found using np.where(df.columns==l)[0] 
#       where l is the correct label.

def brier_score(df, correctlabels):
    score = 0
    
    # get observed probabilities (one for each correct label, otherwise zero)
    observed_probs = pd.get_dummies(correctlabels)
    # vectorized formula slides brier score (probability - observed_probability) squared
    score = (df - observed_probs) ** 2
    # sum over different labels, and sum all instances
    score = score.sum(axis=1)
    # average over instances
    score = score.mean()
        
    return score


In [0]:
# Test your code  (leave this part unchanged)

predictions = pd.DataFrame({"A":[0.5,0.5,0.5,0.25,0.25],"B":[0.5,0.25,0.25,0.5,0.25],"C":[0.0,0.25,0.25,0.25,0.5]})

correctlabels = ["B","A","B","B","C"]

brier_score(predictions,correctlabels)

### Comment on assumptions, things that do not work properly, etc.

## 2d. Calculate AUC of a set of predictions

In [0]:
# Insert the function auc below
#
# Input to auc:
# df: a dataframe with class labels as column names and each row corresponding to
#     a prediction with estimated probabilities for each class
# correctlabels: an array (or list) of the correct class label for each prediction
#                (the number of correct labels must equal the number of rows in df)
#
# Output from auc:
# auc: the weighted area under ROC curve
#
# Hint 1: Calculate the binary AUC first for each class label c, i.e., treating the
#         predicted probability of this class for each instance as a score; the positive
#         instances are the ones belonging to class c and the negative instances the rest
# Hint 2: When calculating the binary AUC, first find the scores of the positive instances and then
#         the scores of the negative instances
# Hint 3: You may use a dictionary with a mapping from each score to an array of two numbers; 
#         the number of positive instances with this score and the number of negative instances with this score
# Hint 4: Created a (reversely) sorted (on the scores) list of pairs from the dictionary and
#         iterate over this to additively calculate the AUC
# Hint 5: For each pair in the above list, there are three cases to consider; the no. of true positives
#         (tp_i) is zero, the number of false positives (fp_i) (negatives) is zero, and both are non-zero
# Hint 6: Calculate the weighted AUC by summing the individual AUCs weighted by the relative
#         frequency of each class (as estimated from the correct labels)

# function for one label , returns tpr 
def auc_single(predictions, correctlabels, threshold, c):
   
    # array with true for correct labels for class c (by row index)
    correctlabels_class = np.array(correctlabels)==predictions.columns[c]
    
    # array with predictions for all instances that should be classified class c
    predictions_class = predictions[ predictions.columns[c] ]
    
    # array with true for all correctly predicted labels according to threshold
    predicted_labels = predictions_class[correctlabels_class] >= threshold
    pos = sum(predicted_labels)
    
    # correctly predicted instances (according to threshold) divided by total number of instances that should be class c
    tpr = pos / sum(correctlabels_class)
    
    # repeat for false positive rate (instances not in class)
    not_correctlabels_class = np.array(correctlabels)!=predictions.columns[c]
    predictions_class = predictions[ predictions.columns[c] ]
    predicted_labels = predictions_class[not_correctlabels_class] >= threshold
    neg = sum(predicted_labels)
    fpr = neg / sum(not_correctlabels_class)
    
    return tpr, fpr


def auc(predictions, correctlabels):
    thresholds = np.unique(predictions)
    total_number_of_labels = len(correctlabels)
    
    AUCs = {}
    
    # iterate over all classes and calculate the area under the ROC(tpr/fpr) curve (AUC)
    for (idx,c) in enumerate(np.unique(correctlabels)):
        single = [auc_single(predictions, correctlabels, t, idx) for t in reversed(thresholds)]
                    
        # calculate AUC as area under the curve
        AUC = 0
        tpr_last = 0
        fpr_last = 0
        
        # iterate over all thresholds
        for s in single:
            tpr, fpr = s
            
            # Case 1.) Add area under triangle        
            if tpr > tpr_last and fpr > fpr_last:
                AUC += (fpr-fpr_last)*tpr_last + (fpr-fpr_last)*(tpr-tpr_last) / 2
            
            # Case 2.) Add area under rectangle            
            elif fpr > fpr_last:
                AUC += (fpr-fpr_last)*tpr
            
            # update point coordinates (tpr, fpr) of curve
            tpr_last = tpr
            fpr_last = fpr
       
        AUCs[c] = AUC
        
                
    # take the weighted average for all classes (dependent on their frequency of occourance)
    AUC_total = 0
    for (cName,auc) in AUCs.items():
        number_of_labels = np.sum(np.array(correctlabels) == cName)
        weight = number_of_labels / total_number_of_labels
        AUC_total += weight * auc
        
    return AUC_total    
    
    #print(tpr)
    #print(fpr)
    
    #plt.plot([0.0]+fpr+[1.0],[0.0]+tpr+[1.0],"-",label="1")
    #plt.plot([0.0,1.0],[0.0,1.0],"--",label="Baseline")
    #plt.xlabel("fpr")
    #plt.ylabel("tpr")
    #plt.legend()
    #plt.show() 


In [0]:
# Test your code  (leave this part unchanged)

predictions = pd.DataFrame({"A":[0.9,0.9,0.6,0.55],"B":[0.1,0.1,0.4,0.45]})

correctlabels = ["A","B","B","A"]

auc(predictions,correctlabels)



0.375

In [0]:
predictions = pd.DataFrame({"A":[0.5,0.5,0.5,0.25,0.25],"B":[0.5,0.25,0.25,0.5,0.25],"C":[0.0,0.25,0.25,0.25,0.5]})

correctlabels = ["B","A","B","B","C"]

auc(predictions,correctlabels)

0.8499999999999999

### Comment on assumptions, things that do not work properly, etc.