# Compute performance metrics for the given Y and Y_score without sklearn<a href="#Compute-performance-metrics-for-the-given-Y-and-Y_score-without-sklearn" class="anchor-link">¶</a>

In \[1\]:

    import numpy as np
    import pandas as pd
    # other than these two you should not import any other packages

## A. Compute performance metrics for the given data '5_a.csv'<a href="#A.-Compute-performance-metrics-for-the-given-data-&#39;5_a.csv&#39;" class="anchor-link">¶</a>

      Note 1: in this data you can see number of positive points >> number of negatives points
       Note 2: use pandas or numpy to read the data from 5_a.csv
       Note 3: you need to derive the class labels from given score

\$y^{pred}= \\text{\[0 if y_score \< 0.5 else 1\]}\$


     Compute Confusion Matrix 
     Compute F1 Score 
     Compute AUC Score, you need to compute different thresholds and for each threshold compute tpr,fpr and then use               numpy.trapz(tpr_array, fpr_array) https://stackoverflow.com/q/53603376/4084039, https://stackoverflow.com/a/39678975/4084039 Note: it should be numpy.trapz(tpr_array, fpr_array) not numpy.trapz(fpr_array, tpr_array)
    Note- Make sure that you arrange your probability scores in descending order while calculating AUC
     Compute Accuracy Score 

In \[2\]:

    df_a=pd.read_csv('5_a.csv')
    df_a.head()

Out\[2\]:

|     | y   | proba    |
|-----|-----|----------|
| 0   | 1.0 | 0.637387 |
| 1   | 1.0 | 0.635165 |
| 2   | 1.0 | 0.766586 |
| 3   | 1.0 | 0.724564 |
| 4   | 1.0 | 0.889199 |

## CONVERTING PROBABILITIES TO OUTPUT LABEL<a href="#CONVERTING-PROBABILITIES-TO-OUTPUT-LABEL" class="anchor-link">¶</a>

In \[3\]:

    def convert_prob2label(prob,threshold):
        # prob : list of predicted output
        #threshold : threshold value to make output 1 or 0
        label=[]
        for i in prob:
            if i>=threshold:
                label.append(1)
            else:
                label.append(0)
        return label

## FUNCTION TO GET CONFUSION MATRIX WHICH INCLUDE TRUE POSITIVE, TRUE NEGATIVE, FALSE POSITIVE, FALSE NEGATIVE<a href="#FUNCTION-TO-GET-CONFUSION-MATRIX-WHICH-INCLUDE-TRUE-POSITIVE,-TRUE-NEGATIVE,-FALSE-POSITIVE,-FALSE-NEGATIVE" class="anchor-link">¶</a>

In \[4\]:

    def confusion_matrix(true_label, predicted_prob):
        #true_label : list of predefined true label
        #predicted_prob : list of output predicted probabilities
        predicted_label= convert_prob2label(predicted_prob, 0.5) # converting output predicted probabilities to label with threshold= 0.5
        T_N = 0  # initializing True negative value
        T_P = 0  # initializing True positive value
        F_P = 0  # initializing false positive value
        F_N = 0  # initializing false negative value
        conf_mat = np.array([[0,0],[0,0]])   # initializing confusion matrix
        for i in range(len(true_label)):
            if true_label[i] == 0 and predicted_label[i] == 0:
                T_N += 1
                print(T_N)
            elif true_label[i] == 1 and predicted_label[i] == 1:
                T_P += 1
            elif true_label[i] == 1 and predicted_label[i] == 0:
                F_N += 1
            else: 
                F_P += 1
        conf_mat[0,0]= T_N
        conf_mat[0,1]= F_N
        conf_mat[1,0]= F_P
        conf_mat[1,1]= T_P
        return T_N, F_N, F_P, T_P, conf_mat

## COMPUTING F1 SCORE<a href="#COMPUTING-F1-SCORE" class="anchor-link">¶</a>

In \[5\]:

    def F1_score(a,b):
        # a : list of true label
        # b : list of predicted probabilities
        T_N, F_N, F_P, T_P, _ = confusion_matrix(a, b)
        precision = T_P / (T_P+F_P)   # precision value
        recall = T_P / (F_N+T_P)      # recall value
        f1_score = (2* precision * recall)/ (precision+recall)
        return f1_score

## FUNCTION TO GET TPR/FPR FOR ROC CURVE<a href="#FUNCTION-TO-GET-TPR/FPR-FOR-ROC-CURVE" class="anchor-link">¶</a>

In \[6\]:

    def ROC_Curve_TPR_FPR(true_label, predicted_prob):
        # true_label : list of true label
        # predicted_prob : list of predicted probabilities
        lst_threshold = sorted(predicted_prob, reverse=True)  # To order predicted_prob in descending order to get threshold value
        TPR = []     # initializing list of TPR to build ROC curve
        FPR = []     # initializing list of FPR to build ROC curve
        for threshold in lst_threshold:
            predicted_label = convert_prob2label(predicted_prob,threshold)
            T_N, F_N, F_P, T_P, _ = confusion_matrix(true_label, predicted_label)
            TPR.append(T_P / (F_N + T_P))
            FPR.append(F_P / (T_N +F_P))
        return TPR, FPR
        

## FUNCTION TO DISPLAY ROC CURVE<a href="#FUNCTION-TO-DISPLAY-ROC-CURVE" class="anchor-link">¶</a>

In \[7\]:

    def display_ROC_curve(TPR,FPR):
        df_roc= pd.DataFrame({'tpr': TPR, 'fpr': FPR })
        df_roc.plot(x= 'fpr', y= 'tpr')
        return

## COMPUTING AUC SCORE<a href="#COMPUTING-AUC-SCORE" class="anchor-link">¶</a>

In \[8\]:

    def auc_score(TPR, FPR):
        return np.trapz(TPR, FPR)

## CONFUSION MATRIX<a href="#CONFUSION-MATRIX" class="anchor-link">¶</a>

In \[9\]:

    T_N, F_N, F_P, T_P, conf_mat = confusion_matrix((df_a['y'].astype(int)).tolist(), df_a['proba'].tolist())
    conf_mat

Out\[9\]:

    array([[    0,     0],
           [  100, 10000]])

## F1-SCORE<a href="#F1-SCORE" class="anchor-link">¶</a>

In \[10\]:

    f1_score = F1_score((df_a['y'].astype(int)).tolist(),df_a['proba'].tolist())
    f1_score

Out\[10\]:

    0.9950248756218906

Though f1 score is good but as data is imbalanced we do not give
importance to f1 score, we will check auc score.

## PLOTTING ROC-CURVE<a href="#PLOTTING-ROC-CURVE" class="anchor-link">¶</a>

In \[ \]:

    TPR, FPR = ROC_Curve_TPR_FPR((df_a['y'].astype(int)).tolist(), df_a['proba'].tolist())

In \[12\]:

    display_ROC_curve(TPR, FPR)

![](attachment:vertopal_48597fe6773b495b80a6d4a512d3a4d4/52a745919f65b34b3d42ec67ea802c6c185746e5.png)

As we can see the curve is similar to y=x line so the model is worse.

## AUC-SCORE<a href="#AUC-SCORE" class="anchor-link">¶</a>

In \[13\]:

    print('AUC Score:' + str(auc_score(TPR, FPR)) )

    AUC Score:0.48829900000000004

AUC-score is worse here.

## B. Compute performance metrics for the given data '5_b.csv'<a href="#B.-Compute-performance-metrics-for-the-given-data-&#39;5_b.csv&#39;" class="anchor-link">¶</a>

       Note 1: in this data you can see number of positive points << number of negatives points
       Note 2: use pandas or numpy to read the data from 5_b.csv
       Note 3: you need to derive the class labels from given score

\$y^{pred}= \\text{\[0 if y_score \< 0.5 else 1\]}\$


     Compute Confusion Matrix 
     Compute F1 Score 
     Compute AUC Score, you need to compute different thresholds and for each threshold compute tpr,fpr and then use               numpy.trapz(tpr_array, fpr_array) https://stackoverflow.com/q/53603376/4084039, https://stackoverflow.com/a/39678975/4084039
    Note- Make sure that you arrange your probability scores in descending order while calculating AUC
     Compute Accuracy Score 

In \[14\]:

    df_b=pd.read_csv('5_b.csv')
    df_b.head()

Out\[14\]:

|     | y   | proba    |
|-----|-----|----------|
| 0   | 0.0 | 0.281035 |
| 1   | 0.0 | 0.465152 |
| 2   | 0.0 | 0.352793 |
| 3   | 0.0 | 0.157818 |
| 4   | 0.0 | 0.276648 |

In \[15\]:

    df_b['y'].value_counts()

Out\[15\]:

    0.0    10000
    1.0      100
    Name: y, dtype: int64

## CONFUSION MATRIX<a href="#CONFUSION-MATRIX" class="anchor-link">¶</a>

In \[ \]:

    T_N, F_N, F_P, T_P, conf_mat = confusion_matrix((df_b['y'].astype(int)).tolist(), df_b['proba'].tolist())

In \[17\]:

    print('confusion matrix: ' +str(conf_mat))

    confusion matrix: [[9761   45]
     [ 239   55]]

## F1-SCORE<a href="#F1-SCORE" class="anchor-link">¶</a>

In \[ \]:

    f1_score = F1_score((df_b['y'].astype(int)).tolist(),df_b['proba'].tolist())

In \[19\]:

    f1_score

Out\[19\]:

    0.2791878172588833

Data is imbalanced so f1 acore is low.

## PLOTTING ROC-CURVE<a href="#PLOTTING-ROC-CURVE" class="anchor-link">¶</a>

In \[ \]:

    TPR, FPR = ROC_Curve_TPR_FPR((df_a['y'].astype(int)).tolist(), df_a['proba'].tolist())

In \[21\]:

    display_ROC_curve(TPR, FPR)

![](attachment:vertopal_48597fe6773b495b80a6d4a512d3a4d4/52a745919f65b34b3d42ec67ea802c6c185746e5.png)

ROC curve is same as y=x curve thus model is bad.

## AUC-SCORE<a href="#AUC-SCORE" class="anchor-link">¶</a>

In \[22\]:

    print('AUC Score:' + str(auc_score(TPR, FPR)) )

    AUC Score:0.48829900000000004

AUC score is same as previous question.

### C. Compute the best threshold (similarly to ROC curve computation) of probability which gives lowest values of metric **A** for the given data<a href="#C.-Compute-the-best-threshold-(similarly-to-ROC-curve-computation)-of-probability-which-gives-lowest-values-of-metric-A-for-the-given-data" class="anchor-link">¶</a>

  

you will be predicting label of a data points like this: \$y^{pred}=
\\text{\[0 if y_score \< threshold else 1\]}\$

\$ A = 500 \\times \\text{number of false negative} + 100 \\times
\\text{numebr of false positive}\$

       Note 1: in this data you can see number of negative points > number of positive points
       Note 2: use pandas or numpy to read the data from 5_c.csv

In \[23\]:

    df_c=pd.read_csv('5_c.csv')
    df_c.head()

Out\[23\]:

|     | y   | prob     |
|-----|-----|----------|
| 0   | 0   | 0.458521 |
| 1   | 0   | 0.505037 |
| 2   | 0   | 0.418652 |
| 3   | 0   | 0.412057 |
| 4   | 0   | 0.375579 |

## FUNCTION TO GET BEST THRESHOLD VALUE FOR SMALLER 'A'<a href="#FUNCTION-TO-GET-BEST-THRESHOLD-VALUE-FOR-SMALLER-&#39;A&#39;" class="anchor-link">¶</a>

In \[24\]:

    def best_threshold_fn(true_label,predicted_prob):
        best_threshold= 0
        prev_A=1e8
        lst_threshold= sorted(predicted_prob, reverse=True)
    #     best_threshold= lst_threshold[0]
        for threshold in lst_threshold:
            predicted_label = convert_prob2label(predicted_prob,threshold)
            T_N, F_N, F_P, T_P, _ = confusion_matrix(true_label, predicted_label)
            A = 500 * F_N + 100 * F_P
            if A < prev_A : 
                best_threshold = threshold
                prev_A = A
        return best_threshold

In \[ \]:

    best_threshold = best_threshold_fn((df_c['y'].astype(int)).tolist(),df_c['prob'].tolist())

## PRINT BEST THRESHOLD VALUE<a href="#PRINT-BEST-THRESHOLD-VALUE" class="anchor-link">¶</a>

In \[26\]:

    print('Best threshold value :' +str(best_threshold))

    Best threshold value :0.2300390278970873

## D.\</b\>\</font\> Compute performance metrics(for regression) for the given data 5_d.csv<a href="#D.%3C/b%3E%3C/font%3E-Compute-performance-metrics(for-regression)-for-the-given-data-5_d.csv" class="anchor-link">¶</a>

        Note 2: use pandas or numpy to read the data from 5_d.csv
        Note 1: 5_d.csv will having two columns Y and predicted_Y both are real valued features

     Compute Mean Square Error 
     Compute MAPE: https://www.youtube.com/watch?v=ly6ztgIkUxk
     Compute R^2 error: https://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions 

In \[27\]:

    df_d=pd.read_csv('5_d.csv')
    df_d.head()

Out\[27\]:

|     | y     | pred  |
|-----|-------|-------|
| 0   | 101.0 | 100.0 |
| 1   | 120.0 | 100.0 |
| 2   | 131.0 | 113.0 |
| 3   | 164.0 | 125.0 |
| 4   | 154.0 | 152.0 |

## COMPUTING MEAN SQUARED ERROR<a href="#COMPUTING-MEAN-SQUARED-ERROR" class="anchor-link">¶</a>

In \[28\]:

    def MSE(a,b):
        return (1/len(a)) * np.sum(np.square(a-b))

## COMPUTING MEAN ABSOLUTE SCALED ERROR<a href="#COMPUTING-MEAN-ABSOLUTE-SCALED-ERROR" class="anchor-link">¶</a>

In \[29\]:

    def MASE(a,b):
        return (1/len(a)) * np.sum(np.absolute(a-b))

## COMPUTING R-SQUARED ERROR<a href="#COMPUTING-R-SQUARED-ERROR" class="anchor-link">¶</a>

In \[30\]:

    def R_squared(a,b):
        mean_y = np.mean(a)
        SS_res = np.sum(np.square(a-b))
        SS_tot = np.sum(np.square(a-mean_y))
        return (1 - (SS_res/SS_tot))

In \[31\]:

    print('mean square error :' + str(MSE(df_d['y'], df_d['pred'])))

    mean square error :177.16569974554707

In \[32\]:

    print('MAPE :' + str(MASE(df_d['y'], df_d['pred'])))

    MAPE :8.594516539440203

In \[33\]:

    print('R_squared error :' + str(R_squared(df_d['y'], df_d['pred'])))

    R_squared error :0.9563582786990937

As there is no way to check if mse and mase is good or not, so we
calculate R-squared error, good or bad depend on the application, R^2
error = 1 is the best value. So for this question R2 error is close to 1
, So model is good.