<font size="5">**ABHISHEK KUMAR SINGH**</font>

<font size="5">**2K19/CO/021**</font>

**<font size="8"><center>EXPERIMENT - 8</center></font>**

**AIM:** To implement Friedman test and Wilcoxon test in python.

[Link of Dataset](https://www.kaggle.com/chirin/africa-economic-banking-and-systemic-crisis-data/version/1)

**THEORY**

**P-value** 

The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.
If the P value is less than the chosen significance level then we reject the null hypothesis i.e. accept that our sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for us to decide when considering the real-world relevance of our result.

**Friedman Test** 

It is a non-parametric test alternative to the one way ANOVA with repeated measures. It tries to determine if subjects changed significantly across occasions/conditions. For example:- Problem-solving ability of a set of people is the same or different in Morning, Afternoon, Evening. It is used to test for differences between groups when the dependent variable is ordinal. This test is particularly useful when the sample size is very small.

**Post Hoc Analysis** 

We can find out whether there is a difference in any given pair of experimental conditions if the null hypothesis is rejected using the Post Hoc analysis which can be done using Wilcoxon signed-rank test, Conover’s test etc.. In Wilcoxon test, we can get the results for all pairs also but we will have to make a Bonferroni correction which will change the level of significance to Given level of significance/total number of pairs.

**Wilcoxon signed-rank test**

The Wilcoxon signed-rank test is the nonparametric test equivalent to the dependent t-test. As the Wilcoxon signed-rank test does not assume normality in the data, it can be used when this assumption has been violated and the use of the dependent t-test is inappropriate. It is used to compare two sets of scores that come from the same participants. This can occur when we wish to investigate any change in scores from one time point to another, or when individuals are subjected to more than one condition.

For example, Wilcoxon signed-rank test can be used to understand whether there was a difference in smokers' daily cigarette consumption before and after a 6 week hypnotherapy programme (i.e., the dependent variable would be "daily cigarette consumption", and the two related groups would be the cigarette consumption values "before" and "after" the hypnotherapy programme). Wilcoxon signed-rank test can also be used to understand whether there was a difference in reaction times under two different lighting conditions (i.e., the dependent variable would be "reaction time", measured in milliseconds, and the two related groups would be reaction times in a room using "blue light" versus "red light").

**CODE AND OUTPUT:**

**Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

**Loading Dataset**

In [2]:
data = pd.read_csv("african_crises.csv")
data.head()

Unnamed: 0,case,cc3,country,year,systemic_crisis,exch_usd,domestic_debt_in_default,sovereign_external_debt_default,gdp_weighted_default,inflation_annual_cpi,independence,currency_crises,inflation_crises,banking_crisis
0,1,DZA,Algeria,1870,1,0.052264,0,0,0.0,3.441456,0,0,0,crisis
1,1,DZA,Algeria,1871,0,0.052798,0,0,0.0,14.14914,0,0,0,no_crisis
2,1,DZA,Algeria,1872,0,0.052274,0,0,0.0,-3.718593,0,0,0,no_crisis
3,1,DZA,Algeria,1873,0,0.05168,0,0,0.0,11.203897,0,0,0,no_crisis
4,1,DZA,Algeria,1874,0,0.051308,0,0,0.0,-3.848561,0,0,0,no_crisis


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   case                             1059 non-null   int64  
 1   cc3                              1059 non-null   object 
 2   country                          1059 non-null   object 
 3   year                             1059 non-null   int64  
 4   systemic_crisis                  1059 non-null   int64  
 5   exch_usd                         1059 non-null   float64
 6   domestic_debt_in_default         1059 non-null   int64  
 7   sovereign_external_debt_default  1059 non-null   int64  
 8   gdp_weighted_default             1059 non-null   float64
 9   inflation_annual_cpi             1059 non-null   float64
 10  independence                     1059 non-null   int64  
 11  currency_crises                  1059 non-null   int64  
 12  inflation_crises    

**Data Preprocessing**

In [4]:
x = data.iloc[:, 2:12]
y = data.iloc[:,13]

In [5]:
col_names_int = list(x.select_dtypes(int).columns)
le=LabelEncoder()
y = pd.DataFrame(le.fit_transform(np.array(y.values.ravel())), columns=["BankCrisis"])
for col in col_names_int:
    x[col] = le.fit_transform(x[col].astype(str))
    y = pd.DataFrame(le.fit_transform(np.array(y.values.ravel())), columns=["RainTomorrow"])
df1= x[col_names_int]

In [6]:
col_names = list(x.select_dtypes(float).columns)
scaler = StandardScaler()
df2 = scaler.fit_transform(x.select_dtypes(float))
df2 = pd.DataFrame(df2, columns=col_names)

In [7]:
# Join features

x = pd.concat([df1, df2], axis=1)

**Building the Model**

In [8]:
resultsagg = pd.DataFrame()

In [9]:
# KNN

results30=[]
for i in range(30): 
    kf = KFold(10, shuffle=True, random_state=i)
    results = []
    for l_train, l_valid in kf.split(x):
        x_train, x_valid = x.iloc[l_train], x.iloc[l_valid] 
        y_train, y_valid = y.iloc[l_train], y.iloc[l_valid]

        knn = KNeighborsClassifier(n_neighbors=10)
        knn.fit(x_train, y_train.values.ravel())
        y_pred = knn.predict(x_valid)
        acc = accuracy_score(y_valid.values.ravel(), y_pred)
        results.append(acc)
    
    results30.append(np.mean(results))

resultsagg["KNN"]=results30

In [10]:
# Logistic Regression

results30=[]
for i in range(30):
    kf = KFold(10, shuffle=True, random_state=i)
    results = []
    for l_train, l_valid in kf.split(x):
        x_train, x_valid = x.iloc[l_train], x.iloc[l_valid] 
        y_train, y_valid = y.iloc[l_train], y.iloc[l_valid]

        log = LogisticRegression(random_state=i, solver='liblinear')
        log.fit(x_train, y_train.values.ravel())
        y_pred = log.predict(x_valid)
        acc = accuracy_score(y_valid.values.ravel(), y_pred)
        results.append(acc)
    results30.append(np.mean(results))

resultsagg["LogisticRegression"]=results30

In [11]:
# Naive Bayes

results30=[]
for i in range(30):
    kf = KFold(10, shuffle=True, random_state=i)
    results = []
    for l_train, l_valid in kf.split(x):
        x_train, x_valid = x.iloc[l_train], x.iloc[l_valid] 
        y_train, y_valid = y.iloc[l_train], y.iloc[l_valid]

        nb = GaussianNB()
        nb.fit(x_train, y_train.values.ravel())
        y_pred = nb.predict(x_valid)
        acc = accuracy_score(y_valid.values.ravel(), y_pred)
        results.append(acc)
    results30.append(np.mean(results))

resultsagg["NaiveBayes"]=results30

In [12]:
print("KNN:",  resultsagg["KNN"].mean())
print("LogisticRegression:",  resultsagg["LogisticRegression"].mean())
print("NaiveBayes:",  resultsagg["NaiveBayes"].mean())

KNN: 0.9079008685235102
LogisticRegression: 0.911137466307278
NaiveBayes: 0.9113932315064392


In [13]:
resultsagg

Unnamed: 0,KNN,LogisticRegression,NaiveBayes
0,0.910314,0.910314,0.908437
1,0.908455,0.911267,0.90938
2,0.9046,0.91124,0.910305
3,0.910261,0.910234,0.910243
4,0.906478,0.912165,0.913109
5,0.904663,0.911258,0.912192
6,0.90841,0.912183,0.912156
7,0.910279,0.911231,0.911231
8,0.909362,0.910305,0.910314
9,0.909299,0.911231,0.910296


**Friedman Test** 

In [14]:
stats.friedmanchisquare(resultsagg['KNN'], resultsagg['LogisticRegression'], resultsagg['NaiveBayes'])

FriedmanchisquareResult(statistic=28.288135593220336, pvalue=7.199617174126064e-07)

There was a statistically significant difference in accuracy_score depending on which type of machine learning model was trained to make predictions, χ2(2) = 28.2881, p = 7.1996e-07.

**Post Hoc Tests**

To examine where the differences actually occur, we need to run separate Wilcoxon signed-rank tests on the different combinations of related groups.

We need to use a Bonferroni adjustment on the results of the Wilcoxon tests because we are making multiple comparisons, which makes it more likely that we will declare a result significant when we should not (a Type I error). Luckily, the Bonferroni adjustment is very easy to calculate; simply take the significance level initially using (in this case, 0.05) and divide it by the number of tests we are running. So in this example, we have a new significance level of 0.05/3 = 0.017. This means that if the p value is larger than 0.017, we do not have a statistically significant result.

**Wilcoxon signed-rank test** 

In [15]:
stats.wilcoxon(resultsagg['KNN'], resultsagg['LogisticRegression'])

WilcoxonResult(statistic=3.0, pvalue=3.50805501501122e-06)

In [16]:
stats.wilcoxon(resultsagg['LogisticRegression'], resultsagg['NaiveBayes'])

WilcoxonResult(statistic=172.0, pvalue=0.3251157886671938)

In [17]:
stats.wilcoxon(resultsagg['KNN'], resultsagg['NaiveBayes'])

WilcoxonResult(statistic=29.0, pvalue=2.839269517405251e-05)

We can see that at the p < 0.017 significance level, accuracy_score between KNN and Logistic Regression or between KNN and Naive Bayes were statistically significantly different.

There was a statistically significant difference in accuracy_score depending on which type of machine learning model was trained to make predictions, χ2(2) = 28.2881, p = 7.1996e-07.

Post hoc analysis with Wilcoxon signed-rank tests was conducted with a Bonferroni correction applied, resulting in a significance level set at p < 0.017. 

There was no significant difference between the Logistic Regression and Naive Bayes models.
However, there were statistically significant differences between the KNN and Logistic Regression models or between KNN and Naive Bayes models.

**LEARNING OUTCOMES**

We learnt about Friedman test and Wilcoxon test.