In [None]:

'''
The code is to handle categorical and continuous data of Adult dataset using K-nearest Neighbors. A range of k-nearest is 
implemented along with k-fold cross-validation.For each value of knn, cross-validation is iterated and accuracy is calculated.
Thus for every knn, we get an accuracy score,knn with the highest accuracy will be chosen for this data set.

In cross-validation, X data is divided into training and test data. At every iteration X data
is divided into n folds,1 fold will be test data and remaining folds will be training data depending on n_split value.
eg if n_split=3 1 fold will be the test data and 2 folds will be training data. The iteration will continue till n_split value.
The combination of testing and training data at each iteration will be different. This code works on handling both categorical
and continuous data. Therefore at every iteration, both types of data will be handled differently as explained below.

Both types of data require different ways of handling when used with this classifier. The continuous data is
transformed into Zscore value. The categorical data is transformed using the scikit library LabelEncoder to integer format.
The method to run the test data over the training model varies for both continuous and categorical data, for continuous data
we use Euclidean Distance and for categorical we use Hamming Distance. 

The Euclidean distance works on the principle of the square root of the sum of the squared distance of the points from each
other (SQUARE ROOT OF (X1-X2)**2 +(Y1-Y2)**2).Where has in Hamming distance, between two strings of equal length is the number
of  positions at which the corresponding symbols are different.


Finally, both the distances are added to get the k-nearest(k-nearest in this code is implemented manually,3 smallest distance)
and mode of k-nearest as the final prediction. The accuracy score is calculated with predicted and actual Y data.

Details of the algorithm are given below.

Data is read into a pandas data frame. The columns having '?' are replaced with the mode of those columns. The data is then 
split into X(input) and Y(output/class). The code implements k-fold cross-validation.The kfold object is created with
n_split=3. For every value of knn, kfold is iterated till n_split=3. At every iteration, the X data is divided into test and 
training data(as explained above). The test and training data are handled as give below to give final accuracy scores for a
range of knn values:-


1. The training and testing data are separated respectively into a continuous and categorical data frame.

2. For continuous data, for every column of training data std and mean is calculated and stored in a dictionary. This data is
used to transform training data columns to zcore value. The continuous data of testing data is transformed on std and mean of
training data.

3. For categorical data, training data is passed over the object of LabelEncoder.The data is fit to get the labels of the column
The data is then transformed using these labels. The categorical data of testing data is also transformed on the labels of
training data.
In case if there are labels in the test which are not available in training, in that case, we should fit the training data 
based on the unique values of columns of the complete data set.

4. Once both training and test are transformed we now calculate the distance of a row of testing data with all rows of training
data.

For continuous data, we use the Euclidean Distance metric and for categorical data,
we use the Hamming Distance metric. Both the distance data are stored in 2 different numpy arrays.

5. We need 1 value to predict, therefore, we add the values of the rows of continuous and categorical distance array's. The new
array is now iterated row-wise using map(). Each row is converted into series with index value = Y train index. From this series
we find the k-nearest, which is the value of knn for which these steps are executed. From the k-nearest, we find the mode. 
The corresponding mode location in the series and Y-train are matched. The label at this matching index in Y train is the 
prediction for that row.
Incrementally at every n_split, the prediction is populated into a list to get a cumulative prediction.

6. The final populated list of test predictions and actual Y data are used to calculate the accuracy score.

7. The above 1-6 steps are repeated for every knn value. An accuracy score against each knn is stored in a dictionary.

8. Knn with the highest accuracy is the best knn for this data.
9. Graphically represent the knn range and respective accuracy score.



'''

In [1]:
import pandas as pd
import numpy as np
import time
from IPython.display import display

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import preprocessing
from scipy.spatial import distance
from sklearn.neighbors import DistanceMetric

from sklearn.model_selection import KFold 


from sklearn.model_selection import cross_val_score
from statistics import mean 
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

In [13]:
def read_data():
    
    '''
    This function is to read data using Pandas read_csv function and convert into dataframe.
    The '?' values of the columns are replaced with the mode value of those columns.
    
    Return :- 
    data :- Dataframe of adult data set
    
    '''
    
    data=pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data',names=['Age','Workclass','FNLWGT','Education','Education-Num','Marital Status','Occupation','Relationship','Race','Sex','Capital Gain','Caplital Loss','Hrs-Per-Week','Native-Country','Sal'])
    for col in data.columns:

        data[col].replace(' ?',data[col].mode()[0],inplace=True)

    return data


def split_data(df):
    
    '''
    The function is to split data first into the Input (X) and Output (Y) dataset. 
    Argument :-
    df :- Adult data frame
    
    Return :- 
    x:-training data
    y :- Validation data
    cat_data :- Categorical columns of complete data set before split.
    
    '''
    
    cat_data=df[['Workclass','Education','Marital Status','Occupation','Relationship','Race','Sex','Native-Country']]
    
    x=df.iloc[:,0:14]
    y=df.iloc[:,-1]
    
    
    return x,y,cat_data


def kfold_split_data(kval):
    
    '''
    Function is to define in how many parts complete data needs to be split. This is done as part of cross validation.
    
    Argument :-
    kval :- values for n_splits.
    
    '''
    
    kf1 = KFold(n_splits=kval)
        
    return kf1



def process_data(x_train,x_test):
    
    '''
    The function is separate train and test data into categorical and continous data respectively.
    
    Argument:-
    x_train :- Training data.
    x_test  :- Testing data
 
    
    Return :-
    xtrain_cat  :- Categorical data of training data set
    xtrain_cont :- Continuous data of training data set.
    xtest_cat :-  Categorical data of test data.
    xtest_cont :- Continuous data of test data.
    '''
    # Training data is separated into categorical and continuous data
    
    xtrain_cat=x_train[['Workclass','Education','Marital Status','Occupation','Relationship','Race','Sex','Native-Country']]
    xtrain_cont=x_train[['Age','FNLWGT','Education-Num','Capital Gain','Caplital Loss','Hrs-Per-Week']]
    
    # Testing data is seprated categorical and continuous data
    
    xtest_cat=x_test[['Workclass','Education','Marital Status','Occupation','Relationship','Race','Sex','Native-Country']]
    xtest_cont=x_test[['Age','FNLWGT','Education-Num','Capital Gain','Caplital Loss','Hrs-Per-Week']]
    
    
    return xtrain_cat,xtrain_cont,xtest_cat,xtest_cont




def cont_cal_stdmean(data):
    
    '''
    This function is used to  calculate std and mean of every column of training data.
    
    Argument :-
    data :- traning data set
    
    Return :-
    stdmean :- Dictionary holding std and mean for each column.        
    '''
    
    
    stdmean={}
    for col in data.columns:
               
        mean_data=data[col].mean()
        stdmean.update({'Mean'+col:mean_data})
        std_data=data[col].std()
        stdmean.update({'Std'+col:std_data})
        
    return stdmean




def cont_transform_data(datatotransform,std_mean):
    
    '''
    This function is used to transform and replace the training column values to Zscore values using the dictionary of
    std and mean for every column. Both training and test data transformed.Testing data are transformed upon std and mean of
    training data.
        
    Argument:- 
    std_mean=Std and mean of every column of training data.
    datatotransform=training data/testing data
    
    Return:
    datatotransform=Transformed training/test data.
    
       
    '''
    
    
    for col in datatotransform.columns:
        datatotransform[col]=((datatotransform[col]-std_mean['Mean'+col])/std_mean['Std'+col])
    
    return datatotransform




def cont_distance_cal(xcont_test,xcont_train):
    
    '''
    The function is used to calculate the Ecuclidean distance between training and testing data.
    
    Argument:-
    xcont_test :- Continuous data of test data set
    xcont_train :- Continuous data of training data set
    
    Return :-
    contdist_arr :- Array of distance. For each row of testing, there are corresponding all rows of training as columns.
    '''
       
    contdist_arr=euclidean_distances(xcont_test,xcont_train)
    
    return contdist_arr




def cat_label_encode(cato_data,datatotransform):
    
    '''
    The function is to convert the categorical data into integer format. The LabelEncoder method is used to generates labels
    for every unique value of the columns. The data is transformed into these label values. The training data is fit and transformed
    with LabelEncoder object. On the same object, the test data is also transformed.
    
    
    Argument:- 
    cato_data=complete catogorical dataset
    datatotransform=training/test data
    
    Return :-
    datatotransform :- Transformed training/test data
    
    '''
    
    le = preprocessing.LabelEncoder()
    for x in cato_data.columns:
        
        
        le.fit(cato_data[x])
        datatotransform[x]=le.transform(datatotransform[x])
        
    return datatotransform




def cat_hamming_dist(cat_train,cat_test):
    
    '''
    The function is to calculate the Hamming distance between training and testing categorical data.
    The DistanceMetric object is created with 'hamming' as an argument.
    Object use's pairwise function to calculate the distance of every test row with all rows of training data.
    
    Argument :-
    cat_train :- Categorical data of training data set.
    cat_test :- Categorical data of testing data set.
    
    Return :-
    dist_arr :- Array of distance. For each row of testing, there are corresponding all rows of training as columns.
    '''
    
    
    dist=DistanceMetric.get_metric('hamming')
    
    dist_arr=dist.pairwise(cat_test,cat_train)
       
    
    return dist_arr



def get_prediction(distance,y_train,val):
    
    '''
    Both for categorical and continous data,data was handled differently and distance was also calculated differently
    Funtion is for row wise values concatenating both distance for categorical and continous data after individual distance calculation.
    For each row after adding the row values,K-nearest is calculated.
    Mode of the K-nearest is calculated and corresponding class/category(Output column) is noted.
    
    
    '''
    
    
    list_ser=pd.Series(distance,index=y_train.index.values)
      
    list_ser=pd.Series(list_ser).nsmallest(val)
    ypred=y_train.loc[list_ser.keys()].mode().values.item(0)
    
    
    return ypred




def pred_accuracy(y_pred_final,ydata):
    
    '''
    The added rows of continuous and categorical distances are passed row-wise.Each row is converted into series with 
    index value = Y train index.From this series, we find the k-nearest. From the k-nearest, we find the mode.The corresponding
    mode location in the series and Y-train are matched. The label at this matching index in Y train is the prediction for 
    that row.k-nearest =3 is used in the code.
    
    Argument :-
    total distnace :- Row containg added values of categorical and continuos distances
    y_train     :- Y data
    
    Return :-
    ypred :- Predication for every row.
    
    '''
    ydata=pd.Series.as_matrix(ydata)
    
    y_pred_final=pd.Series.as_matrix(y_pred_final)
    print("y_pred_final type and shape",y_pred_final.shape)
    
    
    score=(np.sum(y_pred_final==ydata) *100) /ydata.size
    
    return score




def knn_cv_datatransform(kf,xdata,ydata,cato_data):
    
    '''
    The function is to get an accuracy score for a range of Knn values. For each value of knn, cross-validation is iterated to 
    split the X data into training and test data.At each iteration 1 fold will be test data and remaining will be training data.
    The combination of test and training data for iteration will be different. At every iteration, the categorical and continuous
    data are handled differently. Incremental prediction for every iteration of kfold is populated in a series to get a 
    cumulative prediction of test data. The predicted data and actual Y data is used to calculate the accuracy score.
    For each knn, these steps are repeated and an accuracy score for each knn is stored in a dictionary object.
    
    
    Arguments :-
    kf :- kfold object
    xdata :- X data
    ydata :- Y data
    cato_data :- Categorical data of complete data set.
    
    Return :-
    accur_score :- accuracy score
    '''
    y_pred_final=pd.Series([])
    ypred_ser=pd.Series([])
    knn_list=list(range(3,4,5))
    accur_score={}
    cnt=0
    
    
    
    
    for val in knn_list:
        
        res=[]
        for train_index,test_index in kf.split(xdata):
            
            x_train,x_test=xdata.iloc[train_index],xdata.iloc[test_index]
            y_train,y_test=ydata.iloc[train_index],ydata.iloc[test_index]
            
            # Split x_train and x_test data into Categorical and Continuous data respectively.
            
            xcat_train,xcont_train,xcat_test,xcont_test=process_data(x_train,x_test)            
           
            # Calculate std and mean of every column.
            
            xtrain_cont_stdmean=cont_cal_stdmean(xcont_train)
            
            
             #Transform the values of existing continuous data to Zscore values.This is done for both training and testing data.
            
            xcont_train=cont_transform_data(xcont_train,xtrain_cont_stdmean)
            xcont_test=cont_transform_data(xcont_test,xtrain_cont_stdmean)
            
            
            #Calcualte the Euclidean Distance between training and test data
            
            cont_dist_arr=cont_distance_cal(xcont_test,xcont_train)
            
            
            #Convert the Categorical data into frequency(integer) using LabelEncoder.
            #This is done for training and testing data.
            
            xcat_train=cat_label_encode(cato_data,xcat_train)
            xcat_test=cat_label_encode(cato_data,xcat_test)
            
            
            #Calculate the hamming distance between training and test data.
            
            cat_dist_arr=cat_hamming_dist(xcat_train,xcat_test)
            
         
            # Predict the data,add row of catogorical and continuous data.Got from above methods.
            
            result = map(lambda row,row1: np.add(row,row1), cat_dist_arr,cont_dist_arr)
            total_dist=np.array(list(result))
            y_pred=np.apply_along_axis(get_prediction,1,total_dist,y_train,val)
        
            
            ypred_ser=pd.Series(y_pred,index=x_test.index.values)
           
            y_pred_final=y_pred_final.append(ypred_ser)
            
            
            # Calculate the Accuracy score
        
        score=pred_accuracy(y_pred_final,ydata)
        accur_score.update({val:score})
            
      
        
    return accur_score



def max_score_knn(max_score):
    
    
    '''
    The function is to obtain the knn with maximum accuracy score.
    
    Argument :-
    max_score :- dictionary holding knn values and corresponding accuracy score.
    
    
    '''
    max_v = max(zip(max_score.values(), max_score.keys()))
    print('Multiple Knn values with CV,manually code')
    print('Accuracy   :  ' ,max_v[0],'       ' ,'Top Knn with max accuracy      :   ',max_v[1])
        
        
        
        
def plot_graph(score_accuracy):
    
    '''
    The function is to graphically represent the best knn and the accuracy score.
    
    Argument:-
    score_accuracy :-dictionary holding knn values and corresponding accuracy score.
    
    '''
    
    
    plt.plot(*zip(*sorted(score_accuracy.items())))
    plt.xlabel('Number of Neighbors')
    plt.ylabel('Accuracy for  K')
    plt.yscale("linear")
    plt.xticks(range(1,10))
    plt.ylim(45,70,0.5)
    plt.show()
    


In [14]:
#Read th adult data set into pandas data frame.
df=read_data()

In [15]:
#Split the data frame into X ,Y and categorical data frames
xdata,ydata,catogorical_data=split_data(df)

In [16]:
#Create a object of K-fold cross validation.Assign a integer value to the variable for number of n_split
k_val=3
kf=kfold_split_data(k_val)

In [None]:
#Execute the code with range of knn values and kfold n_splits=3.Return values is dictionary with knn values and accuracy scores
score_accuracy=knn_cv_datatransform(kf,xdata,ydata,catogorical_data)

In [27]:
score_accuracy

{3: 81.8156690519333}

In [28]:
max_score_knn(score_accuracy)

Multiple Knn values with CV,manually code
Accuracy   :   81.8156690519333         Top Knn with max accuracy      :    3


In [None]:
plot_graph(score_accuracy)