# Supervised Learning and K Nearest Neighbors Exercises

![knn.png](Assets/knn.png)

# Learning Objectives:

- Explain supervised learning and how it can be applied to regression and classification problems
- Apply K-Nearest Neighbor (KNN) algorithm for classification
- Apply Intel® Extension for Scikit-learn* to leverage underlying compute capabilities of hardware

# scikit-learn* 

Frameworks provide structure that Data Scientists use to build code. Frameworks are more than just libraries, because in addition to callable code, frameworks influence how code is written. 

A main virtue of using an optimized framework is that code runs faster. Code that runs faster is just generally more convenient but when we begin looking at applied data science and AI models, we can see more material benefits. Here you will see how optimization, particularly hyperparameter optimization can benefit more than just speed. 

These exercises will demonstrate how to apply **the Intel® Extension for Scikit-learn*,** a seamless way to speed up your Scikit-learn application. The acceleration is achieved through the use of the Intel® oneAPI Data Analytics Library (oneDAL). Patching is the term used to extend scikit-learn with Intel optimizations and makes it a well-suited machine learning framework for dealing with real-life problems. 

To get optimized versions of many Scikit-learn algorithms using a patch() approach consisting of adding these lines of code after importing sklearn: 

- **from sklearnex import patch_sklearn**
- **patch_sklearn()**

## This exercise relies on installation of  Intel® Extension for Scikit-learn*

If you have not already done so, follow the instructions from Week 1 for instructions


## Introduction

We will be using customer churn data from the telecom industry for the first week's exercises. The data file is called 
`Orange_Telecom_Churn_Data.csv`. We will load this data together, do some preprocessing, and use K-nearest neighbors to predict customer churn based on account characteristics.

In [82]:
from __future__ import print_function
import os
data_path = ['data']

from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearnex import patch_sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
patch_sklearn()

Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


## Question 1

* Begin by importing the data. Examine the columns and data.
* Notice that the data contains a state, area code, and phone number. Do you think these are good features to use when building a machine learning model? Why or why not? 

We will not be using them, so they can be dropped from the data.

In [83]:
# importing shit
filepath = os.sep.join(data_path + ["Orange_Telecom_Churn_Data.csv"])
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,state,account_length,area_code,phone_number,intl_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,...,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churned
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Question 2

* Notice that some of the columns are categorical data and some are floats. These features will need to be numerically encoded using one of the methods from the lecture.
* Finally, remember from the lecture that K-nearest neighbors requires scaled data. Scale the data using one of the scaling methods discussed in the lecture.

In [84]:
# numerically encoding it

class NonNumericEncoder:

    def __init__(self):
        self.mappings = {}

    def encode(self, df: pd.DataFrame) -> pd.DataFrame:

        """
        Encodes non-numeric columns in a DataFrame using pandas factorization.
        
        Args:
            df (pd.DataFrame): The DataFrame at stake.
        
        Returns:
            pd.DataFrame: DataFrame with encoded non-numeric columns.
        """

        non_numeric_columns = df.select_dtypes(exclude=["number"]).columns.tolist()
        
        for column in non_numeric_columns:
            df[column], uniques = pd.factorize(df[column])
            self.mappings[column] = dict(enumerate(uniques))  # store mapping
        
        return df


encoder = NonNumericEncoder()
df_encoded = encoder.encode(df)
df_encoded

Unnamed: 0,state,account_length,area_code,phone_number,intl_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,...,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churned
0,0,128,415,0,0,0,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.70,1,0
1,1,107,415,1,0,0,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.70,1,0
2,2,137,415,2,0,1,0,243.4,114,41.38,...,110,10.30,162.6,104,7.32,12.2,5,3.29,0,0
3,1,84,408,3,1,1,0,299.4,71,50.90,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,3,75,415,4,1,1,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,24,50,408,4995,0,0,40,235.7,127,40.07,...,126,18.96,297.5,116,13.39,9.9,5,2.67,2,0
4996,8,152,415,4996,0,1,0,184.2,90,31.31,...,73,21.83,213.6,113,9.61,14.7,2,3.97,3,1
4997,43,61,415,4997,0,1,0,140.6,89,23.90,...,128,14.69,212.4,97,9.56,13.6,4,3.67,1,0
4998,43,109,510,4998,0,1,0,188.8,67,32.10,...,92,14.59,224.4,89,10.10,8.5,6,2.30,0,0


In [85]:
#scaling this shit

def scale(df, scaling_method=None):

    """
    Scales a DataFrame for KNN using one of 3 scaling methods.

    Args:
        df (pd.DataFrame): DataFrame to be scaled
        scaling_method (str): 'standard', 'minmax', or 'maxabs'

    Returns:
        pd.DataFrame: Scaled DataFrame
    """

    if scaling_method == "standard":
        scaler = StandardScaler()

    elif scaling_method == "minmax":
        scaler = MinMaxScaler()

    elif scaling_method == "maxabs":
        scaler = MaxAbsScaler()

    else:
        raise KeyError("invalid method")

    scaler = scaler.fit(df)
    scaled_array = scaler.transform(df)

    df_scaled = pd.DataFrame(scaled_array, columns=df.columns, index=df.index)

    return df_scaled


In [86]:
#testing the scaler

np.random.seed(42)
df_test = pd.DataFrame({
    "age": np.random.randint(18, 70, 10),
    "income": np.random.randint(2000, 10000, 10),
    "score": np.random.uniform(0, 1, 10)
})

print("Original DataFrame:")
print(df_test)
print("\nBasic stats:")
print(df_test.describe())

# --- test all three scaling methods ---
for method in ["standard", "minmax", "maxabs"]:
    print(f"\n===== Scaling method: {method} =====")
    df_test_scaled = scale(df, scaling_method=method)
    print(df_test_scaled)
    print("\nScaled stats:")
    print(df_test_scaled.describe())

Original DataFrame:
   age  income     score
0   56    6426  0.969910
1   69    7578  0.832443
2   46    8231  0.212339
3   32    5444  0.181825
4   60    5171  0.183405
5   25    4919  0.304242
6   38    9831  0.524756
7   56    2130  0.431945
8   36    3685  0.291229
9   40    9476  0.611853

Basic stats:
             age       income      score
count  10.000000    10.000000  10.000000
mean   45.800000  6289.100000   0.454395
std    13.990473  2496.070977   0.277772
min    25.000000  2130.000000   0.181825
25%    36.500000  4982.000000   0.232062
50%    43.000000  5935.000000   0.368094
75%    56.000000  8067.750000   0.590079
max    69.000000  9831.000000   0.969910

===== Scaling method: standard =====
         state  account_length  area_code  phone_number  intl_plan  \
0    -1.643230        0.698941  -0.519166     -1.731704  -0.323240   
1    -1.575551        0.169849  -0.519166     -1.731012  -0.323240   
2    -1.507872        0.925695  -0.519166     -1.730319  -0.323240   
3   

## Question 3

* Separate the feature columns (everything except `churned`) from the label (`churned`). This will create two tables.
* Fit a K-nearest neighbors model with a value of `k=3` to this data and predict the outcome on the same data.

In [87]:
# separating target from the rest 

df_features = df_encoded.drop(columns=["churned"])
target = df_encoded["churned"]

# KNN-ing it

KNN = KNeighborsClassifier(n_neighbors=3)
KNN = KNN.fit(df_features, target)
predicted_column = KNN.predict(df_features)
predicted_column[:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])

## Question 4

Ways to measure error haven't been discussed in class yet, but accuracy is an easy one to understand--it is simply the percent of labels that were correctly predicted (either true or false). 

* Write a function to calculate accuracy using the actual and predicted labels.
* Using the function, calculate the accuracy of this K-nearest neighbors model on the data.

In [88]:
def accuracyknn(df_features):
    dimension = 5000
    counter = 0

    for element in target:
        if df["churned"][element] == target[element]:
            counter += 1
        else:
            pass
    
    return f"accuracy is {counter / dimension}"

accuracyknn(df_features)

'accuracy is 1.0'

## Question 5

* Fit the K-nearest neighbors model again with `n_neighbors=3` but this time use distance for the weights. Calculate the accuracy using the function you created above. 
* Fit another K-nearest neighbors model. This time use uniform weights but set the power parameter for the Minkowski distance metric to be 1 (`p=1`) i.e. Manhattan Distance.

When weighted distances are used for part 1 of this question, a value of 1.0 should be returned for the accuracy. Why do you think this is? *Hint:* we are predicting on the data and with KNN the model *is* the data. We will learn how to avoid this pitfall in the next lecture.

## Question 6

* Fit a K-nearest neighbors model using values of `k` (`n_neighbors`) ranging from 1 to 20. Use uniform weights (the default). The coefficient for the Minkowski distance (`p`) can be set to either 1 or 2--just be consistent. Store the accuracy and the value of `k` used from each of these fits in a list or dictionary.
* Plot (or view the table of) the `accuracy` vs `k`. What do you notice happens when `k=1`? Why do you think this is? *Hint:* it's for the same reason discussed above.