# Ahmet Emre Usta
# 2200765036

# Part 1

## k-Nearest Neighbor Classification Section Answers

**1.** The **high computational cost** of the k-Nearest Neighbor (k-NN) approach is a disadvantage when testing on a large dataset. 

**Reason:** In order to determine the k nearest neighbors, k-NN compares the test instance to every instance in the training set. The O(n) complexity for each query, where n is the size of the training data, causes testing to perform slowly with a large dataset. Additionally, because every query requires the complete dataset to be saved and referenced, memory consumption increases.


**2.** The ideal k-value, based on the graphs, would be approximately **10**.

Reason: The error in the test error graph reduces as k increases until it reaches its minimum at about k = 10. The test error begins to rise once more after k = 10, which suggests over-smoothing. As a result, k = 10 strikes a balance between preventing underfitting and overfitting and decreasing test error.

**3.**
![1.3](image1.png)

**4.** 
- F
- T
- T
- F

## Linear Regression Section Answers

**1.**

**Model equation:**
\( y = 1.5x + 1.0 \)

**Step 1: Predict the output for each data point using the model equation:**
1. For \( x = 1.0 \):
   \( y_pred = 1.5(1.0) + 1.0 = 2.5 \)
   
2. For \( x = 1.5 \):
   \( y_pred = 1.5(1.5) + 1.0 = 3.25 \)
   
3. For \( x = 3.0 \):
   \( y_pred = 1.5(3.0) + 1.0 = 5.5 \)

**Step 2: Calculate the squared error for each data point:**
1. \( Squared Error = (1.5 - 2.5)^2 = (-1.0)^2 = 1.0 \)
   
2. \( Squared Error = (3.25 - 3.25)^2 = (0)^2 = 0.0 \)
   
3. \( Squared Error = (4.0 - 5.5)^2 = (-1.5)^2 = 2.25 \)

**Step 3: Compute the Mean Squared Error (MSE):**

MSE = (1.0 + 0.0 + 2.25)/3 = (3.25)/3 = 1.083

**Mean Squared Error (MSE)** is **1.083**.

**2.**

- Line 2.
- **Comparison of actual vs. predicted scores for students A, B, C, and D**:
   - **Student A**: Their actual score is **below** the predicted score (since A lies below Line 2).
   - **Student B**: Their actual score is **below** the predicted score (B is below Line 2, and even further below Line 1).
   - **Student C**: Their actual score is **above** the predicted score (C is slightly above Line 2).
   - **Student D**: Their actual score is **above** the predicted score (D is well above Line 2, close to Line 1).

- Since the regression line is Line 2, and considering the overall trend between midterm and final scores, students tend to score higher on the midterm than on the final exam. Therefore, I would expect to see the final exam scores generally lower than the midterm scores. Or we can except around the mean final score which is approximatelly 75.

**3.** The least squares method of linear regression uses **vertical offsets**.

The goal of least squares linear regression is to reduce the vertical error (difference) between the actual data points and the regression line's predicted values. The difference between the predicted values (from the regression model) and the observed dependent variable (y-values) is shown by these vertical offsets.

**4.**
![2.4](image2.png)

**5.**
   
- **Balanced influence**: Without scaling, features with larger values dominate, leading to biased predictions. Scaling keeps them balanced.

- **Better coefficient interpretation**: Scaled features make the model’s weights more meaningful and easier to interpret.

- **Avoids numerical issues**: Large or small feature values can cause calculation errors. Scaling prevents this.

- **Effective regularization**: Scaling ensures that regularization impacts all features equally, improving model performance.

# Part 2

## Install Necessary Libaries

In [1]:
!pip install pandas numpy scikit-learn plotly statsmodels >> /dev/null

## Import Libaries

In [2]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from collections import Counter

# close warnings
import warnings

warnings.filterwarnings("ignore")

## Set the Paths and Read the Dataset

In [3]:
DATASET_PATH = "/Users/emre/GitHub/HU-AI/AIN313/2024/Assignment 1/dataset/"
PROCESSED_DATASET_PATH = os.path.join(DATASET_PATH, "processed")
RAW_DATASET_FILE = os.path.join(
    DATASET_PATH, "raw", "telecommunicaton_classification.csv"
)

In [4]:
# read the dataset
df = pd.read_csv(RAW_DATASET_FILE)
df.head()

Unnamed: 0,district,customer_since,age,is_married,address,salary,ed,employment_status,is_retired,gender,reside,service
0,2,13,44,Yes,9,64.0,4,5,No,F,2,Fundamental Service
1,3,11,33,Yes,7,136.0,5,5,No,F,6,Complete Service
2,3,68,52,Yes,24,116.0,1,29,No,M,2,Advanced Service
3,2,33,33,No,12,33.0,2,0,No,M,1,Fundamental Service
4,2,23,30,Yes,9,30.0,1,2,No,F,4,Advanced Service


In [5]:
# show the dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   district           1000 non-null   int64  
 1   customer_since     1000 non-null   int64  
 2   age                1000 non-null   int64  
 3   is_married         1000 non-null   object 
 4   address            1000 non-null   int64  
 5   salary             1000 non-null   float64
 6   ed                 1000 non-null   int64  
 7   employment_status  1000 non-null   int64  
 8   is_retired         1000 non-null   object 
 9   gender             1000 non-null   object 
 10  reside             1000 non-null   int64  
 11  service            1000 non-null   object 
dtypes: float64(1), int64(7), object(4)
memory usage: 93.9+ KB


## Explotary Data Analysis

In [6]:
# Create a histogram to show the distribution of the 'district' column
district_distribution_fig = px.histogram(
    df,
    x="district",
    title="District Distribution",
    text_auto=True,  # Add text labels to show the count over each bar
)

# Customize the layout for better readability
district_distribution_fig.update_layout(
    xaxis_title="District",
    yaxis_title="Count",
    title={
        "text": "District Distribution",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis=dict(
        tickmode="array",  # Show only labels without ticks
        showticklabels=True,
        showgrid=False,
        zeroline=False,
    ),
)

# Display the histogram
district_distribution_fig.show()

Almost equal which is good for the KNN

In [7]:
# Create histogram
age_distribution_fig = px.histogram(
    df,
    x="age",
    title="Age Distribution",
    text_auto=True,  # Show text labels on bars
    nbins=30,  # Customize number of bins for better granularity
)

# Calculate statistics
median_age = df["age"].median()
mean_age = df["age"].mean()
first_quartile = df["age"].quantile(0.25)
third_quartile = df["age"].quantile(0.75)

# Add vertical lines for statistical markers without annotation_text
stat_lines = [
    {"x": median_age, "color": "green", "text": "Median"},
    {"x": mean_age, "color": "blue", "text": "Mean"},
    {"x": first_quartile, "color": "red", "text": "Q1"},
    {"x": third_quartile, "color": "red", "text": "Q3"},
]

for line in stat_lines:
    age_distribution_fig.add_vline(
        x=line["x"],
        line_dash="dash",
        line_color=line["color"],
    )
    # Add text annotations at top of the chart
    age_distribution_fig.add_annotation(
        x=line["x"],
        y=max(df["age"].value_counts()) * 0.95,
        text=line["text"],
        showarrow=False,
        textangle=-45,
        font=dict(color=line["color"]),
        yshift=120,
        xshift=15,
    )

# Customize layout for readability and presentation
age_distribution_fig.update_layout(
    title={
        "text": "Age Distribution",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis_title="Age",
    yaxis_title="Frequency",
    xaxis=dict(
        tickmode="linear", showgrid=False, zeroline=False
    ),  # Linear ticks for age axis
    yaxis=dict(showgrid=True),  # Y-axis grid for readability
    bargap=0.1,  # Adjust bar gap for aesthetic
)

# Customize hover labels
age_distribution_fig.update_traces(
    hovertemplate="Age: %{x}<br>Count: %{y}<extra></extra>",  # Custom tooltip format
)

# Display the improved histogram
age_distribution_fig.show()

Has huge varience but relately younger dataset.

In [8]:
# show is_married distribution
is_married_distribution_fig = px.pie(
    df,
    names="is_married",
    title="Marital Status Distribution",
    hole=0.3,  # Set the size of the hole in the pie chart
    labels=["Not Married", "Married"],  # Custom labels for the pie chart
)

# Customize layout for better presentation
is_married_distribution_fig.update_layout(
    title={
        "text": "Marital Status Distribution",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
)

# Display the pie chart
is_married_distribution_fig.show()

Almost equal again.

In [9]:
# show gender distribution
gender_distribution_fig = px.pie(
    df,
    names="gender",
    title="Gender Distribution",
    hole=0.3,
    labels=["Male", "Female"],
)

# Customize layout for better presentation
gender_distribution_fig.update_layout(
    title={
        "text": "Gender Distribution",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
)

# Display the pie chart
gender_distribution_fig.show()

We can see little shift to male. But nothing considered as a problem to fix.

In [10]:
# show distribution of the 'reside' column
reside_distribution_fig = px.histogram(
    df,
    x="reside",
    title="Reside Distribution",
    text_auto=True,
)

# Customize layout for better readability
reside_distribution_fig.update_layout(
    xaxis_title="Reside",
    yaxis_title="Count",
    title={
        "text": "Reside Distribution",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis=dict(
        tickmode="array",
        showticklabels=True,
        showgrid=False,
        zeroline=False,
    ),
)

# Display the histogram
reside_distribution_fig.show()

In [11]:
# Create the histogram for customer_since distribution
customer_since_distribution_fig = px.histogram(
    df,
    x="customer_since",
    title="Customer Since Distribution",
    text_auto=True,  # Show text labels on bars
    nbins=30,  # Customize number of bins if needed
)

# Calculate statistics
median_since = df["customer_since"].median()
mean_since = df["customer_since"].mean()
first_quartile_since = df["customer_since"].quantile(0.25)
third_quartile_since = df["customer_since"].quantile(0.75)

# Add vertical lines for statistical markers without annotation_text
stat_lines_since = [
    {"x": median_since, "color": "green", "text": "Median"},
    {"x": mean_since, "color": "blue", "text": "Mean"},
    {"x": first_quartile_since, "color": "red", "text": "Q1"},
    {"x": third_quartile_since, "color": "red", "text": "Q3"},
]

for line in stat_lines_since:
    customer_since_distribution_fig.add_vline(
        x=line["x"],
        line_dash="dash",
        line_color=line["color"],
    )
    # Add text annotations at top of the chart
    customer_since_distribution_fig.add_annotation(
        x=line["x"],
        y=max(df["age"].value_counts()) * 0.95,
        text=line["text"],
        showarrow=False,
        textangle=-45,
        font=dict(color=line["color"]),
        yshift=120,
        xshift=15,
    )

# Customize layout for readability and presentation
customer_since_distribution_fig.update_layout(
    title={
        "text": "Customer Since Distribution",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis_title="Customer Since",
    yaxis_title="Frequency",
    xaxis=dict(
        tickmode="linear", showgrid=False, zeroline=False
    ),  # Linear ticks for year axis
    yaxis=dict(showgrid=True),  # Y-axis grid for readability
    bargap=0.1,  # Adjust bar gap for aesthetic
)

# Customize hover labels
customer_since_distribution_fig.update_traces(
    hovertemplate="Customer Since: %{x}<br>Count: %{y}<extra></extra>",  # Custom tooltip format
)

# Display the improved histogram
customer_since_distribution_fig.show()

We can see the right shift clearly. Note that to consider future.

In [12]:
# show distrubution of the service column
service_distribution_fig = px.histogram(
    df,
    x="service",
    title="Service Distribution",
    text_auto=True,
)

# Customize layout for better readability
service_distribution_fig.update_layout(
    xaxis_title="Service",
    yaxis_title="Count",
    title={
        "text": "Service Distribution",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis=dict(
        tickmode="array",
        showticklabels=True,
        showgrid=False,
        zeroline=False,
    ),
)

# Display the histogram
service_distribution_fig.show()

Very close. Nothing sharp to think about it.

In [13]:
# Create box plot for income distribution by service
income_service_distribution_fig = px.box(
    df,
    x="service",
    y="salary",
    title="Income Distribution by Service",
    points="all",  # Show all data points
    notched=True,  # Add notches for median confidence intervals
)

# Customize layout for better readability and presentation
income_service_distribution_fig.update_layout(
    xaxis_title="Service",
    yaxis_title="Income",
    title={
        "text": "Income Distribution by Service",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis=dict(
        tickmode="array",
        showticklabels=True,
        showgrid=False,
        zeroline=False,
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="lightgrey",
        tickformat=",.0f",  # Format numbers with commas (e.g., 100,000)
    ),
)

# Customize hover labels for clarity
income_service_distribution_fig.update_traces(
    hovertemplate="Service: %{x}<br>Income: %{y:$,.0f}<extra></extra>"  # Format hover to show salary with commas
)

# Display the improved box plot
income_service_distribution_fig.show()

There are some outlier values which we need to consider.

In [14]:
# show ed column distribution
ed_distribution_fig = px.histogram(
    df,
    x="ed",
    title="Education Distribution",
    text_auto=True,
)

# Customize layout for better readability
ed_distribution_fig.update_layout(
    xaxis_title="Education",
    yaxis_title="Count",
    title={
        "text": "Education Distribution",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis=dict(
        tickmode="array",
        showticklabels=True,
        showgrid=False,
        zeroline=False,
    ),
)

# Display the histogram
ed_distribution_fig.show()

In [15]:
# Convert 'service' column to numeric by mapping categories to numbers
df["service_numeric"] = df["service"].astype("category").cat.codes

# Create scatter plot for education and service relation
ed_service_relation_fig = px.scatter(
    df,
    x="ed",
    y="service_numeric",  # Use the numeric version of 'service'
    title="Education and Service Relation",
    color="service",  # Color points by the original 'service' categories
    marginal_y="box",  # Add box plot on y-axis
    marginal_x="box",  # Add box plot on x-axis
    trendline="ols",  # Add linear trendline
    trendline_color_override="black",  # Make the trendline stand out
)

# Customize layout for better readability and presentation
ed_service_relation_fig.update_layout(
    xaxis_title="Education",
    yaxis_title="Service",
    title={
        "text": "Education and Service Relation",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis=dict(
        showgrid=False,
        zeroline=False,
        title_standoff=15,
    ),
    yaxis=dict(
        showgrid=True,
        zeroline=False,
        gridcolor="lightgrey",
        title_standoff=15,
        tickvals=df["service_numeric"].unique(),
        ticktext=df[
            "service"
        ].unique(),  # Map numeric values back to original service names
    ),
)

# Customize hover labels for clarity
ed_service_relation_fig.update_traces(
    hovertemplate="Education: %{x}<br>Service: %{marker.color}<extra></extra>"
)

# Display the scatter plot
ed_service_relation_fig.show()

# Drop the 'service_numeric' column to avoid confusion
df = df.drop("service_numeric", axis=1)

### Encode categorical variables

In [16]:
# Map the 'is_married' column using the dictionary
df["is_married"] = df["is_married"].map({"Yes": 1, "No": 0})

# map is_retired column to numeric
df["is_retired"] = df["is_retired"].map({"Yes": 1, "No": 0})

# map gender column to numeric
df["gender"] = df["gender"].map({"M": 1, "F": 0})

# map service column to numeric values according to above graph show us relation to is income ordinally
df["service"] = df["service"].map(
    {
        "Fundamental Service": 1,
        "Complete Service": 2,
        "Advanced Service": 3,
        "E-Service": 4,
    }
)

In [17]:
# Calculate the correlation matrix
correlation_matrix = df.corr(method="spearman")

# Create a heatmap to show the correlation matrix
heatmap = go.Figure(
    data=go.Heatmap(
        z=correlation_matrix.values,
        x=correlation_matrix.columns,
        y=correlation_matrix.columns,
        colorscale="Viridis",
        text=correlation_matrix.values,  # Add text annotations
        hoverinfo="text",
    )
)

# Add text annotations for each cell with smaller font size
for i in range(len(correlation_matrix.columns)):
    for j in range(len(correlation_matrix.columns)):
        heatmap.add_annotation(
            x=correlation_matrix.columns[i],
            y=correlation_matrix.columns[j],
            text=str(round(correlation_matrix.values[j][i], 2)),
            showarrow=False,
            font=dict(
                size=8,  # Smaller font size
                color="white" if correlation_matrix.values[j][i] < 0.5 else "black",
            ),
        )

# Customize the layout for better readability
heatmap.update_layout(
    title="Correlation Matrix Heatmap",
    xaxis_title="Features",
    yaxis_title="Features",
    title_x=0.5,
    title_y=0.9,
    xaxis=dict(showgrid=True, zeroline=False),
    yaxis=dict(showgrid=True, zeroline=False),
)

# Display the heatmap
heatmap.show()

Choose spearman correlation since we looking for a service(categorical). And clearly seen above not many positive relation between service and dataset.

## Creating Datasets and Train-Test Split

Create 4 different dataset which is:
- Raw
- Scaled
- Eliminated
- Eliminated / Scaled

### Raw Dataset

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    df,
    df["service"],
    test_size=0.2,
    random_state=42,
    stratify=df["service"],
    shuffle=True,
)

display(X_train.head())
display(X_test.head())

Unnamed: 0,district,customer_since,age,is_married,address,salary,ed,employment_status,is_retired,gender,reside,service
742,3,72,65,1,33,71.0,1,41,1,1,4,3
329,3,14,36,0,13,67.0,5,4,0,1,1,1
180,2,69,42,1,11,65.0,2,18,0,0,2,2
164,1,60,46,0,17,81.0,5,9,0,1,1,4
54,1,52,27,0,6,47.0,3,5,0,0,2,1


Unnamed: 0,district,customer_since,age,is_married,address,salary,ed,employment_status,is_retired,gender,reside,service
508,1,39,47,0,1,68.0,4,10,0,1,2,4
940,1,13,41,0,9,55.0,2,12,0,1,3,1
897,1,26,47,0,13,54.0,3,0,0,0,1,4
731,1,8,42,1,2,129.0,4,17,0,1,3,1
570,3,20,32,0,10,19.0,3,5,0,0,1,4


In [19]:
# Define a function to perform Min-Max scaling
def min_max_scale(column):
    return (column - np.min(column)) / (np.max(column) - np.min(column))


# Define a function to perform one-hot encoding using NumPy
def one_hot_encode(column):
    unique_values = np.unique(column)
    one_hot_encoded = np.zeros((column.shape[0], unique_values.shape[0]))
    for i, unique_value in enumerate(unique_values):
        one_hot_encoded[:, i] = (column == unique_value).astype(int)
    return one_hot_encoded, unique_values

In [20]:
# List of continuous columns to scale
continuous_columns = [
    "customer_since",
    "age",
    "salary",
]

categorical_columns = [
    "district",
    "address",
    "ed",
    "employment_status",
    "reside",
]

### Scaled Dataset

Train test data(information) leak considered.

In [21]:
(
    X_train_scaled,
    X_test_scaled,
) = (
    X_train.copy(),
    X_test.copy(),
)

# Apply Min-Max scaling to the continuous columns
for col in continuous_columns:
    X_train_scaled[col] = min_max_scale(X_train_scaled[col])
    X_test_scaled[col] = min_max_scale(X_test_scaled[col])


# Get unique categories from the training set for consistent one-hot encoding
categorical_mappings = {}

for col in categorical_columns:
    unique_values = X_train[col].unique()  # Get unique values from the training set
    categorical_mappings[col] = unique_values  # Store the unique values for this column

    # Apply one-hot encoding to the training set
    one_hot_encoded_train, _ = one_hot_encode(X_train_scaled[col])
    for i, unique_value in enumerate(unique_values):
        X_train_scaled[f"{col}_{unique_value}"] = one_hot_encoded_train[:, i]

    # Apply the same one-hot encoding to the test set
    one_hot_encoded_test = np.zeros(
        (X_test_scaled.shape[0], len(unique_values))
    )  # Ensure same number of columns
    for i, unique_value in enumerate(unique_values):
        one_hot_encoded_test[:, i] = (X_test_scaled[col] == unique_value).astype(int)

    for i, unique_value in enumerate(unique_values):
        X_test_scaled[f"{col}_{unique_value}"] = one_hot_encoded_test[:, i]

# Drop the original categorical columns from both datasets
X_train_scaled = X_train_scaled.drop(categorical_columns, axis=1)
X_test_scaled = X_test_scaled.drop(categorical_columns, axis=1)

# Display the scaled DataFrame
display(X_train_scaled.head())
display(X_test_scaled.head())

Unnamed: 0,customer_since,age,is_married,salary,is_retired,gender,service,district_3,district_2,district_1,...,employment_status_40,employment_status_44,reside_4,reside_1,reside_2,reside_5,reside_3,reside_6,reside_7,reside_8
742,1.0,0.79661,1,0.055258,1,1,3,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
329,0.183099,0.305085,0,0.051693,0,1,1,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
180,0.957746,0.40678,1,0.049911,0,0,2,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
164,0.830986,0.474576,0,0.064171,0,1,4,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
54,0.71831,0.152542,0,0.033868,0,0,1,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,customer_since,age,is_married,salary,is_retired,gender,service,district_3,district_2,district_1,...,employment_status_40,employment_status_44,reside_4,reside_1,reside_2,reside_5,reside_3,reside_6,reside_7,reside_8
508,0.535211,0.482143,0,0.035564,0,1,4,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
940,0.169014,0.375,0,0.027728,0,1,1,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
897,0.352113,0.482143,0,0.027125,0,0,4,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
731,0.098592,0.392857,1,0.072333,0,1,1,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
570,0.267606,0.214286,0,0.006028,0,0,4,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### Eliminated Dataset

- Eliminate the reside column since it's higly(considering the dataset) correlated to is_married column and less correlated to service than is_married.

- Eliminate the district since it's negatively correlated to service.

In [22]:
X_train_eliminated = X_train.drop(
    ["reside", "district"],
    axis=1,
)

X_test_eliminated = X_test.drop(
    ["reside", "district"],
    axis=1,
)

display(X_train_eliminated.head())
display(X_test_eliminated.head())

Unnamed: 0,customer_since,age,is_married,address,salary,ed,employment_status,is_retired,gender,service
742,72,65,1,33,71.0,1,41,1,1,3
329,14,36,0,13,67.0,5,4,0,1,1
180,69,42,1,11,65.0,2,18,0,0,2
164,60,46,0,17,81.0,5,9,0,1,4
54,52,27,0,6,47.0,3,5,0,0,1


Unnamed: 0,customer_since,age,is_married,address,salary,ed,employment_status,is_retired,gender,service
508,39,47,0,1,68.0,4,10,0,1,4
940,13,41,0,9,55.0,2,12,0,1,1
897,26,47,0,13,54.0,3,0,0,0,4
731,8,42,1,2,129.0,4,17,0,1,1
570,20,32,0,10,19.0,3,5,0,0,4


### Eliminated/Scaled Dataset

Train test data(information) leak considered.

In [23]:
eliminated_categorical_columns = [
    "address",
    "ed",
    "employment_status",
]

In [24]:
X_train_eliminated_scaled = X_train_eliminated.copy()
X_test_eliminated_scaled = X_test_eliminated.copy()

for col in continuous_columns:
    X_train_eliminated_scaled[col] = min_max_scale(X_train_eliminated_scaled[col])
    X_test_eliminated_scaled[col] = min_max_scale(X_test_eliminated_scaled[col])


for col in eliminated_categorical_columns:
    unique_values = X_train_eliminated[
        col
    ].unique()  # Get unique values from the training set

    # Apply one-hot encoding to the training set
    one_hot_encoded_train, _ = one_hot_encode(X_train_eliminated_scaled[col])
    for i, unique_value in enumerate(unique_values):
        X_train_eliminated_scaled[f"{col}_{unique_value}"] = one_hot_encoded_train[:, i]

    # Apply the same one-hot encoding to the test set
    one_hot_encoded_test = np.zeros(
        (X_test_eliminated_scaled.shape[0], len(unique_values))
    )  # Ensure same number of columns
    for i, unique_value in enumerate(unique_values):
        one_hot_encoded_test[:, i] = (
            X_test_eliminated_scaled[col] == unique_value
        ).astype(int)

    for i, unique_value in enumerate(unique_values):
        X_test_eliminated_scaled[f"{col}_{unique_value}"] = one_hot_encoded_test[:, i]

# Drop the original categorical columns from both datasets
X_train_eliminated_scaled = X_train_eliminated_scaled.drop(
    eliminated_categorical_columns, axis=1
)
X_test_eliminated_scaled = X_test_eliminated_scaled.drop(
    eliminated_categorical_columns, axis=1
)

display(X_train_eliminated_scaled.head())
display(X_test_eliminated_scaled.head())

Unnamed: 0,customer_since,age,is_married,salary,is_retired,gender,service,address_33,address_13,address_11,...,employment_status_43,employment_status_34,employment_status_23,employment_status_45,employment_status_37,employment_status_36,employment_status_27,employment_status_24,employment_status_40,employment_status_44
742,1.0,0.79661,1,0.055258,1,1,3,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
329,0.183099,0.305085,0,0.051693,0,1,1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
180,0.957746,0.40678,1,0.049911,0,0,2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
164,0.830986,0.474576,0,0.064171,0,1,4,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
54,0.71831,0.152542,0,0.033868,0,0,1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,customer_since,age,is_married,salary,is_retired,gender,service,address_33,address_13,address_11,...,employment_status_43,employment_status_34,employment_status_23,employment_status_45,employment_status_37,employment_status_36,employment_status_27,employment_status_24,employment_status_40,employment_status_44
508,0.535211,0.482143,0,0.035564,0,1,4,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.169014,0.375,0,0.027728,0,1,1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
897,0.352113,0.482143,0,0.027125,0,0,4,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
731,0.098592,0.392857,1,0.072333,0,1,1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
570,0.267606,0.214286,0,0.006028,0,0,4,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Implementing KNN and Variations

### Original KNN Algortihm

In [25]:
# Euclidean distance between two points
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

In [26]:
def get_k_nearest_neighbors(X_train, y_train, x_test, k):
    # Calculate distances from x_test to all training points
    distances = np.array([euclidean_distance(x_test, x_train) for x_train in X_train])

    # Get the indices of the k smallest distances
    k_neighbors_indices = distances.argsort()[:k]

    # Return the classes and distances of the k nearest neighbors
    return y_train[k_neighbors_indices], distances[k_neighbors_indices]

In [27]:
def predict_classification(neighbors):
    # Get the most common class in the neighbors
    most_common = Counter(neighbors).most_common(1)
    return most_common[0][0]

In [28]:
def predict(X_train, y_train, x_test, k):
    # Get the k nearest neighbors
    neighbors, _ = get_k_nearest_neighbors(X_train, y_train, x_test, k)

    # Predict the class based on the neighbors
    return predict_classification(neighbors)

In [29]:
def k_nearest_neighbors(X_train, y_train, X_test, k):
    predictions = [predict(X_train, y_train, x_test, k) for x_test in X_test]
    return np.array(predictions)

### Weighted k-NN (based on distance)

In [30]:
def weighted_classification(neighbors, distances):
    # Inverse distance weighting
    weights = 1 / (distances + 1e-5)  # Add small value to avoid division by zero

    # Dictionary to store the sum of weights for each class
    class_weights = {}

    for idx, neighbor in enumerate(neighbors):
        if neighbor in class_weights:
            class_weights[neighbor] += weights[idx]
        else:
            class_weights[neighbor] = weights[idx]

    # Return the class with the highest weighted sum
    return max(class_weights, key=class_weights.get)

In [31]:
def predict_weighted(X_train, y_train, x_test, k):
    # Get the k nearest neighbors and their distances
    neighbors, distances = get_k_nearest_neighbors(X_train, y_train, x_test, k)

    # Predict the class using weighted voting
    return weighted_classification(neighbors, distances)

In [32]:
def k_nearest_neighbors_weighted(X_train, y_train, X_test, k):
    predictions = [predict_weighted(X_train, y_train, x_test, k) for x_test in X_test]
    return np.array(predictions)

### Weighted k-NN (based on class frequency)

In [33]:
def compute_class_frequencies(y_train):
    # Count the occurrence of each class in the training labels
    unique, counts = np.unique(y_train, return_counts=True)

    # Calculate the frequency of each class
    class_frequencies = dict(zip(unique, counts))

    # Invert the frequency to give more weight to less frequent classes
    class_weights = {cls: 1.0 / freq for cls, freq in class_frequencies.items()}

    return class_weights

In [34]:
def class_frequency_weighted_classification(neighbors, class_weights):
    # Dictionary to store the sum of weights for each class
    weighted_class_sums = {}

    for neighbor in neighbors:
        if neighbor in weighted_class_sums:
            weighted_class_sums[neighbor] += class_weights[neighbor]
        else:
            weighted_class_sums[neighbor] = class_weights[neighbor]

    # Return the class with the highest weighted sum
    return max(weighted_class_sums, key=weighted_class_sums.get)

In [35]:
def predict_class_frequency_weighted(X_train, y_train, x_test, k, class_weights):
    # Get the k nearest neighbors (ignore distances)
    neighbors, _ = get_k_nearest_neighbors(X_train, y_train, x_test, k)

    # Predict the class based on class frequency weighting
    return class_frequency_weighted_classification(neighbors, class_weights)

In [36]:
def k_nearest_neighbors_class_frequency_weighted(X_train, y_train, X_test, k):
    # Compute the class frequencies from the training set
    class_weights = compute_class_frequencies(y_train)

    # Predict the class for each test point using the class frequency weighting
    predictions = [
        predict_class_frequency_weighted(X_train, y_train, x_test, k, class_weights)
        for x_test in X_test
    ]

    return np.array(predictions)

### Combined weighted k-NN (distance + class frequency)

In [37]:
def combined_weighted_classification(neighbors, distances, class_weights):
    # Dictionary to store the combined weight for each class
    combined_class_weights = {}

    for idx, neighbor in enumerate(neighbors):
        # Compute the combined weight (inverse distance * inverse class frequency)
        weight = (1 / (distances[idx] + 1e-5)) * class_weights[neighbor]

        if neighbor in combined_class_weights:
            combined_class_weights[neighbor] += weight
        else:
            combined_class_weights[neighbor] = weight

    # Return the class with the highest combined weight
    return max(combined_class_weights, key=combined_class_weights.get)

In [38]:
def predict_combined_weighted(X_train, y_train, x_test, k, class_weights):
    # Get the k nearest neighbors and their distances
    neighbors, distances = get_k_nearest_neighbors(X_train, y_train, x_test, k)

    # Predict the class using combined distance and class frequency weighting
    return combined_weighted_classification(neighbors, distances, class_weights)

In [39]:
def k_nearest_neighbors_combined_weighted(X_train, y_train, X_test, k):
    # Compute the class frequencies from the training set
    class_weights = compute_class_frequencies(y_train)

    # Predict the class for each test point using combined weighting
    predictions = [
        predict_combined_weighted(X_train, y_train, x_test, k, class_weights)
        for x_test in X_test
    ]

    return np.array(predictions)

### Metrics

In [40]:
def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

In [41]:
# Function to compute precision and recall for multiclass classification
def precision_recall_multiclass(y_true, y_pred, class_labels=[1, 2, 3, 4]):
    # Initialize dictionaries to store precision and recall for each class
    precision = {}
    recall = {}

    for class_label in class_labels:
        # True Positives (TP): Correctly predicted as this class
        TP = np.sum((y_true == class_label) & (y_pred == class_label))

        # False Positives (FP): Predicted as this class but actually belong to a different class
        FP = np.sum((y_true != class_label) & (y_pred == class_label))

        # False Negatives (FN): Actually belong to this class but predicted as a different class
        FN = np.sum((y_true == class_label) & (y_pred != class_label))

        # Calculate precision and recall for the class
        precision[class_label] = TP / (TP + FP) if (TP + FP) > 0 else 0.0
        recall[class_label] = TP / (TP + FN) if (TP + FN) > 0 else 0.0

    return precision, recall

### Elbow Method for Selecting Best k Value

In [42]:
def elbow_method(X_train, y_train, k_values, algorithm):
    # Split the training set into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
    )

    validation_errors = {}

    for k in k_values:
        # Predict on validation set using k-NN with the current value of k
        y_pred = algorithm(X_train.values, y_train.values, X_val.values, k)

        # Calculate the validation error
        error = 1 - accuracy(y_val, y_pred)

        # add error to dictionary
        validation_errors[k] = error

    return validation_errors

## Experiments

### Running Elbow Method for Every Dataset and Algorithm to Find Best k Values over Train Test

In [43]:
algorithms = {
    "k-NN": k_nearest_neighbors,
    "Weighted k-NN": k_nearest_neighbors_weighted,
    "Class Frequency Weighted k-NN": k_nearest_neighbors_class_frequency_weighted,
    "Combined Weighted k-NN": k_nearest_neighbors_combined_weighted,
}

datasets = {
    "Raw": (X_train, X_test),
    "Scaled": (X_train_scaled, X_test_scaled),
    "Eliminated": (X_train_eliminated, X_test_eliminated),
    "Eliminated Scaled": (X_train_eliminated_scaled, X_test_eliminated_scaled),
}

k_values = range(1, 21)

# Initialize a dictionary to store the validation errors for each algorithm and dataset
validation_errors = {algorithm: {} for algorithm in algorithms}

# Compute the validation errors for each algorithm and dataset
for algorithm_name, algorithm in algorithms.items():
    for dataset_name, (X_train, X_test) in datasets.items():
        # Compute the validation errors for the current algorithm and dataset
        validation_errors[algorithm_name][dataset_name] = elbow_method(
            X_train, y_train, k_values, algorithm
        )

In [47]:
# Loop through each algorithm in algorithms
for algorithm_name, algorithm_function in algorithms.items():
    # Initialize figure for each algorithm
    fig = go.Figure()

    # Add Scaled values
    scaled_values = list(validation_errors[algorithm_name]["Scaled"].values())
    scaled_min_index = scaled_values.index(min(scaled_values))
    fig.add_trace(
        go.Scatter(
            x=list(k_values),
            y=scaled_values,
            mode="lines+markers",
            name=f"{algorithm_name} (Scaled)",
            line=dict(color="green"),
        )
    )
    fig.add_annotation(
        x=k_values[scaled_min_index],
        y=scaled_values[scaled_min_index],
        text=f"Min: {scaled_values[scaled_min_index]:.2f}",
        showarrow=True,
        arrowhead=1,
        ax=-40,
        ay=-40,
    )

    # Add Eliminated values
    eliminated_values = list(validation_errors[algorithm_name]["Eliminated"].values())
    eliminated_min_index = eliminated_values.index(min(eliminated_values))
    fig.add_trace(
        go.Scatter(
            x=list(k_values),
            y=eliminated_values,
            mode="lines+markers",
            name=f"{algorithm_name} (Eliminated)",
            line=dict(color="red"),
        )
    )
    fig.add_annotation(
        x=k_values[eliminated_min_index],
        y=eliminated_values[eliminated_min_index],
        text=f"Min: {eliminated_values[eliminated_min_index]:.2f}",
        showarrow=True,
        arrowhead=1,
        ax=-40,
        ay=-40,
    )

    # Add Eliminated Scaled values
    eliminated_scaled_values = list(
        validation_errors[algorithm_name]["Eliminated Scaled"].values()
    )
    eliminated_scaled_min_index = eliminated_scaled_values.index(
        min(eliminated_scaled_values)
    )
    fig.add_trace(
        go.Scatter(
            x=list(k_values),
            y=eliminated_scaled_values,
            mode="lines+markers",
            name=f"{algorithm_name} (Eliminated Scaled)",
            line=dict(color="purple"),
        )
    )
    fig.add_annotation(
        x=k_values[eliminated_scaled_min_index],
        y=eliminated_scaled_values[eliminated_scaled_min_index],
        text=f"Min: {eliminated_scaled_values[eliminated_scaled_min_index]:.2f}",
        showarrow=True,
        arrowhead=1,
        ax=-40,
        ay=-40,
    )

    # Add Raw values
    raw_values = list(validation_errors[algorithm_name]["Raw"].values())
    raw_min_index = raw_values.index(min(raw_values))
    fig.add_trace(
        go.Scatter(
            x=list(k_values),
            y=raw_values,
            mode="lines+markers",
            name=f"{algorithm_name} (Raw)",
            line=dict(color="blue"),
        )
    )
    fig.add_annotation(
        x=k_values[raw_min_index],
        y=raw_values[raw_min_index],
        text=f"Min: {raw_values[raw_min_index]:.2f}",
        showarrow=True,
        arrowhead=1,
        ax=-40,
        ay=-40,
    )

    # Customize layout for each figure
    fig.update_layout(
        title=f"Elbow Method Results for {algorithm_name}",
        xaxis_title="Number of Neighbors (k)",
        yaxis_title="Error Rate",
        legend_title="Method",
        template="plotly_white",
    )

    # Display the figure
    fig.show()

In [48]:
# print best values for each algorithm
for algorithm_name, algorithm_function in algorithms.items():
    print(f"Algorithm: {algorithm_name}")
    for dataset_name, (X_train, X_test) in datasets.items():
        values = list(validation_errors[algorithm_name][dataset_name].values())
        min_index = values.index(min(values))
        print(
            f"Dataset: {dataset_name} - Min Error: {values[min_index]:.2f} (k={min_index+1})"
        )
    print()

Algorithm: k-NN
Dataset: Raw - Min Error: 0.65 (k=14)
Dataset: Scaled - Min Error: 0.04 (k=20)
Dataset: Eliminated - Min Error: 0.64 (k=16)
Dataset: Eliminated Scaled - Min Error: 0.02 (k=16)

Algorithm: Weighted k-NN
Dataset: Raw - Min Error: 0.64 (k=18)
Dataset: Scaled - Min Error: 0.04 (k=16)
Dataset: Eliminated - Min Error: 0.64 (k=18)
Dataset: Eliminated Scaled - Min Error: 0.02 (k=16)

Algorithm: Class Frequency Weighted k-NN
Dataset: Raw - Min Error: 0.64 (k=17)
Dataset: Scaled - Min Error: 0.04 (k=20)
Dataset: Eliminated - Min Error: 0.62 (k=17)
Dataset: Eliminated Scaled - Min Error: 0.03 (k=15)

Algorithm: Combined Weighted k-NN
Dataset: Raw - Min Error: 0.64 (k=19)
Dataset: Scaled - Min Error: 0.04 (k=20)
Dataset: Eliminated - Min Error: 0.63 (k=17)
Dataset: Eliminated Scaled - Min Error: 0.03 (k=15)



### Run Every Algotihm with Best k Value Found Before over Test Set

In [49]:
# run every algorithm with best k value for every dataset and store results
results = {}

for algorithm_name, algorithm_function in algorithms.items():
    results[algorithm_name] = {}
    for dataset_name, (X_train, X_test) in datasets.items():
        k = (
            list(validation_errors[algorithm_name][dataset_name].values()).index(
                min(validation_errors[algorithm_name][dataset_name].values())
            )
            + 1
        )
        y_pred = algorithm_function(X_train.values, y_train.values, X_test.values, k)

        # Store the accuracy, precision, and recall for the current algorithm and dataset
        results[algorithm_name][dataset_name] = {
            "Accuracy": accuracy(y_test, y_pred),
            "Precision": precision_recall_multiclass(y_test, y_pred)[0],
            "Recall": precision_recall_multiclass(y_test, y_pred)[1],
        }

In [58]:
class_names = {
    "Fundamental Service": 1,
    "Complete Service": 2,
    "Advanced Service": 3,
    "E-Service": 4,
}

for algorithm_name, algorithm_data in results.items():
    # Initialize figure for each algorithm
    fig = go.Figure()

    # Plot Accuracy for each method
    accuracy_values = [algorithm_data[method]["Accuracy"] for method in algorithm_data]
    fig.add_trace(
        go.Bar(
            x=list(algorithm_data.keys()),
            y=accuracy_values,
            name="Accuracy",
            text=[f"{val:.3f}" for val in accuracy_values],  # Display values over bars
            textposition="auto",
            marker_color="blue",
        )
    )

    # Plot Precision for each class
    for class_name, class_index in class_names.items():
        precision_values = [
            algorithm_data[method]["Precision"][class_index]
            for method in algorithm_data
        ]
        fig.add_trace(
            go.Bar(
                x=list(algorithm_data.keys()),
                y=precision_values,
                name=f"Precision ({class_name})",
                text=[
                    f"{val:.3f}" for val in precision_values
                ],  # Display values over bars
                textposition="auto",
                marker_color=f"rgba(0, 255, {class_index*60}, 0.8)",
            )
        )

    # Plot Recall for each class
    for class_name, class_index in class_names.items():
        recall_values = [
            algorithm_data[method]["Recall"][class_index] for method in algorithm_data
        ]
        fig.add_trace(
            go.Bar(
                x=list(algorithm_data.keys()),
                y=recall_values,
                name=f"Recall ({class_name})",
                text=[
                    f"{val:.3f}" for val in recall_values
                ],  # Display values over bars
                textposition="auto",
                marker_color=f"rgba(255, 0, {class_index*60}, 0.8)",
            )
        )

    # Calculate overall precision and recall for each method
    overall_precision_values = [
        sum(algorithm_data[method]["Precision"].values())
        / len(algorithm_data[method]["Precision"])
        for method in algorithm_data
    ]
    overall_recall_values = [
        sum(algorithm_data[method]["Recall"].values())
        / len(algorithm_data[method]["Recall"])
        for method in algorithm_data
    ]

    # Plot overall precision
    fig.add_trace(
        go.Bar(
            x=list(algorithm_data.keys()),
            y=overall_precision_values,
            name="Overall Precision",
            text=[
                f"{val:.3f}" for val in overall_precision_values
            ],  # Display values over bars
            textposition="auto",
            marker_color="green",
        )
    )

    # Plot overall recall
    fig.add_trace(
        go.Bar(
            x=list(algorithm_data.keys()),
            y=overall_recall_values,
            name="Overall Recall",
            text=[
                f"{val:.3f}" for val in overall_recall_values
            ],  # Display values over bars
            textposition="auto",
            marker_color="red",
        )
    )

    # Customize layout for each figure
    fig.update_layout(
        title=f"Performance Metrics for {algorithm_name}",
        xaxis_title="Method",
        yaxis_title="Metric Value",
        barmode="group",  # Group bars together
        legend_title="Metrics",
        template="plotly_white",
    )

    # Display the figure
    fig.show()

### Report: Performance Comparison of k-NN Variants Using Different Preprocessing Techniques

This report presents the performance metrics of four k-Nearest Neighbors (k-NN) variants, evaluated using four different preprocessing techniques: **Raw**, **Scaled**, **Eliminated**, and **Eliminated Scaled**. The metrics considered are **Accuracy**, **Precision**, and **Recall** for four service categories, as well as the overall precision and recall for each method. The algorithms assessed include:

1. **k-NN**
2. **Weighted k-NN**
3. **Class Frequency Weighted k-NN**
4. **Combined Weighted k-NN**

The four classes analyzed are:
- **Fundamental Service**
- **Complete Service**
- **Advanced Service**
- **E-Service**

Performance is displayed through bar charts grouped by preprocessing techniques, with specific values shown on the bars for easy comparison. Below is an analysis of each graph and the insights drawn.

---

### 1. **k-NN Performance**

#### Observations:
- **Accuracy**: A significant improvement is observed in accuracy from the **Raw** method (around 0.31) to **Scaled** (0.945) and **Eliminated Scaled** (0.97). The **Eliminated** method shows only a slight improvement (around 0.35).
  
- **Precision and Recall**: The precision and recall metrics mirror this trend. For **Raw** data, precision and recall are low across all classes, especially for **E-Service** (Precision: ~0.270, Recall: ~0.227). In contrast, the **Scaled** and **Eliminated Scaled** methods show near-perfect precision and recall across all service categories, demonstrating a substantial improvement.

- **Overall Precision and Recall**: These metrics are very high for **Scaled** and **Eliminated Scaled**, both approaching values of 1.0, highlighting how scaling the data improves the overall performance of the k-NN model.

---

### 2. **Weighted k-NN Performance**

#### Observations:
- **Accuracy**: The **Raw** method results in lower accuracy (~0.345), but **Scaled** and **Eliminated Scaled** significantly improve accuracy (0.945 and 0.97, respectively). The **Eliminated** method provides little improvement (~0.345).

- **Precision and Recall**: Similar to the basic k-NN, the **Raw** method struggles in precision and recall, especially for **Advanced Service** and **E-Service** (Precision: ~0.315, Recall: ~0.272). On the other hand, the **Scaled** and **Eliminated Scaled** methods deliver near-perfect results for precision and recall across all services.

- **Overall Precision and Recall**: Both metrics are nearly flawless (close to 1.0) for the **Scaled** and **Eliminated Scaled** methods, emphasizing that weighting combined with scaling dramatically improves the model's performance.

---

### 3. **Class Frequency Weighted k-NN Performance**

#### Observations:
- **Accuracy**: The **Raw** method again yields lower accuracy (~0.34), while accuracy improves substantially for **Scaled** (0.93) and **Eliminated Scaled** (0.965) methods. The **Eliminated** method shows slight improvement (~0.355).

- **Precision and Recall**: Precision and recall are quite low for the **Raw** method, particularly for **Advanced Service** (Precision: ~0.311, Recall: ~0.25). However, both metrics reach near-perfect values for **Scaled** and **Eliminated Scaled**, indicating excellent class frequency-weighted performance.

- **Overall Precision and Recall**: The overall metrics are strong for **Scaled** and **Eliminated Scaled**, with both methods approaching 1.0, showing their effectiveness in enhancing this k-NN variant.

---

### 4. **Combined Weighted k-NN Performance**

#### Observations:
- **Accuracy**: Consistent patterns are observed, with lower accuracy for **Raw** (~0.355) and dramatic improvements for **Scaled** (0.935) and **Eliminated Scaled** (0.965). The **Eliminated** method exhibits minimal gains (~0.345).

- **Precision and Recall**: The **Raw** method continues to struggle, particularly in **Advanced Service** and **E-Service** (Precision: ~0.333, Recall: ~0.285). However, both **Scaled** and **Eliminated Scaled** methods demonstrate excellent performance, with values nearing 1.0 for precision and recall across all services.

- **Overall Precision and Recall**: **Scaled** and **Eliminated Scaled** preprocessing techniques yield near-perfect overall metrics, reinforcing that scaling is a crucial step in improving performance for the combined weighting strategy.

---

### Key Takeaways:

1. **Raw Data**: Across all variants, the **Raw** method underperforms, particularly for precision and recall in **Advanced Service** and **E-Service**. It also results in lower overall precision and recall.

2. **Scaled Data**: Scaling consistently enhances performance for all algorithms, achieving near-perfect accuracy, precision, and recall across all service categories.

3. **Eliminated Method**: While feature elimination alone provides some improvement, it is significantly less effective than scaling.

4. **Eliminated Scaled Method**: Combining feature elimination with scaling yields the best results, with near-perfect metrics across all variants.

5. **Class-wise Performance**: The **Advanced Service** and **E-Service** classes tend to perform poorly with the **Raw** method, but scaling and elimination dramatically improve their precision and recall.

---

### Conclusion:

The use of **Scaled** and **Eliminated Scaled** preprocessing techniques significantly boosts the performance of all k-NN variants across key metrics such as accuracy, precision, and recall. The **Raw** method generally underperforms, particularly for **Advanced Service** and **E-Service**, while scaling leads to substantial improvements across the board. This analysis underscores the critical role of preprocessing, particularly scaling, in optimizing k-NN and its variants for superior performance in service classification tasks.