# AIN313 ASSINGMENT 1


# PART I: Theory Questions


# Telecommunication Customer Classification System

### __STEP 1__
__1. Import and visualize the data in any aspects that you think it is beneficial for the reader’s better understanding of the data.__ 


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
import numpy as np
from collections import Counter

# Load dataset and show it
df = pd.read_csv('telecommunicaton_classification.csv')

df.head()


In [None]:
# Check for missing values
print(df.isnull().sum())

In [None]:
# Visualize the distribution of the target column (service)
sns.countplot(x='service', data=df)
plt.show()

In [150]:
# Format and normalize the DataFrame excluding the service column
df_except_service = df.iloc[:,:-1]

dummies = pd.get_dummies(df_except_service, drop_first=True, dtype=int)
dummies_normalized = (dummies - dummies.min()) / (dummies.max() - dummies.min())

df = pd.concat([dummies, df.iloc[:, -1]], axis=1)

n_df = pd.concat([dummies_normalized, df.iloc[:, -1]], axis=1)

In [None]:
# Show and Check transformed data
df.head()

In [None]:
n_df.head()

### __STEP 2__ 
__2. Split data into train and test set randomly (you can use 80% of the data for training and 20% of it for the test purposes).__


In [153]:

# Define features (X) and target (y) based on non_normalized df (df)
X = df.drop('service', axis=1)  
y = df['service']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define features (X) and target (y) based on normalized df (ndf/NDF)
X_NDF = n_df.drop('service', axis=1)  
y_ndf = n_df['service']

# Split data
X_train_ndf, X_test_ndf, y_train_ndf, y_test_ndf = train_test_split(X_NDF, y_ndf, test_size=0.2, random_state=42)


### __STEP 3 , STEP 4 and STEP 5__

__3. For the test set that you separated at the previous step try to determine classes for the customers.__ <br />
__4. Finally compute performance of your model to measure the success of your KNN Classification method for each setting you have used: You will report Accuracy, Precision, and Recall measures.__ <br />
__5. The most important part of this project is doing as much experiment as you can to show strengths and weaknesses of the kNN algorithm.__ <br />


In [154]:
# distance functions

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

def manhattan_distance(a, b):
    return np.sum(np.abs(a - b))



In [155]:
# auxiliary functions to knn

def nearest_neighbors(distance_class, k):
    neighbors = [] # where we keep the k nearest neighbors

    distance_class.sort(key=lambda x: x[0])  # We sort by distance.
    for dis , cls in distance_class[:k]:
        neighbors.append(cls)
    return neighbors

def most_common_neighbor(neighbors):
    neighbor = Counter(neighbors).most_common(1)[0][0] # The most common neighbor among the k nearest neighbors
    return neighbor

In [156]:
# KNN Algorithm function

def k_nearest_neighbors(X_train, y_train, X_test, k, distance_func):
    predictions = []  # We'll keep our guesses here
    
    for test_point in X_test.to_numpy():  
        distances = []  # A list to store distances
        
        # We calculate the distance with the data points in the training set.
        for i, train_point in enumerate(X_train.to_numpy()):
            distance = distance_func(test_point, train_point)
            distances.append((distance, y_train.iloc[i]))  # We keep distance and class.
        
        # We call the k nearest neighbors.
        neighbors = nearest_neighbors(distances , k)
        
        # We call the most common class and throw it as a guess.
        neighbor = most_common_neighbor(neighbors)
        
        predictions.append(neighbor)
    
    return predictions

In [157]:


def evaluate_k_values(X_train, y_train, X_test, y_test, k_values , distance_func):
    accuracy_scores = []
    
    for k in k_values:
        y_pred = k_nearest_neighbors(X_train, y_train, X_test, k , distance_func)
        accuracy = accuracy_score(y_test, y_pred)
        accuracy_scores.append(accuracy)
        print(f'k={k}, Accuracy={accuracy:.2f}')
    
    return accuracy_scores


### Evaluating the efficiency of the knn algorithm according to normalized and non-normalized data frame (with euclidean distance)

In [158]:
def best_k_func(X_train, y_train, X_test, y_test ,distance_func):
    # k values ​​to try
    k_values = range(1, 100, 2)

    # We evaluate the model for each k.
    accuracy_scores = evaluate_k_values(X_train, y_train, X_test, y_test, k_values , distance_func)

    # We find the best k value.
    best_k = k_values[np.argmax(accuracy_scores)]
    print(f'Best k value: {best_k}')

    # We draw graphs according to the accuracy scores of the k values ​​we tried.
    plt.figure(figsize=(10, 6))
    plt.plot(k_values, accuracy_scores, marker='o', linestyle='--', color='b')
    plt.xlabel('k value')
    plt.ylabel('Accuracy Score')
    plt.title('Accuracy Score According to Different k Values')
    plt.xticks(k_values)  
    plt.grid(True)
    plt.show()

    return best_k

def scores(X_train, y_train, X_test, y_test, best_k , distance_func):
    # We make the model predict.
    y_pred = k_nearest_neighbors(X_train, y_train, X_test, best_k , distance_func)

    # We are evaluating the results. 
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')

    print(f'Accuracy: {accuracy:.2f}')
    print(f'Precision: {precision:.2f}')
    print(f'Recall: {recall:.2f}')

#### knn algorithm for non-normalized df 

In [None]:
best_k = best_k_func(X_train, y_train, X_test, y_test , euclidean_distance)

In [None]:
scores(X_train, y_train, X_test, y_test, best_k , euclidean_distance)

#### knn algorithm for normalized df

In [None]:
best_k = best_k_func(X_train_ndf, y_train_ndf, X_test_ndf, y_test_ndf, euclidean_distance)

In [None]:
scores(X_train_ndf, y_train_ndf, X_test_ndf, y_test_ndf, best_k, euclidean_distance)

### kNN Results: Normalized vs. Non-Normalized Data

#### 1. General Performance Comparison
- **Non-Normalized Data**: Accuracy ranges from 0.30 to 0.39. The best k value is 37 with 0.39 accuracy.
- **Normalized Data**: Accuracy is generally higher, between 0.30 and 0.43. The best k value is 87 with 0.43 accuracy.
- Normalization improves performance and stability, especially at higher k values.

#### 2. Optimal k Value
- **Non-Normalized**: Best k is 37 (0.39 accuracy). Accuracy improves with higher k values but plateaus beyond a certain point.
- **Normalized**: Best k is 87 (0.43 accuracy). Large k values work better, indicating stronger generalization due to normalization.

#### 3. Overfitting vs. Generalization
- **Non-Normalized**: Small k values lead to overfitting (e.g., k=1), while larger k values generalize better.
- **Normalized**: Smaller k values are more stable, and higher k values (e.g., k=87) show the best performance.

#### 4. Impact of Normalization
- Normalization results in consistently higher accuracy and less sensitivity to k value changes.
- Larger k values perform better on normalized data.

#### 5. Conclusion
- Normalization improves kNN's performance, especially with large k values.
- **Recommendation**: Use normalization to enhance accuracy and generalization in kNN.



### Evaluating the efficiency of the knn algorithm according to distance function difference (with normalized df)

#### with Manhattan Distance

In [None]:
best_k = best_k_func(X_train_ndf, y_train_ndf, X_test_ndf, y_test_ndf, manhattan_distance)

In [None]:
scores(X_train_ndf, y_train_ndf, X_test_ndf, y_test_ndf, best_k, manhattan_distance)

#### with Euclidean Distance

In [None]:
best_k = best_k_func(X_train_ndf, y_train_ndf, X_test_ndf, y_test_ndf, euclidean_distance)

In [None]:
scores(X_train_ndf, y_train_ndf, X_test_ndf, y_test_ndf, best_k, euclidean_distance)

### kNN Results: Manhattan vs. Euclidean Distance

#### 1. General Performance Comparison

- **Manhattan**: Accuracy ranges from 0.29 to 0.41, with the best k at 99 achieving 0.41 accuracy.
- **Euclidean**: Accuracy ranges from 0.30 to 0.43, with the best k at 87 achieving 0.43 accuracy.
- Euclidean shows higher overall performance and faster improvement in accuracy with increasing k values.

#### 2. Optimal k Value

- **Manhattan**: Best k = 99 (accuracy = 0.41). Larger k values provide better generalization but limited improvement.
- **Euclidean**: Best k = 87 (accuracy = 0.43). Euclidean improves faster and gives higher accuracy at medium to large k values.

#### 3. Overfitting vs. Generalization

- **Manhattan**: Lower k values (e.g., k=1) show overfitting, while larger k values provide more generalization.
- **Euclidean**: Small k values improve accuracy more quickly, with better generalization at larger k values.

#### 4. Manhattan vs. Euclidean

- **Euclidean Distance** performs better overall, especially at larger k values (best accuracy = 0.43).
- **Manhattan Distance** stabilizes with higher k but doesn't improve beyond 0.41 accuracy.
  
#### 5. Conclusion

- Euclidean is the better choice for kNN, offering faster and higher accuracy improvements at larger k values. It generalizes better compared to Manhattan.
