## **K-Nearest Neighbors (KNN) Tutorial with Census Income Dataset :**

**Introduction**

In this tutorial, we'll use KNN to Predict whether annual income of an individual exceeds $50K/yr based on census data. This dataset is perfect for learning because it contains mostly categorical data that needs encoding.

**What you will learn:**

* Handling categorical data with encoding

* Data preprocessing for machine learning

* Building and evaluating a KNN model

* Finding the optimal K value

**Let's Code!**


**Step 1: Import Librarie**s

In [87]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [23]:
# Step 1: Load the Adult Income dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                'marital_status', 'occupation', 'relationship', 'race', 'sex',
                'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
try:
    df = pd.read_csv(url,
                     names=column_names,
                     na_values=[' ?', '?', 'nan', 'NaN', '', 'NA'],
                     skipinitialspace=True,
                     sep=',',
                     engine='python')

    print(f"âœ“ Dataset loaded successfully: {df.shape}")

except Exception as e:
    print(f"Error loading dataset: {e}")

âœ“ Dataset loaded successfully: (32561, 15)


In [43]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [31]:
print(f"\nDataset Shape: {df.shape}")
print(f"Total records: {len(df)}")

print(f"\nColumn Data Types:")
print(df.dtypes)

print(f"\nCategorical Columns:")
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    print(f"  {col}: {df[col].nunique()} unique values")


Dataset Shape: (32561, 15)
Total records: 32561

Column Data Types:
age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

Categorical Columns:
  workclass: 8 unique values
  education: 16 unique values
  marital_status: 7 unique values
  occupation: 14 unique values
  relationship: 6 unique values
  race: 5 unique values
  sex: 2 unique values
  native_country: 41 unique values
  income: 2 unique values


**Step 2: Data Preprocessing and Encoding**

In [68]:
# Select features for our model
selected_features = ['age', 'workclass', 'education', 'marital_status',
                    'occupation', 'sex', 'hours_per_week', 'income']

df1 = df[selected_features].copy()

**Why These 8 Features Were Selected:**

**Features INCLUDED:**

* age - Strong predictor of income (experience increases with age)
* workclass - Type of employer matters (private vs government vs self-employed)
* education - Higher education typically = higher income
* marital_status - Married couples often have different income patterns
* occupation - Job type directly impacts salary
* sex - Unfortunately, gender pay gap exists in real data
* hours_per_week - More hours often = more income
* income - This is our TARGET variable (what we're predicting)

**Features EXCLUDED (and why):**

* fnlwgt (final weight) - Census sampling weight, not a personal characteristic
* education_num - Redundant! It's just a numeric version of education
* relationship - Very similar to marital_status, would be redundant
* race - Ethical consideration: we avoid using race in predictive models
* capital_gain - Too closely related to income (would "leak" information)
* capital_loss - Same issue - too directly related to our target
* native_country - 90% of data is from one country (USA), not very useful

In [59]:
# Handle missing values
print("\n1. Handling Missing Values:")
df1.isnull().sum()


1. Handling Missing Values:


Unnamed: 0,0
age,0
workclass,1836
education,0
marital_status,0
occupation,1843
sex,0
hours_per_week,0
income,0


In [69]:
# Handle missing values with PROPER IMPUTATION
print(f"   Missing values BEFORE imputation:")
missing_before = df1.isnull().sum()
missing_before[missing_before > 0]

   Missing values BEFORE imputation:


Unnamed: 0,0
workclass,1836
occupation,1843


In [61]:
print("\n   Imputation Strategy:")
print("   â†’ For categorical columns: Use MODE (most frequent value)")
print("   â†’ For numerical columns: Use MEDIAN (robust to outliers)")


   Imputation Strategy:
   â†’ For categorical columns: Use MODE (most frequent value)
   â†’ For numerical columns: Use MEDIAN (robust to outliers)


In [70]:
 #Method 1: MODE imputation for categorical columns:

print("\n   a) Imputing 'workclass' with MODE:")
workclass_mode = df1['workclass'].mode()[0]
print(f"      Most frequent value: '{workclass_mode}'")
df1['workclass'].fillna(workclass_mode, inplace=True)
print(f"      âœ“ Filled {missing_before['workclass']} missing values")

print("\n   b) Imputing 'occupation' with MODE:")
occupation_mode = df1['occupation'].mode()[0]
print(f"      Most frequent value: '{occupation_mode}'")
df1['occupation'].fillna(occupation_mode, inplace=True)
print(f"      âœ“ Filled {missing_before['occupation']} missing values")


   a) Imputing 'workclass' with MODE:
      Most frequent value: 'Private'
      âœ“ Filled 1836 missing values

   b) Imputing 'occupation' with MODE:
      Most frequent value: 'Prof-specialty'
      âœ“ Filled 1843 missing values


In [71]:
le = LabelEncoder()

df1['sex_encoded'] = le.fit_transform(df1['sex'])
df1['workclass_encoded'] = le.fit_transform(df1['workclass'])

print("   - Label encoded: 'sex', 'workclass'")

   - Label encoded: 'sex', 'workclass'


In [72]:
# -------------------------
# One-Hot Encoding (multi-category)
# -------------------------
education_dummies = pd.get_dummies(df1['education'], prefix='edu', drop_first=True)
marital_dummies = pd.get_dummies(df1['marital_status'], prefix='marital', drop_first=True)
occupation_dummies = pd.get_dummies(df1['occupation'], prefix='occ', drop_first=True)

print("   - One-hot encoded: 'education', 'marital_status', 'occupation'")

   - One-hot encoded: 'education', 'marital_status', 'occupation'


In [73]:
df1['income_encoded'] = le.fit_transform(df1['income'])
print("   - Encoded target: 'income'")


   - Encoded target: 'income'


In [77]:
df_encoded = pd.concat([
    df1[['age', 'hours_per_week', 'sex_encoded', 'workclass_encoded']],
    education_dummies,
    marital_dummies,
    occupation_dummies,
    df1['income_encoded']
], axis=1)

print("\n Final Encoded Dataset:")
print(f"   Shape: {df_encoded.shape}")
print(df_encoded.head())


 Final Encoded Dataset:
   Shape: (32561, 39)
   age  hours_per_week  sex_encoded  workclass_encoded  edu_11th  edu_12th  \
0   39              40            1                  6     False     False   
1   50              13            1                  5     False     False   
2   38              40            1                  3     False     False   
3   53              40            1                  3      True     False   
4   28              40            0                  3     False     False   

   edu_1st-4th  edu_5th-6th  edu_7th-8th  edu_9th  ...  occ_Handlers-cleaners  \
0        False        False        False    False  ...                  False   
1        False        False        False    False  ...                  False   
2        False        False        False    False  ...                   True   
3        False        False        False    False  ...                   True   
4        False        False        False    False  ...                  False  

In [78]:
# Prepare X and y
X = df_encoded.drop('income_encoded', axis=1)
y = df_encoded['income_encoded']

print(f"\n4. Feature and Target Sets:")
print(f"   Features (X): {X.shape}")
print(f"   Target (y): {y.shape}")
print(f"\n   Income distribution:")
print(f"   <=50K: {(y==0).sum()} ({(y==0).sum()/len(y)*100:.1f}%)")
print(f"   >50K: {(y==1).sum()} ({(y==1).sum()/len(y)*100:.1f}%)")


4. Feature and Target Sets:
   Features (X): (32561, 38)
   Target (y): (32561,)

   Income distribution:
   <=50K: 24720 (75.9%)
   >50K: 7841 (24.1%)


**Step 3: Split data**

In [79]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")

Training set: 26048 samples (80.0%)
Testing set: 6513 samples (20.0%)


**Step 4: Feature Scaling**

In [80]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nBefore scaling (age column):")
print(f"  Mean: {X_train['age'].mean():.2f}, Std: {X_train['age'].std():.2f}")
print("\nAfter scaling (age column):")
print(f"  Mean: {X_train_scaled[:, 0].mean():.2f}, Std: {X_train_scaled[:, 0].std():.2f}")


Before scaling (age column):
  Mean: 38.59, Std: 13.64

After scaling (age column):
  Mean: -0.00, Std: 1.00


**Step5: Building the KNN classifier with random K-value**

In [82]:
knn_random = KNeighborsClassifier(n_neighbors=5,metric='minkowski',)
knn_random.fit(X_train_scaled, y_train)

# Evaluate with random K
y_pred_random = knn_random.predict(X_test_scaled)
accuracy_random = accuracy_score(y_test, y_pred_random)
print(f"\nResults with K={5}:")
print(f"  Training Accuracy: {knn_random.score(X_train_scaled, y_train):.4f} ({knn_random.score(X_train_scaled, y_train)*100:.2f}%)")
print(f"  Testing Accuracy: {accuracy_random:.4f} ({accuracy_random*100:.2f}%)")


Results with K=5:
  Training Accuracy: 0.8691 (86.91%)
  Testing Accuracy: 0.8118 (81.18%)


**Step 5B: Finding the optimal K-value**

In [83]:
k_values = range(1, 31)  # Test K from 1 to 30
train_scores = []
test_scores = []

print("Testing different K values...")
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)

    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

best_k = list(k_values)[np.argmax(test_scores)]
best_accuracy = max(test_scores)


print(f"âœ“ OPTIMAL K VALUE: K={best_k}")
print(f"âœ“ Best Test Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

Testing different K values...
âœ“ OPTIMAL K VALUE: K=12
âœ“ Best Test Accuracy: 0.8326 (83.26%)


**Step 6: Training of the Final Model**

In [85]:
knn_final = KNeighborsClassifier(n_neighbors=best_k)
knn_final.fit(X_train_scaled, y_train)
y_pred = knn_final.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nOverall Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

print("\nClassification Report:")
print(classification_report(y_test, y_pred,
                          target_names=['<=50K', '>50K']))


Overall Accuracy: 0.8326 (83.26%)

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.87      0.92      0.89      4945
        >50K       0.69      0.55      0.61      1568

    accuracy                           0.83      6513
   macro avg       0.78      0.74      0.75      6513
weighted avg       0.82      0.83      0.83      6513



### **ðŸ“˜Explanation of Classification Report:**

| Metric        | What It Means (Simple)                                                  | <=50K Score | >50K Score | Easy Explanation                                                      |
| ------------- | ----------------------------------------------------------------------- | ----------- | ---------- | --------------------------------------------------------------------- |
| **Precision** | Of all predictions for this class, how many were correct?               | **0.87**    | **0.69**   | Model is very accurate when predicting <=50K; less accurate for >50K. |
| **Recall**    | Of all actual people in this class, how many the model correctly found? | **0.92**    | **0.55**   | Model finds most <=50K people but misses many >50K people.            |
| **F1-score**  | Balance of precision and recall                                         | **0.89**    | **0.61**   | Overall performance is strong for <=50K, weaker for >50K.             |
| **Support**   | Total true samples in the dataset                                       | **4945**    | **1568**   | There are many more <=50K samples, which causes imbalance.            |

### **ðŸ“Š Overall Model Summary :**


| Metric           | Score                                       | Easy Explanation                                                                       |
| ---------------- | ------------------------------------------- | -------------------------------------------------------------------------------------- |
| **Accuracy**     | **0.83**                                    | The model is correct 83% of the time, but this is influenced by the large <=50K class. |
| **Macro Avg**    | Precision: 0.78<br>Recall: 0.74<br>F1: 0.75 | Treats both classes equally â€” shows performance drops for >50K.                        |
| **Weighted Avg** | Precision: 0.82<br>Recall: 0.83<br>F1: 0.83 | Weighted by class sizes â€” mostly reflects good performance on <=50K.                   |
