# **K- Nearest Neighbour: KNN**
> Regression problem:

Knn for regression is similar to Knn for classification. 

The only difference is that the `output variable is continuous` rather than categorical.
 
The output value is the average of the values of `its k nearest neighbors`.

## About the dataset:
Auto_mpg dataset is used to predict the `miles per gallon(mpg)` of a vehicle based on its characteristics.

The dataset contains `398 rows` and `9 columns`.

The columns are:

1. mpg: miles per gallon
2. cylinders: Number of cylinders between 4 and 8
3. displacement: Engine displacement (cu. inches)
4. horsepower: Engine horsepower
5. weight: Vehicle weight (lbs.)
6. acceleration: Time to accelerate from 0 to 60 mph (sec.)
7. model year: Model year (modulo 100)
8. origin: Origin of car (1. American, 2. European, 3. Japanese)
9. car name: Vehicle name

The aim is to predict the miles per gallon(mpg) of a vehicle based on its characteristics.

> In this notebook, I will use the KNN algorithm to predict the miles per gallon(mpg) of a vehicle based on its characteristics.

In [1]:
# importing the libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score


In [19]:
import pandas as pd

# Define the column names based on auto-mpg.names file
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']

# Load the data into a DataFrame
df = pd.read_csv('../dataset/auto_mpg/auto-mpg.data', delim_whitespace=True, names=column_names, na_values='?')

  df = pd.read_csv('../dataset/auto_mpg/auto-mpg.data', delim_whitespace=True, names=column_names, na_values='?')


In [15]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [20]:
# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# Convert data types if necessary
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')  # Convert 'horsepower' to numeric, coercing errors to NaN



Missing values in each column:
mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car_name      398 non-null    object 
dtypes: float64(5), int64(3), object(1)
memory usage: 28.1+ KB


In [23]:
# impute the missing values with the median value of the column
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

In [25]:
# Display the DataFrame info to verify data types and missing values
print("\nDataFrame Info:")
print(df.info())



DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car_name      398 non-null    object 
dtypes: float64(5), int64(3), object(1)
memory usage: 28.1+ KB
None


##### so now we have the data cleaned and ready to be used for the model:

In [26]:
# split the data into features and target
X = df.drop(columns=['mpg', 'car_name'])
y = df['mpg']

In [27]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
# call the KNeighborsRegressor model

model = KNeighborsRegressor(n_neighbors=5)

# fit the model with the training data
model.fit(X_train, y_train)

# predict the target values
y_pred = model.predict(X_test)

In [29]:
# evaluate the model

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R^2 Score: {r2}')
print(f'Mean Absolute Error: {mae}')

Mean Squared Error: 12.581365000000002
Root Mean Squared Error: 3.547021990346268
R^2 Score: 0.7659996807953288
Mean Absolute Error: 2.7692500000000004


---

## Let's try to improve mdels performance by using KNN algorithm:

---

In [41]:
# importing the dataset:

# Load the data into a DataFrame
df = pd.read_csv('../dataset/auto_mpg/auto-mpg.data', delim_whitespace=True, names=column_names, na_values='?')

  df = pd.read_csv('../dataset/auto_mpg/auto-mpg.data', delim_whitespace=True, names=column_names, na_values='?')


In [42]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car_name      398 non-null    object 
dtypes: float64(5), int64(3), object(1)
memory usage: 28.1+ KB


In [44]:
df.dropna(inplace=True)

In [45]:
# Feature engineering: Extract year from car_name
df['car_year'] = df['car_name'].apply(lambda x: int(x.split()[-1]) if x.split()[-1].isdigit() else np.nan)
df = df.dropna(subset=['car_year'])



In [46]:
# Drop the 'car_name' column
df = df.drop('car_name', axis=1)


In [47]:
# Define features and target
X = df.drop('mpg', axis=1)
y = df['mpg']


In [48]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [49]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [52]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

In [53]:
# Model selection and hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predictions
y_pred = best_rf.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2 Score: {r2}")
print(f"Mean Absolute Error: {mae}")

Mean Squared Error: 7.316985862886126
Root Mean Squared Error: 2.704992765773344
R^2 Score: 0.7414582857581309
Mean Absolute Error: 1.975050448933773


---
### So, we learned how to implement the KNN algorithm using the scikit-learn library in Python for Regression problem. We also learned how to find the optimal value of k and how to use different distance metrics in the KNN algorithm. 

---

# About Me:

<img src="https://scontent.flhe6-1.fna.fbcdn.net/v/t39.30808-6/449152277_18043153459857839_8752993961510467418_n.jpg?_nc_cat=108&ccb=1-7&_nc_sid=127cfc&_nc_ohc=6slHzGIxf0EQ7kNvgEeodY9&_nc_ht=scontent.flhe6-1.fna&oh=00_AYCiVUtssn2d_rREDU_FoRbXvszHQImqOjfNEiVq94lfBA&oe=66861B78" width="30%">

**Muhammd Faizan**

3rd Year BS Computer Science student at University of Agriculture, Faisalabad.\
Contact me for queries/collabs/correction

[Kaggle](https://www.kaggle.com/faizanyousafonly/)\
[Linkedin](https://www.linkedin.com/in/mrfaizanyousaf/)\
[GitHub](https://github.com/faizan-yousaf/)\
[Email] faizan6t45@gmail.com or faizanyousaf815@gmail.com \
[Phone/WhatsApp]() +923065375389