1 Program Description
In this programming assignment, you are to implement a general k-NN, weighted k-NN (k
should be parameterized, i.e., k could be any value), using Euclidean distance.
Specifically, you will use your models to predict the level of corruption in a country based
on a range of macro-economic and social features. The data is given. Below lists the list of
descriptive features (Columns 2-6 in the dataset):
• LIFE EXP.: the mean life expectancy at birth
• TOP-10 INCOME , the percentage of the annual income of the country that goes to
the top 10% of earners
• INFANT MORT.: the number of infant deaths per 1,000 births
• MIL. SPEND: the percentage of GDP spent on the military
• SCHOOL YEARS: the mean number years spent in school by adult females
The target feature is the Corruption Perception Index (CPI) (The last column in the dataset).
The CPI measures the perceived levels of corruption in the public sector of countries and
ranges from 0 (highly corrupt) to 100 (very clean).
We will use Russia as our query country for this question. The table below lists the descriptive features for Russia.
COUNTRY LIFE TOP-10 INFANT MIL. SCHOOL CPI
ID EXP. INCOME MORT. SPEND YEARS
Russia 67.62 31.68 10.00 3.87 12.90 ?

In [4]:
from math import sqrt
from csv import reader

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = row[column].strip()
        row[column] = float(row[column])

# Find the min and max values for each column
def dataset_minmax(dataset):
    minmax = list()
    for i in range(len(dataset[0])):
        col_values = [row[i] for row in dataset]
        value_min = min(col_values)
        value_max = max(col_values)
        minmax.append([value_min, value_max])
    return minmax

# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
    for row in dataset:
        for i in range(len(row)):
            row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

# Calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
    distances = list()
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances[i][0])
    return neighbors

# Predict the continuous value with neighbors
def predict_value(train, test_row, num_neighbors):
    neighbors = get_neighbors(train, test_row, num_neighbors)
    output_values = [row[-1] for row in neighbors]
    prediction = sum(output_values) / len(neighbors)
    return prediction

# Hardcoded dataset (you could load it from a CSV as in your example)
# Data:
dataset = [
    ["Afghanistan",59.61,23.21,74.30,4.44,0.40,1.5171],
    ["Haiti",45.00,47.67,73.10,0.09,3.40,1.7999],
    # ... include all countries
    ["NewZealand",80.67,27.81,4.90,1.13,12.30,9.4627]
]

russia_data = [67.62, 31.68, 10.00, 3.87, 12.90]
normalized_russia = [0.6099, 0.3754, 0.0948, 0.5658, 0.9058]

# Function to calculate Euclidean distance
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

# Function for 3-NN prediction model
def k_nearest_neighbors(data, test_row, k=3):
    distances = [(row[0], euclidean_distance(test_row, row[1:6]), row[6]) for row in data]
    distances.sort(key=lambda tup: tup[1])
    neighbors = distances[:k]
    output_values = [row[2] for row in neighbors]
    prediction = sum(output_values) / k
    return prediction

# Function for weighted k-NN prediction model
def weighted_k_nearest_neighbors(data, test_row, k=16):
    distances = [(row[0], euclidean_distance(test_row, row[1:6]), row[6]) for row in data]
    distances.sort(key=lambda tup: tup[1])
    neighbors = distances[:k]
    weighted_sum = sum(row[2] / (row[1]**2 + 1e-10) for row in neighbors)  # 1e-10 to avoid division by zero
    weights = sum(1 / (row[1]**2 + 1e-10) for row in neighbors)
    prediction = weighted_sum / weights
    return prediction

# 1. 3-NN prediction for Russia
cpi_3nn = k_nearest_neighbors(dataset, russia_data, 3)
print(f"1. Predicted CPI for Russia (3-NN): {cpi_3nn:.4f}")

# 2. Weighted k-NN prediction for Russia
cpi_weighted_16nn = weighted_k_nearest_neighbors(dataset, russia_data, 16)
print(f"2. Predicted CPI for Russia (weighted 16-NN): {cpi_weighted_16nn:.4f}")

# 3. 3-NN prediction for normalized Russia data
cpi_3nn_normalized = k_nearest_neighbors(dataset, normalized_russia, 3)
print(f"3. Predicted CPI for Russia with normalized data (3-NN): {cpi_3nn_normalized:.4f}")

# 4. Weighted k-NN prediction for normalized Russia data
cpi_weighted_16nn_normalized = weighted_k_nearest_neighbors(dataset, normalized_russia, 16)
print(f"4. Predicted CPI for Russia with normalized data (weighted 16-NN): {cpi_weighted_16nn_normalized:.4f}")


1. Predicted CPI for Russia (3-NN): 4.2599
2. Predicted CPI for Russia (weighted 16-NN): 8.7873
3. Predicted CPI for Russia with normalized data (3-NN): 4.2599
4. Predicted CPI for Russia with normalized data (weighted 16-NN): 4.7374


Country  Euclid  CPI
Argentina 9.7575 2.9961
China 10.7275 3.6356
U.S.A 11.7044 7.1357
Egypt 13.7168 2.8622
U.K.14.0956 7.7751
Brazil 14.6801 3.7741
NewZealand 14.8040 9.4627
Ireland 15.2219 7.5360
Israel 15.6514 5.8069
Canada 16.1224 8.6725
Australia 16.9841 8.8442
Germany 17.3560 8.0461
Sweden 18.5875 9.2985
Afghanistan 66.5354 1.5171
Haiti 69.6670 1.7999
Nigeria 75.2681 2.4493
CPI for 3-NN: 4.5891

Country Euclid CPI Weight W*CPI
Argentina 9.7575 2.9961 0.0105 0.0315
China 10.7275 3.6356 0.0087 0.0316
U.S.A 11.7044 7.1357 0.0073 0.0521
Egypt 13.7168 2.8622 0.0053 0.0152 
U.K. 14.0956 7.7751 0.0050 0.0391
Brazil 14.6801 3.7741 0.0046 0.0175
NewZealand 14.8040 9.4627 0.0046 0.0432
Ireland 15.2219 7.5360 0.0043 0.0325
Israel 15.6514 5.8069 0.0041 0.0237
Canada 16.1224 8.6725 0.0038 0.0334
Australia 16.9841 8.8442 0.0035 0.0307
Germany 17.3560 8.0461 0.0033 0.0267
Sweden 18.5875 9.2985 0.0029 0.0269 
Afghanistan 66.5354 1.5171 0.0002 0.0003 
Haiti 69.6670 1.7999 0.0002 0.0004
Nigeria 75.2681 2.4493 0.0002 0.0004
      
CPI for weighted 16-NN: 5.9087

In [None]:
# Dataset
data = [
    ["Afghanistan", 59.61, 23.21, 74.30, 4.44, 0.40, 1.5171],
    ["Haiti", 45.00, 47.67, 73.10, 0.09, 3.40, 1.7999],
    ["Nigeria", 51.30, 38.23, 82.60, 1.07, 4.10, 2.4493],
    ["Egypt", 70.48, 26.58, 19.60, 1.86, 5.30, 2.8622],
    ["Argentina", 75.77, 32.30, 13.30, 0.76, 10.10, 2.9961],
    ["China", 74.87, 29.98, 13.70, 1.95, 6.40, 3.6356],
    ["Brazil", 73.12, 42.93, 14.50, 1.43, 7.20, 3.7741],
    ["Israel", 81.30, 28.80, 3.60, 6.77, 12.50, 5.8069],
    ["U.S.A", 78.51, 29.85, 6.30, 4.72, 13.70, 7.1357],
    ["Ireland", 80.15, 27.23, 3.50, 0.60, 11.50, 7.5360],
    ["U.K.", 80.09, 28.49, 4.40, 2.59, 13.00, 7.7751],
    ["Germany", 80.24, 22.07, 3.50, 1.31, 12.00, 8.0461],
    ["Canada", 80.99, 24.79, 4.90, 1.42, 14.20, 8.6725],
    ["Australia", 82.09, 25.40, 4.20, 1.86, 11.50, 8.8442],
    ["Sweden", 81.43, 22.18, 2.40, 1.27, 12.80, 9.2985],
    ["NewZealand", 80.67, 27.81, 4.90, 1.13, 12.30, 9.4627]
]


1. What value would a 3-nearest neighbor prediction model using Euclidean distance
return for the CPI of Russia?
2. What value would a weighted k-NN prediction model return for the CPI of Russia?
Use k = 16 (i.e., the full dataset) and a weighting scheme of the reciprocal of the
squared Euclidean distance between the neighbor and the query.

In [7]:
from math import sqrt

# Sample data
countries = [
    ["Afghanistan", 59.61, 23.21, 74.30, 4.44, 0.40, 1.5171],
    ["Haiti", 45.00, 47.67, 73.10, 0.09, 3.40, 1.7999],
    ["Nigeria", 51.30, 38.23, 82.60, 1.07, 4.10, 2.4493],
    ["Egypt", 70.48, 26.58, 19.60, 1.86, 5.30, 2.8622],
    ["Argentina", 75.77, 32.30, 13.30, 0.76, 10.10, 2.9961],
    ["China", 74.87, 29.98, 13.70, 1.95, 6.40, 3.6356],
    ["Brazil", 73.12, 42.93, 14.50, 1.43, 7.20, 3.7741],
    ["Israel", 81.30, 28.80, 3.60, 6.77, 12.50, 5.8069],
    ["U.S.A", 78.51, 29.85, 6.30, 4.72, 13.70, 7.1357],
    ["Ireland", 80.15, 27.23, 3.50, 0.60, 11.50, 7.5360],
    ["U.K.", 80.09, 28.49, 4.40, 2.59, 13.00, 7.7751],
    ["Germany", 80.24, 22.07, 3.50, 1.31, 12.00, 8.0461],
    ["Canada", 80.99, 24.79, 4.90, 1.42, 14.20, 8.6725],
    ["Australia", 82.09, 25.40, 4.20, 1.86, 11.50, 8.8442],
    ["Sweden", 81.43, 22.18, 2.40, 1.27, 12.80, 9.2985],
    ["NewZealand", 80.67, 27.81, 4.90, 1.13, 12.30, 9.4627]
]
# Euclidean distance
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(1, 6):  # Only using the feature columns for distance calculation
        distance += (row1[i] - row2[i]) ** 2
    return sqrt(distance)

# Print the table
def print_table(countries_with_distance_and_weights, is_weighted):
    print("Country", "Euclid", "CPI", "Weight" if is_weighted else "", "W*CPI" if is_weighted else "")
    for entry in countries_with_distance_and_weights:
        country, dist, cpi, weight, weighted_cpi = entry
        if is_weighted:
            print(country, "{:.4f}".format(dist), "{:.4f}".format(cpi), "{:.4f}".format(weight), "{:.4f}".format(weighted_cpi))
        else:
            print(country, "{:.4f}".format(dist), "{:.4f}".format(cpi))

# 3-nearest neighbor prediction
def kNN_3(countries, target_country):
    countries_with_distance = []
    for country in countries:
        dist = euclidean_distance(country, target_country)
        countries_with_distance.append((country[0], dist, country[6], 0, 0))

    countries_with_distance.sort(key=lambda tup: tup[1])
    
    print_table(countries_with_distance, is_weighted=False)

    # Compute the mean of 3 nearest neighbors
    sum_cpi = 0
    for i in range(3):
        sum_cpi += countries_with_distance[i][2]
    
    return sum_cpi / 3

# Weighted k-NN prediction for k=16
def weighted_kNN_16(countries, target_country):
    countries_with_weights = []
    for country in countries:
        dist = euclidean_distance(country, target_country)
        weight = 1 / (dist ** 2)
        weighted_cpi = weight * country[6]
        countries_with_weights.append((country[0], dist, country[6], weight, weighted_cpi))

    countries_with_weights.sort(key=lambda tup: tup[1])
    
    print_table(countries_with_weights, is_weighted=True)

    weighted_sum = 0
    weight_accumulator = 0
    for i in range(16):
        weighted_sum += countries_with_weights[i][4]
        weight_accumulator += countries_with_weights[i][3]
    
    return weighted_sum / weight_accumulator

russia = ["Russia", 67.62, 31.68, 10.00, 3.87, 12.90]
cpi_3nn = kNN_3(countries, russia)
print("Country  Euclid  CPI")
print(f"CPI for 3-NN: {cpi_3nn}")
cpi_weighted_16nn = weighted_kNN_16(countries, russia)
print("Country  Euclid  CPI  Weight  W*CPI")
print(f"CPI for weighted 16-NN: {cpi_weighted_16nn}")

Country Euclid CPI  
Argentina 9.7575 2.9961
China 10.7275 3.6356
U.S.A 11.7044 7.1357
Egypt 13.7168 2.8622
U.K. 14.0956 7.7751
Brazil 14.6801 3.7741
NewZealand 14.8040 9.4627
Ireland 15.2219 7.5360
Israel 15.6514 5.8069
Canada 16.1224 8.6725
Australia 16.9841 8.8442
Germany 17.3560 8.0461
Sweden 18.5875 9.2985
Afghanistan 66.5354 1.5171
Haiti 69.6670 1.7999
Nigeria 75.2681 2.4493
Country  Euclid  CPI
CPI for 3-NN: 4.589133333333334
Country Euclid CPI Weight W*CPI
Argentina 9.7575 2.9961 0.0105 0.0315
China 10.7275 3.6356 0.0087 0.0316
U.S.A 11.7044 7.1357 0.0073 0.0521
Egypt 13.7168 2.8622 0.0053 0.0152
U.K. 14.0956 7.7751 0.0050 0.0391
Brazil 14.6801 3.7741 0.0046 0.0175
NewZealand 14.8040 9.4627 0.0046 0.0432
Ireland 15.2219 7.5360 0.0043 0.0325
Israel 15.6514 5.8069 0.0041 0.0237
Canada 16.1224 8.6725 0.0038 0.0334
Australia 16.9841 8.8442 0.0035 0.0307
Germany 17.3560 8.0461 0.0033 0.0267
Sweden 18.5875 9.2985 0.0029 0.0269
Afghanistan 66.5354 1.5171 0.0002 0.0003
Haiti 69.6670 1.

3. The descriptive features in this dataset are of different types. For example, some are
percentages, others are measured in years, and others are measured in counts per 1,000.
We should always consider normalizing our data, but it is particularly important to do
this when the descriptive features are measured in different units. What value would
a 3-nearest neighbor prediction model using Euclidean distance return for the CPI of
Russia when the descriptive features have been normalized using range normalization?
(Hint: The normalized query is given as follows: [‘Russia’, 0.6099, 0.3754, 0.0948,
0.5658, 0.9058]

In [8]:
def range_normalize(dataset):
    # Find min and max values for each feature
    min_vals = [min([row[i] for row in dataset]) for i in range(1, 6)]
    max_vals = [max([row[i] for row in dataset]) for i in range(1, 6)]
    
    # Normalize the dataset
    normalized_dataset = []
    for row in dataset:
        normalized_row = [row[0]]
        for i in range(1, 6):
            normalized_val = (row[i] - min_vals[i-1]) / (max_vals[i-1] - min_vals[i-1])
            normalized_row.append(normalized_val)
        normalized_row.append(row[6])  # Append CPI without normalization
        normalized_dataset.append(normalized_row)
    
    return normalized_dataset

normalized_countries = range_normalize(countries)
russia_normalized = ["Russia", 0.6099, 0.3754, 0.0948, 0.5658, 0.9058]
cpi_3nn_normalized = kNN_3(normalized_countries, russia_normalized)
print(f"CPI for 3-NN with normalized data: {cpi_3nn_normalized}")


Country Euclid CPI  
U.S.A 0.3362 7.1357
U.K. 0.4125 7.7751
Argentina 0.5553 2.9961
Australia 0.5643 8.8442
NewZealand 0.5664 9.4627
Israel 0.5869 5.8069
China 0.5909 3.6356
Canada 0.5914 8.6725
Ireland 0.6331 7.5360
Germany 0.6437 8.0461
Sweden 0.6609 9.2985
Egypt 0.6736 2.8622
Brazil 0.7226 3.7741
Afghanistan 1.2754 1.5171
Nigeria 1.2887 2.4493
Haiti 1.4748 1.7999
CPI for 3-NN with normalized data: 5.968966666666667


4. What value would a weighted k-NN prediction model—with k=16 (i.e., the full dataset)
and using a weighting scheme of the reciprocal of the squared Euclidean distance
between the neighbor and the query—return for the CPI of Russia when it is applied
to the range-normalized data?

In [10]:
from math import sqrt

def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)):
        distance += (row1[i] - row2[i]) ** 2
    return sqrt(distance)

def range_normalize(dataset):
    # Find min and max values for each feature
    min_vals = [min([row[i] for row in dataset]) for i in range(1, 6)]
    max_vals = [max([row[i] for row in dataset]) for i in range(1, 6)]
    
    # Normalize the dataset
    normalized_dataset = []
    for row in dataset:
        normalized_row = [row[0]]
        for i in range(1, 6):
            normalized_val = (row[i] - min_vals[i-1]) / (max_vals[i-1] - min_vals[i-1])
            normalized_row.append(normalized_val)
        normalized_row.append(row[6])  # Append CPI without normalization
        normalized_dataset.append(normalized_row)
    
    return normalized_dataset

def weighted_kNN_16(dataset, query):
    distances = []
    for row in dataset:
        if row[0] != "Russia":  # Exclude Russia itself
            dist = euclidean_distance(query[1:], row[1:-1])  # excluding the name and the CPI
            distances.append((row[0], dist, row[-1]))  # name, distance, CPI

    # Compute the weighted CPI using the reciprocal of the squared distance
    total_weight = 0
    weighted_cpi_sum = 0
    for _, distance, cpi in distances:
        weight = 1 / (distance ** 2)
        total_weight += weight
        weighted_cpi_sum += weight * cpi

    # Return the weighted average CPI
    return weighted_cpi_sum / total_weight

normalized_countries = range_normalize(countries)
russia_normalized = [0.6099, 0.3754, 0.0948, 0.5658, 0.9058]
cpi_weighted_16nn_normalized = weighted_kNN_16(normalized_countries, ["Russia"] + russia_normalized)
print(f"CPI for weighted 16-NN with normalized data: {cpi_weighted_16nn_normalized:.4f}")


CPI for weighted 16-NN with normalized data: 6.6347


2 Useful Help
You should not use scikit-learner KNN for this program, but you are allowed to use
scikit-learner for range normalization.
An online Tutorial To Implement k-Nearest Neighbors in Python From Scratch (See the link:
http://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-
in-python-from-scratch/) should be helpful to this program. Read the tutorial and un-
derstand how KNN can be used for predicting Iris.data.
You can use the code as the starting point for your program and modfiy based on that. Keep
in mind there are lots of differences. Just name a few:
1. The target values of your problem are continuous, not discrete;
2. You do not need to split training and testing data;
3. You do not need to evaluate your prediction accuracy;
4. You need to normalize your data for question 3 and 4.