PROBLEM 2

The solution involves using the cumulative distribution function (CDF) of the minimum distance from the origin to n points drawn uniformly from a p-dimensional unit ball centered at the origin. The CDF can be derived by finding the probability that the minimum distance is less than a certain value y, which can be calculated as the volume of the p-dimensional unit ball centered at the origin that is inside the sphere of radius y around the origin. The median of the CDF then corresponds to the value of y such that 50% of the points are closer to the origin than y and 50% are farther away.

To find the CDF, we first need to find the probability density function (PDF) of the distance from the origin to a single point xi. The PDF is given by the product of the volume of the unit ball and the reciprocal of the total number of points, which is proportional to the volume of the p-dimensional sphere of radius y.

Next, we use the CDF of the minimum of n independent and identically distributed random variables to find the CDF of the minimum distance from the origin to the closest data point. The CDF can be expressed as the product of the CDF of the distance to a single point and the complementary cumulative distribution function (CCDF) of the maximum distance, raised to the power of n.

Finally, we find the median of the CDF by setting it equal to 0.5 and solving for y. The median distance from the origin to the closest data point can then be calculated. The specific expression for the median distance will depend on the dimensionality of the unit ball, but in general it will increase as the number of data points n increases and as the dimensionality of the unit ball p increases.

Taking into account the equations:

The probability density function (PDF) of the distance from the origin to a single point xi can be expressed as:

f(y) = n * Vp * (1/Vp) * (y^(p-1))/y^p
where n is the number of points, Vp is the volume of the p-dimensional unit ball, and y is the distance from the origin to the closest point.

The CDF of the minimum distance from the origin to the closest point can then be derived as:

F(y) = (1 - (1 - f(y))^n)

Finally, to find the median distance from the origin to the closest point, we set F(y) = 0.5 and solve for y:

0.5 = 1 - (1 - f(y))^n
1 - 0.5 = (1 - f(y))^n
0.5 = (1 - y^p)^n
(1 - 0.5^(1/n))^(1/p) = y


d(p,N)  = \left(1-\frac{1}{2}^{1/N}\right)^{1/p}.
d(p,N)  = \left(1-\frac{1}{2}^{1/N}\right)^{1/2}.

The specific expression for the median distance will depend on the dimensionality of the unit ball and the number of data points. In general, the median distance will increase as the number of data points n increases and as the dimensionality of the unit ball p increases.

PROBLEM 4

Theoretically, the result of f(0) is = 1. If we perform the simulation, we get a value close to 0, which is not the correct one. Therefore, we can say that the bias error is high due to the fact that the final result is always  close to 0, not to 1.

We know that the input data is homogeneously distributed in a p-dimensional unit ball space (origin- centered). Hence, the probability of any point to be closest to the origin is going to be the samefor all. Thus, the average value of the nearest neighbors will not be a good (accurate) estimator because it will not fit into the correct answer.

In [4]:
from keras.models import Sequential
from keras.layers import Dense, Activation
import math
import tensorflow as tf
import random as rand
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

N = 200
p = 10

def f(x):
    sum = 0
    for h in range(p):
        sum += pow(x[h], 2)
    inside = pow(sum, 1/p)
    result = math.exp((-10)*inside)
    return result

# Generate "num_points" random points in "dimension" that have uniform
# probability over the unit ball scaled by "radius" (length of points
# are in range [0, "radius"]).
def random_ball(num_points, dimension, radius=1):
    from numpy import random, linalg
    # First generate random directions by normalizing the length of a
    # vector of random-normal values (these distribute evenly on ball).
    random_directions = random.normal(size=(dimension,num_points))
    random_directions /= linalg.norm(random_directions, axis=0)
    # Second generate a random radius with probability proportional to
    # the surface area of a ball with a given radius.
    random_radii = random.random(num_points) ** (1/dimension)
    # Return the list of random (direction & length) points.
    return radius * (random_directions * random_radii).T


def plot(X_train):
    plt.xlabel("X1")
    plt.ylabel("X2")
    plt.title("Homogeneously distributed p-ball sampling")
    plt.scatter(X_train.loc[:, 0], X_train.loc[:, 1], color = 'orange')
    plt.show()


X_radius = pd.DataFrame(random_ball(N, p))
for i in range(N): X_radius['result'] = f(X_radius.loc[i, :9])

# print(X_radius.head())    
# plot(X_radius)




In [5]:
result_y = X_radius.loc[:, 9]
y_hat = result_y.mean()
y_std = result_y.std()

# print(result_y)
# print(y_hat)
# print(y_std)

# plt.hist(result_y, bins = 50)


With a Nearest Neighbor. k = 1.

In [6]:
# Import necessary modules
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
  
  
# Create feature and target arrays
X = X_radius.loc[:, :9]
y = X_radius['result']
  
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
  
knn = KNeighborsRegressor(n_neighbors=1)
  
knn.fit(X_train, y_train)
  
# Predict on dataset which model has not seen before
test = np.zeros((1,p))
pred = round(knn.predict(test)[0], 6)
#knn.score(X_train, y_train)

correct = round(f(np.zeros(p)), 5)

print("The predicted result is: ", pred )
print("The accurate result is: ", correct)


The predicted result is:  4.8e-05
The accurate result is:  1.0


Look how 4.7e-05 is far away from 1.0. Thus, demonstrating the high bias of the estimator.