### Import nessecary modules and libraries.

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import StructType, StructField, IntegerType

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.clustering import KMeans
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import ClusteringEvaluator

seed = 56611230

### Create a spark session

In [2]:
spark = SparkSession.builder \
    .master("local[*]") \
    .getOrCreate()

spark

### Import the train and test data

In [3]:
schema = StructType([StructField("label", IntegerType(), True)] +
                    [StructField(f"pixel{i}", IntegerType(), True) for i in range(1, 785)])

mnist_train = spark.read.options(header = False).schema(schema).csv("dataset/mnist_train.csv.tar.gz").dropna()
mnist_test = spark.read.options(header = False).schema(schema).csv("dataset/mnist_test.csv.tar.gz").dropna()

# It is recommended to unzip the .csv

# mnist_train = spark.read.options(header = False).schema(schema).csv("dataset/mnist_train.csv").dropna()
# mnist_test = spark.read.options(header = False).schema(schema).csv("dataset/mnist_test.csv").dropna()

### Reformat the train and test set<br/>

These images are flattened into a 784-dimensional vector (28*28) where each component represents the grayscale intensity of a pixel to be the input for each model.

In [4]:
assembler = VectorAssembler(inputCols=[f"pixel{i}" for i in range(1, 785)],
                            outputCol="features")
train = assembler.transform(mnist_train)
test = assembler.transform(mnist_test)

### Logisitic regression<br/>

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. It is a form of binomial regression.<br/>

$$
p(y = 1 | \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x} - b}}
$$

The core of logistic regression is the logistic function, where z is the linear combination of the input features x with their corresponding weights w and bias b. It is estimated using maximum likelihood estimation to find the best parameters (weights and bias) that maximize the likelihood of producing the observed set of data.<br/>

Logistic regression output for MNIST is a vector of probabilities due to the multi-class nature of the dataset (digits 0 through 9).<br/>

Below shows the result for logistic regression, which shows a quite-good performance:

In [5]:
lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)
lr_model = lr.fit(train)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

train_predictions = lr_model.transform(train)
train_accuracy = evaluator.evaluate(train_predictions)
test_predictions = lr_model.transform(test)
test_accuracy = evaluator.evaluate(test_predictions)

print(f"Training Accuracy (Logistic regression) = {train_accuracy}")
print(f"Test Accuracy (Logistic regression) = {test_accuracy}")

Training Accuracy (Logistic regression) = 0.8965833333333333
Test Accuracy (Logistic regression) = 0.9002


regParam : This controls the regularization strength, which helps to prevent overfitting. A higher value increases regularization strength, which reduce overfitting but may underfit the data.<br/>

elasticNetParam: Dictates the mix between L1 and L2 regularization. A value of 0 corresponds to L2 regularization only, while 1 corresponds to L1 only.<br/>

Below shows an example for logistic regression with regularization. A higher regularization will reduce the accuracy:

In [6]:
lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10, regParam=0.01, elasticNetParam=0.5)
lr_model = lr.fit(train)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
train_predictions = lr_model.transform(train)
train_accuracy = evaluator.evaluate(train_predictions)
test_predictions = lr_model.transform(test)
test_accuracy = evaluator.evaluate(test_predictions)

print(f"Training Accuracy (Logistic regression) = {train_accuracy}")
print(f"Test Accuracy (Logistic regression) = {test_accuracy}")

Training Accuracy (Logistic regression) = 0.88075
Test Accuracy (Logistic regression) = 0.8884


### K-Mean clustering<br/>

K-Means is a clustering algorithm that partitions a given dataset into a predefined number of clusters, denoted as k, that the points in the same cluster are more similar to each other than to those in other clusters.<br/>

Assignment step: Each observation is assigned to the nearest cluster by minimizing the distance between the observation and the cluster centroid.<br/>

$$
S_i^{(t)} = \{ x_p : \| x_p - \mu_i^{(t)} \| \leq \| x_p - \mu_j^{(t)} \| \text{ for all } j = 1, \dots, k \}
$$

Update step: The centroids of the clusters are recalculated as the mean of all observations assigned to each cluster:<br/>

$$
\mu_i^{(t+1)} = \frac{1}{|S_i^{(t)}|} \sum_{x_j \in S_i^{(t)}} x_j
$$

The output of applying K-Means to MNIST would be the assignment of each image to one of k clusters.<br/>

Ideally, we can only choose k=10 as we have a fixed number of 10 output for MNIST.<br/>

The Silhouette Score is a popular metric for assessing the quality of clusters created by algorithms like K-means. It measures how similar each data point is to its own cluster compared to other clusters.<br/>

Below shows the result for K-Mean clustering. These scores are close to 0, which suggests that the clusters are overlapping quite a bit and are not distinctly separated, implying that K-Mean clustering may not be a good idea for MNIST problem:

In [7]:
# Train a K-means model
kmeans = KMeans().setK(10).setSeed(seed)
kmeans_model = kmeans.fit(train.select('features'))  # Note: no labels used

# Assign clusters
train_predictions_kmeans = kmeans_model.transform(train)
test_predictions_kmeans = kmeans_model.transform(test)

# Evaluate clustering
evaluator = ClusteringEvaluator()
silhouette_train = evaluator.evaluate(train_predictions_kmeans)
silhouette_test = evaluator.evaluate(test_predictions_kmeans)

print(f"Train Silhouette Score (K-means) = {silhouette_train}")
print(f"Test Silhouette Score (K-means) = {silhouette_test}")

Train Silhouette Score (K-means) = 0.10656523514439634
Test Silhouette Score (K-means) = 0.1083833467339384


### Multi-Layer perceptron (MLP)<br/>

A Multi-Layer Perceptron (MLP) is a type of artificial neural network for tackling complex prediction tasks. MLPs belong to the category of feedforward neural networks, which means the data flows in one direction from input to output, with no cycles or loops.<br/>

An MLP consists of multiple layers, each composed of nodes or neurons.<br/>

Input Layer: This layer receives the raw input signal.<br/>
Hidden Layers: One or more layers that perform computations and feature transformations.<br/>
Output Layer: This layer produces the final prediction or classification output.<br/>

Node:<br/>

$$
a_i = \sigma\left(\sum_{j} w_{ij} x_j + b_i\right)
$$

Cost function:<br/>

$$
C = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Backpropagation:<br/>

$$
w_{ij} \leftarrow w_{ij} - \eta \frac{\partial C}{\partial w_{ij}}
$$

The output layer consists of 10 neurons, where each neuron corresponds to a digit from 0 to 9.

Below shows the result for MLPs. 2 types of architecture are tried, and both attatined a simialr result:

In [8]:
layers = [784, 128, 64, 10]  # 784 input features, two hidden layers of 128 and 64 neurons, and 10 output classes

mlp = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=seed)

mlp_model = mlp.fit(train)

train_predictions_mlp = mlp_model.transform(train)
test_predictions_mlp = mlp_model.transform(test)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
train_accuracy_mlp = evaluator.evaluate(train_predictions_mlp)
test_accuracy_mlp = evaluator.evaluate(test_predictions_mlp)

print(f"Training Accuracy (MLP) = {train_accuracy_mlp}")
print(f"Test Accuracy (MLP) = {test_accuracy_mlp}")

Training Accuracy (MLP) = 0.9563166666666667
Test Accuracy (MLP) = 0.9446


In [9]:
layers = [784, 256, 128, 64, 10]  # 784 input features, three hidden layers of 256, 128 and 64 neurons, and 10 output classes

mlp = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=seed)

mlp_model = mlp.fit(train)

train_predictions_mlp = mlp_model.transform(train)
test_predictions_mlp = mlp_model.transform(test)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
train_accuracy_mlp = evaluator.evaluate(train_predictions_mlp)
test_accuracy_mlp = evaluator.evaluate(test_predictions_mlp)

print(f"Training Accuracy (MLP) = {train_accuracy_mlp}")
print(f"Test Accuracy (MLP) = {test_accuracy_mlp}")

Training Accuracy (MLP) = 0.9568
Test Accuracy (MLP) = 0.9423


### Result<br/>

We can see MLP is having the best result among all methods. Although we cannot compare the result of K-Mean clustering with other models, we can still see that logistic regression is possibly outperform K-Mean clustering due to its fantastic performance comparing to K-Mean clustering.<br/>

#### Logistic Regression<br/>

Advantages:<br/>

It is relatively simple to implement, understand, and interpret.<br/>
It is computationally less intensive than more complex models like MLP.<br/>
It provides probabilities for outcomes, which can be a useful measure of confidence in predictions.<br/>

Disadvantages:<br/>

It can only capture linear boundaries between classes unless manually extended with kernels or polynomial terms.<br/>

#### K-means Clustering<br/>

Advantages:<br/>

It can be used to find patterns or groupings in data without needing any labels.<br/>
It is typically fast and efficient in terms of computational resources.<br/>
It can be used for feature extraction or dimensionality reduction before applying another classification technique.<br/>

Disadvantages:<br/>

It does not utilize label information, making it less suitable directly for classification tasks like MNIST.<br/>
It assumes clusters are spherical and equally sized, which might not hold true for complex datasets.<br/>
The results can significantly vary based on the initial cluster centers' placement.<br/>

#### Multi-Layer Perceptron (MLP)

Advantages:<br/>

It can model highly complex relationships due to its capability to learn nonlinear models.<br/>

It performs well on large and complex datasets like MNIST.<br/>
It is capable of learning feature interactions automatically without needing manual intervention.<br/>

Disadvantages:<br/>

It is more computationally intensive and requires more resources, making training longer and more expensive.<br/>
Without proper regularization, MLPs can easily overfit to training data.<br/>
It requires careful tuning of parameters, including the number of hidden layers, number of neurons in each layer, learning rate, etc.<br/>

#### Comparison:

MLP: Best performance due to its flexibility in modeling non-linear and high dimensional data.<br/>
Logistic Regression: Decent performance but limited by its linearity.<br/>
K-means: Least suitable for direct application to classification problems like MNIST due to its unsupervised nature and basic assumptions about data distribution.<br/>

### Terminate the spark session

In [10]:
spark.stop()