# Problem Set 4 


___


_____

In [2]:
import pandas as pd
import numpy as np
from numpy.linalg import eig
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

In [3]:
moral_df = pd.read_excel('moral_machine_data.xlsx')
moral_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1483 entries, 0 to 1482
Columns: 148 entries, age to cat_no_of_pedestrians_on_other_lane_saved
dtypes: int64(139), object(9)
memory usage: 1.7+ MB


## Question 1

[10 points] To develop a better understanding of the Moral Machine Experiment, please read the academic article on ‘Moral Machine Experiment’. Provide a summary of around 250 words by talking about the goals of the paper, data, and the results.


The **Moral Machine experiment** explores moral preferences of the public regarding ethical dilemmas faced by autonomous vehicles (AVs). This comes at a time when machines are being integrated into more and more human tasks and are faced with potential situations where human harm cannot be eliminated. Researchers developed the Moral Machine Experiment to get a more centralized view of such moral situations, which can then be taught to machines. This is an online platform that presents participants with various moral dilemmas, such as sacrificing passengers versus pedestrians or deciding whom to prioritize among potential victims. (in terms of traffic scenarios.) where AVs must choose between two different outcomes.

The data collected (which is all public) was first used to summerhouse global preferences. 9 categories of preference were explored. In general, participants showed a tendency to favor human lives over animals, prioritizing more lives over fewer lives, and saving younger individuals over older ones. However, researchers determined it would also be beneficial to be more specific with the preferences. 492, 921 respondents also completed a survey on age, education, gender, income, and political and religious views. This second analysis was done to observe the individual variation in choices. The researchers then looked at cultural clusters, identifying 3 clusters with distinct moral preferences: Western, Eastern, and Southern.  These variations in moral preferences also correlated with countries' cultural values, economic conditions, and institutions. For example, places with high economic inequality appeared to influence decisions about prioritizing individuals of higher or lower social status.

## Question 2

1. [10 points] Code the PCA algorithm from scratch. (Note: Your code should be able to
process any m x n dataset). (Note: Set random.seed(265) before you start).

2. [10 points] Test your algorithm on the columns that denote ‘the results of your decision’
(more information can be found in the data dictionary above). (Note: Set number of
dimensions to 2).

In [4]:
import random
random.seed(265)

In [5]:
# standardize data
def standardize(mat):
    mean = np.mean(mat, axis=0)
    std = np.std(mat, axis=0)
    return (mat - mean)/std
    
def covariance(mat):
    return np.cov(mat.T)

# find eigenvectors and eigenvalues of covariance matrix
# sort eigenvectors acc to magnitude
# choose top k and create transformation matrix
def extract_principal_components(mat, k=2):
    vals, vecs = eig(mat)
    abs_vec = np.abs(vecs)
    #making largest eigenvector positive directioned (for reference)
    vecs = (vecs*np.sign(vecs[np.argmax(abs_vec, axis=0), :])).T
    pairs = [(vals[i], vecs[i]) for i in range(len(vals))]
    pairs.sort(key=lambda x: x[0], reverse=True)
    top_k_vecs = np.array([vecs[i] for i in range(len(vecs))])[:k, :]
    return top_k_vecs

# transform matrix to one in k dimensions
def transform(top_k_vecs, og_mat):
    return og_mat.dot(top_k_vecs.T)

In [6]:
# apply on the results columns (ending in _saved or _died)
result_cols = [col for col in moral_df.columns if col.endswith('_saved') or col.endswith('_died')]

matrix = moral_df[result_cols]
matrix = standardize(matrix)
covar_matrix = covariance(matrix)
transformation_mat = extract_principal_components(covar_matrix)
new_matrix = transform(transformation_mat, matrix)

In [7]:
new_matrix #with new components

Unnamed: 0,0,1
0,-0.229410,-0.251343
1,0.182999,0.340581
2,-0.155791,-0.566522
3,-0.091939,0.202081
4,-0.602857,0.680327
...,...,...
1478,1.348812,-0.276964
1479,-0.001224,2.026693
1480,-0.746230,3.321773
1481,0.010015,1.726834


## Question 3

Using the two principal components you obtained in Q1, create five scatterplots
for the following five columns by using the visualization_code.py file in the assignment folder.
Cluster labels will be determined by the unique values in columns of the dataset (you don’t
need to run a separate clustering algorithm, but you will need to create class labels for some
of the observations in the columns below) (example: USA = Cluster 0, India = Cluster 1 etc.):
1. [3 points] age
2. [3 points] gender
3. [3 points] grown_up_in_US
4. [3 points] country_of_origin
5. [3 points] no_of_siblings (you will need to create a new column by doing: no_of_sisters + no_of_brothers)
6. [5 points] Interpret the visuals you obtained above in around 200 words. Specifically: Do
you see any patterns? Which column do you think creates the best clustering pattern?

In [None]:
#TO-DO

## Question 4
Repeat Q2, this time using spectral embedding for dimensionality reduction.
Please answer the following:
1. [10 points] What is spectral embedding? Please do some online research and explain how
spectral embedding works in around 200 words.
2. [10 points] Using the SpectralEmbedding module of sklearn2, create the same
set of five graphs (1 point each). Interpret the results in around 150 words (5 points).
Specifically: Do you see any patterns? Which column do you think creates the best
clustering pattern? And: Are the results better than PCA?

1. Spectral Embedding is a non linear dimensionality reduction technique that preserves local relationships, especially those used in graph structures or for clustering. Unlike PCA which decomposes the covariance matrix, spectral embedding performs eigendecomposition on a Laplacian matrix that  preserves the relationship between data points. The resulting eigenvectors represent the relationships in a lower dimensional space

    The process begins by representing the dataset as a similarity graph, where each data point is a node, and edges between nodes indicate similarity based on a chosen metric. The strength of these connections is quantified in an adjacency matrix. From this, the degree matrix is computed, containing the sum of connections or the degree of the nodes The graph Laplacian is then derived by subtracting the adjacency matrix from the degree matrix. This Laplacian matrix encapsulates the structure of the data.​
 
    By performing eigendecomposition on the Laplacian matrix, we obtain its eigenvalues and eigenvectors. Unlike PCA which looks at the k largest eigenvectors, this method finds k smallest non-zero eigenvalues. This is low magnitude eigenvalues represent slowly varying eigenvectors, which capture global structure and broad patterns in the data. Conversely, large eigenvalues capture diversity and high frequency changes in the dataset.
 
    Spectral embedding is particularly effective for clustering tasks, especially when dealing with data that exhibits complex, non-convex patterns. It has applications in various fields, including image segmentation, social network analysis, and bioinformatics, where understanding the underlying manifold structure of the data is crucial. ​

In [8]:
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import LabelEncoder

X = moral_df[result_cols]
embedding = SpectralEmbedding(n_components=2, random_state=265)
X_transformed = embedding.fit_transform(X)
print(f"Matrix X of dim {X.shape} -> Transformed Matrix of dim {X_transformed.shape}")

Matrix X of dim (1483, 120) -> Transformed Matrix of dim (1483, 2)


In [14]:
from scipy import interpolate
from scipy.spatial import ConvexHull

In [15]:
#TO-DO
moral_df['X_dim_1'] = X_transformed[:,0]
moral_df['X_dim_2'] = X_transformed[:,1]

moral_df['no_of_sublings'] = moral_df['no_of_brothers'] + moral_df['no_of_sisters']

moral_df['gender_label'] = moral_df['gender'].astype(str)
moral_df['grown_up_in_US_label'] = df['grown_up_in_US'].astype(str)
moral_df['country_of_origin_label'] = pd.factorize(moral_df['country_of_origin'])
moral_df['no_of_siblings_label'] = moral_df['no_of_siblings']

data = {
    'age':'Age',
    'gender_label': 'Gender',
    'grown_up_in_US_label': 'Country of Origin',
    'no_of_siblings_label':'Number of Siblings'
}

KeyError: 'gender'

In [None]:
mpl.rcParams['figure.dpi'] = 600
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
plt.rcParams['font.size'] = 10

for col, title in data.items():
    moral_df['kmeans'] = mpl.rcParams['figure.dpi'] = 600
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
plt.rcParams['font.size'] = 10[col]

pal = sns.color_palette("Paired", n_colors=len(set(moral_df['kmeans_label'])))

label_values = moral_df['kmeans_label'].unique()
label_to_color_index = {label: idx for idx, label in enumerate(label_values)}

plt.figure(figsize=(20,5))

pl = sns.scatterplot(
    x="x_dim_1", y="x_dim_2",
    hue="kmeans_label", palette=pal, data=moral_df,
    s=250, alpha=0.7, legend=False
)

for line in range(moral_df.sjhape[0]):
    pl.text(
        moral_df.x_dim_1[line], moral_df.x_dim_2[line],
        str(moral_df.kmeans_label[line]),
        horizontalalignment='left', size='medium',
        color='black', weight='semibold'
    )

plt.suptitle(f"Spectral Embedding Scatter Plot by {title}", fontsize=36)
plt.xlable("Spectral Dim 1", font_size=24)
plt.ylable("Spectral Dim 2", font_size=24)

for label in label_values:

    points = moral_df[moral_df.kmeans == label][['x_dim_1', 'x_dim_2']].drop_duplicates()
    if len(points) <3:
        continue
    hull = ConvexHull(points)

    x_hull = np.append(points[hull.vertivces, 0], points[hull.vertices, 0][0])
    y_hull = np.append(points[hull.vertivces, 1], points[hull.vertices, 1][0])

    #interplotate
    dist = np.sqrt((x_hull[:-1] - x_hull[1:]) ** 2 + (y_hull[:-1] - y_hull[1:]) ** 2)
    dist_along = np.concatenate(([0], dist.cumsum()))
    spline, _ = interpolate.splprep([x_hull, y_hull], u=dist_along, s=0)
    interp_d = np.linspace(dist_along[0], dist_along[-1], 50)
    interp_x, interp_y = interpolate.splev(interp_d, spline)

    plt.fill(interp_x, interp_y, '--', c=pal[label_to_color_index[label]], alpha=0.2)

    plt.grid()
    plt.tight_layout()
    plt.show
 




## Question 5
[20 points] Repeat Q2, this time using T-SNE for dimensionality reduction. Please answer the
following:
1. [10 points] What is T-SNE? Please do some online research and explain how T-SNE works
in around 200 words.
2. [10 points] Using the TSNE module of sklearn3, create the same set of five graphs (1
point each). Interpret the results 150 words (5 points). Specifically: Do you see any
patterns? Which column do you think creates the best clustering pattern? And: Compare
the results to PCA and Spectral Embedding. Are the results any better?

1. T-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique that uses probabilistic pairwise similarity between data points in both high-dimensional and low-dimensional spaces to ensure that similar points remain close together in the embedding.        

    The process begins by calculating pairwise similarities between data points using a Gaussian distribution based on Euclidean distances. Points that are closer in high-dimensional space have higher similarity scores.

    Next, T-SNE randomly initializes a set of points in the lower-dimensional space (usually 2D or 3D) and computes pairwise similarities between them using a Student’s t-distribution. The t-distribution’s heavy tails allow for better separation of distant points, avoiding crowding in the lower dimensions.

    To align the high-dimensional similarities with the low-dimensional ones, the algorithm minimizes the Kullback-Leibler (KL) divergence, a measure of dissimilarity between the two distributions, using gradient descent.

    T-SNE is highly effective for visualizing complex data, often revealing clusters and patterns. However, it is computationally expensive and does not preserve global structure as well as local relationships, making it more suitable for data exploration rather than precise metric-based analysis.

In [59]:
from sklearn.manifold import TSNE
X_tsne = TSNE(n_components=2, 
                  learning_rate='auto',
                  init='random', 
                  perplexity=3).fit_transform(X)
print(f"Matrix X of dim {X.shape} -> Transformed Matrix of dim {X_tsne.shape}")


Matrix X of dim (1483, 120) -> Transformed Matrix of dim (1483, 2)


In [None]:
#TO-DO

## Question 6

Finally, create a correlation matrix4 that looks at the correlations between all of
the numerical and numerically coded variables in your dataset (excluding the string variables) and also the six new dimension reduction columns you created in Q2, Q3, and Q4. Please do
the following:
1. [5 points] Create the correlation matrix.
2. [5 points] Interpret the results in around 250 words. Specifically, answer the
following: Are there any variables that are strongly correlated? (i) Are there any
variables from the original dataset that are strongly correlated with the dimension
reduction variables? (ii) Do you think there are any variables in the original dataset
that are strongly represented by the dimension reduction variables? (iii)

In [80]:
numeric_cols = moral_df.select_dtypes(include=np.number)

#adding our embeddings
numeric_cols['spectral_c1'] = X_transformed.T[0]
numeric_cols['spectral_c2'] = X_transformed.T[1]

numeric_cols['tsne_c1'] = X_tsne.T[0]
numeric_cols['tsne_c2'] = X_tsne.T[1]

numeric_cols['pca_c1'] = new_matrix[0]
numeric_cols['pca_c2'] = new_matrix[1]

print(numeric_cols.columns[-6:])

Index(['spectral_c1', 'spectral_c2', 'tsne_c1', 'tsne_c2', 'pca_c1', 'pca_c2'], dtype='object')


In [104]:
corr_matrix = numeric_cols.corr()


corr_pairs = corr_matrix.unstack()
corr_pairs = corr_pairs[corr_pairs != 1]


# drop duplicate pairs 
corr_pairs = corr_pairs.sort_index()
corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) < corr_pairs.index.get_level_values(1)]


threshold = 0.5
significant_corrs = corr_pairs[abs(corr_pairs) > threshold]
significant_corrs = significant_corrs.sort_values(ascending=False, key=lambda x:np.abs(x))

# top N significant unique pairs
top_n = significant_corrs.head(15)  
print(top_n)

pedestrians_ahead_vs_passengers           pedestrians_on_other_lane_crossing   -0.963335
passengers_in_the_self_driving_car        pedestrians_vs_pedestrians           -0.960889
passengers_vs_pedestrians_on_other_lane   pedestrians_on_lane_crossing_ahead   -0.943989
cat_no_of_pedestrians_on_lane_ahead_died  spectral_c2                           0.790923
dog_no_of_pedestrians_on_lane_ahead_died  spectral_c2                           0.762263
spectral_c1                               tsne_c1                               0.728717
pedestrians_on_lane_ahead_should_die      spectral_c1                           0.682536
pedestrians_ahead_vs_passengers           pedestrians_vs_pedestrians           -0.653102
pedestrians_on_other_lane_crossing        pedestrians_vs_pedestrians            0.634352
passengers_in_the_self_driving_car        pedestrians_ahead_vs_passengers       0.625799
                                          pedestrians_on_other_lane_crossing   -0.613140
pedestrians_on_lane_a