### PCA Analysis on the MET Exhibitions

- This PCA test analyzes 718 MET exhbitions from 2015 to 2024. The exhibition data come from the Metropolitan Museum of Art's official website: https://www.metmuseum.org/exhibitions/past
- After reduced to two dimensions, the exhibitions are grouped into 10 clusters. Each cluster is color-coded. Some clusters are more interpretable than the others (e.g. cluster 4 is about "sports"; cluster 7 is about "Christmas"), and some are harder to tell what they are about. The number of clusters needs to be further adjusted to get the optimum clustering.
- A zoom-in function is further defined so that the user can take a closer look at each cluster and the exhibitions within it. In the zoom-in view, the year of an exhibition is color-coded (though this might cause confusion due to the previous color-encoding for clusters).
- Outliers could be taken out so that the PCA plot looks more spread out. Instead of using numbers to represent the clusters, specific category names could be applied to each cluster.

In [None]:
# Install and import packages

import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np

In [2]:
df = pd.read_csv("metexhibitions_2015-2024_pos.csv")

In [None]:
# Transform the data frame to lists

df = df.dropna(subset=["Description_filtered", "Title"])
documents = df["Description_filtered"].tolist()
titles = df["Title"].tolist()
years = df["Year"].astype(str).tolist()

In [None]:
# TF-IDF vectorize the filtered descriptions

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(documents)

In [5]:
tfidf_dense = tfidf_matrix.toarray()

In [None]:
# Fit the PCA model

pca = PCA(n_components=2)
pca_result = pca.fit_transform(tfidf_dense)

In [None]:
# Select cluster quantity

num_clusters = 10
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(pca_result)

In [None]:
# Create a PCA data frame

pca_df = pd.DataFrame({
    "PCA1": pca_result[:, 0],
    "PCA2": pca_result[:, 1],
    "Cluster": cluster_labels,
    "Year": years,
    "Title": titles
})

In [None]:
# Plot the scatter plot

fig = px.scatter(
    pca_df,
    x="PCA1",
    y="PCA2",
    color=pca_df["Cluster"].astype(str),
    hover_data={"Title": True, "Year": True},
    title="PCA of the MET Exhibitions (Color-Coded by Theme)",
    labels={"PCA1": "PCA Component 1", "PCA2": "PCA Component 2"}
)
fig.show()

In [None]:
# Define the Zoomed-In View Function

def zoom_to_cluster(pca_df, cluster_number):
    cluster_data = pca_df[pca_df["Cluster"] == cluster_number]
    x_min, x_max = cluster_data["PCA1"].min(), cluster_data["PCA1"].max()
    y_min, y_max = cluster_data["PCA2"].min(), cluster_data["PCA2"].max()
    
    # Increase padding for better zoom visualization
    x_padding = (x_max - x_min) * 0.2
    y_padding = (y_max - y_min) * 0.2

    # Create the zoomed-in plot with larger canvas size and adjusted axis ranges
    fig = px.scatter(
        cluster_data,
        x="PCA1",
        y="PCA2",
        color="Year",  # Color by Year for the zoomed-in view
        hover_data={"Title": True, "Year": True},  # Include Year in hover
        title=f"Zoomed-in View of Cluster {cluster_number} (Color-Coded by Year)",
        labels={"PCA1": "PCA Component 1", "PCA2": "PCA Component 2"}
    )
    
    fig.update_layout(
        width=1000,  # Set canvas width
        height=800,  # Set canvas height
        margin=dict(l=50, r=50, t=50, b=50)  # Adjust margins for better spacing
    )
    
    fig.update_xaxes(range=[x_min - x_padding, x_max + x_padding])
    fig.update_yaxes(range=[y_min - y_padding, y_max + y_padding])
    
    fig.show()

In [None]:
# Call the Zoomed-In-View Function
zoom_to_cluster(pca_df, cluster_number=1)