Objective:

Kyle: I had an idea that I wanted to run past you. As you know DSI will move to the new CDIS building in about a year. There is some discussion about arranging the faculty from computer science, statistics, and biomedical informatics into interdisciplinary “pods”. They are trying to come up with some themes that would achieve this goal. It’s basically a clustering problem. I was thinking that it might be cool to use the vector store to aid in this.
Possible approaches:
make a few TSNE / PCA plots restricted to papers by faculty in those departments [either on the web or exported as html with plotly]
make a few TSNE / PCA plots by faculty instead of by paper (using the mean or a barrycenter of their papers) [either on the web or exported as html with plotly]
try a few clustering approaches on either 1) papers or 2) faculty
attempt to use GPT to suggest categories similar to the taxonomy project for news feed.

Kyle: I will send an email about the clustering project sometime soon. I’m curious if you had any plans for next steps beyond what you sent above. I think we talked about using ChatGPT to name the clusters.
If you could attach some simple stats to the clusters it might help. E.g. Number of unique faculty, number of departments represented. Maybe a small table with faculty assigned to each cluster.
If we wanted to try to find clusters that had some nicer properties (like mix of departments), I’m not sure the best way to go about it. We could make more clusters first and then try to merge nearby ones to maximize some metric (like number of departments with a min/max number of people in the cluster).


Departments:
1. Statistics
1. CS
1. BMI
1. Information school

Steps:
1. Get all departments' faculty
1. Get all papers from faculty
1. Get all papers' embeddings
1. Cluster all papers
1. Define clusters and generate descriptions with a list people involved
1. Maybe grouping multiple cluster together

### Get all departments' faculty

In [None]:
import pickle
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

from embedding_search.academic_analytics import get_units, get_faculties
from embedding_search.vector_store import get_author

from openai import OpenAI

In [None]:
def ask_openai(messages: list[dict]) -> dict:
    """Ask gpt with a data package.

    Example input: [{"role": "user", "content": "Hello world example in python."}]
    """

    client = OpenAI()
    chat_completion = client.chat.completions.create(
        messages=messages,
        model="gpt-4-1106-preview",
    )
    return chat_completion.choices[0].message.content


alt.data_transformers.disable_max_rows()

df = pd.read_parquet("data/cdis_clustering.parquet")
with open("data/cdis_embeddings.pkl", "rb") as f:
    embeddings = pickle.load(f)

In [None]:
# Run once

# units = get_units()
# for unit in units:
#     print(f"{unit['unit']['id']}: {unit['unit']['name']}")

In [None]:
# selected_units = {
#     "28603": "Department of Computer Sciences",
#     "28673": "Department of Statistics",
#     "28591": "Department of Biostatistics and Medical Informatics",
#     "28634": "Information School",
# }

In [None]:
# Get updated faculty from each selected unit

# faculty_list = []
# for unit_id, name in selected_units.items():
#     print(f"{unit_id}: {name}")
#     unit_faculties = get_faculties(unit_id)

#     sel_faculties = [f for f in unit_faculties if f["isNonFaculty"] == False]
#     for f in sel_faculties:
#         f["unit"] = name

#     faculty_list.extend(sel_faculties)

In [None]:
# def get_unit_from_faculty_profile(id: int, faculty_list: list[dict]) -> str:
#     """Lookup unit from faculty profile."""
#     for faculty in faculty_list:
#         if faculty["id"] == id:
#             return faculty["unit"]
#     return None

In [None]:
# Re-download all authors in selected departments/units (expansive, run once)

# from crawl import download_all_authors_in_unit
# for unit_id, name in selected_units.items():
#     print(f"Downloading unit in {name}")
#     download_all_authors_in_unit(unit=unit_id, overwrite=True)

In [None]:
# Load downloaded data to memory
# authors = []
# for x in faculty_list:
#     try:
#         authors.append(get_author(x["id"]))
#     except FileNotFoundError:
#         print(f"Author {x['id']} not found")

### Collect all papers embeddings

In [None]:
# units = []
# names = []
# embeddings = []
# article_titles = []
# article_doi = []

# # Collect useful information
# for a in authors:
#     embeddings.extend(a.articles_embeddings)
#     for article in a.articles:
#         article_titles.append(article.title)
#         article_doi.append(article.doi)
#         names.append(a.first_name + " " + a.last_name)

#         # Get department/unit name
#         try:
#             # Prioritizing primary affiliation
#             unit = selected_units[str(a.unit_id)]
#         except KeyError:
#             # If faculty's primary affiliation is not in selected units
#             # use the unit's faculty list to determine affiliation
#             unit = get_unit_from_faculty_profile(a.id, faculty_list)
#         units.append(unit)
# embeddings = np.array(embeddings)

In [None]:
# import pickle
# with open("data/cdis_embeddings.pkl", "wb") as f:
#     pickle.dump(embeddings, f)

In [None]:
# print(f"Number of units: {len(units)}")
# print(f"Number of authors: {len(names)}")
# print(f"Number of articles: {len(article_titles)}")
# print(f"Number of dois: {len(article_doi)}")
# print(f"Number of embeddings: {len(embeddings)}")

### Cluster all papers (on full vector) and make 2d projections

In [None]:
# Try some basic 2d projections
# projection_pca = PCA(n_components=2).fit_transform(embeddings)
# projection_tsne = TSNE(n_components=2, random_state=0).fit_transform(embeddings)

In [None]:
# # Pack into dataframe
# df_tsne = pd.DataFrame(projection_tsne, columns=["x_tsne", "y_tsne"])
# df_pca = pd.DataFrame(projection_pca, columns=["x_pca", "y_pca"])
# df = pd.concat([df_tsne, df_pca], axis=1)
# df["title"] = article_titles
# df["doi"] = article_doi
# df["name"] = names
# df["department"] = units

In [None]:
# # Save df for later use
# df.to_parquet("data/cdis_clustering.parquet")

Perhaps we can use Elbow method to determine the number of clusters.

In [None]:
sse = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(embeddings)
    sse.append(kmeans.inertia_)

In [None]:
# Plot the results
plt.plot(range(2, 20), sse)
plt.title("Elbow Method")
plt.xlabel("Number of Clusters")
plt.ylabel("Sum of Squared Distances")
plt.show()

It doesn't seem to have an obvious inflection point...

Make a experiment class to help with experimenting.

In [None]:
class ClusterExplore:
    """Clustering experiment manager."""

    def __init__(
        self,
        embeddings: np.array,
        df: pd.DataFrame,
        n_clusters: int,
        label: bool = False,
    ) -> None:
        self.embeddings = embeddings
        self.df = df.copy()
        self.n_clusters = n_clusters

        # Run clustering
        if "cluster" not in self.df.columns:
            self._cluster(n_clusters)

        # Get cluster label from GPT
        if label:
            self.get_clusters_label()

    def get_clusters_label(self) -> None:
        """Get label for each cluster."""

        for cluster_id in range(self.n_clusters):
            self._label_cluster(cluster_id)

    def plot(self) -> alt.Chart:
        return (self._plot_clusters() & self._plot_faculty()).resolve_scale(
            color="independent"
        )

    # def cluster_statistics(self) -> pd.DataFrame:
    #     count_unique_faculty = (
    #         self.df.groupby(["cluster", "department"])
    #         .agg(n_faculty=("name", "nunique"))
    #         .reset_index()
    #     )
    #     return count_unique_faculty.pivot_table(
    #         index="cluster", rows="label", columns="department", values="n_faculty", fill_value=0
    #     )

    def cluster_faculty(self, cluster: int) -> pd.DataFrame:
        return (
            self.df.query(f"cluster == {cluster}")
            .groupby(["department", "name"])
            .agg(n_articles=("title", "count"))
            .reset_index()
        )

    # Private methods
    def _cluster(self, n: int) -> None:
        """K-means clustering."""
        kmeans = KMeans(n_clusters=n, random_state=0).fit(self.embeddings)
        self.df["cluster"] = kmeans.labels_

    def _label_cluster(self, cluster_id: int) -> None:
        """Use GPT to label a cluster."""
        try:
            prompt = f"Try to give a topic name to describe all the publication below: \n\n {self._get_cluster_pub_titles(cluster_id)}"
            gpt_label = ask_openai([{"role": "user", "content": prompt}])
            self.df.loc[self.df.cluster == cluster_id, "label"] = gpt_label
        except TypeError:
            pass

    def _get_cluster_pub_titles(self, cluster_id: int) -> str:
        """Get a list of publications in a cluster."""

        titles = self.df.query(f"cluster == {cluster_id}").title.to_list()
        return "\n\n ".join([t for t in titles if t is not None])

    def _plot_clusters(self) -> alt.Chart:
        self.cluster_select = alt.selection_point(fields=["cluster"])

        tooltip = [
            "title",
            "cluster",
            "doi",
            "name",
            "department",
        ]

        if "label" in self.df.columns:
            tooltip.append("label")

        chart = (
            alt.Chart(self.df)
            .mark_circle()
            .encode(
                x=f"x_tsne",
                y=f"y_tsne",
                color="cluster:N",
                opacity=alt.condition(
                    self.cluster_select, alt.value(1), alt.value(0.2)
                ),
                tooltip=tooltip,
            )
            .add_params(self.cluster_select)
        )

        return (
            (chart)
            .properties(
                width=1000,
                height=600,
            )
            .interactive()
        )

    def _plot_faculty(self) -> alt.Chart:
        return (
            alt.Chart(self.df)
            .mark_bar()
            .encode(
                x=alt.X("name:N", sort="-y"),
                y="count()",
                color="department:N",
                tooltip=["department", "name", "count()"],
            )
            .transform_filter(self.cluster_select)
            .properties(
                width=1000,
                height=300,
                title="Number of publication in selected cluster by faculty",
            )
        )

In [None]:
experiment = ClusterExplore(embeddings, df, 18)
experiment.get_clusters_label()
experiment.plot()

In [None]:
# Faculty distribution in each cluster (by department)

count_unique_faculty = (
    experiment.df.groupby(["cluster", "label", "department"])
    .agg(n_faculty=("name", "nunique"))
    .reset_index()
)
count_unique_faculty.pivot_table(
    index=["cluster", "label"], columns="department", values="n_faculty", fill_value=0
)

In [None]:
# Faculty name in each cluster
for i in range(18):
    print({i: experiment.cluster_faculty(i).name.tolist()})