Objective:

Kyle: I had an idea that I wanted to run past you. As you know DSI will move to the new CDIS building in about a year. There is some discussion about arranging the faculty from computer science, statistics, and biomedical informatics into interdisciplinary “pods”. They are trying to come up with some themes that would achieve this goal. It’s basically a clustering problem. I was thinking that it might be cool to use the vector store to aid in this.
Possible approaches:
make a few TSNE / PCA plots restricted to papers by faculty in those departments [either on the web or exported as html with plotly]
make a few TSNE / PCA plots by faculty instead of by paper (using the mean or a barrycenter of their papers) [either on the web or exported as html with plotly]
try a few clustering approaches on either 1) papers or 2) faculty
attempt to use GPT to suggest categories similar to the taxonomy project for news feed.


Departments:
1. Statistics
1. CS
1. BMI
1. Information school

Steps:
1. Get all departments' faculty
1. Get all papers from faculty
1. Get all papers' embeddings
1. Cluster all papers
1. Define clusters and generate descriptions with a list people involved (not yet implemented)

### Get all departments' faculty

In [None]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

from embedding_search.academic_analytics import get_units, get_faculties, get_author

In [None]:
units = get_units()
for unit in units:
    print(f"{unit['unit']['id']}: {unit['unit']['name']}")

In [None]:
selected_units = {
    "28603": "Department of Computer Sciences",
    "28673": "Department of Statistics",
    "28591": "Department of Biostatistics and Medical Informatics",
    "28634": "Information School",
}

In [None]:
faculty_list = []
for unit_id, name in selected_units.items():
    print(f"{unit_id}: {name}")
    unit_faculties = get_faculties(unit_id)

    sel_faculties = [f for f in unit_faculties if f["isNonFaculty"] == False]
    for f in sel_faculties:
        f["unit"] = name

    faculty_list.extend(sel_faculties)

In [None]:
# Re-download all authors in selected departments/units (expansive, run once)

# from crawl import download_all_authors_in_unit
# for unit_id, name in selected_units.items():
#     print(f"Downloading unit in {name}")
#     download_all_authors_in_unit(unit=unit_id, overwrite=True)

In [None]:
# Load downloaded data to memory
authors = []
for x in faculty_list:
    try:
        authors.append(get_author(x["id"]))
    except FileNotFoundError:
        print(f"Author {x['id']} not found")

### Collect all papers embeddings

In [None]:
names = []
embeddings = []
article_titles = []
article_doi = []

# Collect useful information
for a in authors:
    embeddings.extend(a.articles_embeddings)
    for article in a.articles:
        article_titles.append(article.title)
        article_doi.append(article.doi)
        names.append(a.first_name + " " + a.last_name)
embeddings = np.array(embeddings)

In [None]:
print(f"Number of articles: {len(article_titles)}")
print(f"Number of authors: {len(names)}")
print(f"Number of embeddings: {len(embeddings)}")
print(f"Number of dois: {len(article_doi)}")

### Cluster all papers (on full vector) and make 2d projections

In [None]:
# Try some basic 2d projections
projection_pca = PCA(n_components=2).fit_transform(embeddings)
projection_2d = TSNE(n_components=2, random_state=0).fit_transform(embeddings)

In [None]:
# Pack into dataframe
df_tsne = pd.DataFrame(projection_2d, columns=["x_tsne", "y_tsne"])
df_pca = pd.DataFrame(projection_pca, columns=["x_pca", "y_pca"])
df = pd.concat([df_tsne, df_pca], axis=1)
df["title"] = article_titles
df["doi"] = article_doi
df["name"] = names

In [None]:
### Apply K-Mean clustering

kmeans = KMeans(n_clusters=7, random_state=0).fit(embeddings)
df["cluster"] = kmeans.labels_

### Check how kMean match with 2d projections

In [None]:
alt.data_transformers.disable_max_rows()
pca_plot = (
    alt.Chart(df)
    .mark_circle()
    .encode(x="x_pca", y="y_pca", color="cluster:N", tooltip=["title", "doi", "name"])
    .interactive()
)

tsne_plot = pca_plot.encode(
    x="x_tsne",
    y="y_tsne",
)

pca_plot | tsne_plot

T-SNE seems match kMean better, export.

In [None]:
tsne_plot.properties(
    title="k-mean clustering with tsne projection summarizing all published works",
    width=800,
    height=800,
).save("explore_cdis.html")