# CDIS pod topic exploration

This notebook explores the research interests of all faculty members in the School of Computer, Data & Information Sciences (CDIS).

Data source: [Academic Analytics API](https://wisc.discovery.academicanalytics.com/)

Authors: [Jason Lo](https://datascience.wisc.edu/staff/lo-jason/) and [Kyle Cranmer](https://datascience.wisc.edu/staff/cranmer-kyle/)

Version: 3

Date: 2024-01-30

Objectives:

1. Explore the research interests of all faculty members in CDIS.
2. Potentially identify research topics that are of interest to multiple faculty members across all departments in CDIS.

Departments in CDIS:

- Department of Computer Sciences
- Department of Statistics
- Department of Biostatistics and Medical Informatics
- Information School

Procedure:

1. Retrieve faculty data from CDIS via Academic Analytics API.
2. Fetch research outputs for each faculty member from the same API.
3. Use OpenAI embeddings to convert research output titles into vectors.
4. Apply k-means clustering to these vectors.
5. Utilize GPT for naming clusters based on publication titles.
6. Create visual representations of the results.

For simplicity, implementation details in step 1-5 is omitted in this notebook. If you are interested in the implementation details, please refer to the [source code](https://github.com/UW-Madison-DSI/faculty-search/blob/122ecca93a9a65414645a84d04d29f2419c9e711/notebooks/proto_cdis_cluster.ipynb).

In [None]:
import pickle
import pandas as pd
from embedding_search.experimental.cdis import ClusterExplore

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

In [None]:
# Load data from preprocessed files.
df = pd.read_parquet("data/cdis_clustering.parquet")
with open("data/cdis_embeddings.pkl", "rb") as f:
    embeddings = pickle.load(f)

In [None]:
# Convenient function to get cluster name
def get_cluster_name(cluster: int) -> str:
    return df.query(f"cluster == {cluster}")["label"].iloc[0]

### Visualizing research topics

In [None]:
experiment = ClusterExplore(embeddings, df, n_clusters=18)
experiment.plot()

Note. By clicking on a point in the plot, you can select a cluster, which will update the information in the bottom panel. 

### Faculty distribution in each cluster (by department)

In [None]:
count_unique_faculty = (
    experiment.df.groupby(["cluster", "label", "department"])
    .agg(n_faculty=("name", "nunique"))
    .reset_index()
)
count_unique_faculty.pivot_table(
    index=["cluster", "label"], columns="department", values="n_faculty", fill_value=0
)

### Faculty publication count in each cluster (by department)

In [None]:
count_unique_publications = (
    experiment.df.groupby(["cluster", "label", "department", "name"])
    .agg(n_publications=("title", "nunique"))
    .reset_index()
)

In [None]:
publication_count = count_unique_publications.pivot_table(
    index=["department", "name"],
    columns="cluster",
    values="n_publications",
    fill_value=0,
)

publication_count

### Normalized faculty publication count (by row) in each cluster (by department)

In [None]:
normalized_publication_count = publication_count.apply(
    lambda x: x / x.sum(), axis=1
).round(2)

normalized_publication_count

### Group authors into clusters based on their most significant category, determined by the normalized value.

In [None]:
# Author to cluster mapping
author_cluster_map = normalized_publication_count.idxmax(axis=1).to_dict()

# Inverting the mapping
cluster_author_map = {}
for k, v in author_cluster_map.items():
    if v not in cluster_author_map:
        cluster_author_map[v] = [k]
    else:
        cluster_author_map[v] += [k]

clusters_with_someone = sorted(cluster_author_map.keys())

for i in clusters_with_someone:
    print(f"cluster {i}: {get_cluster_name(i)}")
    print(f"Faculties: {cluster_author_map[i]}")
    print()