### HCP Publications Dataset
- **Description:** Human Connectome Project (HCP) publications and citation networks.
- **Data Source:** [aggregate_titles_embeddings_umap_2d_with_info.parquet](https://www.dropbox.com/scl/fi/uj14y2hre4he2iafpativ/aggregate_titles_embeddings_umap_2d_with_info.parquet?rlkey=tjey12v6cru3iq88xitytefsr&dl=1)
  - **Potential columns for visualization:**
    - **X & Y Coordinates:** `x`, `y`
    - **Point Size:** `n_cits` (citation count)
    - **Color:** `main_field` (research domain)
    - **Label:** `title`
  - **Related code file:** [hcp.py](https://github.com/thorwhalen/imbed_data_prep/blob/main/imbed_data_prep/hcp.py)

## Get data

### Data parameters

In [1]:
ext = '.parquet'
src = 'https://www.dropbox.com/scl/fi/uj14y2hre4he2iafpativ/aggregate_titles_embeddings_umap_2d_with_info.parquet?rlkey=tjey12v6cru3iq88xitytefsr&dl=1'
target_filename = 'aggregate_titles_embeddings_umap_2d_with_info.parquet'

### Install and import

In [2]:
import os
if not os.getenv('IN_COSMO_DEV_ENV'):
    %pip install -q cosmograph tabled cosmodata

import tabled
import cosmodata

from functools import partial 
from cosmograph import cosmo

### Load data

In [3]:
if ext:
    getter = partial(tabled.get_table, ext=ext)
else:
    getter = tabled.get_table
# acquire_data takes care of caching locally too, so next time access will be faster
# (If you want a fresh copy, you can delete the local cache file manually.)
data = cosmodata.acquire_data(src, target_filename, getter=getter)

## Peep at the data

In [4]:
mode = 'short'  #Literal['short', 'sample', 'stats'] = 'short',
exclude_cols = []
cosmodata.print_dataframe_info(data, exclude_cols, mode=mode)

DataFrame shape: (340855, 9)
First row
------------------------------------------------------------
id                                                          245658
x                                                        10.981117
y                                                          7.02786
title            Standardized low-resolution brain electromagne...
source                                                      PubMed
pub_date                                                2002-01-01
n_cits                                                        1056
micro_cluster                                                  241
main_field                          Biomedical and health sciences


## Visualize data

### Scatter Plot of Publications

This scatter plot visualizes publications based on their `x` and `y` coordinate values. The size of each point represents the number of citations each publication has received, allowing a quick assessment of impact based on `n_cits`. Each point is colored based on the `main_field` of the publication, helping to visually distinguish the different fields of study.

In [None]:
cosmo(
    data,
    point_x_by="x",
    point_y_by="y",
    # point_size_by="n_cits",
    # point_color_by="main_field",
    # point_id_by="id",
    # point_label_by="title",
    # point_color_palette=["#FF6347", "#4682B4", "#3CB371"],
    # point_size_range=[5, 20],
    # show_labels=True,
    # render_links=False,
)

### Clustered Publication Map by Micro Clusters

This visualization employs point clustering based on the `micro_cluster` column, enabling users to identify concentrations of related publications in the coordinate space defined by `x` and `y`. The points are sized by citation count, and their colors differentiate micro clusters.

In [None]:
cosmo(
    data,
    point_x_by="x",
    point_y_by="y",
    point_size_by="n_cits",
    point_color_by="micro_cluster",
    point_id_by="id",
    point_color_palette=["#FF4500", "#32CD32", "#1E90FF"],
    point_size_range=[2, 15],
    show_labels=True,
    show_top_labels=True,
    show_top_labels_limit=5,
    fit_view_on_init=True,
    disable_simulation=False,
)

### Publication Network Visualization

Here we visually represent relationships between publications by treating them as nodes linked by their citation connections. This requires additional linking data to depict how these publications relate through citations, using `id` as node identifiers. Links can be created to reflect citation patterns if the data extends to include connections.

In [None]:
cosmo(
    data,
    links=links,
    link_source_by="id",
    link_target_by="cited_id",
    link_color_by="link_strength",
    link_color="#999999",
    link_width=1,
    link_arrows=True,
    render_links=True,
    show_labels=False,
)