## PMC15 Pipeline

This will run the PMC15 pipeline. The steps are as follows:

1. Download the list of PMC Open Access data
2. Download and extract the data
3. Parse all the articles and create a `_results/data/pubmed_parsed_data.jsonl` file

In the `pubmed_parsed_data.jsonl` file, each line is a JSON object with the following shape:

```json
{
    "pmid": "PMID_VALUE like 11178228",
    "pmc": "PMC_VALUE like 15015",
    "location": "LOCATION_PATH: path to where the article is stored on disk",
    "figures": [
        {
            "fig_caption": "FIGURE_CAPTION: the caption of the figure in the article",
            "fig_id": "FIGURE_ID: F1, F2, etc",
            "fig_label": "FIGURE_LABEL: Figure 1, Figure 2, etc. Where the figure is referenced in the article",
            "graphic_ref": "GRAPHIC_REFERENCE_PATH: path to where the imape is stored on disk",
            "pair_id": "PAIR_ID: {pmid}_{fig_id}",
        },
    ]
}
```

In [5]:
# this controls how many articles will be downloaded and processed. Set to `None` to process all articles in the PMCOA list
MAX_ITEMS_TO_PROCESS = 200

In [1]:
from pmc15_pipeline import data
from pmc15_pipeline.utils import fs_utils

In [None]:
## https://ftp.ncbi.nlm.nih.gov/pub/pmc/
repo_root = fs_utils.get_repo_root_path()

In [3]:
list_output_path = repo_root / "_results" / "data" / "pubmed_open_access_file_list.txt"

data.download_pubmed_file_list(
    output_file_path=list_output_path,
)

Downloading OpenAccess file list from: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt to f:\PMC_Data\_results\data\pubmed_open_access_file_list.txt
File already exists: f:\PMC_Data\_results\data\pubmed_open_access_file_list.txt


In [6]:
# remove the subset_size argument to download all files
downloaded_articles_output_path = repo_root / "_results" / "data" / "pubmed_open_access_files_compressed"

data.download_pubmed_files_from_list(
    file_list_path=list_output_path,
    output_folder_path=downloaded_articles_output_path,
    subset_size=MAX_ITEMS_TO_PROCESS,
)

                                                 

File: PMC13900.tar.gz already exists. Not downloading again.
File: PMC13901.tar.gz already exists. Not downloading again.
File: PMC13902.tar.gz already exists. Not downloading again.
File: PMC13911.tar.gz already exists. Not downloading again.
File: PMC13912.tar.gz already exists. Not downloading again.
File: PMC13913.tar.gz already exists. Not downloading again.
File: PMC13914.tar.gz already exists. Not downloading again.
File: PMC13915.tar.gz already exists. Not downloading again.
File: PMC13916.tar.gz already exists. Not downloading again.
File: PMC13917.tar.gz already exists. Not downloading again.
File: PMC13918.tar.gz already exists. Not downloading again.
File: PMC13919.tar.gz already exists. Not downloading again.
File: PMC13920.tar.gz already exists. Not downloading again.
File: PMC13921.tar.gz already exists. Not downloading again.
File: PMC13922.tar.gz already exists. Not downloading again.
File: PMC13923.tar.gz already exists. Not downloading again.
File: PMC13924.tar.gz al

 48%|████▊     | 97/200 [00:00<00:00, 294.18it/s]

File: PMC17827.tar.gz already exists. Not downloading again.
File: PMC17828.tar.gz already exists. Not downloading again.
File: PMC17829.tar.gz already exists. Not downloading again.
File: PMC25774.tar.gz already exists. Not downloading again.
File: PMC25775.tar.gz already exists. Not downloading again.
File: PMC25776.tar.gz already exists. Not downloading again.
File: PMC28985.tar.gz already exists. Not downloading again.
File: PMC28986.tar.gz already exists. Not downloading again.
File: PMC28987.tar.gz already exists. Not downloading again.
File: PMC28988.tar.gz already exists. Not downloading again.
File: PMC28989.tar.gz already exists. Not downloading again.
File: PMC28990.tar.gz already exists. Not downloading again.
File: PMC28991.tar.gz already exists. Not downloading again.
File: PMC28992.tar.gz already exists. Not downloading again.
File: PMC28993.tar.gz already exists. Not downloading again.
File: PMC28994.tar.gz already exists. Not downloading again.
File: PMC28995.tar.gz al

 48%|████▊     | 97/200 [00:02<00:00, 294.18it/s]

File: PMC29019.tar.gz size: 122308 bytes


 48%|████▊     | 97/200 [00:07<00:00, 294.18it/s]

File: PMC29020.tar.gz size: 554865 bytes


 48%|████▊     | 97/200 [00:11<00:00, 294.18it/s]

File: PMC29021.tar.gz size: 566226 bytes


 48%|████▊     | 97/200 [00:18<00:00, 294.18it/s]

File: PMC29022.tar.gz size: 361237 bytes


 52%|█████▏    | 104/200 [00:23<00:36,  2.64it/s]

File: PMC29023.tar.gz size: 190168 bytes


 52%|█████▎    | 105/200 [00:26<00:43,  2.18it/s]

File: PMC29024.tar.gz size: 184616 bytes


 52%|█████▎    | 105/200 [00:33<00:43,  2.18it/s]

File: PMC29025.tar.gz size: 295435 bytes


 52%|█████▎    | 105/200 [00:37<00:43,  2.18it/s]

File: PMC29026.tar.gz size: 155705 bytes


 54%|█████▍    | 108/200 [00:42<01:23,  1.10it/s]

File: PMC29027.tar.gz size: 507486 bytes


 55%|█████▍    | 109/200 [00:48<01:35,  1.05s/it]

File: PMC29028.tar.gz size: 183269 bytes


 55%|█████▍    | 109/200 [00:51<01:35,  1.05s/it]

File: PMC29029.tar.gz size: 130195 bytes


 56%|█████▌    | 111/200 [01:11<03:23,  2.29s/it]

File: PMC29030.tar.gz size: 123349 bytes


 56%|█████▌    | 112/200 [01:15<03:30,  2.39s/it]

File: PMC29031.tar.gz size: 156027 bytes


 56%|█████▌    | 112/200 [01:19<03:30,  2.39s/it]

File: PMC29032.tar.gz size: 529817 bytes


 56%|█████▌    | 112/200 [01:24<03:30,  2.39s/it]

File: PMC29033.tar.gz size: 130269 bytes


 56%|█████▌    | 112/200 [01:28<03:30,  2.39s/it]

File: PMC29034.tar.gz size: 973127 bytes


 58%|█████▊    | 116/200 [01:33<04:14,  3.03s/it]

File: PMC29035.tar.gz size: 162805 bytes


 58%|█████▊    | 117/200 [01:37<04:15,  3.08s/it]

File: PMC29036.tar.gz size: 76983 bytes


 58%|█████▊    | 117/200 [01:44<04:15,  3.08s/it]

File: PMC29037.tar.gz size: 296808 bytes


 58%|█████▊    | 117/200 [01:48<04:15,  3.08s/it]

File: PMC29038.tar.gz size: 155293 bytes


 60%|██████    | 120/200 [01:54<04:47,  3.59s/it]

File: PMC29039.tar.gz size: 241936 bytes


 60%|██████    | 121/200 [02:01<05:23,  4.10s/it]

File: PMC29040.tar.gz size: 104901 bytes


 60%|██████    | 121/200 [02:06<05:23,  4.10s/it]

File: PMC29041.tar.gz size: 254252 bytes


 60%|██████    | 121/200 [02:11<05:23,  4.10s/it]

File: PMC29042.tar.gz size: 87616 bytes


 62%|██████▏   | 124/200 [02:16<05:39,  4.46s/it]

File: PMC29043.tar.gz size: 262087 bytes


 62%|██████▎   | 125/200 [02:21<05:53,  4.71s/it]

File: PMC29044.tar.gz size: 355962 bytes


 62%|██████▎   | 125/200 [02:25<05:53,  4.71s/it]

File: PMC29045.tar.gz size: 148569 bytes


 62%|██████▎   | 125/200 [02:29<05:53,  4.71s/it]

File: PMC29046.tar.gz size: 353735 bytes


 64%|██████▍   | 128/200 [02:33<05:20,  4.45s/it]

File: PMC29047.tar.gz size: 625848 bytes


 64%|██████▍   | 129/200 [02:38<05:14,  4.43s/it]

File: PMC29048.tar.gz size: 99832 bytes


 64%|██████▍   | 129/200 [02:43<05:14,  4.43s/it]

File: PMC29049.tar.gz size: 117362 bytes


 66%|██████▌   | 131/200 [02:48<05:18,  4.62s/it]

File: PMC29050.tar.gz size: 100734 bytes


 66%|██████▌   | 132/200 [02:52<05:13,  4.61s/it]

File: PMC29051.tar.gz size: 461359 bytes


 66%|██████▌   | 132/200 [02:56<05:13,  4.61s/it]

File: PMC29052.tar.gz size: 325144 bytes


 67%|██████▋   | 134/200 [03:00<04:56,  4.50s/it]

File: PMC29053.tar.gz size: 180577 bytes


 68%|██████▊   | 135/200 [03:04<04:42,  4.34s/it]

File: PMC29054.tar.gz size: 230267 bytes


 68%|██████▊   | 136/200 [03:08<04:29,  4.22s/it]

File: PMC29055.tar.gz size: 82444 bytes


 68%|██████▊   | 137/200 [03:12<04:24,  4.19s/it]

File: PMC29056.tar.gz size: 259643 bytes


 69%|██████▉   | 138/200 [03:16<04:17,  4.16s/it]

File: PMC29057.tar.gz size: 2133649 bytes


 70%|██████▉   | 139/200 [03:21<04:34,  4.49s/it]

File: PMC29058.tar.gz size: 1304706 bytes


 70%|███████   | 140/200 [03:28<04:29,  4.50s/it]

File: PMC29059.tar.gz size: 393415 bytes


 70%|███████   | 141/200 [03:35<05:23,  5.48s/it]

File: PMC29061.tar.gz size: 434742 bytes


 71%|███████   | 142/200 [03:40<05:09,  5.34s/it]

File: PMC29062.tar.gz size: 9714661 bytes


 72%|███████▏  | 143/200 [03:50<05:58,  6.28s/it]

File: PMC29063.tar.gz size: 1335454 bytes


 72%|███████▏  | 144/200 [04:00<06:22,  6.83s/it]

File: PMC29064.tar.gz size: 449007 bytes


 72%|███████▎  | 145/200 [04:04<06:45,  7.38s/it]

File: PMC29065.tar.gz size: 777609 bytes


 73%|███████▎  | 146/200 [04:13<06:05,  6.76s/it]

File: PMC29066.tar.gz size: 746487 bytes


 74%|███████▎  | 147/200 [04:23<06:45,  7.65s/it]

File: PMC29067.tar.gz size: 1512957 bytes


 74%|███████▍  | 148/200 [04:28<06:52,  7.94s/it]

File: PMC29068.tar.gz size: 1647371 bytes


 74%|███████▍  | 149/200 [04:34<06:05,  7.17s/it]

File: PMC29069.tar.gz size: 3187499 bytes


 75%|███████▌  | 150/200 [04:40<05:48,  6.97s/it]

File: PMC29073.tar.gz size: 426761 bytes


 76%|███████▌  | 151/200 [04:46<04:58,  6.09s/it]

File: PMC29074.tar.gz size: 363652 bytes


 76%|███████▌  | 152/200 [04:51<04:57,  6.20s/it]

File: PMC29075.tar.gz size: 195170 bytes


 76%|███████▋  | 153/200 [04:59<04:30,  5.76s/it]

File: PMC29076.tar.gz size: 482881 bytes


 77%|███████▋  | 154/200 [05:05<05:01,  6.55s/it]

File: PMC29077.tar.gz size: 744234 bytes


 78%|███████▊  | 155/200 [05:12<04:42,  6.28s/it]

File: PMC29078.tar.gz size: 223931 bytes


 78%|███████▊  | 156/200 [05:16<04:50,  6.61s/it]

File: PMC29079.tar.gz size: 243844 bytes


 78%|███████▊  | 157/200 [05:20<04:08,  5.77s/it]

File: PMC29080.tar.gz size: 4096984 bytes


 79%|███████▉  | 158/200 [05:27<04:16,  6.11s/it]

File: PMC29081.tar.gz size: 1548550 bytes


 80%|███████▉  | 159/200 [05:32<04:00,  5.87s/it]

File: PMC29082.tar.gz size: 2124615 bytes


 80%|████████  | 160/200 [05:39<03:47,  5.68s/it]

File: PMC29083.tar.gz size: 795041 bytes


 80%|████████  | 161/200 [05:44<03:49,  5.88s/it]

File: PMC29084.tar.gz size: 2971615 bytes


 81%|████████  | 162/200 [05:51<03:41,  5.83s/it]

File: PMC29085.tar.gz size: 205418 bytes


 82%|████████▏ | 163/200 [05:55<03:33,  5.76s/it]

File: PMC29086.tar.gz size: 1143531 bytes


 82%|████████▏ | 164/200 [06:01<03:21,  5.60s/it]

File: PMC29087.tar.gz size: 988570 bytes


 82%|████████▎ | 165/200 [06:06<03:10,  5.43s/it]

File: PMC29088.tar.gz size: 2075197 bytes


 83%|████████▎ | 166/200 [06:11<03:02,  5.36s/it]

File: PMC29089.tar.gz size: 191702 bytes


 84%|████████▎ | 167/200 [06:15<02:39,  4.85s/it]

File: PMC29090.tar.gz size: 488327 bytes


 84%|████████▍ | 168/200 [06:20<02:38,  4.96s/it]

File: PMC29091.tar.gz size: 3514574 bytes


 84%|████████▍ | 169/200 [06:26<02:43,  5.27s/it]

File: PMC29092.tar.gz size: 309299 bytes


 85%|████████▌ | 170/200 [06:30<02:26,  4.88s/it]

File: PMC29093.tar.gz size: 114106 bytes


 86%|████████▌ | 171/200 [06:34<02:14,  4.63s/it]

File: PMC29094.tar.gz size: 132510 bytes


 86%|████████▌ | 172/200 [06:37<02:00,  4.32s/it]

File: PMC29095.tar.gz size: 205902 bytes


 86%|████████▋ | 173/200 [06:42<02:00,  4.46s/it]

File: PMC29096.tar.gz size: 271490 bytes


 87%|████████▋ | 174/200 [06:46<01:51,  4.29s/it]

File: PMC29097.tar.gz size: 651512 bytes


 88%|████████▊ | 175/200 [06:51<01:46,  4.27s/it]

File: PMC29098.tar.gz size: 177625 bytes


 88%|████████▊ | 176/200 [06:55<01:49,  4.55s/it]

File: PMC29099.tar.gz size: 948877 bytes


 88%|████████▊ | 177/200 [07:02<01:46,  4.62s/it]

File: PMC29100.tar.gz size: 718340 bytes


 89%|████████▉ | 178/200 [07:08<01:57,  5.34s/it]

File: PMC29101.tar.gz size: 2065030 bytes


 90%|████████▉ | 179/200 [07:13<01:53,  5.41s/it]

File: PMC29102.tar.gz size: 1746792 bytes


 90%|█████████ | 180/200 [07:19<01:49,  5.47s/it]

File: PMC29103.tar.gz size: 3521351 bytes


 90%|█████████ | 181/200 [07:26<01:48,  5.69s/it]

File: PMC29104.tar.gz size: 273214 bytes


 91%|█████████ | 182/200 [07:31<01:46,  5.90s/it]

File: PMC29105.tar.gz size: 163207 bytes


 92%|█████████▏| 183/200 [07:35<01:31,  5.37s/it]

File: PMC29106.tar.gz size: 147069 bytes


 92%|█████████▏| 184/200 [07:40<01:21,  5.10s/it]

File: PMC29715.tar.gz size: 354197 bytes


 92%|█████████▎| 185/200 [07:46<01:12,  4.82s/it]

File: PMC29831.tar.gz size: 333570 bytes


 93%|█████████▎| 186/200 [07:53<01:11,  5.13s/it]

File: PMC30703.tar.gz size: 824064 bytes


 94%|█████████▎| 187/200 [07:58<01:18,  6.02s/it]

File: PMC30704.tar.gz size: 520891 bytes


 94%|█████████▍| 188/200 [08:04<01:06,  5.53s/it]

File: PMC30705.tar.gz size: 577427 bytes


 94%|█████████▍| 189/200 [08:12<01:05,  5.95s/it]

File: PMC30706.tar.gz size: 1604972 bytes


 95%|█████████▌| 190/200 [08:18<01:05,  6.59s/it]

File: PMC30707.tar.gz size: 191532 bytes


 96%|█████████▌| 191/200 [08:22<00:55,  6.12s/it]

File: PMC30708.tar.gz size: 592037 bytes


 96%|█████████▌| 192/200 [08:27<00:45,  5.65s/it]

File: PMC30709.tar.gz size: 870054 bytes


 96%|█████████▋| 193/200 [08:34<00:42,  6.03s/it]

File: PMC30710.tar.gz size: 187712 bytes


 97%|█████████▋| 194/200 [08:37<00:31,  5.29s/it]

File: PMC30711.tar.gz size: 468206 bytes


 98%|█████████▊| 195/200 [08:41<00:24,  4.98s/it]

File: PMC30712.tar.gz size: 915004 bytes


 98%|█████████▊| 196/200 [08:46<00:20,  5.01s/it]

File: PMC30713.tar.gz size: 202743 bytes


 98%|█████████▊| 197/200 [08:50<00:13,  4.55s/it]

File: PMC30714.tar.gz size: 958245 bytes


 99%|█████████▉| 198/200 [08:56<00:10,  5.15s/it]

File: PMC30715.tar.gz size: 206280 bytes


100%|█████████▉| 199/200 [09:01<00:04,  4.83s/it]

File: PMC30938.tar.gz size: 94144 bytes


100%|██████████| 200/200 [09:03<00:00,  2.72s/it]


Skipped 0 files.


In [7]:
decompressed_folder_path = repo_root / "_results" / "data" / "pubmed_open_access_files"

data.decompress_pubmed_files(
    input_folder_path=downloaded_articles_output_path,
    output_folder_path=decompressed_folder_path,
)

Found 200 files that match *.tar.gz in f:\PMC_Data\_results\data\pubmed_open_access_files_compressed


100%|██████████| 200/200 [00:05<00:00, 38.27it/s]

Finished extracting 200 files





In [8]:
pipeline_input_file_path = repo_root / "_results" / "data" / "pubmed_parsed_data.jsonl"

data.generate_pmc15_pipeline_outputs(
    decompressed_folder=decompressed_folder_path,
    output_file_path=pipeline_input_file_path,
)

f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13900\BCR-3-1-055.nxml
starting...
parsed f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13900\BCR-3-1-055.nxml
no output
f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13901\BCR-3-1-061.nxml
starting...
parsed f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13901\BCR-3-1-061.nxml
f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13902\BCR-3-1-066.nxml
starting...
parsed f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13902\BCR-3-1-066.nxml
f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13911\bcr-2-1-059.nxml
starting...
parsed f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13911\bcr-2-1-059.nxml
no output
f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13912\bcr-2-1-064.nxml
starting...
parsed f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13912\bcr-2-1-064.nxml
f:\PMC_Data\_results\data\pubmed_open_access_files\PMC13913\bcr-1-1-073.nxml
starting...
parsed f:\PMC_Data\_result

In [9]:
num_lines = fs_utils.get_line_count(pipeline_input_file_path)
print(f"Number of lines in pipeline output file: {num_lines}")

Number of lines in pipeline output file: 162


### Processed parsed data

In [None]:
# import json
# import os

# # Paths to input and output files in the _results/data folder
# input_file_path = '_results/data/pubmed_parsed_data.jsonl'
# output_file_path = '_results/data/pubmed_processed_data.json'

# def parse_pubmed_data(file_path):
#     """
#     Reads the JSONL file and returns a list of articles that have figure data.
#     Each line in the file should be a JSON object.
#     """
#     articles = []
#     with open(file_path, 'r', encoding='utf-8') as f:
#         for line in f:
#             try:
#                 data = json.loads(line.strip())
#                 # Only keep articles that contain figure data
#                 if data.get("figures"):
#                     articles.append(data)
#             except json.JSONDecodeError as e:
#                 print(f"Error decoding JSON: {e}")
#     return articles

# def process_article(article):
#     """
#     Extracts relevant information from an article.
#     Returns a dictionary with the article's pmid, pmc, location, and a list of figures.
#     """
#     processed = {
#         "pmid": article.get("pmid"),
#         "pmc": article.get("pmc"),
#         "location": article.get("location"),
#         "figures": []
#     }
#     for fig in article.get("figures", []):
#         figure_data = {
#             "fig_caption": fig.get("fig_caption"),
#             "fig_id": fig.get("fig_id"),
#             "fig_label": fig.get("fig_label"),
#             "graphic_ref": fig.get("graphic_ref"),
#             "pair_id": fig.get("pair_id")
#         }
#         processed["figures"].append(figure_data)
#     return processed

# def main():
#     # Parse the JSONL file
#     articles = parse_pubmed_data(input_file_path)
#     print(f"Found {len(articles)} articles with figure data.\n")
    
#     # Process all articles
#     processed_articles = [process_article(article) for article in articles]
    
#     # Ensure output directory exists
#     os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
    
#     # Save the processed data to a JSON file
#     with open(output_file_path, 'w', encoding='utf-8') as outfile:
#         json.dump(processed_articles, outfile, indent=2)
    
#     print(f"Processed data saved to {output_file_path}\n")
    
#     # Display the first 5 processed articles as an example
#     for article in processed_articles[:5]:
#         print(json.dumps(article, indent=2))
#         print("-" * 80)

# if __name__ == '__main__':
#     main()

Found 162 articles with figure data.

Processed data saved to _results/data/pubmed_processed_data.json

{
  "pmid": "11250747",
  "pmc": "13901",
  "location": "f:\\PMC_Data\\_results\\data\\pubmed_open_access_files\\PMC13901",
  "figures": [
    {
      "fig_caption": "Immunohistochemical localization of BRCA1 and BRCA2 with formalin-fixed and paraffin sections in an ovotestis. (a) Hematoxylin eosin saffron (HES) histology of the ovotestis demonstrating testicular tissue with seminiferous cords (triangle), adjacent to ovarian tissue with primordial follicles (arrow) (\u00d7 200). For BRCA1 protein: (b) K-18 antibodies showed cytoplasmic staining of oocytes (arrowhead) surrounded by follicule primordial and cytoplasmic staining of male germ cells (arrow) in seminiferous cords identified by the presence of Sertoli cells inside (\u00d7 220); (c) 8F7 antibodies showed predominantly nuclear stainings of Sertoli cells (asterisk) and of oocytes (arrowhead), and cytoplasmic staining was also 

### Processed dataset

In [11]:
import json
import logging
from pathlib import Path
import shutil
import re
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('dataset_generation.log'), logging.StreamHandler()]
)

class DatasetGenerator:
    MODALITY_MAP = {
        r'\b(MRI|MR|fMRI|DWI|MRA)\b': 'Magnetic Resonance Imaging',
        r'\b(CT|HRCT|CBCT)\b': 'Computed Tomography',
        r'\b(PET|SPECT)\b': 'Nuclear Medicine',
        r'\b(X-?ray|Radiography)\b': 'X-ray',
        r'\bUltrasound|US|Sonography\b': 'Ultrasound',
        r'\bAngiography|DSA\b': 'Angiography',
        r'\bHistology|IHC\b': 'Histopathology',
        r'\bEndoscopy\b': 'Endoscopy',
        r'\b(ECG|EKG)\b': 'Electrocardiography'
    }

    BODY_PART_MAP = {
        r'\bAbdomen|Abdominal\b': 'Abdomen',
        r'\bBrain|Cerebral|CNS\b': 'Brain',
        r'\bChest|Thoracic\b': 'Chest',
        r'\bHeart|Cardiac\b': 'Heart',
        r'\bSpine|Vertebral\b': 'Spine',
        r'\bPelvis|Pelvic\b': 'Pelvis',
        r'\bLiver|Hepatic\b': 'Liver',
        r'\bLung|Pulmonary\b': 'Lung'
    }

    def __init__(self):
        self.input_jsonl = Path("_results/data/pubmed_parsed_data.jsonl")
        self.output_jsonl = Path("final_data/final_dataset.jsonl")
        self.images_dir = Path("final_data/images")
        self.counter = 1
        
        # Create directories if they don't exist
        self.images_dir.mkdir(parents=True, exist_ok=True)

    @staticmethod
    def extract_metadata(text, pattern_map):
        """Extract metadata using prioritized regex patterns"""
        for pattern, value in pattern_map.items():
            if re.search(pattern, text, re.IGNORECASE):
                return value
        return 'Unknown'

    def process_article(self, article):
        """Process a single article and its figures"""
        try:
            pmc = article['pmc']
            entries = []
            
            for idx, fig in enumerate(article['figures']):
                src_img = Path(fig['graphic_ref'])
                if not src_img.exists():
                    logging.warning(f"Missing image: {src_img}")
                    continue

                dest_img = self.images_dir / f"pmc_{pmc}_{idx}{src_img.suffix}"
                if not dest_img.exists():
                    shutil.copy(str(src_img), str(dest_img))

                entry = {
                    "image": [f"images/{dest_img.name}"],
                    "sequence": [f"images/{dest_img.name}"],
                    "conversations": [
                        {"from": "human", "value": "Analyze the image in a comprehensive and detailed manner."},
                        {"from": "gpt", "value": fig['fig_caption']}
                    ],
                    "id": f"Alignment_VQA_{self.counter}",
                    "modality": self.extract_metadata(fig['fig_caption'], self.MODALITY_MAP),
                    "body_part": self.extract_metadata(fig['fig_caption'], self.BODY_PART_MAP),
                    "pmcid": pmc,
                    "pmid": article.get('pmid', ''),
                    "original_fig_id": fig.get('fig_id', '')
                }
                
                entries.append(entry)
                self.counter += 1
            
            return entries
        except Exception as e:
            logging.error(f"Error processing article {article.get('pmc', 'unknown')}: {str(e)}")
            return []

    def generate_dataset(self):
        """Main method to generate the dataset"""
        with open(self.input_jsonl, 'r', encoding='utf-8') as infile, \
             open(self.output_jsonl, 'w', encoding='utf-8') as outfile:

            # Get total articles for progress bar
            total_articles = sum(1 for _ in open(self.input_jsonl, 'r', encoding='utf-8'))
            
            with ThreadPoolExecutor() as executor:
                future_to_line = {
                    executor.submit(self.process_article, json.loads(line.strip())): line
                    for line in infile
                }

                with tqdm(total=total_articles, desc="Processing articles") as pbar:
                    for future in future_to_line:
                        entries = future.result()
                        for entry in entries:
                            json.dump(entry, outfile, ensure_ascii=False)
                            outfile.write('\n')
                        pbar.update(1)

if __name__ == "__main__":
    generator = DatasetGenerator()
    generator.generate_dataset()

Processing articles: 100%|██████████| 162/162 [00:00<00:00, 267.33it/s]


### PMC-Fine-Grained Dataset 

In [19]:
import json
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
from torchvision import transforms
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

# 1. Load dataset
def load_dataset(file_path):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# 2. Image preprocessing with enhanced size
def create_image_pipeline(target_size=512):
    return transforms.Compose([
        transforms.Resize((target_size, target_size)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                            std=[0.229, 0.224, 0.225])
    ])

# 3. Text processing with extended length
def process_text(caption, processor, max_length=128):
    return processor(text=caption, 
                    padding='max_length',
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt")

# 4. BiomedCLIP embedding model
class BiomedCLIPEmbedder:
    def __init__(self):
        self.model = AutoModel.from_pretrained(
            "microsoft/BiomedCLIP-PubMedBERT-base-uncased-abstract-fulltext"
        )
        self.processor = AutoProcessor.from_pretrained(
            "microsoft/BiomedCLIP-PubMedBERT-base-uncased-abstract-fulltext"
        )
        
    def get_image_embedding(self, image):
        inputs = self.processor(images=image, return_tensors="pt")
        with torch.no_grad():
            return self.model.get_image_features(**inputs)
        
    def get_text_embedding(self, text):
        inputs = process_text(text, self.processor)
        with torch.no_grad():
            return self.model.get_text_features(**inputs)

# 5. Taxonomy classifier
class TaxonomyClassifier:
    def __init__(self, taxonomy_keywords):
        self.embedder = BiomedCLIPEmbedder()
        self.keywords = taxonomy_keywords
        
        # Precompute keyword embeddings
        text_embeddings = [self.embedder.get_text_embedding(k) for k in self.keywords]
        self.text_features = torch.cat(text_embeddings).numpy()
        
        # Build nearest neighbors index
        self.nn = NearestNeighbors(n_neighbors=1, metric='cosine')
        self.nn.fit(self.text_features)
        
    def classify_image(self, image):
        image_features = self.embedder.get_image_embedding(image).numpy()
        _, indices = self.nn.kneighbors(image_features)
        return self.keywords[indices[0][0]]

# 6. Analysis pipeline
def analyze_dataset(dataset_path, taxonomy):
    # Initialize components
    data = load_dataset(dataset_path)
    image_pipeline = create_image_pipeline(512)
    classifier = TaxonomyClassifier(taxonomy)
    
    # Process dataset
    results = {}
    for entry in data:
        try:
            # Load and process image
            img = Image.open(entry['image'][0])
            img_tensor = image_pipeline(img).unsqueeze(0)
            
            # Classify image
            image_type = classifier.classify_image(img_tensor)
            results[image_type] = results.get(image_type, 0) + 1
            
        except Exception as e:
            print(f"Error processing {entry['id']}: {str(e)}")
    
    # Generate visualization
    plot_results(results)

def plot_results(results):
    sorted_items = sorted(results.items(), key=lambda x: x[1], reverse=True)[:30]
    labels, values = zip(*sorted_items)
    
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(labels)), values, align='center')
    plt.yticks(range(len(labels)), labels)
    plt.gca().invert_yaxis()
    plt.xlabel('Frequency')
    plt.title('Top 30 Image Types in PMC-Fine-Grained')
    plt.tight_layout()
    plt.show()

# Example usage
if __name__ == "__main__":
    taxonomy = [
        "statistical figures", "graphs", "charts", "tables",
        "X-ray", "magnetic resonance", "CT scan", 
        "light microscopy", "electron microscopy",
        "histopathology", "immunohistochemistry",
        "flow cytometry", "gel electrophoresis",
        "ultrasound", "endoscopy", "ECG",
        "genetic sequencing", "protein structure",
        "molecular model", "surgical illustration"
    ]
    
    analyze_dataset("final_data\final_dataset.jsonl", taxonomy)

ModuleNotFoundError: No module named 'torch'