# Building Custom DataFrames for Various Uses  
**Filename:** building_custom_dfs.ipynb  
**Path:** TAMIDS/Code/Scholars@TAMU Data/building_custom_dfs.ipynb  
**Created Date:** 04 April 2022, 18:09 

In this document I build several custom DataFrames from the data I have collected.

In [1]:
from IPython.display import Markdown, display, HTML
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
import requests
from requests.exceptions import HTTPError
from tqdm import tqdm

pd.options.display.float_format = '{:,.3f}'.format
plt.style.use('seaborn-darkgrid')

# General Markdown Formatting Functions

def printmd(string, level=1):
    header_level = '#'*level + ' '
    display(Markdown(header_level + string))

## Loading the Data

Creates a dictionary of dictionaries of dataframes for all the pickeled data.  
Ex: `data['people_education']` contains the 'people_education' DataFrame loaded from `../../Data/Scholars@TAMU/people/people_education.pickle`.  
This makes calls to each DataFrame simpler than they were in the `pickling_raw_data.ipynb` and `completeness.ipynb` files.

In [2]:
base_path = "../../Data/Scholars@TAMU"

with open('dicts/data_filenames.json', 'r') as infile:
    data_filenames = json.load(infile)

data = {}
for foldername, filenames in data_filenames.items():
    for filename in filenames:
        data[filename] = pd.read_pickle(base_path + "/" + foldername + "/" + filename + ".pickle")

In [3]:
with open('../../Data/Scholars@TAMU/my_api_calls/general_data_dict.json', 'r') as infile:
    api = json.load(infile)

## people

Each row of this dataframe represents a unique person from Scholars@TAMU.

In [4]:
URL_PREFIX = 'https://scholars.library.tamu.edu/vivo/display/'
API_PREFIX = 'https://api.library.tamu.edu/scholars-discovery/individual/'

people = pd.DataFrame({
    'api_id': data['people_overview']['people_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', '')),
    'uin': data['people_overview']['uin'],
    'lastname': data['people_overview']['lastname'],
    'middlename': data['people_overview']['middle'],
    'firstname': data['people_overview']['firstname'],
    'email': data['people_overview']['email'],
    'preferred_title': data['people_overview']['preferred_title'],
    'employment_type': data['people_overview']['status']

}).set_index('api_id')

In [5]:
def get_attribute_series(key: str, is_list=True) -> list | str:
    has_it_counter = {'Yes': 0, 'No': 0}
    def get_attribute(api_id: str) -> list:
        try:
            person = api[api_id]
        except KeyError:
            return []

        try:
            if is_list:
                attribute_list = [attribute for attribute in person[key]]
            else:
                attribute_list = person[key]
            has_it_counter['Yes'] += 1
        except KeyError:
            if is_list:
                attribute_list = []
            else:
                attribute_list = ''
            has_it_counter['No'] += 1

        return attribute_list

    s = people.index.map(get_attribute)
    print(f"{key}: {has_it_counter}")
    return s


In [6]:
people['research_areas'] = get_attribute_series(key='researchAreas')
people['keywords'] = get_attribute_series(key='keywords')
people['colleges'] = get_attribute_series(key='schools')
people['organizations'] = get_attribute_series(key='organizations')
people['education'] = get_attribute_series(key='educationAndTraining')
people['teaching'] = get_attribute_series(key='teachingActivities')
people['publications'] = get_attribute_series(key='publications')
people['hr_title'] = get_attribute_series(key='hrJobTitle', is_list=False)
people['all_positions'] = get_attribute_series(key='positions')
people['overview'] = get_attribute_series(key='overview', is_list=False)


researchAreas: {'Yes': 643, 'No': 4189}
keywords: {'Yes': 3099, 'No': 1733}
schools: {'Yes': 4583, 'No': 249}
organizations: {'Yes': 4775, 'No': 57}
educationAndTraining: {'Yes': 3054, 'No': 1778}
teachingActivities: {'Yes': 3874, 'No': 958}
publications: {'Yes': 3304, 'No': 1528}
hrJobTitle: {'Yes': 4822, 'No': 10}
positions: {'Yes': 4775, 'No': 57}
overview: {'Yes': 1788, 'No': 3044}


In [7]:
people

Unnamed: 0_level_0,uin,lastname,middlename,firstname,email,preferred_title,employment_type,research_areas,keywords,colleges,organizations,education,teaching,publications,hr_title,all_positions,overview
api_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
n28cb7333,706006006,Carter,H,Misti,hillcarter@tamu.edu,Assistant Professor,Faculty,[],[],[College of Medicine],[Humanities in Medicine],[{'id': 'n28cb7333_c2acab6a-b3ca-11e9-adb7-001...,"[{'id': 'n63e653c9', 'label': 'MEID610 Heal I'...","[{'id': 'n92740SE', 'label': 'Examining First-...",Clinical Assistant Professor,"[{'id': 'ne69a2757', 'label': 'Assistant Profe...",
n014c3d0f,502001050,Allen,C,Gregg,gregg.allen@tamu.edu,Associate Professor,Faculty,[],"[Photoreceptor Cells, Vertebrate, Glycine, Gam...",[College of Medicine],[Neuroscience and Experimental Therapeutics],[{'id': 'n014c3d0f_26a39892-b399-11e9-adb7-001...,"[{'id': 'n9d4cc9ec', 'label': 'NEXT620 Gross A...","[{'id': 'n87623SE', 'label': 'An autonomous ci...",Instructional Associate Professor,"[{'id': 'n073b4eaf', 'label': 'Associate Profe...",My primary research interest focuses on the un...
n7a168a93,902000258,Dubois,W,Dustin,dubois@tamu.edu,Assistant Professor,Faculty,[],"[Sipsc, Ethanol, Excitatory Postsynaptic Poten...",[College of Medicine],[Neuroscience and Experimental Therapeutics],[{'id': 'n7a168a93_c2acab6a-b3ca-11e9-adb7-001...,"[{'id': 'n8f977c82', 'label': 'NEXT605 Moleclr...","[{'id': 'n367780SE', 'label': 'Effects of etha...",Instructional Assistant Professor,"[{'id': 'n87109373', 'label': 'Assistant Profe...",My recent research interests have focused on u...
nbccd1f64,202004708,Mccord,C,Gary,g-mccord@tamu.edu,Professor,Staff,[],"[Tendinopathy, Bursitis, Hydroxyapatites, Calc...","[College of Medicine, Health Science Center]","[Neuroscience and Experimental Therapeutics, C...",[{'id': 'nbccd1f64_dc7dc0d0-b399-11e9-adb7-001...,"[{'id': 'n075561e4', 'label': 'MEID708 Integum...","[{'id': 'n317155SE', 'label': 'Four teaching s...",Senior Associate Dean for Student Affairs and ...,"[{'id': 'n2c80881d', 'label': 'Professor', 'ty...",I am primarily a teaching faculty member and a...
n18de9127,919002271,Bondos,E,Sarah,bondos@tamu.edu,Associate Professor,Faculty,[],"[Recombinant Fusion Proteins, Gene Expression,...",[],[],[{'id': 'n18de9127_580dfdad-b399-11e9-adb7-001...,"[{'id': 'nd8499416', 'label': 'MSCI691 Researc...","[{'id': 'n83617SE', 'label': 'Detection and pr...",Associate Professor,[],My laboratory works in two research areas. Fir...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
nc1e62471,701002712,Ferro,J,Pamela,p-ferro@tamu.edu,"Section Head, Molecular Diagnostics",Staff,[],"[Viral Envelope Proteins, Sensitivity And Spec...",[College of Veterinary Medicine and Biomedical...,[Texas A&M Veterinary Medical Diagnostic Labor...,[{'id': 'nc1e62471_dc7dc0d0-b399-11e9-adb7-001...,[],"[{'id': 'n93579SE', 'label': 'A duplex real-ti...",Section Head,"[{'id': 'nf71ae86c', 'label': 'Section Head, M...",
n14b2580b,730003535,Chakraborty,,Joydeep,joydeep.c2019@tamu.edu,Postdoctoral Researcher,Staff,"[{'id': 'nfst01036069', 'label': 'Nerves, Peri...","[Saturation Mutagenesis, 2-oxo Acid Dehydrogen...",[College of Agriculture and Life Sciences],[Nutrition],[{'id': 'n14b2580b_874490bb-b3ae-11e9-adb7-001...,[],"[{'id': 'n589414SE', 'label': 'Catalysis of tr...",Postdoctoral Researcher,"[{'id': 'n18641334', 'label': 'Postdoctoral Re...","Currently, working as a post doc and investiga..."
n4f37dfa5,324003135,Dvir,,Rotem,rdvir@tamu.edu,Assistant Research Scientist,Staff,"[{'id': 'nfst01110895', 'label': 'Security, In...","[8.3 Policy, Ethics, And Research Governance]",[Bush School of Government and Public Service],"[Institute for Science, Technology, and Public...",[{'id': 'n4f37dfa5_3d0e6a4d-b399-11e9-adb7-001...,"[{'id': 'na4f3301a', 'label': 'BUSH631 Quant M...","[{'id': 'n596137SE', 'label': 'Who Is a Rebel?...",Assistant Research Scientist,"[{'id': 'n13f7fe34', 'label': 'Assistant Resea...",
n0e788fcb,231005803,Agadi,,Satish,agadi@tamu.edu,Clinical Assistant Professor,Staff,[],[],[Health Science Center],[College of Medicine],[],[],[],Clinical Assistant Professor,"[{'id': 'n1d395815', 'label': 'Clinical Assist...",


In [8]:
people.to_pickle('../../Data/Scholars@TAMU/my_api_calls/people_df.pickle')

## departments

columns for departments  
metadata -> name, id, research fields

## publications

In [9]:
publications = data['publications_overview'].copy()
publications['people_api_id'] = publications['people_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', ''))
publications['publication_api_id'] = publications['publication_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', ''))

publications_unified = pd.DataFrame({
    'author_ids': publications.groupby('publication_api_id')['people_api_id'].apply(list),
    'author_uins': publications.groupby('publication_api_id')['uin'].apply(list),
    'year': publications.groupby('publication_api_id')['year'].first(),
    'publication_type': publications.groupby('publication_api_id')['publication_type'].first(),
    'publication_title': publications.groupby('publication_api_id')['publication_title'].first()
})

In [10]:
pubs_wos = data['publications_subject_journal_wos'].copy()
pubs_wos['publication_api_id'] = pubs_wos['publication_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', ''))

pubs_wos_unified = pd.DataFrame({
    'keyword': pubs_wos.groupby('publication_api_id')['keyword'].apply(list)
    })

In [11]:
unsdg = data['publications_unsdg'].copy()
unsdg['publication_api_id'] = unsdg['publication_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', ''))

unsdg_unified = pd.DataFrame({
    'un_sustainable_development_goals': unsdg.groupby('publication_api_id')['name'].apply(list)
    })

In [12]:
authors = data['publications_author_institutions'].copy()
authors['publication_api_id'] = authors['publication_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', ''))

authors_unified = pd.DataFrame({
    'author_organization': authors.groupby('publication_api_id')['organisation'].apply(list),
    'author_city': authors.groupby('publication_api_id')['city'].apply(list),
    'author_country': authors.groupby('publication_api_id')['country'].apply(list),
    })

In [17]:
abstract = data['publications_abstract'].copy()
abstract['publication_api_id'] = abstract['publication_api'].apply(lambda x: x.replace('https://api.library.tamu.edu/scholars-discovery/individual/', ''))

abstract_unified = pd.DataFrame({
    'abstract': abstract.groupby('publication_api_id')['abstract'].first()
    })

In [18]:
key = authors['publication_api_id'].sample(1).to_list()[0]

print('publication_api_id: ' + key)
display(publications[publications['publication_api_id'] == key])
display(pubs_wos[pubs_wos['publication_api_id'] == key])
display(unsdg[unsdg['publication_api_id'] == key])
display(authors[authors['publication_api_id'] == key])
display(abstract[abstract['publication_api_id'] == key])

publication_api_id: n301830SE


Unnamed: 0,people_uid,uin,people_uri,people_api,dept_id,publication_uid,publication_uri,publication_api,doi,issn,...,year,begin_page,end_page,volume,issue,publisher,publication_type,publication_title,people_api_id,publication_api_id
187742,29c4db01cbc04b27820afdd347f0ab11,726000577,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,16,301830SE,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,10.1128/CDLI.7.1.114-118.2000,1071-412X,...,2000,0 388\n1 r40\n2 829...,0 394\n1 r40\n2 838...,0 155\n1 12\n2 4\n3...,0 NaN\n1 4\n2 6\n3...,American Society for Microbiology,Journal Article,Characterization of specific immune responses ...,n5889f585,n301830SE


Unnamed: 0,publication_uri,publication_api,wos_id,wos_research_area_id,keyword,publication_api_id
100629,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,WOS:000084723000023,35.0,Immunology,n301830SE
100630,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,WOS:000084723000023,36.0,Infectious Diseases,n301830SE
100631,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,WOS:000084723000023,37.0,Microbiology,n301830SE


Unnamed: 0,publication_uri,publication_api,category_sdg_id,name,publication_api_id


Unnamed: 0,publication_uri,publication_api,organisation,city,country,publication_api_id
693552,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,National Agricultural Technology Institute,Buenos Aires,Argentina,n301830SE


Unnamed: 0,publication_uri,publication_api,abstract,publication_api_id
93239,https://scholars.library.tamu.edu/vivo/display...,https://api.library.tamu.edu/scholars-discover...,Using the shuttle vector pMCO2 and the vaccini...,n301830SE


In [19]:
pubs_unified = pd.concat([publications_unified, pubs_wos_unified, unsdg_unified, authors_unified, abstract_unified], axis=1)

pubs_unified.sample(n=10)

Unnamed: 0_level_0,author_ids,author_uins,year,publication_type,publication_title,keyword,un_sustainable_development_goals,author_organization,author_city,author_country,abstract
publication_api_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
n43668SE,[n06bf3bf8],[501001957],1983.0,Journal Article,Homogeneous catalysts for carbon dioxide/hydro...,[Chemistry],,[Texas A and M University],[College Station],[United States],
n105384SE,[n238242e1],[226003814],2014.0,Working Paper,Gender Differences in Competitiveness: The Rol...,,,[University of Zurich],[Zurich],[Switzerland],Gender differences in competitiveness have bee...
n335694SE,,,,,,,,"[University of Oklahoma, Drake University]","[Norman, Des Moines]","[United States, United States]",
n170405SE,[n35757a82],[101009288],1985.0,Journal Article,A Method for Investigation of Steady State Wav...,"[Engineering, Water Resources]",,[University of Maine],[Orono],[United States],A technique is presented for modeling the evol...
n42398SE,[nd695d1d9],[917000989],2006.0,Journal Article,Microarray Applications in Microbial Ecology R...,,,[Oak Ridge National Laboratory],[Oak Ridge],[United States],Microarray technology has the unparalleled pot...
n227606SE,[n939257d5],[802005810],2015.0,Book,Radioâ€Frequency Integratedâ€Circuit Enginee...,,,,,,"Â© 2015 by John Wiley & Sons, Inc. All right..."
n196684SE,[n01799b2e],[202008097],1998.0,Journal Article,Heat/Mass Transfer Distribution in a Rotating ...,,,[Texas A and M University],[College Station],[United States],Naphthalene sublimation experiments have been ...
n61122SE,[nb82a0bc7],[814001818],2014.0,Journal Article,CO adsorption on Pt clusters supported on grap...,"[Chemistry, Electrochemistry]",,[Texas A and M University],[College Station],[United States],Density functional theory calculations are use...
n120003SE,[n82bca37a],[224005011],1995.0,Journal Article,"Methods for Collecting, Processing, and Provid...",,,,,,
n351210SE,[n2d08218e],[818002449],2020.0,Journal Article,A Power-of-Two Choices Based Algorithm for Fog...,[Computer Science],,"[Universita degli Studi di Roma La Sapienza, T...","[Roma, Doha, College Station, Rome, Rome, Doha...","[Italy, Qatar, United States, Italy, Italy, Qa...",IEEE The fog computing paradigm brings togethe...


In [20]:
pubs_unified.to_pickle('../../Data/Scholars@TAMU/my_api_calls/publications_df.pickle')