# Pulling in More Data from Scholars@TAMU  
**Filename:** pulling_more_data.ipynb  
**Path:** TAMIDS/Code/Scholars@TAMU Data/pulling_more_data.ipynb  
**Created Date:** 03 April 2022, 22:07 

There were some things missing from the provided data that I needed. I am making API calls to get get information like:
- Department Name (not ID)
- 

In [81]:
from IPython.display import Markdown, display, HTML
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
import requests
from requests.exceptions import HTTPError
from tqdm import tqdm

pd.options.display.float_format = '{:,.3f}'.format
plt.style.use('seaborn-darkgrid')

# General Markdown Formatting Functions

def printmd(string, level=1):
    header_level = '#'*level + ' '
    display(Markdown(header_level + string))

## DataFrame Preperation

Creates a dictionary of dictionaries of dataframes for all the pickeled data.  
Ex: `data['people_education']` contains the 'people_education' DataFrame loaded from `../../Data/Scholars@TAMU/people/people_education.pickle`.  
This makes calls to each DataFrame simpler than they were in the `pickling_raw_data.ipynb` and `completeness.ipynb` files.

In [3]:
base_path = "../../Data/Scholars@TAMU"

with open('dicts/data_filenames.json', 'r') as infile:
    data_filenames = json.load(infile)

data = {}
for foldername, filenames in data_filenames.items():
    for filename in filenames:
        data[filename] = pd.read_pickle(base_path + "/" + foldername + "/" + filename + ".pickle")

## Getting API URLs

In [4]:
apis = data['people_overview']['people_api'].unique()

apis

array(['https://api.library.tamu.edu/scholars-discovery/individual/n28cb7333',
       'https://api.library.tamu.edu/scholars-discovery/individual/n014c3d0f',
       'https://api.library.tamu.edu/scholars-discovery/individual/n7a168a93',
       ...,
       'https://api.library.tamu.edu/scholars-discovery/individual/n4f37dfa5',
       'https://api.library.tamu.edu/scholars-discovery/individual/n0e788fcb',
       'https://api.library.tamu.edu/scholars-discovery/individual/nca549702'],
      dtype=object)

## API Calls

In [64]:
def get_api_dict(url: str) -> dict:
    try:
        response = requests.get(url)
        response.raise_for_status()
        jsonResponse = response.json()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
    except Exception as err:
        print(f'Other error occurred: {err}')
    else:
        return jsonResponse
    return {}

### test case

In [22]:
url = 'https://api.library.tamu.edu/scholars-discovery/individual/n029092b0'
get_api_dict(url)

{'id': 'n029092b0',
 'class': 'Person',
 'name': 'Garay, Juan',
 'primaryEmail': 'garay@tamu.edu',
 'preferredTitle': 'Professor - Term Appointment',
 'positions': [{'id': 'n0863ed82',
   'label': 'Professor - Term Appointment',
   'type': 'FacultyPosition',
   'organizations': [{'id': 'n3b8431fc',
     'label': 'Computer Science and Engineering',
     'parent': [{'id': 'n8627320c', 'label': 'College of Engineering'}]}]}],
 'overview': "Dr. Garay's research interests include both foundational and applied aspects of cryptography and information security. He has published extensively in the areas of cryptography, network security, distributed computing and algorithms. He has been involved in the design, analysis and implementation of a variety of secure systems, and is the recipient of over two dozen patents.",
 'hrJobTitle': 'Professor',
 'keywords': ['Perennial Computation',
  'Secret Sharing',
  'Oblivious Turing-machine',
  'Information-theoretic Security',
  'Cs.dc',
  'Cs.dm',
  'C

## General Data from all IDs

In [18]:
succesfull_calls = []
failed_calls = []

for api in tqdm(apis):
    response = get_api_dict(url=api)
    if response:
        succesfull_calls.append(response)
    else:
        failed_calls.append(api)

  8%|▊         | 391/4860 [02:22<24:02,  3.10it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n531a30c4


 14%|█▍        | 684/4860 [04:34<27:41,  2.51it/s]  

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n6b86bf0a


 16%|█▌        | 777/4860 [05:07<21:05,  3.23it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/nb0ba8174


 17%|█▋        | 830/4860 [05:25<23:32,  2.85it/s]

Other error occurred: Invalid URL 'and cell wall assembly."': No scheme supplied. Perhaps you meant http://and cell wall assembly."?


 22%|██▏       | 1077/4860 [06:56<20:53,  3.02it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/nb2af52ee


 25%|██▌       | 1217/4860 [07:51<26:59,  2.25it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n6fe482a6


 26%|██▌       | 1247/4860 [08:01<20:38,  2.92it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n14aa5202


 33%|███▎      | 1587/4860 [10:01<16:42,  3.26it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n3d197541


 38%|███▊      | 1836/4860 [11:27<18:01,  2.80it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n7b5179b3


 40%|████      | 1960/4860 [12:12<16:58,  2.85it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n847f3873


 41%|████      | 1997/4860 [12:26<17:57,  2.66it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n631e24a7


 41%|████▏     | 2006/4860 [12:31<19:05,  2.49it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n8247cf18


 43%|████▎     | 2094/4860 [13:46<22:53,  2.01it/s]  

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n0e3856dd


 45%|████▌     | 2198/4860 [14:47<14:09,  3.13it/s]  

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/ne71697c4


 48%|████▊     | 2333/4860 [15:36<14:42,  2.86it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n3525e685


 71%|███████▏  | 3470/4860 [22:12<08:30,  2.72it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n1ab41bb2


 76%|███████▌  | 3678/4860 [24:11<06:05,  3.23it/s]  

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/na9655306


 76%|███████▌  | 3684/4860 [24:13<06:12,  3.16it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/nf26d46a6


 78%|███████▊  | 3789/4860 [24:46<05:17,  3.37it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n981a6d1d


 78%|███████▊  | 3815/4860 [24:54<05:08,  3.39it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n87907ac0


 79%|███████▉  | 3858/4860 [25:06<05:17,  3.16it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/nb2760449


 80%|███████▉  | 3886/4860 [25:16<05:00,  3.24it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/ncbe3c229


 82%|████████▏ | 3989/4860 [25:48<04:28,  3.24it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n4be06dfb


 83%|████████▎ | 4018/4860 [25:57<04:38,  3.02it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n63a8f154


 84%|████████▎ | 4061/4860 [26:12<04:39,  2.86it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n7b242370


 84%|████████▎ | 4064/4860 [26:13<04:35,  2.89it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n3706f4f0


 87%|████████▋ | 4251/4860 [27:12<03:04,  3.30it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n83cd1943


 99%|█████████▊| 4789/4860 [29:52<00:20,  3.49it/s]

HTTP error occurred: 404 Client Error:  for url: https://api.library.tamu.edu/scholars-discovery/individual/n57a931e2


100%|██████████| 4860/4860 [30:48<00:00,  2.63it/s]


It looks like there is another problem with information shifting as seen in `failed_calls[3]`. Unfortunately, I do not have the time to deal with it.
Only 28 failed calls out of 4860; the reason is unknown.

In [25]:
failed_calls

['https://api.library.tamu.edu/scholars-discovery/individual/n531a30c4',
 'https://api.library.tamu.edu/scholars-discovery/individual/n6b86bf0a',
 'https://api.library.tamu.edu/scholars-discovery/individual/nb0ba8174',
 ' and cell wall assembly."',
 'https://api.library.tamu.edu/scholars-discovery/individual/nb2af52ee',
 'https://api.library.tamu.edu/scholars-discovery/individual/n6fe482a6',
 'https://api.library.tamu.edu/scholars-discovery/individual/n14aa5202',
 'https://api.library.tamu.edu/scholars-discovery/individual/n3d197541',
 'https://api.library.tamu.edu/scholars-discovery/individual/n7b5179b3',
 'https://api.library.tamu.edu/scholars-discovery/individual/n847f3873',
 'https://api.library.tamu.edu/scholars-discovery/individual/n631e24a7',
 'https://api.library.tamu.edu/scholars-discovery/individual/n8247cf18',
 'https://api.library.tamu.edu/scholars-discovery/individual/n0e3856dd',
 'https://api.library.tamu.edu/scholars-discovery/individual/ne71697c4',
 'https://api.library

In [26]:
# because I spelled it wrong and don't want to want 30min to correct it
successfull_calls = succesfull_calls

In [44]:
num_positions = {'0': 0, '1': 0, '2+': 0}
for call in successfull_calls:
    try:
        x = len(call['positions'])
        match x:
            case 0:
                num_positions['0'] += 1
            case 1:
                num_positions['1'] += 1
            case _ if x > 1:
                num_positions['2+'] += 1
            case _:
                print('Case was out of bounds.')
    except:
        num_positions['0'] += 1

print(f"Has no listed positions: {num_positions['0']}")
print(f"Has one listed position: {num_positions['1']}")
print(f"Has multiple positions: {num_positions['2+']}")

Has no listed positions: 57
Has one listed position: 3935
Has multiple positions: 840


In [54]:
with open('../../Data/Scholars@TAMU/my_api_calls/general_data_list.json', 'w') as outfile:
    json.dump(successfull_calls, outfile, indent=4)

In [None]:
succesfull_calls_by_id = {person['id']: person for person in succesfull_calls}

with open('../../Data/Scholars@TAMU/my_api_calls/general_data_dict.json', 'w') as outfile:
    json.dump(successfull_calls_by_id, outfile, indent=4)

## TAMU Department Structure

In [78]:
api_base = 'https://api.library.tamu.edu/scholars-discovery/individual/'

tamu_info_data = get_api_dict(url=api_base + 'n5d3837d6')
tamu_info = tamu_info_data

In [98]:
structure_data = {
    'name': 'Texas A&M University',
    'id': 'n5d3837d6',
    'api_data': get_api_dict(api_base + 'n5d3837d6'),
    'colleges': []
    }


for college in structure_data['api_data']['hasSubOrganizations']:
    college_data = {
        'name': college['label'],
        'id': college['id'],
        'api_data': get_api_dict(api_base + college['id']),
        'departments': [],
        'research_areas': []
    }

    try:
        for department in college_data['api_data']['hasSubOrganizations']:
            department_data = {
                'name': department['label'],
                'id': department['id'],
                'api_data': get_api_dict(api_base + department['id'])
            }
            college_data['departments'].append(department_data)
    except KeyError:
        print(college_data['name'] + ' has no departments.')

    
    try:
        for research in college_data['api_data']['affiliatedResearchAreas']:
            research_data = {
                'name': research['label'],
                'id': research['id'],
                'api_data': get_api_dict(api_base + research['id'])
            }
            college_data['research_areas'].append(research_data)
    except KeyError:
        print(college_data['name'] + ' has no reseach areas.')

    structure_data['colleges'].append(college_data)


Office of Academic Affairs has no departments.
Office of Academic Affairs has no reseach areas.
Texas A&M Institute for Neuroscience has no departments.
Texas A&M Engineering Extension Service (TEEX) has no departments.
School of Law has no departments.
Office of the Provost and Executive Vice President has no departments.
Office of the President has no reseach areas.
Texas A&M Veterinary Medical Diagnostic Laboratory has no departments.
Albritton Center for Grand Strategy has no departments.
College of Architecture has no reseach areas.
Institute for Sustainable Communities has no departments.
Institute for Sustainable Communities has no reseach areas.


In [103]:
tamu_structure = structure_data.copy()

del tamu_structure['api_data']
for college in tamu_structure['colleges']:
    try:
        del college['api_data']
    except KeyError:
        print(college['name'] + " has no api_data.")
    
    for department in college['departments']:
        try:
            del department['api_data']
        except KeyError:
            print(college['name'] + " has no departments.")
    for research_area in college['research_areas']:
        try:
            del research_area['api_data']
        except KeyError:
            print(college['name'] + " has no research areas.")
        

tamu_structure

Office of Academic Affairs has no api_data.
Texas A&M Institute for Neuroscience has no api_data.
Texas A&M Engineering Extension Service (TEEX) has no api_data.
School of Law has no api_data.
Office of the Provost and Executive Vice President has no api_data.
Texas A&M University at Galveston has no api_data.
Texas A&M Engineering Experiment Station (TEES) has no api_data.
Office of the President has no api_data.
University Libraries has no api_data.
Texas A&M Veterinary Medical Diagnostic Laboratory has no api_data.
College of Science has no api_data.
Albritton Center for Grand Strategy has no api_data.
Texas A&M AgriLife Research has no api_data.
Bush School of Government and Public Service has no api_data.
College of Engineering has no api_data.
College of Geosciences has no api_data.
Mays Business School has no api_data.
Texas A&M University at Qatar has no api_data.
Texas A&M AgriLIFE Extension has no api_data.
Texas A&M Transportation Institute (TTI) has no api_data.
College of 

{'name': 'Texas A&M University',
 'id': 'n5d3837d6',
 'colleges': [{'name': 'Office of Academic Affairs',
   'id': 'n78eb517b',
   'departments': [],
   'research_areas': []},
  {'name': 'Texas A&M Institute for Neuroscience',
   'id': 'n772ec370',
   'departments': [],
   'research_areas': [{'name': 'Medical sciences', 'id': 'nfst01014601'},
    {'name': 'Art', 'id': 'nfst00815177'},
    {'name': 'Psychology', 'id': 'nfst01081447'},
    {'name': 'Electroencephalography', 'id': 'nfst00906445'},
    {'name': 'Music', 'id': 'nfst01030269'},
    {'name': 'Visual perception', 'id': 'nfst01168049'},
    {'name': 'Public health', 'id': 'nfst01082238'},
    {'name': 'Public health disparities', 'id': 'nfst01082238'},
    {'name': 'History', 'id': 'nfst00958235'},
    {'name': 'Neurosciences', 'id': 'nfst01036509'},
    {'name': 'Electromyography', 'id': 'nfst00906641'},
    {'name': 'Biomechanics', 'id': 'nfst00832558'}]},
  {'name': 'Texas A&M Engineering Extension Service (TEEX)',
   'id': 

In [111]:
department_struct = {}

for college in structure_data['colleges']:
    try:
        departments = [department['name'] for department in college['departments']]
    except KeyError:
        departments = []

    department_struct[college['name']] = departments

with open('./dicts/structure/departments.json', 'w') as outfile:
    json.dump(department_struct, outfile, indent=4)

In [113]:
research_struct = {}

for college in structure_data['colleges']:
    try:
        departments = {department['name']: [] for department in college['departments']}
    except KeyError:
        departments = {}
    else:
        try:
            department_dicts = [department for department in college['departments']]
            departments[department_dicts['name']: get_api_dict(api_base + department_dicts['id'])]

        except KeyError:
            divisions = []
            print(divisions)

    research_struct[college['name']] = departments

# with open('./dicts/structure/departments.json', 'w') as outfile:
#     json.dump(department_struct, outfile, indent=4)

TypeError: list indices must be integers or slices, not str

In [115]:
with open('./test1.json', 'w') as outfile:
    json.dump(tamu_structure, outfile, indent=4)
with open('./test2.json', 'w') as outfile:
    json.dump(structure_data, outfile, indent=4)