<center>

<h1><b>Predict hierarchical levels in Knowledge Organization Systems (KOS) </b></h1>
<h2><b>Ahsan Ali </b></h2>
<h2><b>Supervisor: Péter Kiraly</b></h1>
</center>

### The main aim of this mid document is to find out which of the provided terms in data are resolvable and which terms are not resolvable. This will also help in getting familiar with the API. Further data analysis can be done to retrieve information.

Here is the data loaded into a dataframe

In [1]:
import pandas as pd
import requests

In [2]:
df = pd.read_csv("facet-terms-for-082a_ClassificationDdc.csv")
df

Unnamed: 0,term,count
0,2,7506
1,4,6369
2,1,6342
3,3,5928
4,5,4866
...,...,...
45482,zot,1
45483,zs95f,1
45484,zur,1
45485,zw315m,1


In the data there are two columns, one is term (Dewey Decimal Classification) and count for the corresponding term. The goal is to find if the term is resolveable using the API.

In [None]:
df.describe(include='all')

From the following stats we can see that there are total of 45487 values in both columns. The minimum value in count column if 1 which is indication that there are no values with 0 count that makes sense becuase this is count column it should be this way. The maximum value is also very interesting that is 7506.

In [None]:
def get_concept_from_term(term):
    voc = "http://dewey.info/scheme/edition/e23/"
    url = f"https://coli-conc.gbv.de/api/concepts?notation={term}&voc={voc}"
    
    response = requests.get(url)
    if response.status_code == 200:
            data = response.json()
            return data
    else: 
         return response.status_code

Above provided API is used to get information about the term from coli-conc database that enables interoperability between Knowledge Organization Systems (KOS) with focus on German library KOS. In our case we are using the Dewey Decimal Classification (DDC).


After running API for every term in the data here is the result that I got.

In [3]:
df_output = pd.read_csv("combined_output.csv")
df_output

Unnamed: 0,term,Response Code,Returned List,count
0,0,200.0,"[{'notation': ['000'], 'inScheme': [{'uri': 'h...",12.0
1,0,200.0,"[{'uri': 'http://dewey.info/class/0/e23/', 'mo...",12.0
2,0,200.0,"[{'notation': ['00'], 'inScheme': [{'uri': 'ht...",12.0
3,0,200.0,[],12.0
4,00,,,1.0
...,...,...,...,...
51185,zot,200.0,[],1.0
51186,zs95f,200.0,[],1.0
51187,zur,200.0,[],1.0
51188,zw315m,200.0,[],1.0


In [4]:
df_output.describe(include='all')

Unnamed: 0,term,Response Code,Returned List,count
count,51190.0,45487.0,45487,47074.0
unique,49510.0,,10602,
top,941.0,,[],
freq,5.0,,34762,
mean,,200.0,,15.718868
std,,0.0,,162.27175
min,,200.0,,1.0
25%,,200.0,,1.0
50%,,200.0,,1.0
75%,,200.0,,3.0


Here is the data after getting the responses from the API. Interesting this is there are 10602 unique values in "Returned List" column which means that rest are most probably empty lists returned.

Here I am checking if there are null values in our data.

In [10]:
non_null_count = df_output['Returned List'].notnull().sum()
total_count = len(df)
percentage_non_null = (non_null_count / total_count) * 100

print(f"Percentage of terms with values in 'Returned List': {percentage_non_null:.2f}%")


Percentage of terms with values in 'Returned List': 100.00%


All the rows have returned values which means API returned some response for all the rows.

Now I am checking how much rows has valid response and how much has empty list.

In [None]:
valid_responses = df_output['Returned List'].apply(lambda x: x != '[]' and pd.notnull(x))

valid_count = valid_responses.sum()

empty_count = (~valid_responses).sum()

total_count = len(df)


print(f"Total rows: {total_count}")
print(f"Number of rows with valid responses: {valid_count}")
print(f"Number of rows with empty lists: {empty_count}")


percentage_valid = (valid_count / total_count) * 100
percentage_empty = (empty_count / total_count) * 100

print(f"Percentage of rows with valid responses: {percentage_valid:.2f}%")
print(f"Percentage of rows with empty lists: {percentage_empty:.2f}%")

Total rows: 45487
Number of rows with valid responses: 10725
Number of rows with empty lists: 34762
Percentage of rows with valid responses: 23.58%
Percentage of rows with empty lists: 76.42%


The above stats show us that 23.58% of terms are resolable and got response from that API which 76.42% of terms are not resolveable and got empty response from the API.

In [None]:
df_valid = df_output[valid_responses]
df_valid.head()

Unnamed: 0,term,Response Code,Returned List
0,2,200,"[{'modified': '2014-09-23', 'created': '2000-0..."
1,4,200,"[{'modified': '2005-11-02', 'created': '2000-0..."
2,1,200,"[{'uri': 'http://dewey.info/class/1/e23/', 'mo..."
3,3,200,"[{'uri': 'http://dewey.info/class/3/e23/', 'mo..."
4,5,200,[{'inScheme': [{'uri': 'http://bartoc.org/en/n...


The df_valid dataframe now has only data for resolveable terms.

In [27]:
df_valid.describe(include='all')

Unnamed: 0,term,Response Code,Returned List
count,10725.0,10725.0,10725
unique,10511.0,,10601
top,2.0,,"[{'notation': ['629.13'], 'prefLabel': {'de': ..."
freq,3.0,,2
mean,,200.0,
std,,0.0,
min,,200.0,
25%,,200.0,
50%,,200.0,
75%,,200.0,


Total resolveable terms are 10725. One noteable thing is 10601 values in "Returned List" are unique which means that there are some terms for which API reutrned same response. We can use this 10725 lines of data for our further work. 