In this notebook we get semantically related terms for the topics we are exploring in the paper using [ConceptNet](https://conceptnet.io/)'s [API](https://github.com/commonsense/conceptnet5/wiki/API). The data from the API is returned in JSON-LD format.

In [2]:
import requests
import pandas as pd
from nltk.corpus import stopwords  # Stopwords list
from nltk.tokenize import word_tokenize  # For tokenizing
from pandarallel import pandarallel  # For parallelizing pandas row operations

# 1. Introduction to ConceptNet's API

## Walk-through example

Simple example: getting the information from the network associated to the term "example". For a complete guide on the use of their API, see https://github.com/commonsense/conceptnet5/wiki/API.

In [3]:
obj = requests.get('http://api.conceptnet.io/c/en/example').json()
display(obj)

{'@context': ['http://api.conceptnet.io/ld/conceptnet5.7/context.ld.json'],
 '@id': '/c/en/example',
 'edges': [{'@id': '/a/[/r/Synonym/,/c/ca/exemple/n/wn/communication/,/c/en/example/n/wn/communication/]',
   '@type': 'Edge',
   'dataset': '/d/wordnet/3.1',
   'end': {'@id': '/c/en/example/n/wn/communication',
    '@type': 'Node',
    'label': 'example',
    'language': 'en',
    'sense_label': 'n, communication',
    'term': '/c/en/example'},
   'license': 'cc:by/4.0',
   'rel': {'@id': '/r/Synonym', '@type': 'Relation', 'label': 'Synonym'},
   'sources': [{'@id': '/s/resource/wordnet/rdf/3.1',
     '@type': 'Source',
     'contributor': '/s/resource/wordnet/rdf/3.1'}],
   'start': {'@id': '/c/ca/exemple/n/wn/communication',
    '@type': 'Node',
    'label': 'exemple',
    'language': 'ca',
    'sense_label': 'n, communication',
    'term': '/c/ca/exemple'},
   'surfaceText': '[[exemple]] is a translation of [[example]]',
   'weight': 2.0},
  {'@id': '/a/[/r/Synonym/,/c/ja/代表/n/wn/c

- Each node in concept net can be retrieved by appending `http://api.conceptnet.io` with `/c/lan_code/term`, where `lan_code` stands for the language code. The `/c/lan_code/term` is the node identifier within ConceptNet (the author calls it the "URI").
  - The URIs for terms (also known as "concepts") start with `/c/`, and follow a hierarchy from languages, to terms, to senses of terms with a particular part of speech.
  - Consider the term `/c/it/esempio/n`. This represents the Italian noun (`n`) "esempio".
  - `/c/it/esempio` represents the term "esempio" in Italian, whether it is a noun or not. You'll still find the results from /c/it/esempio/n when you browse it. Any term URI implicitly contains all its more specific URIs.
  - Phrases can also be inputted into ConceptNet, by replacing spaces with underscores (`_`). For exapmle: the phrase "french toast" is at `/c/en/french_toast`.
- Note that only the first 20 results related to the seed term are returned! To get more information, the "nextPage" can be used for accessing to the next page of results. Or complex queries can be used (i.e., setting a higher limit).
- Explanation of the output:
  - `@context` links to a file of information that helps JSON-LD tools understand the API, and also comes with comments that may be helpful to humans. The context file explains, in RDF and in English, what the @-less property names like "edges" and "view" mean.
  - `@id`: identifier for the ConceptNet node (URI). The @id at the top level is the URI you just looked up. You'll find more @ids inside the "edges" list that you can use to browse to related information.
  - `view` (see the end of the output, outside the `edge` document): the view object describes how a long list is paginated: it has an @id that links to the particular page of results that you're seeing, and firstPage, previousPage, and lastPage values that link to various pages. "paginatedProperty": "edges" tells you that "edges" is the list that you're browsing page by page.
  - `edges`: returns all of the edges linked to the input URI.

There are three methods for accessing data through the ConceptNet 5 API: lookup, search, and association.
1. **Lookup** (example above) is for when you know the URI of an object in ConceptNet, and want to see a list of edges that include it.
2. **Search** finds a list of edges that match certain criteria.
3. **Association** is for finding concepts similar to a particular concept or a list of concepts. This uses word embeddings - LLMs -> cannot be used for the paper!

In [4]:
print(obj.keys())
print(len(obj['edges']))

dict_keys(['@context', '@id', 'edges', 'version', 'view'])
20


Getting the 3rd edge:

In [5]:
jap_example = obj['edges'][2]
display(jap_example)

{'@id': '/a/[/r/Synonym/,/c/ja/例/n/wn/cognition/,/c/en/example/n/wn/cognition/]',
 '@type': 'Edge',
 'dataset': '/d/wordnet/3.1',
 'end': {'@id': '/c/en/example/n/wn/cognition',
  '@type': 'Node',
  'label': 'example',
  'language': 'en',
  'sense_label': 'n, cognition',
  'term': '/c/en/example'},
 'license': 'cc:by/4.0',
 'rel': {'@id': '/r/Synonym', '@type': 'Relation', 'label': 'Synonym'},
 'sources': [{'@id': '/s/resource/wordnet/rdf/3.1',
   '@type': 'Source',
   'contributor': '/s/resource/wordnet/rdf/3.1'}],
 'start': {'@id': '/c/ja/例/n/wn/cognition',
  '@type': 'Node',
  'label': '例',
  'language': 'ja',
  'sense_label': 'n, cognition',
  'term': '/c/ja/例'},
 'surfaceText': '[[例]] is a translation of [[example]]',
 'weight': 2.0}

Description of the information contained in each of the `edges`:
- `@id`: complex URI which "uniquely describes this edge in terms of the nodes it connects and how it connects them." It contains the type of edge (see `rel` below), the `start` and `end` nodes, etc.
- `start` and `end`: `start` points to the attributes of the node in question, while `end` points to the input term. Attributes inside `start` and `end`:
  - A human-readable `label`, which may be a more complete phrase such as "an example" instead of just the word "example" that appears in the URI.
  - `language`, the language code for what language the label is in (this is always the same as the language code that appears in its URI).
  - `term`, a link to the most general version of this term. In many cases this is just the same URI. If you've looked up a particular sense, such as the noun sense of "example" at /c/en/example/n, this links to the more general /c/en/example.
- `rel`: describes the type of relation that connects the input and the returned node. There are [34 types of relations](https://github.com/commonsense/conceptnet5/wiki/Relations). The `Synonym` type can just be a translation of the input term.
- `surfaceText`: Some of ConceptNet's data is extracted from natural-language text. The `surfaceText` value shows you what this text was.
- `sources`: "tells you why ConceptNet believes this information."
- `weight`: the weight value says how believable the information is. A typical weight is 1.0, and the number is higher when the information comes from more sources or more reliable sources.

In [6]:
print(jap_example['weight'])
print(jap_example['start']['language'])

2.0
ja


## Complex queries

To filter for specific information, you can give parameters to http://api.conceptnet.io/query, which gives you a list of matching edges.

You can specify any of the following parameters:
- **start**: a URI that the "start" or "subject" position must match.
- **end**: a URI that the "end" or "object" position must match.
- **rel**: a relation.
- **node**: a URI that must match either the start or the end.
- **other**: a URI that must match either the start or the end, and be different from node.
- **sources**: a URI that must match one of the sources of the edge.

Examples:
- To see all relations that connect "dog" and "bark": /query?node=/c/en/dog&other=/c/en/bark
- To see what the original OMCS dev team said about ferrets: /query?node=/c/en/ferret&sources=/s/contributor/omcs/dev
- To see assertions about cats (猫) that are entirely in Japanese: /query?node=/c/ja/猫&other=/c/ja

Here's an example of using the API to get all external Linked Data items that are connected to the ConceptNet term "apple":

In [7]:
response = requests.get('http://api.conceptnet.io/query?start=/c/en/apple&rel=/r/ExternalURL&limit=1000')
obj = response.json()
[edge['end']['@id'] for edge in obj['edges']]

['http://wikidata.dbpedia.org/resource/Q89',
 'http://dbpedia.org/resource/Apple',
 'http://wordnet-rdf.princeton.edu/wn31/112654755-n',
 'http://wordnet-rdf.princeton.edu/wn31/107755101-n',
 'http://sw.opencyc.org/2012/05/10/concept/en/Apple',
 'http://fr.wiktionary.org/wiki/apple',
 'http://en.wiktionary.org/wiki/apple',
 'http://en.wiktionary.org/wiki/Apple']

In [8]:
display(obj)

{'@context': ['http://api.conceptnet.io/ld/conceptnet5.7/context.ld.json'],
 '@id': '/query?start=/c/en/apple&rel=/r/ExternalURL',
 'edges': [{'@id': '/a/[/r/ExternalURL/,/c/en/apple/,/http://wikidata.dbpedia.org/resource/Q89/]',
   '@type': 'Edge',
   'dataset': '/d/dbpedia/en',
   'end': {'@id': 'http://wikidata.dbpedia.org/resource/Q89',
    '@type': 'Node',
    'label': 'Q89',
    'path': '/resource/Q89',
    'site': 'wikidata.dbpedia.org',
    'site_available': False,
    'term': 'http://wikidata.dbpedia.org/resource/Q89'},
   'license': 'cc:by-sa/4.0',
   'rel': {'@id': '/r/ExternalURL',
    '@type': 'Relation',
    'label': 'ExternalURL'},
   'sources': [{'@id': '/s/resource/dbpedia/2015/en',
     '@type': 'Source',
     'contributor': '/s/resource/dbpedia/2015/en'}],
   'start': {'@id': '/c/en/apple',
    '@type': 'Node',
    'label': 'apple',
    'language': 'en',
    'term': '/c/en/apple'},
   'surfaceText': None,
   'weight': 1.0},
  {'@id': '/a/[/r/ExternalURL/,/c/en/apple/

Below, we get the first 1000 terms related to "apple", with any type of relation.

In [9]:
response = requests.get('http://api.conceptnet.io/query?start=/c/en/apple&limit=1000')
obj = response.json()
display(obj)

{'@context': ['http://api.conceptnet.io/ld/conceptnet5.7/context.ld.json'],
 '@id': '/query?start=/c/en/apple',
 'edges': [{'@id': '/a/[/r/RelatedTo/,/c/en/apple/,/c/en/fruit/]',
   '@type': 'Edge',
   'dataset': '/d/verbosity',
   'end': {'@id': '/c/en/fruit',
    '@type': 'Node',
    'label': 'fruit',
    'language': 'en',
    'term': '/c/en/fruit'},
   'license': 'cc:by/4.0',
   'rel': {'@id': '/r/RelatedTo', '@type': 'Relation', 'label': 'RelatedTo'},
   'sources': [{'@id': '/and/[/s/process/split_words/,/s/resource/verbosity/]',
     '@type': 'Source',
     'contributor': '/s/resource/verbosity',
     'process': '/s/process/split_words'},
    {'@id': '/s/resource/verbosity',
     '@type': 'Source',
     'contributor': '/s/resource/verbosity'}],
   'start': {'@id': '/c/en/apple',
    '@type': 'Node',
    'label': 'apple',
    'language': 'en',
    'term': '/c/en/apple'},
   'surfaceText': '[[apple]] is related to [[fruit]]',
   'weight': 12.80968383684781},
  {'@id': '/a/[/r/HasPro

Note that, with complex queries, `end` points to the associated term with the input term (which is more intuitive).

In [10]:
print(len(obj['edges']))

709


In [11]:
edges = [edge for edge in obj['edges']]
display(edges)

[{'@id': '/a/[/r/RelatedTo/,/c/en/apple/,/c/en/fruit/]',
  '@type': 'Edge',
  'dataset': '/d/verbosity',
  'end': {'@id': '/c/en/fruit',
   '@type': 'Node',
   'label': 'fruit',
   'language': 'en',
   'term': '/c/en/fruit'},
  'license': 'cc:by/4.0',
  'rel': {'@id': '/r/RelatedTo', '@type': 'Relation', 'label': 'RelatedTo'},
  'sources': [{'@id': '/and/[/s/process/split_words/,/s/resource/verbosity/]',
    '@type': 'Source',
    'contributor': '/s/resource/verbosity',
    'process': '/s/process/split_words'},
   {'@id': '/s/resource/verbosity',
    '@type': 'Source',
    'contributor': '/s/resource/verbosity'}],
  'start': {'@id': '/c/en/apple',
   '@type': 'Node',
   'label': 'apple',
   'language': 'en',
   'term': '/c/en/apple'},
  'surfaceText': '[[apple]] is related to [[fruit]]',
  'weight': 12.80968383684781},
 {'@id': '/a/[/r/HasProperty/,/c/en/apple/,/c/en/red/]',
  '@type': 'Edge',
  'dataset': '/d/conceptnet/4/en',
  'end': {'@id': '/c/en/red',
   '@type': 'Node',
   'labe

Useful things to be extracted for each edge:
1. `label` of `end` node.
2. Number of terms in the `label` (i.e., length of the splitted string).
3. `term` of `end` node (as a reference).
4. Type of relation: `label` in `rel`.
5. `weight`.
6. `language` in `end` node.

In [12]:
[edge['end']['label'] for edge in edges]

['fruit',
 'red',
 'red',
 'green',
 'eaten',
 'apple tree',
 'fall from a tree',
 'eating',
 'a core',
 'red fruit',
 'green',
 'core',
 'eve',
 'tree',
 'computer',
 'round',
 'a grocery store',
 'mac',
 'adam',
 'pie',
 'pear',
 'macintosh',
 'seeds',
 'eden',
 'stem',
 'trees',
 'orange',
 'delicious',
 'food',
 'adam eve',
 'skin',
 'sate hunger',
 'pome',
 'computer brand',
 'many cooking uses',
 'a skin',
 'apple tree',
 'apple tree',
 'jabuka',
 'apple',
 'making apple pie',
 'edible fruit',
 'seeds inside',
 'orange',
 'pies',
 'forbidden',
 'red green',
 'cider',
 'green fruit',
 'orange',
 'tree fruit',
 'teachers',
 'smith',
 'granny smith',
 'granny',
 'round fruit',
 'red delicious',
 'peel',
 'newton',
 'mac computer',
 'crunchy',
 'ate',
 'eden fruit',
 'forbidden fruit',
 'pips',
 'adams',
 'doctor',
 'orchard',
 'ball',
 'company',
 'cider fruit',
 'ፖም',
 'pome',
 'appel',
 'æble',
 'mansanas',
 'āporo',
 'pomme',
 'સફરજન',
 'making pie',
 'ябко',
 '苹果',
 'dessert',
 

In [13]:
[edge['weight'] for edge in edges]

[12.80968383684781,
 9.591663046625438,
 9.309350138436088,
 7.211102550927979,
 6.32455532033676,
 5.656854249492381,
 5.291502622129181,
 5.291502622129181,
 4.898979485566356,
 4.827835954131002,
 4.44837048816755,
 4.441171016747722,
 4.315089802078283,
 4.297441099072796,
 4.164132562731402,
 4.144393803682271,
 4.0,
 3.887415593938986,
 3.6149688795340964,
 3.604441704342019,
 3.0672463220289305,
 2.9933259094191538,
 2.93393933134276,
 2.9318935860634507,
 2.6877499883731746,
 2.6765649627834556,
 2.571380951940027,
 2.42404620418011,
 2.318620279390311,
 2.2045407685048604,
 2.1762352813976715,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 1.8489999999999998,
 1.843,
 1.8120000000000003,
 1.775,
 1.767,
 1.7320000000000002,
 1.69,
 1.638,
 1.62,
 1.5259999999999998,
 1.5259999999999998,
 1.5259999999999998,
 1.48,
 1.396,
 1.3840000000000003,
 1.3070000000000004,
 1.302,
 1.2800000000000002,
 1.255,
 1.2409999999999997,
 1.2350000000000003,
 1.2199999

## Looking up related terms

This API endpoint uses word embeddings built from ConceptNet and other inputs to find related terms. The embeddings are a version of [ConceptNet Numberbatch](https://github.com/commonsense/conceptnet-numberbatch) (as such, we should not use it for the paper - it is closer to an LLM), with a reduced vocabulary that makes it more reasonable to load on the server.

## Rate limits

You can make 3600 requests per hour to the ConceptNet API, with bursts of 120 requests per minute allowed. The /related and /relatedness endpoints count as two requests when you call them.

This means you should design your usage of the API to average less than 1 request per second.

# 2. Lookup of associated terms from ConceptNet's semantic network

## Function and example usage

Below, we define a function for retrieving the information that we are interested in for the related terms.

In [14]:
def related_conceptnet(term: str, max_rel_terms: int):

    if len(term.split()) > 1:
        term = term.replace(' ', '_')  # Replace spaces with underscores if phrase is inputted
    
    # Lowercase the term for the API call (required for ConceptNet's API)
    term = term.lower()

    api_query = f'http://api.conceptnet.io/query?start=/c/en/{term}&limit={max_rel_terms}'
    
    print(f"Requesting: {api_query}")  # Debug print to see the actual URL

    response = requests.get(api_query)
    obj = response.json()

    edges = [edge for edge in obj['edges']]  # Extract all of the edges for the input term
    
    # Initialize empty lists for each field
    labels = []
    num_words = []
    languages = []
    weights = []
    type_relation = []
    terms_conceptnet = []
    input_terms = []
    
    # Extract fields of interest from each edge with error handling
    for edge in edges:  # Loop over all edges
        # Use try-except blocks to handle missing fields
        try:  
            label = edge.get('end', {}).get('label', float('nan'))  # 1. Name of the associated term
            labels.append(label)
            # num_words.append()  # 2. Number of terms in each label
        except:
            labels.append(float('nan'))
            num_words.append(float('nan'))
            
        try:
            languages.append(edge.get('end', {}).get('language', float('nan')))  # 3. Language of node in the end of each edge
        except:
            languages.append(float('nan'))
            
        try:
            weights.append(edge.get('weight', float('nan')))  # 4. Weight of the edge with the input term (higher -> closer association)
        except:
            weights.append(float('nan'))
            
        try:
            type_relation.append(edge.get('rel', {}).get('label', float('nan')))  # 5. Type of relation with input term
        except:
            type_relation.append(float('nan'))
            
        try:
            terms_conceptnet.append(edge.get('end', {}).get('term', float('nan')))  # 6. Link to append to ConceptNet API for getting the term
        except:
            terms_conceptnet.append(float('nan'))
            
        input_terms.append(term)  # 7. Input term as a reference (does not depend on the edge content)
    
    # Construct data frame with related terms
    df = pd.DataFrame(data = {
        'label': labels,
        # 'num_words': num_words,
        'language': languages,
        'weight': weights,
        'type_relation': type_relation,
        'term_conceptnet': terms_conceptnet,
        'input_term': input_terms
    })
    
    return df

In [15]:
# Example usage

df = related_conceptnet(term = 'apple', max_rel_terms = 50)

df

Requesting: http://api.conceptnet.io/query?start=/c/en/apple&limit=50


Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term
0,fruit,en,12.809684,RelatedTo,/c/en/fruit,apple
1,red,en,9.591663,HasProperty,/c/en/red,apple
2,red,en,9.30935,RelatedTo,/c/en/red,apple
3,green,en,7.211103,HasProperty,/c/en/green,apple
4,eaten,en,6.324555,ReceivesAction,/c/en/eaten,apple
5,apple tree,en,5.656854,AtLocation,/c/en/apple_tree,apple
6,fall from a tree,en,5.291503,CapableOf,/c/en/fall_from_tree,apple
7,eating,en,5.291503,UsedFor,/c/en/eating,apple
8,a core,en,4.898979,HasA,/c/en/core,apple
9,red fruit,en,4.827836,RelatedTo,/c/en/red_fruit,apple


## Application to specific topics

Given the information that we have about the World Values Survey, for consistency we will search for information on 6 different topics:
1. Financial situation (i.e., "money").
2. Confidence on the police ("police").
3. Confidence on the parliament ("parliament" or "politics").
4. Confidence on the courts ("justice").
5. Pride on the nationality ("US", "country").
6. Health subjective perception ("health").

Topics 2 to 4 could also be aggregated under "authorities".

In [16]:
# Topics of interest
topics = ['money', 
          'police', 'parliament', 'politics', 'justice', 'authority',
          'United States', 'country',
          'health']

df_dict = {} # Initialize dictionary to access DataFrames by name

for i in range(len(topics)):
    
    # Create a clean name by replacing spaces with underscores
    clean_topic = topics[i].replace(' ', '_').lower()
    df_name = f"df_{clean_topic}"
    
    # Create a DataFrame (replace this with your actual data)
    df = related_conceptnet(term = topics[i], max_rel_terms = 500)
    
    # Store in dictionary for easy access by name
    df_dict[df_name] = df

Requesting: http://api.conceptnet.io/query?start=/c/en/money&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/police&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/parliament&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/politics&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/justice&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/authority&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/united_states&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/country&limit=500
Requesting: http://api.conceptnet.io/query?start=/c/en/health&limit=500


## Exploration of results

In [17]:
print(df_dict.keys())

dict_keys(['df_money', 'df_police', 'df_parliament', 'df_politics', 'df_justice', 'df_authority', 'df_united_states', 'df_country', 'df_health'])


In [18]:
money_df = df_dict['df_money']
money_df

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term
0,a bank,en,7.745967,AtLocation,/c/en/bank,money
1,a wallet,en,7.483315,AtLocation,/c/en/wallet,money
2,cash,en,5.338164,RelatedTo,/c/en/cash,money
3,currency,en,5.214211,RelatedTo,/c/en/currency,money
4,a pocket,en,4.898979,AtLocation,/c/en/pocket,money
...,...,...,...,...,...,...
495,income,en,0.270000,RelatedTo,/c/en/income,money
496,green bill,en,0.262000,RelatedTo,/c/en/green_bill,money
497,drawer,en,0.261000,RelatedTo,/c/en/drawer,money
498,dollar euro,en,0.261000,RelatedTo,/c/en/dollar_euro,money


In [19]:
police_df = df_dict['df_police']
police_df

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term
0,tail a suspect,en,3.464102,CapableOf,/c/en/tail_suspect,police
1,donut shop,en,2.828427,AtLocation,/c/en/donut_shop,police
2,milicija,sh,2.000000,Synonym,/c/sh/milicija,police
3,patrol,en,2.000000,Synonym,/c/en/patrol,police
4,law enforcement agency,en,2.000000,IsA,/c/en/law_enforcement_agency,police
...,...,...,...,...,...,...
242,post punk,en,0.500000,genre,/c/en/post_punk,police
243,band,en,0.500000,IsA,/c/en/band,police
244,police,,0.250000,ExternalURL,http://en.wiktionary.org/wiki/police,police
245,police,,0.250000,ExternalURL,http://fr.wiktionary.org/wiki/police,police


In [20]:
health_df = df_dict['df_health']
health_df

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term
0,being,en,2.474672,RelatedTo,/c/en/being,health
1,well,en,2.442949,RelatedTo,/c/en/well,health
2,condition of the body,en,2.000000,DefinedAs,/c/en/condition_of_body,health
3,wellbeing,en,2.000000,IsA,/c/en/wellbeing,health
4,condition,en,2.000000,IsA,/c/en/condition,health
...,...,...,...,...,...,...
295,good body,en,0.112000,RelatedTo,/c/en/good_body,health
296,body quality,en,0.105000,RelatedTo,/c/en/body_quality,health
297,good nutrition,en,0.103000,RelatedTo,/c/en/good_nutrition,health
298,important,en,0.102000,RelatedTo,/c/en/important,health


In [21]:
us_df = df_dict['df_united_states']
us_df

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term
0,north america,en,4.472136,AtLocation,/c/en/north_america,united_states
1,a map,en,3.464102,AtLocation,/c/en/map,united_states
2,a country in North America,en,2.828427,IsA,/c/en/country_in_north_america,united_states
3,the western hemisphere,en,2.828427,AtLocation,/c/en/western_hemisphere,united_states
4,America,en,2.828427,AtLocation,/c/en/america,united_states
...,...,...,...,...,...,...
195,tax,en,0.500000,RelatedTo,/c/en/tax,united_states
196,united states,en,0.500000,Synonym,/c/en/united_states,united_states
197,country,en,0.500000,IsA,/c/en/country,united_states
198,United States,,0.250000,ExternalURL,http://fr.wiktionary.org/wiki/United_States,united_states


## Data cleaning

In [22]:
# Concatenate all data frames in a single one
df = pd.concat(df_dict, ignore_index = True)
df

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term
0,a bank,en,7.745967,AtLocation,/c/en/bank,money
1,a wallet,en,7.483315,AtLocation,/c/en/wallet,money
2,cash,en,5.338164,RelatedTo,/c/en/cash,money
3,currency,en,5.214211,RelatedTo,/c/en/currency,money
4,a pocket,en,4.898979,AtLocation,/c/en/pocket,money
...,...,...,...,...,...,...
2374,good body,en,0.112000,RelatedTo,/c/en/good_body,health
2375,body quality,en,0.105000,RelatedTo,/c/en/body_quality,health
2376,good nutrition,en,0.103000,RelatedTo,/c/en/good_nutrition,health
2377,important,en,0.102000,RelatedTo,/c/en/important,health


First, we remove stopwords: for instance, we don't want to keep "a bank" and "bank" both at the same time, so the idea is to keep only the one with the highest weight. 

In [23]:
def remove_stopwords(text, rm_stopwords = False, stopword_set = None):
    """
    Preprocess text by removing stopwords.
    """
    tokens = word_tokenize(text)
    # Remove stopwords if desired
    if rm_stopwords == True:
        tokens = [token for token in tokens if token not in stopword_set]
    # We return the whole string of tokens so that we can find n-grams later
    return " ".join(tokens)

In [24]:
# my_stop_words = set(stopwords.words('english'))

# df['label_nostopwords'] = df['label'].apply(
#     lambda row: remove_stopwords(text = row, rm_stopwords=True, stopword_set=my_stop_words)
#     )

# df

In [25]:
# Remove "false relations" - ExternalURL (see https://github.com/commonsense/conceptnet5/wiki/Relations)
df = df[df['type_relation'] != 'ExternalURL']

# Keep only labels in English and Spanish (2nd most spoken language in US homes - see https://www.census.gov/library/stories/2022/12/languages-we-speak-in-united-states.html)
df = df[(df['language'] == 'en') | (df['language'] == 'es')]

# Drop duplicate labels, based on the preprocessed label (and keeps only the first 
# term, i.e., the one with the highest weight per topic)
df = df.drop_duplicates(subset = ['label', 'input_term'], keep = 'first')

# Create a column with the number of terms in the preprocessed label
df['num_words'] = df['label'].apply(lambda x: len(x.split()) if isinstance(x, str) else float('nan'))

# Keep only uni-grams, bi-grams and tri-grams (relatively simple concepts)
df = df[df['num_words'] <= 3]

df

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term,num_words
0,a bank,en,7.745967,AtLocation,/c/en/bank,money,2
1,a wallet,en,7.483315,AtLocation,/c/en/wallet,money,2
2,cash,en,5.338164,RelatedTo,/c/en/cash,money,1
3,currency,en,5.214211,RelatedTo,/c/en/currency,money,1
4,a pocket,en,4.898979,AtLocation,/c/en/pocket,money,2
...,...,...,...,...,...,...,...
2371,our body,en,0.141000,RelatedTo,/c/en/our_body,health,2
2373,death,en,0.121000,Antonym,/c/en/death,health,1
2374,good body,en,0.112000,RelatedTo,/c/en/good_body,health,2
2375,body quality,en,0.105000,RelatedTo,/c/en/body_quality,health,2


Now, the idea is to select a subset of the resulting terms. The criteria that we will apply is the following:
- Select top 40 terms per topic, as described by the weights.
- Select all `Antonym`s and `DistinctFrom` terms for a topic. This is done to avoid missing some relevant concepts related to a topic (e.g., "death" is considered as an antonym for "health", but it has a very low weight, lower than 1, even if it is closely connected).

In [26]:
# Step 1: Select top 40 terms per input_term based on weight
df_top40 = df.sort_values(['input_term', 'weight'], ascending=[True, False]).groupby('input_term').head(40)

# Step 2: Select all "Antonym" and "DistinctFrom" terms
df_antonym_distinct = df[df['type_relation'].isin(["Antonym", "DistinctFrom"])]

# Step 3: Combine both DataFrames and drop duplicates (terms with high weight and antonyms/distinctfrom)
df_final = pd.concat([df_top40, df_antonym_distinct]).drop_duplicates()

# Step 4: Sort by topic, and then by weight
df_final = df_final.sort_values(['input_term', 'weight'], ascending=[True, False]).reset_index(drop = True)

# Display the final result
df_final

Unnamed: 0,label,language,weight,type_relation,term_conceptnet,input_term,num_words
0,agency,en,2.0,Synonym,/c/en/agency,authority,1
1,control,en,2.0,IsA,/c/en/control,authority,1
2,expert,en,2.0,IsA,/c/en/expert,authority,1
3,permission,en,2.0,IsA,/c/en/permission,authority,1
4,person,en,2.0,IsA,/c/en/person,authority,1
...,...,...,...,...,...,...,...
314,military,en,1.0,HasContext,/c/en/military,united_states,1
315,estados unidos,es,1.0,Synonym,/c/es/estados_unidos,united_states,2
316,50 States,en,1.0,HasA,/c/en/50_states,united_states,2
317,protect you,en,1.0,UsedFor,/c/en/protect,united_states,2


In [27]:
# Count number of terms per topic ("input_term")
print(df_final['input_term'].value_counts())

input_term
health           44
country          43
justice          41
police           40
money            40
united_states    40
parliament       30
politics         24
authority        17
Name: count, dtype: int64


## Exporting csv

In [28]:
# Keep only columns of interest in right order
columns_ordered = ['label', 'num_words', 'language', 'weight', 'type_relation', 'term_conceptnet', 'input_term']
df_final[columns_ordered].to_csv('concept_net_terms.csv', index = False)

Additional preprocessing and exploration (if desired):

In [29]:
df_final = pd.read_csv('concept_net_terms.csv')

df_final

Unnamed: 0,label,num_words,language,weight,type_relation,term_conceptnet,input_term
0,agency,1,en,2.0,Synonym,/c/en/agency,authority
1,control,1,en,2.0,IsA,/c/en/control,authority
2,expert,1,en,2.0,IsA,/c/en/expert,authority
3,permission,1,en,2.0,IsA,/c/en/permission,authority
4,person,1,en,2.0,IsA,/c/en/person,authority
...,...,...,...,...,...,...,...
314,military,1,en,1.0,HasContext,/c/en/military,united_states
315,estados unidos,2,es,1.0,Synonym,/c/es/estados_unidos,united_states
316,50 States,2,en,1.0,HasA,/c/en/50_states,united_states
317,protect you,2,en,1.0,UsedFor,/c/en/protect,united_states


In [30]:
# Step 1: Select top 10 terms per input_term based on weight
df_top10 = df_final.sort_values(['input_term', 'weight'], ascending=[True, False]).groupby('input_term').head(10)

# # Step 2: Select all "Antonym" and "DistinctFrom" terms
# df_antonym_distinct = df_final[df_final['type_relation'].isin(["Antonym", "DistinctFrom"])]

# # Step 3: Combine both DataFrames and drop duplicates (terms with high weight and antonyms/distinctfrom)
# df_final = pd.concat([df_top10, df_antonym_distinct]).drop_duplicates()

# # Step 4: Sort by topic, and then by weight
# df_final = df_final.sort_values(['input_term', 'weight'], ascending=[True, False]).reset_index(drop = True)

df_top10

Unnamed: 0,label,num_words,language,weight,type_relation,term_conceptnet,input_term
0,agency,1,en,2.0,Synonym,/c/en/agency,authority
1,control,1,en,2.0,IsA,/c/en/control,authority
2,expert,1,en,2.0,IsA,/c/en/expert,authority
3,permission,1,en,2.0,IsA,/c/en/permission,authority
4,person,1,en,2.0,IsA,/c/en/person,authority
...,...,...,...,...,...,...,...
284,freedom and refuge,3,en,2.0,UsedFor,/c/en/freedom_and_refuge,united_states
285,fifty states,2,en,2.0,HasA,/c/en/fifty_states,united_states
286,living in,2,en,2.0,UsedFor,/c/en/living_in,united_states
287,protection,1,en,2.0,UsedFor,/c/en/protection,united_states


In [31]:
df_top10 = df_top10[['label', 'input_term']]

df_top10

Unnamed: 0,label,input_term
0,agency,authority
1,control,authority
2,expert,authority
3,permission,authority
4,person,authority
...,...,...
284,freedom and refuge,united_states
285,fifty states,united_states
286,living in,united_states
287,protection,united_states


Convert into LaTex format to show in paper.

In [32]:
# Create a helper column to preserve the order within each input_term group
df_top10['order'] = df_top10.groupby('input_term').cumcount()

# Pivot the DataFrame so that each column corresponds to an input_term
df_wide = df_top10.pivot(index='order', columns='input_term', values='label')

df_wide

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_top10['order'] = df_top10.groupby('input_term').cumcount()


input_term,authority,country,health,justice,money,parliament,police,politics,united_states
order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,agency,nation,being,natural virtue,a bank,United Kingdom,tail a suspect,activity,north america
1,control,land,well,court,a wallet,legislature,donut shop,profession,a map
2,expert,states,wellbeing,Department of Justice,cash,fantan,patrol,social science,the western hemisphere
3,permission,state,condition,judgment,currency,ordinal number,law enforcement agency,government,America
4,person,a CONTINENT,good,judge,a pocket,judicial,force,affairs,North America
5,book,usa,well being,righteousness,dollars,debate,direct traffic,politics,freedom and refuge
6,assurance,place,body,just,coins,gingerbread,guard banks,opinion,fifty states
7,enforce,music,healthiness,judiciary,value,legislative organization,government,social relation,living in
8,authoritarian,america,sanidad,justeza,paper,historical,free a prisoner,methodology,protection
9,power,continent,heartiness,certain rules,dollar,flock,hurry drivers,social event,United States government


In [33]:
# Keep only a subset of columns (for making it fit into the Latex file)
print(df_wide[['health', 'justice', 'money', 'politics']].to_latex())

\begin{tabular}{lllll}
\toprule
input_term & health & justice & money & politics \\
order &  &  &  &  \\
\midrule
0 & being & natural virtue & a bank & activity \\
1 & well & court & a wallet & profession \\
2 & wellbeing & Department of Justice & cash & social science \\
3 & condition & judgment & currency & government \\
4 & good & judge & a pocket & affairs \\
5 & well being & righteousness & dollars & politics \\
6 & body & just & coins & opinion \\
7 & healthiness & judiciary & value & social relation \\
8 & sanidad & justeza & paper & methodology \\
9 & heartiness & certain rules & dollar & social event \\
\bottomrule
\end{tabular}



# 3. Attribution 

This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 4.0) from https://conceptnet.io. The included data was created by contributors to Commonsense Computing projects, contributors to Wikimedia projects, Games with a Purpose, Princeton University's WordNet, DBPedia, OpenCyc, and Umbel. 