# Imports

In [1]:
import json
import nltk

# Read the data
* So we have all our data in the form of JSON objects (one on each line) in the file `ARXIV_TOPICS_TITLES.txt` and we load it in memory. 
* A single JSON object contains a `title`, `abstract` and `topics` which is basically all the topics for that title. 
* We have a dictionary called `title_to_topics` which stores topics corresponding to a given title.
* then we have a frequency dictionary called `idict` for storing frequencies of individual topics. 

In [12]:
idict = {}
title_to_topics = {}
with open("ARXIV_TOPICS_TITLES.txt") as f:
    for line in f:
        j = json.loads(line)
        title = " ".join(nltk.word_tokenize(j["title"].lower()))
        title_to_topics[title] = j["topics"]
        for t in j["topics"]:
            idict[t] = idict.get(t, 0) + 1

# Pruning the topics
We only keep topics which have atleast 501 frequency amongst the entire dataset. Also, we want to trim the too frequent topics, so we sort by frequency and trim the top 8.

In [13]:
topics = [(k, v) for k, v in idict.items() if v > 500]
topics.sort(reverse=True, key=lambda x: x[1])
topics = topics[8:]
topics = [k for k, _ in topics]

In [16]:
topics[0]

'Computer vision'

# Stringify topics and sanity print

In [18]:
topics = list(map(str, topics))
topics[:5], len(topics)

# Filtering out important topics
* Get all the titles and abstracts from our original dataset `ARXIV-CORPUS-COMPLETE-50k.txt`.
* Then, we iterate over all these titles that we just loaded and for every title, we first check if its there in the Semantic Scholar dataset we loaded previously. 
* For the titles that exist, we filter out topics from the important list of 76 topics we found earlier. 
* We write this data to a new file `FINAL-DATA-WITH-TOPICS.txt`.

In [24]:
f_t = []
f_a = []
with open("FINAL-DATA-WITH-TOPICS.txt", "w") as f1:
    with open("ARXIV-CORPUS-COMPLETE-50k.txt") as f2:
        i = 0
        for line in f2:
            if i % 2 == 0:
                f_t.append(line.strip())
            else:
                f_a.append(line.strip())
            i += 1

        for t, a in zip(f_t, f_a):

            t = " ".join(nltk.word_tokenize(t.lower()))
            a = " ".join(nltk.word_tokenize(a.lower()))

            if t in title_to_topics:
                curr_topics = title_to_topics[t]
                final_topics = []
                for to in curr_topics:
                    if to in topics:
                        final_topics.append(to.lower())

                if final_topics:
                    val1 = {"title": t, "abstract": a, "topics": final_topics}
                    val2 = {"title": t, "abstract": a}
                    f1.write(json.dumps(val1))
                    f1.write("\n")
