This notebook was used to explore the dataset and to decide which operations had to be performed and in order to clean it and reduce its size to a more manageable one.

In this notebook only a chunk of the dataset is analyzed, and the process is explained in every detail.

The complete dataset is then processed in a different script for convenience.

In [1]:
# Uncomment if using Colab and if the dataset is stored on Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

Specify the file location of dataset below:

In [2]:
import pandas as pd
import numpy as np
file_path = ""

In [3]:
df = pd.DataFrame()

We read the first chunk of the dataframe. A chunk is composed of `chunk_size` JSON objects.



In [4]:
chunk_size = 10_000
with pd.read_json(file_path, encoding="UTF-8", lines=True, chunksize=chunk_size) as reader:
    for chunk in reader:
      df = chunk
      break

In [5]:
df

Unnamed: 0,id,title,authors,venue,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,fos,doi,references,indexed_abstract
0,100001334,Ontologies in HYDRA - Middleware for Ambient I...,"[{'name': 'Peter Kostelnik', 'id': '2702511795...",{'raw': 'AMIF'},2009,2,43,46,,,,,"[{'name': 'Lernaean Hydra', 'w': 0.4178039}, {...",,,
1,1000018889,Remote Policy Enforcement for Trusted Applicat...,"[{'name': 'Fabio Martinelli', 'id': '210743870...",{'raw': 'international conference on trusted s...,2013,2,70,84,Conference,"Springer, Cham",,,"[{'name': 'Trusted Computing', 'w': 0.63148589...",10.1007/978-3-319-03491-1_5,"[94181602, 1504669610, 1542792105, 1639158619,...","{'IndexLength': 173, 'InvertedIndex': {'Both':..."
2,1000022707,A SIMPLE OBSERVATION REGARDING ITERATIONS OF F...,"[{'name': 'Jerzy Mycka', 'id': '263067851'}]","{'raw': 'Reports on Mathematical Logic', 'id':...",2009,0,19,29,Journal,,44,,"[{'name': 'Discrete mathematics', 'w': 0.47368...",,"[1972178849, 2069792094]","{'IndexLength': 49, 'InvertedIndex': {'A': [0]..."
3,100004108,Gait based human identity recognition from mul...,"[{'name': 'Emdad Hossain', 'id': '2017661848',...",{'raw': 'international conference on algorithm...,2012,0,319,328,Conference,"Springer, Berlin, Heidelberg",,,"[{'name': 'Biometrics', 'w': 0.529778063000000...",10.1007/978-3-642-33065-0_34,"[1578000111, 2120433720, 2136461127, 213893135...","{'IndexLength': 82, 'InvertedIndex': {'In': [0..."
4,10000571,The GAME Algorithm Applied to Complex Fraction...,"[{'name': 'Pavel Kordík', 'id': '419063071', '...",{'raw': 'international conference on artificia...,2008,5,859,868,Conference,"Springer, Berlin, Heidelberg",,,"[{'name': 'Pattern recognition', 'w': 0.453429...",10.1007/978-3-540-87559-8_89,"[291899685, 1964166287, 2135293965, 2146842127...","{'IndexLength': 171, 'InvertedIndex': {'Comple..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,105154181,Mathematical Theory Makes Formal Concept Analy...,"[{'name': 'Bernhard Ganter', 'id': '2584030807'}]","{'raw': 'Description Logics', 'id': '2764557503'}",2008,0,,,Journal,,,,"[{'name': 'Applied mathematics', 'w': 0.440267...",,,
9996,105154304,Improved phoneme recognition by integrating ev...,"[{'name': 'Shang-wen Li', 'id': '2137430318'},...",{'raw': 'conference of the international speec...,2010,5,1177,1180,Conference,,,,"[{'name': 'Artificial intelligence', 'w': 0.0}...",,,
9997,10515480,A Unified Taxonomy of Hybrid Metaheuristics wi...,"[{'name': 'El-Ghazali Talbi', 'id': '228565846...",{'raw': 'Hybrid Metaheuristics'},2013,7,3,76,,Springer Berlin Heidelberg,,,"[{'name': 'Metaheuristic', 'w': 0.605617642}, ...",10.1007/978-3-642-30671-6_1,"[51126395, 72011665, 73826515, 91781137, 12577...","{'IndexLength': 101, 'InvertedIndex': {'Over':..."
9998,105155086,Multithreaded Execution Architecture and Compi...,"[{'name': 'Organizers: Dean M. Tullsen', 'id':...",{'raw': 'high performance computer architectur...,1999,2,321,,Conference,IEEE Computer Society,,,"[{'name': 'Parallel computing', 'w': 0.4707954...",,,


In [6]:
df.shape

(10000, 16)

We analyze all the columns one by one to check for null and `NaN` values.
Given that we dispose of a very large dataset, we will drop the samples that present such data.

In [7]:
for column in df:
  print(f"Column: {column}")
  print(f"NUll values: {df[column].isnull().values.any()}") 
  print(f"NaN values : {df[column].isna().values.any()}\n")

Column: id
NUll values: False
NaN values : False

Column: title
NUll values: False
NaN values : False

Column: authors
NUll values: False
NaN values : False

Column: venue
NUll values: True
NaN values : True

Column: year
NUll values: False
NaN values : False

Column: n_citation
NUll values: False
NaN values : False

Column: page_start
NUll values: False
NaN values : False

Column: page_end
NUll values: False
NaN values : False

Column: doc_type
NUll values: False
NaN values : False

Column: publisher
NUll values: False
NaN values : False

Column: volume
NUll values: False
NaN values : False

Column: issue
NUll values: False
NaN values : False

Column: fos
NUll values: True
NaN values : True

Column: doi
NUll values: True
NaN values : True

Column: references
NUll values: True
NaN values : True

Column: indexed_abstract
NUll values: True
NaN values : True



In [8]:
df = df.dropna(axis=0, how="any")
df

Unnamed: 0,id,title,authors,venue,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,fos,doi,references,indexed_abstract
1,1000018889,Remote Policy Enforcement for Trusted Applicat...,"[{'name': 'Fabio Martinelli', 'id': '210743870...",{'raw': 'international conference on trusted s...,2013,2,70,84,Conference,"Springer, Cham",,,"[{'name': 'Trusted Computing', 'w': 0.63148589...",10.1007/978-3-319-03491-1_5,"[94181602, 1504669610, 1542792105, 1639158619,...","{'IndexLength': 173, 'InvertedIndex': {'Both':..."
3,100004108,Gait based human identity recognition from mul...,"[{'name': 'Emdad Hossain', 'id': '2017661848',...",{'raw': 'international conference on algorithm...,2012,0,319,328,Conference,"Springer, Berlin, Heidelberg",,,"[{'name': 'Biometrics', 'w': 0.529778063000000...",10.1007/978-3-642-33065-0_34,"[1578000111, 2120433720, 2136461127, 213893135...","{'IndexLength': 82, 'InvertedIndex': {'In': [0..."
4,10000571,The GAME Algorithm Applied to Complex Fraction...,"[{'name': 'Pavel Kordík', 'id': '419063071', '...",{'raw': 'international conference on artificia...,2008,5,859,868,Conference,"Springer, Berlin, Heidelberg",,,"[{'name': 'Pattern recognition', 'w': 0.453429...",10.1007/978-3-540-87559-8_89,"[291899685, 1964166287, 2135293965, 2146842127...","{'IndexLength': 171, 'InvertedIndex': {'Comple..."
11,1000096266,Pilots Aided Channel Estimation for Doubly Sel...,"[{'name': 'Sunzeng Cai', 'id': '2703866035', '...",{'raw': 'international wireless internet confe...,2013,0,1,13,Conference,"Springer, Berlin, Heidelberg",,,"[{'name': 'Computer science', 'w': 0.361955434...",10.1007/978-3-642-41773-3_1,"[1801089468, 1963850605, 2064076416, 211953192...","{'IndexLength': 131, 'InvertedIndex': {'In': [..."
13,1000117647,Mould-taper asymptotics and air gap formation ...,"[{'name': 'Brendan J. Florio', 'id': '19948004...","{'raw': 'Applied Mathematics and Computation',...",2015,7,1122,1139,Journal,Elsevier,268,,"[{'name': 'Mathematical optimization', 'w': 0....",10.1016/j.amc.2015.07.011,"[2115718968, 2151110682]","{'IndexLength': 159, 'InvertedIndex': {'We': [..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9991,105153476,Earley’s Parsing Algorithm and k-Petri Net Con...,"[{'name': 'Taishin Y. Nishida', 'id': '2040037...",{'raw': 'Languages Alive'},2012,1,174,185,,"Springer, Berlin, Heidelberg",,,"[{'name': 'Parsing', 'w': 0.651717365}, {'name...",10.1007/978-3-642-31644-9_12,"[196129900, 1496549841, 1601504131, 1970961429...","{'IndexLength': 64, 'InvertedIndex': {'In': [0..."
9993,1051540501,Towards Forward Security Properties for PEKS a...,"[{'name': 'Qiang Tang', 'id': '2132485330', 'o...",{'raw': 'australasian conference on informatio...,2015,2,127,144,Conference,"Springer, Cham",,,"[{'name': 'Computer science', 'w': 0.3969278},...",10.1007/978-3-319-19962-7_8,"[122561124, 165774471, 177462532, 198181672, 1...","{'IndexLength': 218, 'InvertedIndex': {'In': [..."
9994,1051540762,Constructions of CCA-Secure Revocable Identity...,"[{'name': 'Yuu Ishida', 'id': '2512789699', 'o...",{'raw': 'australasian conference on informatio...,2015,3,174,191,Conference,"Springer, Cham",,,"[{'name': 'Ciphertext', 'w': 0.5906863}, {'nam...",10.1007/978-3-319-19962-7_11,"[21280657, 1484751769, 1522268755, 1546774120,...","{'IndexLength': 177, 'InvertedIndex': {'Key': ..."
9997,10515480,A Unified Taxonomy of Hybrid Metaheuristics wi...,"[{'name': 'El-Ghazali Talbi', 'id': '228565846...",{'raw': 'Hybrid Metaheuristics'},2013,7,3,76,,Springer Berlin Heidelberg,,,"[{'name': 'Metaheuristic', 'w': 0.605617642}, ...",10.1007/978-3-642-30671-6_1,"[51126395, 72011665, 73826515, 91781137, 12577...","{'IndexLength': 101, 'InvertedIndex': {'Over':..."


We check that all the incomplete values have been removed:

In [9]:
for column in df:
  print(f"Column: {column}")
  print(f"NUll values: {df[column].isnull().values.any()}") 
  print(f"NaN values : {df[column].isna().values.any()}\n")

Column: id
NUll values: False
NaN values : False

Column: title
NUll values: False
NaN values : False

Column: authors
NUll values: False
NaN values : False

Column: venue
NUll values: False
NaN values : False

Column: year
NUll values: False
NaN values : False

Column: n_citation
NUll values: False
NaN values : False

Column: page_start
NUll values: False
NaN values : False

Column: page_end
NUll values: False
NaN values : False

Column: doc_type
NUll values: False
NaN values : False

Column: publisher
NUll values: False
NaN values : False

Column: volume
NUll values: False
NaN values : False

Column: issue
NUll values: False
NaN values : False

Column: fos
NUll values: False
NaN values : False

Column: doi
NUll values: False
NaN values : False

Column: references
NUll values: False
NaN values : False

Column: indexed_abstract
NUll values: False
NaN values : False



Again, given that the dataset is very large, we will also drop the samples that contain an empty string in any column.

In [10]:
for column in df:
  df = df[df[column] != ""]
  
df

Unnamed: 0,id,title,authors,venue,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,fos,doi,references,indexed_abstract
44,100026658,On the Seismic Disturbance Rejection of Struct...,"[{'name': 'Emmanuel C. Zacharenakis', 'id': '2...","{'raw': 'Journal of Global Optimization', 'id'...",2000,1,403,410,Journal,Kluwer Academic Publishers,17,1,"[{'name': 'Mathematical optimization', 'w': 0....",10.1023/A:1026528715747,"[2003915072, 2006246895]","{'IndexLength': 40, 'InvertedIndex': {'Using':..."
49,1000290224,Fast motion estimation for HEVC on graphics pr...,"[{'name': 'Dongkyu Lee', 'id': '2163056226', '...",{'raw': 'Journal of Real-time Image Processing...,2016,3,549,562,Journal,Springer Berlin Heidelberg,12,2,[{'name': 'Synchronization (computer science)'...,10.1007/s11554-015-0522-6,"[2019126579, 2024876340, 2033981732, 208518556...","{'IndexLength': 211, 'InvertedIndex': {'The': ..."
50,1000298444,Technological unemployment and human disenhanc...,"[{'name': 'Michele Loi', 'id': '2676097775', '...","{'raw': 'Ethics and Information Technology', '...",2015,6,201,210,Journal,Springer Netherlands,17,3,"[{'name': 'Sociology', 'w': 0.410797566}, {'na...",10.1007/s10676-015-9375-8,[1631150390],"{'IndexLength': 135, 'InvertedIndex': {'This':..."
56,1000312060,Runtime verification with minimal intrusion th...,"[{'name': 'Shay Berkovich', 'id': '2117350123'...","{'raw': 'formal methods', 'id': '1169806927'}",2015,8,317,348,Conference,Springer US,46,3,"[{'name': 'Real-time computing', 'w': 0.469279...",10.1007/s10703-015-0226-3,"[39084266, 140229164, 143962766, 209133351, 32...","{'IndexLength': 207, 'InvertedIndex': {'Runtim..."
61,100031634,A window on shared virtual environments,"[{'name': 'Denis Amselem', 'id': '2533711948',...",{'raw': 'Presence: Teleoperators & Virtual Env...,1995,28,130,145,Journal,MIT Press,4,2,"[{'name': 'Multi-user', 'w': 0.477106541000000...",10.1162/pres.1995.4.2.130,"[203056148, 1515186102, 1965498649, 2023331673...","{'IndexLength': 56, 'InvertedIndex': {'This': ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9840,1050485050,Several classes of polynomials with low differ...,"[{'name': 'Guangkui Xu', 'id': '2097899471', '...","{'raw': 'Applicable Algebra in Engineering, Co...",2016,0,91,103,Journal,Springer Berlin Heidelberg,27,2,"[{'name': 'Combinatorics', 'w': 0.461340159}, ...",10.1007/s00200-015-0272-5,"[103329842, 1528065150, 1587687098, 1858314666...","{'IndexLength': 66, 'InvertedIndex': {'In': [0..."
9846,1050549874,A diagnosis algorithm by using graph-coloring ...,"[{'name': 'Qiang Zhu', 'id': '2656285327', 'or...",{'raw': 'Journal of Combinatorial Optimization...,2016,2,960,969,Journal,Springer US,32,3,"[{'name': 'Vertex (geometry)', 'w': 0.44818171...",10.1007/s10878-015-9923-5,"[1584084163, 1595619353, 1699514597, 192339624...","{'IndexLength': 144, 'InvertedIndex': {'Fault'..."
9868,1050717819,Descriptive analysis of medication errors repo...,"[{'name': 'Zahraa Hassan Abdelrahman Shehata',...",{'raw': 'Journal of the American Medical Infor...,2016,4,366,374,Journal,The Oxford University Press,23,2,"[{'name': 'Knowledge transfer', 'w': 0.4179790...",10.1093/jamia/ocv096,[2099114371],"{'IndexLength': 282, 'InvertedIndex': {'Object..."
9949,1051237357,Construction of self-dual codes over ℤ2m,"[{'name': 'Anuradha Sharma', 'id': '2424898481...","{'raw': 'Cryptography and Communications', 'id...",2016,0,83,101,Journal,Springer US,8,1,"[{'name': 'Combinatorics', 'w': 0.401857883}, ...",10.1007/s12095-015-0139-4,"[2061721328, 2063618950, 2127017001, 2151813678]","{'IndexLength': 147, 'InvertedIndex': {'Self-d..."


In [11]:
df.shape

(478, 16)

Additionally, we remove the samples with a number of citations smaller than threshold, assuming that they are "less important".

In [12]:
df = df[df["n_citation"] >= 5]
df

Unnamed: 0,id,title,authors,venue,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,fos,doi,references,indexed_abstract
50,1000298444,Technological unemployment and human disenhanc...,"[{'name': 'Michele Loi', 'id': '2676097775', '...","{'raw': 'Ethics and Information Technology', '...",2015,6,201,210,Journal,Springer Netherlands,17,3,"[{'name': 'Sociology', 'w': 0.410797566}, {'na...",10.1007/s10676-015-9375-8,[1631150390],"{'IndexLength': 135, 'InvertedIndex': {'This':..."
56,1000312060,Runtime verification with minimal intrusion th...,"[{'name': 'Shay Berkovich', 'id': '2117350123'...","{'raw': 'formal methods', 'id': '1169806927'}",2015,8,317,348,Conference,Springer US,46,3,"[{'name': 'Real-time computing', 'w': 0.469279...",10.1007/s10703-015-0226-3,"[39084266, 140229164, 143962766, 209133351, 32...","{'IndexLength': 207, 'InvertedIndex': {'Runtim..."
61,100031634,A window on shared virtual environments,"[{'name': 'Denis Amselem', 'id': '2533711948',...",{'raw': 'Presence: Teleoperators & Virtual Env...,1995,28,130,145,Journal,MIT Press,4,2,"[{'name': 'Multi-user', 'w': 0.477106541000000...",10.1162/pres.1995.4.2.130,"[203056148, 1515186102, 1965498649, 2023331673...","{'IndexLength': 56, 'InvertedIndex': {'This': ..."
67,100034678,Technical Section: Topology authentication for...,"[{'name': 'Zhiyong Su', 'id': '2493435590', 'o...","{'raw': 'Computers & Graphics', 'id': '94821547'}",2013,6,269,279,Journal,Pergamon,37,4,"[{'name': 'Digital watermarking', 'w': 0.51685...",10.1016/j.cag.2013.02.009,"[1883100757, 1964254008, 1964939487, 196870166...","{'IndexLength': 173, 'InvertedIndex': {'Topolo..."
94,1000498390,Automatic Geolocation Correction of Satellite ...,"[{'name': 'Özge Can Özcanli', 'id': '204855396...",{'raw': 'International Journal of Computer Vis...,2016,7,263,277,Journal,Springer US,116,3,"[{'name': 'Computer science', 'w': 0.3579622},...",10.1007/s11263-015-0852-7,[2098079560],"{'IndexLength': 211, 'InvertedIndex': {'Modern..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9530,1048334742,Security framework for RESTful mobile cloud co...,"[{'name': 'Feda AlShahwan', 'id': '1384795202'...","{'raw': 'ambient intelligence', 'id': '1197908...",2016,9,649,659,Conference,Springer Berlin Heidelberg,7,5,"[{'name': 'Mobile search', 'w': 0.680315554}, ...",10.1007/s12652-015-0308-5,"[83167447, 86792954, 141134686, 1524329929, 16...","{'IndexLength': 146, 'InvertedIndex': {'Provid..."
9549,104841381,Refining UML interactions with underspecificat...,"[{'name': 'Ragnhild Kobro Runde', 'id': '20412...","{'raw': 'Nordic Journal of Computing', 'id': '...",2005,31,157,188,Journal,Publishing Association Nordic Journal of Compu...,12,2,"[{'name': 'Underspecification', 'w': 0.5243531...",10.13140/rg.2.2.36691.37920,"[189544743, 1482962177, 1561178350, 1581532873...","{'IndexLength': 111, 'InvertedIndex': {'STAIRS..."
9660,104916616,GDDP: Generalized Dual Dynamic Programming Theory,"[{'name': 'Jesús M. Velásquez Bermúdez', 'id':...","{'raw': 'Annals of Operations Research', 'id':...",2002,6,21,31,Journal,Springer,117,1,"[{'name': 'Mathematical optimization', 'w': 0....",10.1023/A:1021557003554,[2019710194],"{'IndexLength': 81, 'InvertedIndex': {'This': ..."
9699,104945644,Invariant imbedding and the calculation of eig...,"[{'name': 'M. R. Scott', 'id': '2645941388', '...","{'raw': 'Computing', 'id': '35593046'}",1969,34,10,23,Journal,Springer Vienna,4,1,"[{'name': 'Gravitational singularity', 'w': 0....",10.1007/BF02236538,[],"{'IndexLength': 193, 'InvertedIndex': {'A': [0..."


In [13]:
df.shape

(234, 16)

We also remove the samples with a number of references smaller than a threshold.

In [14]:
num_references = df["references"].to_numpy().copy()
for i in range(num_references.size):
  num_references[i] = len(num_references[i])

In [15]:
df = df[num_references >= 5]
df

Unnamed: 0,id,title,authors,venue,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,fos,doi,references,indexed_abstract
56,1000312060,Runtime verification with minimal intrusion th...,"[{'name': 'Shay Berkovich', 'id': '2117350123'...","{'raw': 'formal methods', 'id': '1169806927'}",2015,8,317,348,Conference,Springer US,46,3,"[{'name': 'Real-time computing', 'w': 0.469279...",10.1007/s10703-015-0226-3,"[39084266, 140229164, 143962766, 209133351, 32...","{'IndexLength': 207, 'InvertedIndex': {'Runtim..."
61,100031634,A window on shared virtual environments,"[{'name': 'Denis Amselem', 'id': '2533711948',...",{'raw': 'Presence: Teleoperators & Virtual Env...,1995,28,130,145,Journal,MIT Press,4,2,"[{'name': 'Multi-user', 'w': 0.477106541000000...",10.1162/pres.1995.4.2.130,"[203056148, 1515186102, 1965498649, 2023331673...","{'IndexLength': 56, 'InvertedIndex': {'This': ..."
67,100034678,Technical Section: Topology authentication for...,"[{'name': 'Zhiyong Su', 'id': '2493435590', 'o...","{'raw': 'Computers & Graphics', 'id': '94821547'}",2013,6,269,279,Journal,Pergamon,37,4,"[{'name': 'Digital watermarking', 'w': 0.51685...",10.1016/j.cag.2013.02.009,"[1883100757, 1964254008, 1964939487, 196870166...","{'IndexLength': 173, 'InvertedIndex': {'Topolo..."
151,1000804384,Parallel Double Snakes. Application to the seg...,"[{'name': 'Florence Rossant', 'id': '182637985...","{'raw': 'Pattern Recognition', 'id': '414566'}",2015,11,3857,3870,Journal,Elsevier Science Inc.,48,12,"[{'name': 'Pattern recognition', 'w': 0.418113...",10.1016/j.patcog.2015.06.009,"[246149496, 1501827752, 1534168035, 1603965921...","{'IndexLength': 191, 'InvertedIndex': {'In': [..."
218,1001060374,Lattice Boltzmann analysis of micro-particles ...,"[{'name': 'Hamid Hassanzadeh Afrouzi', 'id': '...",{'raw': 'Computers & Mathematics With Applicat...,2015,8,1136,1151,Journal,"Pergamon Press, Inc.",70,5,"[{'name': 'Lift (force)', 'w': 0.508257}, {'na...",10.1016/j.camwa.2015.07.008,"[1996902556, 2015054985, 2032012801, 204422735...","{'IndexLength': 242, 'InvertedIndex': {'Disper..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9421,104754383,Shackled to the status quo: the inhibiting eff...,"[{'name': 'Greta L. Polites', 'id': '230572121...",{'raw': 'Management Information Systems Quarte...,2012,215,21,42,Journal,Society for Information Management and The Man...,36,1,"[{'name': 'Management science', 'w': 0.3945593...",10.2307/41410404,"[150339760, 1482773440, 1487725643, 1508343863...","{'IndexLength': 293, 'InvertedIndex': {'Given'..."
9529,104833327,A framework for linking the structure of infor...,"[{'name': 'Sunro Lee', 'id': '2576570380'}, {'...",{'raw': 'Journal of Management Information Sys...,1992,49,27,44,Journal,"M. E. Sharpe, Inc.",8,4,"[{'name': 'Knowledge management', 'w': 0.47296...",10.1080/07421222.1992.11517937,"[96126854, 1509283903, 1521870617, 1527944521,...","{'IndexLength': 71, 'InvertedIndex': {'Abstrac..."
9530,1048334742,Security framework for RESTful mobile cloud co...,"[{'name': 'Feda AlShahwan', 'id': '1384795202'...","{'raw': 'ambient intelligence', 'id': '1197908...",2016,9,649,659,Conference,Springer Berlin Heidelberg,7,5,"[{'name': 'Mobile search', 'w': 0.680315554}, ...",10.1007/s12652-015-0308-5,"[83167447, 86792954, 141134686, 1524329929, 16...","{'IndexLength': 146, 'InvertedIndex': {'Provid..."
9549,104841381,Refining UML interactions with underspecificat...,"[{'name': 'Ragnhild Kobro Runde', 'id': '20412...","{'raw': 'Nordic Journal of Computing', 'id': '...",2005,31,157,188,Journal,Publishing Association Nordic Journal of Compu...,12,2,"[{'name': 'Underspecification', 'w': 0.5243531...",10.13140/rg.2.2.36691.37920,"[189544743, 1482962177, 1561178350, 1581532873...","{'IndexLength': 111, 'InvertedIndex': {'STAIRS..."


In [16]:
df.shape

(182, 16)

We can see that the field `indexed_abstract` has the structure of an inverted index. Here we will convert this structure to a string, so that it would be easier to query once the dataset is imported in a database



In [17]:
indexed_abstract = df["indexed_abstract"].to_numpy()

# Read samples one-by-one
for i in range(indexed_abstract.size):
  entry = indexed_abstract[i]
  # Words are stored temporary in an array
  string_arr = np.ndarray(shape=entry["IndexLength"], dtype=object)
  dictionary = entry["InvertedIndex"]

  # Store the words
  for key, value in dictionary.items():
    for elem in value:
      string_arr[elem] = key

  # Input data is not always correct, in such cases we have to clean it
  string_arr = string_arr[string_arr != None]

  # Compact the array into a single string
  indexed_abstract[i] = " ".join(string_arr)

Accordingly, we rename the column `indexed_abstract` into `abstract`:

In [21]:
df = df.rename({"indexed_abstract": "abstract"}, axis=1)
df

Unnamed: 0,id,title,authors,venue,year,n_citation,page_start,page_end,doc_type,publisher,volume,issue,fos,doi,references,abstract
56,1000312060,Runtime verification with minimal intrusion th...,"[{'name': 'Shay Berkovich', 'id': '2117350123'...","{'raw': 'formal methods', 'id': '1169806927'}",2015,8,317,348,Conference,Springer US,46,3,"[{'name': 'Real-time computing', 'w': 0.469279...",10.1007/s10703-015-0226-3,"[39084266, 140229164, 143962766, 209133351, 32...",Runtime verification is a monitoring technique...
61,100031634,A window on shared virtual environments,"[{'name': 'Denis Amselem', 'id': '2533711948',...",{'raw': 'Presence: Teleoperators & Virtual Env...,1995,28,130,145,Journal,MIT Press,4,2,"[{'name': 'Multi-user', 'w': 0.477106541000000...",10.1162/pres.1995.4.2.130,"[203056148, 1515186102, 1965498649, 2023331673...",This paper presents the architecture of a mult...
67,100034678,Technical Section: Topology authentication for...,"[{'name': 'Zhiyong Su', 'id': '2493435590', 'o...","{'raw': 'Computers & Graphics', 'id': '94821547'}",2013,6,269,279,Journal,Pergamon,37,4,"[{'name': 'Digital watermarking', 'w': 0.51685...",10.1016/j.cag.2013.02.009,"[1883100757, 1964254008, 1964939487, 196870166...",Topology authentication for computer-aided pla...
151,1000804384,Parallel Double Snakes. Application to the seg...,"[{'name': 'Florence Rossant', 'id': '182637985...","{'raw': 'Pattern Recognition', 'id': '414566'}",2015,11,3857,3870,Journal,Elsevier Science Inc.,48,12,"[{'name': 'Pattern recognition', 'w': 0.418113...",10.1016/j.patcog.2015.06.009,"[246149496, 1501827752, 1534168035, 1603965921...","In order to segment elongated structures, we p..."
218,1001060374,Lattice Boltzmann analysis of micro-particles ...,"[{'name': 'Hamid Hassanzadeh Afrouzi', 'id': '...",{'raw': 'Computers & Mathematics With Applicat...,2015,8,1136,1151,Journal,"Pergamon Press, Inc.",70,5,"[{'name': 'Lift (force)', 'w': 0.508257}, {'na...",10.1016/j.camwa.2015.07.008,"[1996902556, 2015054985, 2032012801, 204422735...",Dispersion and deposition of microparticles ar...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9421,104754383,Shackled to the status quo: the inhibiting eff...,"[{'name': 'Greta L. Polites', 'id': '230572121...",{'raw': 'Management Information Systems Quarte...,2012,215,21,42,Journal,Society for Information Management and The Man...,36,1,"[{'name': 'Management science', 'w': 0.3945593...",10.2307/41410404,"[150339760, 1482773440, 1487725643, 1508343863...",Given that adoption of a new system often impl...
9529,104833327,A framework for linking the structure of infor...,"[{'name': 'Sunro Lee', 'id': '2576570380'}, {'...",{'raw': 'Journal of Management Information Sys...,1992,49,27,44,Journal,"M. E. Sharpe, Inc.",8,4,"[{'name': 'Knowledge management', 'w': 0.47296...",10.1080/07421222.1992.11517937,"[96126854, 1509283903, 1521870617, 1527944521,...",Abstract:This paper describes relations betwee...
9530,1048334742,Security framework for RESTful mobile cloud co...,"[{'name': 'Feda AlShahwan', 'id': '1384795202'...","{'raw': 'ambient intelligence', 'id': '1197908...",2016,9,649,659,Conference,Springer Berlin Heidelberg,7,5,"[{'name': 'Mobile search', 'w': 0.680315554}, ...",10.1007/s12652-015-0308-5,"[83167447, 86792954, 141134686, 1524329929, 16...",Providing Web services from the mobile cloud i...
9549,104841381,Refining UML interactions with underspecificat...,"[{'name': 'Ragnhild Kobro Runde', 'id': '20412...","{'raw': 'Nordic Journal of Computing', 'id': '...",2005,31,157,188,Journal,Publishing Association Nordic Journal of Compu...,12,2,"[{'name': 'Underspecification', 'w': 0.5243531...",10.13140/rg.2.2.36691.37920,"[189544743, 1482962177, 1561178350, 1581532873...",STAIRS is an approach to the compositional dev...


Lastly, we export the chunk we are working with into a .csv file:

In [22]:
df.to_csv("dataset_v11.csv", encoding="UTF-8", index=False)