# Cleaning and Preprocessing the Scopus publications related to COVID-19

For collecting the Scopus publications related to COVID-19, we used the "pybliometrics" library. It is avaliable on https://pypi.org/project/pybliometrics/.

In [1]:
# Importing the required libraries.
import re, csv, pandas as pd, numpy as np
from pylatexenc.latex2text import LatexNodes2Text

## 1. Generating the dataframe from the raw data

In [2]:
# Creating a dataframe from the raw data.
df_data = pd.read_csv("../../data/raw/scopus_raw.csv", header=0, dtype=object)

In [3]:
# Checking the dataframe.
df_data.head()

Unnamed: 0,id,doi,eid,pii,pubmed_id,title,abstract,description,publication_date,citation_num,...,vehicle_address,title_edition,publisher,affiliations,subject_areas,authors,author_affil,ref_count,references,period
0,85086284745,10.1093/jas/skaa159,2-s2.0-85086284745,,32447386,Effects of medium chain fatty acids as a mitig...,© 2020 The Author(s) 2020. Published by Oxford...,© 2020 The Author(s) 2020. Published by Oxford...,2019-12-31,1.0,...,,,Oxford University Press,"({'id': '60000689', 'affiliation': 'Kansas Sta...","('Food Science', 'Animal Science and Zoology',...","({'id': '57205663870', 'name': 'Annie B. Lerne...","({'id': '57205663870', 'name': 'Annie B. Lerne...",30.0,"({'id': '85015948816', 'title': 'Weight of the...",12-2019
1,85077574207,10.3390/v12010043,2-s2.0-85077574207,,31905881,Feline infectious peritonitis virus NSP5 inhib...,© 2019 by the authors.Feline infectious perito...,© 2019 by the authors.Feline infectious perito...,2019-12-30,4.0,...,,,MDPI AG,"({'id': '60017705', 'affiliation': 'Chinese Ac...","('Infectious Diseases', 'Virology')","({'id': '57193357295', 'name': 'Si Chen'}, {'i...","({'id': '57193357295', 'name': 'Si Chen', 'aff...",53.0,"({'id': '33845329175', 'title': 'Factors assoc...",12-2019
2,85077542676,10.3390/v12010041,2-s2.0-85077542676,,31905842,Investigation of the role of the spike protein...,© 2019 by the authorsPorcine epidemic diarrhea...,© 2019 by the authorsPorcine epidemic diarrhea...,2019-12-30,2.0,...,,,MDPI AG,"({'id': '60005429', 'affiliation': 'National T...","('Infectious Diseases', 'Virology')","({'id': '57194272852', 'name': 'Chi Fei Kao'},...","({'id': '57194272852', 'name': 'Chi-Fei Kao', ...",31.0,"({'id': '0018177616', 'title': 'A new coronavi...",12-2019
3,85077287373,10.1186/s12917-019-2212-2,2-s2.0-85077287373,,31881873,Prevalence and phylogenetic analysis of porcin...,"© 2019 The Author(s).Background: In China, lar...","© 2019 The Author(s).Background: In China, lar...",2019-12-27,8.0,...,,,BioMed Central Ltd.,"({'id': '60004148', 'affiliation': 'Jiangxi Ag...","('Veterinary (all)',)","({'id': '56764850300', 'name': 'Fanfan Zhang'}...","({'id': '56764850300', 'name': 'Fanfan Zhang',...",42.0,"({'id': '84962194707', 'title': 'Epidemiology ...",12-2019
4,85073749551,10.1016/j.jbiotec.2019.10.007,2-s2.0-85073749551,S0168165619308879,31614169,Preparation of virus-like particle mimetic nan...,© 2019 Elsevier B.V.Middle East respiratory sy...,© 2019 Elsevier B.V.Middle East respiratory sy...,2019-12-20,18.0,...,,,Elsevier B.V.,"({'id': '60103680', 'affiliation': 'Shizuoka U...","('Biotechnology', 'Bioengineering', 'Applied M...","({'id': '55270209300', 'name': 'Tatsuya Kato'}...","({'id': '55270209300', 'name': 'Tatsuya Kato',...",31.0,"({'id': '84869081784', 'title': 'Is the discov...",12-2019


In [4]:
# Visualizing the information of dataset.
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84613 entries, 0 to 84612
Data columns (total 30 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                84613 non-null  object
 1   doi               81883 non-null  object
 2   eid               84610 non-null  object
 3   pii               24426 non-null  object
 4   pubmed_id         58339 non-null  object
 5   title             84610 non-null  object
 6   abstract          52351 non-null  object
 7   description       52351 non-null  object
 8   publication_date  84610 non-null  object
 9   citation_num      84610 non-null  object
 10  language          84161 non-null  object
 11  production_type   84610 non-null  object
 12  source_type       84610 non-null  object
 13  auth_keywords     46201 non-null  object
 14  index_terms       50756 non-null  object
 15  issn              83048 non-null  object
 16  isbn              1707 non-null   object
 17  conf_locatio

## 2. Cleaning and preprocessing the dataframe

In [5]:
# Defining the function "clean_text" to clean and preprocess any text.
def clean_text(text, has_latex=False):
    if text:
        text = re.sub(r"\u2fff(s|\s)", r"'\1", re.sub(r"\s+", " ", re.sub(r"\ufeff\.?", "", re.sub(
            r"\\\\(\’\s)?", "", str(text))))).replace("\u200b", "").replace("\ue001", "").replace(
            "\ue061", "").replace("\u202f", "").replace("\u2060", "").replace("\u200f", "").replace(
            "\u200e", "").replace("\u202c", "").replace("&#x2013;", "-").replace("&quot", "\"\"").replace(
            "\u200c", "").replace("\\u0019", "").replace("\\s", "s").replace("\u202a", "").replace(
            "\u202d", "-").replace("\u0383", "-").replace("\u20f3", "ó").replace("\u20fa", "ú").replace(
            "\u2fff", "-").strip()
        text = text.replace("TNF-alpha induced", "TNF-α induced").replace(
            "TNF-Alpha induced", "TNF-α induced").replace("TNF- ␣ induced", "TNF-α induced").replace(
            "TNF-αinduced", "TNF-α induced").replace(
            "via NF- \u242c B pathway", "via NF-κB pathway").replace(
            "via NF-kappaB pathway", "via NF-κB pathway").strip()
        if has_latex:
            text = LatexNodes2Text().latex_to_text(re.sub("\\?%", "@PER@CENT@", text)).replace("@PER@CENT@", "%")
        text = re.sub(r"\s+", " ", re.sub(r"\-{2,}", "-", re.sub(r"\s?\xad(\s|\-)?", "-", text))).replace(
            "\\", "").replace("\\%", "%").replace("()", "").replace("[]", "").strip()
        return text
    else:
        return None

In [6]:
# Removing the invalid articles.
df_data = df_data.loc[df_data.id.notnull() & df_data.eid.notnull()]

In [7]:
# Defining the "None" value for the "NaN" values.
df_data.replace({np.nan: None}, inplace=True)

In [8]:
# Defining the "zero" value for the articles without numbers of citation and references.
df_data.citation_num.loc[df_data.citation_num.isnull()] = 0
df_data.ref_count.loc[df_data.ref_count.isnull()] = 0

In [9]:
# Normalizing the feature "abstract".
df_data.abstract.loc[df_data.abstract.isnull() & df_data.description.notnull()] = df_data.description.loc[
    df_data.abstract.isnull() & df_data.description.notnull()]
df_data.abstract.loc[df_data.abstract.notnull()] = df_data.abstract.loc[df_data.abstract.notnull()].apply(
    lambda x: clean_text(x, True))

In [10]:
# Normalizing the feature "vehicle_name".
df_data.vehicle_name.loc[df_data.conference_name.notnull() & df_data.vehicle_name.notnull()] = df_data.conference_name.loc[df_data.conference_name.notnull() & df_data.vehicle_name.notnull()]
df_data.vehicle_name.loc[df_data.vehicle_name.notnull()] = df_data.vehicle_name.loc[
    df_data.vehicle_name.notnull()].apply(clean_text)

In [11]:
# Normalizing the feature "title".
df_data.title.loc[df_data.title.notnull()] = df_data.title.loc[df_data.title.notnull()].apply(clean_text)

In [12]:
# Removing unnecessary columns.
columns_drop = ["eid", "pii", "description", "isbn", "conf_location", "conference_name",
    "vehicle_address", "title_edition"]
df_data.drop(axis=1, columns=columns_drop, inplace=True)

In [13]:
# Changing the type of some features.
df_data.loc[:, ["citation_num", "ref_count"]] = df_data.loc[
    :, ["citation_num", "ref_count"]].astype(np.float32)
df_data.auth_keywords.loc[df_data.auth_keywords.notnull()] = df_data.auth_keywords.loc[
    df_data.auth_keywords.notnull()].apply(eval)
df_data.index_terms.loc[df_data.index_terms.notnull()] = df_data.index_terms.loc[
    df_data.index_terms.notnull()].apply(eval)
df_data.affiliations.loc[df_data.affiliations.notnull()] = df_data.affiliations.loc[
    df_data.affiliations.notnull()].apply(eval)
df_data.subject_areas.loc[df_data.subject_areas.notnull()] = df_data.subject_areas.loc[
    df_data.subject_areas.notnull()].apply(eval)
df_data.authors.loc[df_data.authors.notnull()] = df_data.authors.loc[df_data.authors.notnull()].apply(eval)
df_data.author_affil.loc[df_data.author_affil.notnull()] = df_data.author_affil.loc[
    df_data.author_affil.notnull()].apply(eval)
df_data.references.loc[df_data.references.notnull()] = df_data.references.loc[
    df_data.references.notnull()].apply(eval)
df_data.publication_date = pd.to_datetime(df_data.publication_date)

In [14]:
# Normalizing the itens contained in the features "auth_keywords" and "index_terms".
df_data.auth_keywords.loc[df_data.auth_keywords.notnull()] = df_data.auth_keywords.loc[
    df_data.auth_keywords.notnull()].apply(lambda x: tuple([clean_text(item) for item in x]))
df_data.index_terms.loc[df_data.index_terms.notnull()] = df_data.index_terms.loc[
    df_data.index_terms.notnull()].apply(lambda x: tuple([clean_text(item) for item in x]))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [15]:
# Checking there are invalid values in the features "auth_keywords", "index_terms" and "subject_areas".
for column in ["auth_keywords", "index_terms", "subject_areas"]:
    count = df_data.loc[df_data[column].notnull(), column][
                [np.any([item == None or item.lower() == "none" for item in items])
                 for items in df_data.loc[df_data[column].notnull(), column]]].size
    print("{}: {}".format(column, count))

auth_keywords: 0
index_terms: 0
subject_areas: 0


In [16]:
# Removing the invalid values in the features "auth_keywords", "index_terms" and "subject_areas".
for column in ["auth_keywords", "index_terms", "subject_areas"]:
    df_data.loc[df_data[column].notnull(), column] = [
        tuple([item for item in items if item])
        for items in df_data.loc[df_data[column].notnull(), column]]
    df_data.loc[df_data[column].notnull(), column] = df_data.loc[
        df_data[column].notnull(), column].apply(lambda x: x if len(x) > 0 else None)

  return array(a, dtype, copy=False, order=order)
  arr_value = np.array(value)


In [17]:
# Normalizing the content contained in the features "authors", "affiliations" and "author_affil".
df_data.affiliations.loc[df_data.affiliations.notnull()] = df_data.affiliations.loc[
    df_data.affiliations.notnull()].apply(lambda x: tuple([{"id": item["id"],
        "affiliation": clean_text(item["affiliation"]), "country": item["country"]}
        for item in x if item["id"]]))
df_data.author_affil.loc[df_data.author_affil.notnull()] = df_data.author_affil.loc[
    df_data.author_affil.notnull()].apply(lambda x: tuple(
        [{"id": item["id"], "name": clean_text(item["name"]), "affil_id": item["affil_id"],
          "affiliation": clean_text(item["affiliation"]), "country": item["country"]}
         for item in x if item["id"] or item["name"] or item["affil_id"] or \
             item["affiliation"] or item["country"]]))
df_data.authors.loc[df_data.authors.notnull()] = df_data.authors.loc[
    df_data.authors.notnull()].apply(lambda x: tuple(
        [{"id": item["id"], "name": clean_text(item["name"])} for item in x if item["id"]]))

In [18]:
# Removing the invalid values in the features "authors", "affiliations" and "author_affil".
for column in ["authors", "affiliations", "author_affil"]:
    df_data.loc[df_data[column].notnull(), column] = df_data.loc[
        df_data[column].notnull(), column].apply(lambda x: x if len(x) > 0 else None)

In [19]:
# Creating the affiliations' and authors' IDs for those that have not a ID.
df_data.author_affil.loc[df_data.author_affil.notnull()] = df_data.author_affil.loc[
    df_data.author_affil.notnull()].apply(lambda x: tuple([{
        "id": item["id"] if item["id"] and item["name"] else \
            str(hash("{} - {}".format(item["name"], "Scopus"))) if item["name"] else None,
        "name": item["name"],
        "affil_id": item["affil_id"] if item["affil_id"] and item["affiliation"] else \
            str(hash("{} - {}".format(item["affiliation"], "Scopus"))) \
                if item["affiliation"] else None,
        "affiliation": item["affiliation"], "country": item["country"]}
    for item in x]))

In [20]:
# Removing duplicates within the list of affiliations and authors.
df_data.author_affil.loc[df_data.author_affil.notnull()] = [
    set([(au["id"], au["name"], au["affil_id"],
        au["affiliation"], au["country"]) for au in row])
    for row in df_data.author_affil[df_data.author_affil.notnull()]]
df_data.author_affil.loc[df_data.author_affil.notnull()] = [tuple([dict(zip(
        ["id", "name", "affil_id", "affiliation", "country"], au)) for au in row])
    for row in df_data.author_affil[df_data.author_affil.notnull()]]

In [21]:
# Removing the duplicated records by features "title" and "doi".
df_data = pd.concat([df_data[df_data.title.isnull() | df_data.doi.isnull()],
    df_data[df_data.title.notnull() & df_data.doi.notnull()].sort_values(
        by=["title", "citation_num", "publication_date"]).drop_duplicates(
            ["title", "doi"], "last")], ignore_index=True)

In [22]:
# Normalizing the feature "references".
df_data.references.loc[df_data.references.notnull()] = df_data.references.loc[
    df_data.references.notnull()].apply(lambda x: tuple(
        [{"id": ref["id"], "title": clean_text(ref["title"], True),
          "doi": clean_text(ref["doi"]), "authors": clean_text(ref["authors"], True)}
         for ref in x]))

In [23]:
# Checking the result.
df_data.head()

Unnamed: 0,id,doi,pubmed_id,title,abstract,publication_date,citation_num,language,production_type,source_type,...,issn,vehicle_name,publisher,affiliations,subject_areas,authors,author_affil,ref_count,references,period
0,85086071498,,,"Apping and visualisation of health data, le co...","© 2019 University of L'Aquila, Department of C...",2019-12-01,1.0,eng,Journal,j,...,18285961.0,DISEGNARECON,"University of L'Aquila, Department of Civil Co...","({'id': '60010110', 'affiliation': 'Università...","(Architecture, Visual Arts and Performing Arts...","({'id': '57218914310', 'name': 'Enrico Cicald'...","({'id': '57218914310', 'name': 'Enrico Cicald'...",27.0,"({'id': '77949657266', 'title': 'Health resear...",12-2019
1,85098881043,,,CODS-COMAD 2021 - Proceedings of the 3rd ACM I...,The proceedings contain 93 papers. The topics ...,2020-01-02,0.0,eng,Conference Proceeding,p,...,,3rd ACM India Joint International Conference o...,Association for Computing Machinery,,"(Human-Computer Interaction, Computer Networks...",,,0.0,,01-2020
2,85082342162,,32200398.0,The Novel Coronavirus (SARS-CoV-2) Epidemic,,2020-01-01,14.0,eng,Journal,j,...,3044602.0,"Annals of the Academy of Medicine, Singapore",NLM (Medline),"({'id': '60017161', 'affiliation': 'National U...","(Medicine (all),)","({'id': '8161583900', 'name': 'Li Yang Hsu'}, ...","({'id': '57215908259', 'name': 'Jeremy Fy Lim'...",0.0,,01-2020
3,85083405993,,32291373.0,Gastrointestinal Presentation in COVID-19 in I...,Severe acute respiratory syndrome coronavirus ...,2020-01-01,7.0,eng,Journal,j,...,1259326.0,Acta medica Indonesiana,NLM (Medline),"({'id': '60069377', 'affiliation': 'Universita...","(Medicine (all),)","({'id': '57202798959', 'name': 'Muhammad Khifz...","({'id': '57216406590', 'name': 'Fauzia Kirana'...",0.0,,01-2020
4,85083410524,,32291376.0,Clinical Progression of COVID-19 Patient with ...,"Coronavirus Disease 2019 (COVID-19), previousl...",2020-01-01,6.0,eng,Journal,j,...,1259326.0,Acta medica Indonesiana,NLM (Medline),"({'id': '60196806', 'affiliation': 'RSUP Persa...","(Medicine (all),)","({'id': '36058554600', 'name': 'Erlina Burhan'...","({'id': '57216406235', 'name': 'Ibrahim Dharma...",0.0,,01-2020


In [24]:
# Visualizing the information of dataset.
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84526 entries, 0 to 84525
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id                84526 non-null  object        
 1   doi               81799 non-null  object        
 2   pubmed_id         58269 non-null  object        
 3   title             84526 non-null  object        
 4   abstract          52326 non-null  object        
 5   publication_date  84526 non-null  datetime64[ns]
 6   citation_num      84526 non-null  object        
 7   language          84077 non-null  object        
 8   production_type   84526 non-null  object        
 9   source_type       84526 non-null  object        
 10  auth_keywords     46182 non-null  object        
 11  index_terms       50712 non-null  object        
 12  issn              82964 non-null  object        
 13  vehicle_name      84526 non-null  object        
 14  publisher         8452

## 3. Saving the dataframe

In [25]:
# Exporting the data to CSV file.
df_data.to_csv("../../data/prepared/scopus_covid_19.csv", index=False, quoting=csv.QUOTE_ALL)