## Get all archives using http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/

Need to find a way to filter all these articles and keep only those which can be of interest to us i.e. those talking about species.

Idealy we need to tag each article of interest with:
- tag of the specie it talks about
- tag of the genus
- tag of the family

We can base our work on: 
- paper title
- paper abstract

Our assumption is that is the author used a particular species for his article, he will mention it either in the title or the abstract (strong assumption).

First I will study papers' abstract. 

In a given gz file, we will need to drop all rows with no abstract. 

#### Imports

In [1]:
import glob
import json
import pandas as pd
import gzip
import os
import io
from langdetect import detect
import scispacy                                                        
import spacy                                                        
import en_core_sci_lg
from spacy import displacy

In [2]:
from typing import List, Dict, Union

### Get data

In [3]:
data_dir = "/Users/chloesekkat/Documents/batch8_ceebios/data_open_source"

In [4]:
def get_gz_files(data_dir: str) -> List[Dict]:
    """
    Get list of json files from gz files.
    """
    for file in os.listdir(data_dir):
        json_list = []
        if file.endswith('.gz'):
            gz = gzip.open(os.path.join(data_dir, file), 'rb')
            f = io.BufferedReader(gz)
            for line in f.readlines():
                json_list.append(json.loads(line))
            gz.close()
        return json_list

In [5]:
json_list = get_gz_files(data_dir)

In [6]:
data = pd.DataFrame(json_list)

In [7]:
data.shape

(32229, 21)

In [8]:
def missing_data(data: pd.DataFrame) -> pd.DataFrame:
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return(tt)

In [9]:
missing_data(data)

Unnamed: 0,Total,Percent
id,0,0.0
title,0,0.0
paperAbstract,0,0.0
authors,0,0.0
inCitations,0,0.0
outCitations,0,0.0
year,254,0.78811
s2Url,0,0.0
sources,0,0.0
pdfUrls,0,0.0


In [10]:
data.head()

Unnamed: 0,id,title,paperAbstract,authors,inCitations,outCitations,year,s2Url,sources,pdfUrls,...,journalName,journalVolume,journalPages,doi,doiUrl,pmid,fieldsOfStudy,magId,s2PdfUrl,entities
0,5cf3fcad3ee67c45f1c3d98c2b4bc22f683bfca7,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,"[{'name': 'Jun Zhang', 'ids': ['49050562']}, ...",[7dfd349892b456de2b60a9e9a4dedfbaeedf767f],"[b029856e343340fc7c034f514dc15a2ad49f6c1a, 7c8...",2016.0,https://semanticscholar.org/paper/5cf3fcad3ee6...,[],[],...,Laser Physics,26.0,095501,10.1088/1054-660X/26/9/095501,https://doi.org/10.1088/1054-660X%2F26%2F9%2F0...,,[Physics],2461559103,,[]
1,9758f471a6789d32e9d684856c1b257b1fbb5546,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,"[{'name': 'Richard C. J. Somerville', 'ids': [...","[eca662dc341726551b1e74e9bc86d6f9aa37b15e, efd...",[],2008.0,https://semanticscholar.org/paper/9758f471a678...,[],[],...,,,,,,,[],2902693084,,[]
2,73236ad2ede98f3f3f6acbf83a463b4afe2dd1fb,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[{'name': 'Jenny Shapiro', 'ids': ['36049549'...",[],"[d83ebd27f91b75b472da63636596668da6c72059, 56c...",2020.0,https://semanticscholar.org/paper/73236ad2ede9...,[Medline],[],...,Molecular Therapy. Methods & Clinical Development,17.0,1097 - 1107,10.1016/j.omtm.2020.04.027,https://doi.org/10.1016/j.omtm.2020.04.027,32478125.0,"[Biology, Medicine]",3022279551,,[]
3,3d9a672fef4b9fe99cd948787e340d68fbf1513b,Don’t You Be Telling Me How Tah Talk: Educatio...,,"[{'name': 'LaQuita N Gresham', 'ids': ['811828...",[],"[3285d0b0374acfeaeca887ba5e884d62393d5e40, 136...",2014.0,https://semanticscholar.org/paper/3d9a672fef4b...,[],[],...,,,,,,,[Medicine],49447698,,[]
4,251045388ce98e901ccc7d22ae754a224c72dc28,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,"[{'name': 'Springer-Verlag Italia', 'ids': ['...",[077ca4833b07ff8fb39501d915f43d48976c7eab],"[fb93f765e84bfcbdc722c07762e247a544bae269, 21b...",2014.0,https://semanticscholar.org/paper/251045388ce9...,[],[],...,,,,,,,[Medicine],2327692075,,[]


In [11]:
type(data.iloc[3]["paperAbstract"])

str

We need to remove the rows with empty string.

In [12]:
def remove_empty_abstract(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove rows where paper abstract is an empty string.
    """
    where = df["paperAbstract"].values != ""
    return df[where]

In [13]:
data = remove_empty_abstract(data)

In [14]:
to_keep = [
    "title",
    "paperAbstract",
    "fieldsOfStudy"
]

def keep_columns(df: pd.DataFrame, cols_to_keep: List[str]) -> pd.DataFrame:
    """
    Return dataframe with wanted columns.
    """
    return df[cols_to_keep]

In [15]:
data = keep_columns(data, to_keep)

In [16]:
data.head()

Unnamed: 0,title,paperAbstract,fieldsOfStudy
0,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,[Physics]
1,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,[]
2,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[Biology, Medicine]"
4,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,[Medicine]
6,Mechanism of 'crowd' in evolutionary MAS for m...,This work introduces a new evolutionary approa...,[Computer Science]


In [17]:
def keep_english_abstracts(df: pd.DataFrame) -> pd.DataFrame:
    """
    Keep only papers with abstract in english.
    """
    keep_indexes = [] # get lits of indexes that we will keep
    df = df.reset_index(drop=True)
    for i, element in enumerate(df["paperAbstract"]):
        try:
            res = detect(element)
            if res == 'en':
                keep_indexes.append(i)
        except:
            print('Error with: ', element)
            print(type(element))
    return df[df.index.isin(keep_indexes)]

In [18]:
data = keep_english_abstracts(data)

Error with:  2
<class 'str'>
Error with:  ............................................................................................................................................... 3
<class 'str'>
Error with:  3.
<class 'str'>
Error with:  3
<class 'str'>
Error with:  473
<class 'str'>
Error with:  5.
<class 'str'>


In [19]:
data.shape

(15330, 3)

In [20]:
data.head()

Unnamed: 0,title,paperAbstract,fieldsOfStudy
0,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,[Physics]
1,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,[]
2,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[Biology, Medicine]"
3,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,[Medicine]
4,Mechanism of 'crowd' in evolutionary MAS for m...,This work introduces a new evolutionary approa...,[Computer Science]


### Exploration using SciSpacy

In [21]:
test = data.iloc[1]["paperAbstract"]

In [22]:
nlp = en_core_sci_lg.load()
doc = nlp(test)

In [23]:
doc.ents

(European Science Foundation,
 ESF,
 French Foundation of the Maison des Sciences de l’Homme,
 FMSH,
 Entre-Sciences programme,
 conference series,
 environmental sciences,
 scientists,
 humanities,
 social sciences,
 colleagues,
 life,
 natural sciences,
 interdisciplinary conference,
 modelling,
 Global Change,
 Geosciences,
 Economics,
 history,
 novelty,
 interrelations,
 political science,
 impact,
 model,
 public sphere,
 speakers,
 attendees,
 Europe,
 United States,
 Asia,
 construction,
 models,
 global change,
 Earth sciences,
 Economics,
 issues,
 impacts,
 risks,
 multidisciplinary approaches,
 public policy perspective,
 history of,
 science,
 Chaired,
 Joel Guiot,
 climatologist,
 European Centre,
 Research,
 Teaching,
 Geosciences,
 Environment,
 CEREGE,
 Aix-en-Provence,
 Sylvie Thoron,
 economist,
 Aix-Marseille Research Group,
 Quantitative Economy,
 GREQAM,
 Marseille,
 conference,
 intense,
 discussions,
 scientists,
 disciplinary horizons,
 integrative approaches,


In [24]:
displacy.render(next(doc.sents), style='dep', jupyter=True)

### Load GBIF data

In [26]:
gbif = pd.read_csv("/Users/chloesekkat/Documents/batch8_ceebios/data/gbif_extract.csv")
gbif.head()

Unnamed: 0.1,Unnamed: 0,key,nubKey,nameKey,taxonID,sourceTaxonKey,kingdom,phylum,order,family,...,publishedIn,acceptedKey,accepted,proParteKey,genus,genusKey,species,speciesKey,basionymKey,basionym
0,0,8003,8003,6849425,gbif:8003,156957565.0,Animalia,Arthropoda,Amphipoda,Melitidae,...,,,,,,,,,,
1,1,8004,8004,7068178,gbif:8004,156957851.0,Animalia,Arthropoda,Amphipoda,Mimonectidae,...,,,,,,,,,,
2,2,8005,8005,7669892,gbif:8005,156957506.0,Animalia,Arthropoda,Amphipoda,Ochlesidae,...,,,,,,,,,,
3,3,8006,8006,7718541,gbif:8006,156957210.0,Animalia,Arthropoda,Amphipoda,Oedicerotidae,...,"LILLJEBORG, W. (1865). On the Lysianassa magel...",,,,,,,,,
4,4,8007,8007,7848133,gbif:8007,156085450.0,Animalia,Arthropoda,Amphipoda,Opisidae,...,"Lowry, J. K.; Stoddart, H. E. (1995). The Amph...",,,,,,,,,


Idealy we need to tag each article of interest with:

- tag of the specie it talks about
- tag of the genus
- tag of the family

In [28]:
to_keep = [
    "key",
    "canonicalName",
    "family",
    "familyKey",
    "genus",
    "genusKey"
]
gbif = gbif[to_keep]

In [29]:
missing_data(gbif)

Unnamed: 0,Total,Percent
key,0,0.0
canonicalName,0,0.0
family,1349,1.335644
familyKey,1349,1.335644
genus,8657,8.571287
genusKey,8657,8.571287


#### What should we do about the missing values ? 

In [30]:
gbif.shape

(101000, 6)

In [31]:
gbif = gbif.drop_duplicates()
gbif.shape

(101000, 6)

### Load Keyword processor

In [32]:
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor()

for name in gbif["canonicalName"]:
    keyword_processor.add_keyword(name)

In [33]:
text = "Distribution and movement patterns of Antarctic blue whales Balaenoptera musculus intermedia at large temporal and spatial scales are still poorly understood. The objective of this study was to explore spatio-temporal distribution patterns of Antarctic blue whales in the Atlantic sector of the Southern Ocean,using passive acoustic monitoring data. Multi-year data were collected between 2008 and 2013 by 11 recorders deployed in the Weddell Sea and along the Greenwich meridian. Antarctic blue whale Z-calls were detected via spectrogram cross-correlation. A Blue Whale Index was developed to quantify the proportion of time during which acoustic energy from Antarctic blue whales dominatedover background noise. Our results show that Antarctic blue whales were acoustically present year-round, with most call detections between January and April.During austral summer, the number of detected calls peaked synchronously throughout the study area in mostyears, and hence, no directed meridional movement pattern was detectable. During austral winter,vocalizations were recorded at latitudes as high as 69°S, with sea ice cover exceeding 90%,suggesting that some Antarctic blue whales overwinterin Antarctic waters. Polynyas likely serve as an important habitat for baleen whales duringaustral winter, providing food and reliable access to open water for breathing. Overall, our results support increasing evidence of a complex and non-obligatory migratory behavior of Antarctic blue whales,potentially involving temporally and spatially dynamic migration routes and destinations, as well as variable timing of migration to and from the feeding grounds."

In [34]:
keyword_processor.extract_keywords(text)

[]

Does not find "Balaenoptera musculus", we need to split all canonical names ? 

In [35]:
where = gbif["canonicalName"].str.contains('musculus') 
gbif[where]

Unnamed: 0,key,canonicalName,family,familyKey,genus,genusKey
10270,1002097,Trichocerca musculus,Trichocercidae,8115.0,Trichocerca,1001946.0
42790,1048436,Meibomeus musculus,Chrysomelidae,7780.0,Meibomeus,1048435.0
47513,1046890,Anthicus musculus,Anthicidae,7771.0,Anthicus,1046876.0
47806,1047185,Vanonus musculus,Aderidae,1047172.0,Vanonus,1047173.0
56552,1066979,Pharaphodius musculus,Aphodiidae,2933.0,Pharaphodius,1066872.0
62966,1073160,Xenochodaeus musculus,Ochodaeidae,9523.0,Xenochodaeus,1073158.0
72662,1085757,Neoathyreus ramusculus,Bolboceratidae,7720.0,Neoathyreus,1085749.0
77248,1091082,Onthophagus musculus,Scarabaeidae,5840.0,Onthophagus,1089294.0
80296,1108020,Lepturges musculus,Cerambycidae,5602.0,Urgleptes,1107917.0
80298,1108023,Lepturgus musculus,Cerambycidae,5602.0,Urgleptes,1107917.0


### Find keywords on all dataframe

In [36]:
def keep_articles_species(data: pd.DataFrame, keyword_processor: KeywordProcessor) -> pd.DataFrame:
    """
    Keep only articles for which we find a match and add a keyword column.
    The matched keywords need to be previously set in `keyword_processor`.
    """
    data = data.reset_index(drop=True)
    keep_indexes = []
    keywords = []
    for i, element in enumerate(data["paperAbstract"]):
        res = keyword_processor.extract_keywords(element)
        if len(res) > 0:
            keep_indexes.append(i)
            keywords.append(list(set(res)))
    data = data[data.index.isin(keep_indexes)]
    data["keyword"] = keywords
    return data

In [37]:
data = keep_articles_species(data, keyword_processor)

In [38]:
data.shape

(556, 4)

In [39]:
data.head()

Unnamed: 0,title,paperAbstract,fieldsOfStudy,keyword
73,Acid Content of Scent Fluid from Acanthocephal...,The major component of the defensive scent flu...,[Biology],[Acanthocephala]
102,A taxonomy of datatypes,This is the second article based on language-i...,[Computer Science],[Goes]
108,Quantitative response to photoperiod and weak ...,Reproduction and wing patterns (shape and colo...,[],"[Allomyrina, Nymphalidae, Lepidoptera, Gymnopl..."
121,Ability of a wash regimen to remove biofilm fr...,The skin/implant interface of osseointegrated ...,[Medicine],[Bacteria]
150,Synthesis and biological activities of (Z) and...,Synthesis of Z and E ethenyl acyclonucleosides...,[Medicine],[Viruses]


### Add keys to dataframe

In [40]:
def get_dict_name_keys(df: pd.DataFrame) -> Dict[str, Union[int, List[float]]]:
    """
    Construct a dictionnary mapping each canonical name
    to its corresponding key, familyKey and genusKey when available.
    """
    where = df["genusKey"].isna() & df["familyKey"].isna()
    df_tmp = df[where]
    dict_1 = df_tmp.set_index("canonicalName")["key"].to_dict()
    df = df[~where]
    
    where = df["genusKey"].isna()
    df_tmp = df[where]
    df_tmp["allKeys"] = df_tmp[["key", "familyKey"]].values.tolist()
    dict_2 = df_tmp.set_index("canonicalName")["allKeys"].to_dict()
    df = df[~where]
    
    where = df["family"].isna()
    df_tmp = df[~where]
    df_tmp["allKeys"] = df_tmp[["key", "genusKey"]].values.tolist()
    dict_3 = df_tmp.set_index("canonicalName")["allKeys"].to_dict()
    return {**dict_1, **dict_2, **dict_3}

In [41]:
dict_name_keys = get_dict_name_keys(gbif)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["allKeys"] = df_tmp[["key", "familyKey"]].values.tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["allKeys"] = df_tmp[["key", "genusKey"]].values.tolist()


In [42]:
def from_str_to_keys(liste: List[str], dict_map: Union[int, List[float]]) -> List[str]:
    """
    Convert a list of keywords into a list of corresponding keys.
    Corresponding keys are deduced from `dict_map`.
    """
    to_return = []
    if len(liste) > 1:
        for name in liste:
            key = dict_map[name.strip("''")]
            if type(key) == list:
                to_return += key
            else:
                to_return.append(key)
    else:
        key = dict_map[liste[0].strip("''")]
        if type(key) == list:
                to_return += key
        else:
                to_return.append(key)
    return to_return

In [43]:
data["paper_keys"] = data["keyword"].apply(lambda x: from_str_to_keys(x, dict_name_keys))

In [44]:
data.head()

Unnamed: 0,title,paperAbstract,fieldsOfStudy,keyword,paper_keys
73,Acid Content of Scent Fluid from Acanthocephal...,The major component of the defensive scent flu...,[Biology],[Acanthocephala],[67]
102,A taxonomy of datatypes,This is the second article based on language-i...,[Computer Science],[Goes],"[1125298.0, 1125298.0]"
108,Quantitative response to photoperiod and weak ...,Reproduction and wing patterns (shape and colo...,[],"[Allomyrina, Nymphalidae, Lepidoptera, Gymnopl...","[1075673.0, 1075673.0, 7017.0, 7017.0, 797, 10..."
121,Ability of a wash regimen to remove biofilm fr...,The skin/implant interface of osseointegrated ...,[Medicine],[Bacteria],[3]
150,Synthesis and biological activities of (Z) and...,Synthesis of Z and E ethenyl acyclonucleosides...,[Medicine],[Viruses],[8]
