## Get all archives using http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/

Need to find a way to filter all these articles and keep only those which can be of interest to us i.e. those talking about species.

Idealy we need to tag each article of interest with:
- tag of the specie it talks about
- tag of the genus
- tag of the family

We can base our work on: 
- paper title
- paper abstract

Our assumption is that is the author used a particular species for his article, he will mention it either in the title or the abstract (strong assumption).

First I will study papers' abstract. 

In a given gz file, we will need to drop all rows with no abstract. 

#### Imports

In [1]:
import json
import pandas as pd
import gzip
import os
import io
from langdetect import detect
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

In [2]:
from typing import List, Dict, Union, Set

### Get data

In [3]:
data_dir = "/Users/chloesekkat/Documents/batch8_ceebios/data_open_source"

In [4]:
def get_gz_files(data_dir: str) -> List[Dict]:
    """
    Get list of json files from gz files.
    """
    for file in os.listdir(data_dir):
        json_list = []
        if file.endswith('.gz'):
            gz = gzip.open(os.path.join(data_dir, file), 'rb')
            f = io.BufferedReader(gz)
            for line in f.readlines():
                json_list.append(json.loads(line))
            gz.close()
        return json_list

In [5]:
json_list = get_gz_files(data_dir)

In [6]:
data = pd.DataFrame(json_list)

In [7]:
data.shape

(32229, 21)

In [8]:
def missing_data(data: pd.DataFrame) -> pd.DataFrame:
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return(tt)

In [9]:
missing_data(data)

Unnamed: 0,Total,Percent
id,0,0.0
title,0,0.0
paperAbstract,0,0.0
authors,0,0.0
inCitations,0,0.0
outCitations,0,0.0
year,254,0.78811
s2Url,0,0.0
sources,0,0.0
pdfUrls,0,0.0


In [10]:
data.head()

Unnamed: 0,id,title,paperAbstract,authors,inCitations,outCitations,year,s2Url,sources,pdfUrls,...,journalName,journalVolume,journalPages,doi,doiUrl,pmid,fieldsOfStudy,magId,s2PdfUrl,entities
0,5cf3fcad3ee67c45f1c3d98c2b4bc22f683bfca7,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,"[{'name': 'Jun Zhang', 'ids': ['49050562']}, ...",[7dfd349892b456de2b60a9e9a4dedfbaeedf767f],"[b029856e343340fc7c034f514dc15a2ad49f6c1a, 7c8...",2016.0,https://semanticscholar.org/paper/5cf3fcad3ee6...,[],[],...,Laser Physics,26.0,095501,10.1088/1054-660X/26/9/095501,https://doi.org/10.1088/1054-660X%2F26%2F9%2F0...,,[Physics],2461559103,,[]
1,9758f471a6789d32e9d684856c1b257b1fbb5546,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,"[{'name': 'Richard C. J. Somerville', 'ids': [...","[eca662dc341726551b1e74e9bc86d6f9aa37b15e, efd...",[],2008.0,https://semanticscholar.org/paper/9758f471a678...,[],[],...,,,,,,,[],2902693084,,[]
2,73236ad2ede98f3f3f6acbf83a463b4afe2dd1fb,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[{'name': 'Jenny Shapiro', 'ids': ['36049549'...",[],"[d83ebd27f91b75b472da63636596668da6c72059, 56c...",2020.0,https://semanticscholar.org/paper/73236ad2ede9...,[Medline],[],...,Molecular Therapy. Methods & Clinical Development,17.0,1097 - 1107,10.1016/j.omtm.2020.04.027,https://doi.org/10.1016/j.omtm.2020.04.027,32478125.0,"[Biology, Medicine]",3022279551,,[]
3,3d9a672fef4b9fe99cd948787e340d68fbf1513b,Don’t You Be Telling Me How Tah Talk: Educatio...,,"[{'name': 'LaQuita N Gresham', 'ids': ['811828...",[],"[3285d0b0374acfeaeca887ba5e884d62393d5e40, 136...",2014.0,https://semanticscholar.org/paper/3d9a672fef4b...,[],[],...,,,,,,,[Medicine],49447698,,[]
4,251045388ce98e901ccc7d22ae754a224c72dc28,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,"[{'name': 'Springer-Verlag Italia', 'ids': ['...",[077ca4833b07ff8fb39501d915f43d48976c7eab],"[fb93f765e84bfcbdc722c07762e247a544bae269, 21b...",2014.0,https://semanticscholar.org/paper/251045388ce9...,[],[],...,,,,,,,[Medicine],2327692075,,[]


In [11]:
type(data.iloc[3]["paperAbstract"])

str

We need to remove the rows with empty string.

In [12]:
def remove_empty_abstract(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove rows where paper abstract is an empty string.
    """
    where = df["paperAbstract"].values != ""
    return df[where]

In [13]:
data = remove_empty_abstract(data)

In [14]:
to_keep = [
    "title",
    "paperAbstract",
    "fieldsOfStudy"
]

def keep_columns(df: pd.DataFrame, cols_to_keep: List[str]) -> pd.DataFrame:
    """
    Return dataframe with wanted columns.
    """
    return df[cols_to_keep]

In [15]:
data = keep_columns(data, to_keep)

In [16]:
data.head()

Unnamed: 0,title,paperAbstract,fieldsOfStudy
0,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,[Physics]
1,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,[]
2,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[Biology, Medicine]"
4,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,[Medicine]
6,Mechanism of 'crowd' in evolutionary MAS for m...,This work introduces a new evolutionary approa...,[Computer Science]


In [17]:
def keep_english_titles(df: pd.DataFrame) -> pd.DataFrame:
    """
    Keep only papers with title in english.
    """
    keep_indexes = [] # get lits of indexes that we will keep
    df = df.reset_index(drop=True)
    for i, element in enumerate(df["title"]):
        try:
            res = detect(element)
            if res == 'en':
                keep_indexes.append(i)
        except:
            print('Error with: ', element)
            print(type(element))
    return df[df.index.isin(keep_indexes)]

In [18]:
# TODO: changer la fonction précédente avec un map 

In [19]:
data = keep_english_titles(data)

Error with:  206
<class 'str'>


In [20]:
data.shape

(14425, 3)

In [21]:
data.head()

Unnamed: 0,title,paperAbstract,fieldsOfStudy
0,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,[Physics]
1,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,[]
2,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[Biology, Medicine]"
3,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,[Medicine]
4,Mechanism of 'crowd' in evolutionary MAS for m...,This work introduces a new evolutionary approa...,[Computer Science]


### Load GBIF data

In [22]:
gbif = pd.read_csv("/Users/chloesekkat/Documents/batch8_ceebios/data/simplified_taxon_gbif.csv")
gbif.head()

Unnamed: 0,taxonID,parentNameUsageID,canonicalName,scientificName,taxonRank,family,genus
0,1162096,1162079.0,Eopenthes deceptor,"Eopenthes deceptor Sharp, 1908",species,Elateridae,Eopenthes
1,1162114,1162079.0,Eopenthes basalis,"Eopenthes basalis Sharp, 1885",species,Elateridae,Eopenthes
2,1741665,1741585.0,Cochylis psychrasema,"Cochylis psychrasema Meyrick, 1937",species,Tortricidae,Cochylis
3,1741670,1741585.0,Cochylis sagittigera,"Cochylis sagittigera Razowski & Becker, 1983",species,Tortricidae,Cochylis
4,1782495,1782493.0,Baputa dichroa,"Baputa dichroa Kirsch, 1877",species,Noctuidae,Baputa


Idealy we need to tag each article of interest with:

- tag of the specie it talks about
- tag of the genus
- tag of the family

In [23]:
missing_data(gbif)

Unnamed: 0,Total,Percent
taxonID,0,0.0
parentNameUsageID,17,0.000258
canonicalName,640022,9.716995
scientificName,0,0.0
taxonRank,0,0.0
family,327389,4.970512
genus,339089,5.148145


In [24]:
gbif = gbif.dropna()

In [25]:
all_species = gbif["canonicalName"].unique().tolist()
all_family = gbif["family"].unique().tolist()
all_genus = gbif["genus"].unique().tolist()

In [26]:
all_names = set(all_species + all_family + all_genus)
len(all_names)

5285518

In [27]:
gbif.shape

(5710818, 7)

### Load Keyword processor

In [28]:
# prend un certain temps

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor()

for name in all_names:
    keyword_processor.add_keyword(name)

In [29]:
text = "Distribution and movement patterns of Antarctic blue whales Balaenoptera musculus intermedia at large temporal and spatial scales are still poorly understood. The objective of this study was to explore spatio-temporal distribution patterns of Antarctic blue whales in the Atlantic sector of the Southern Ocean,using passive acoustic monitoring data. Multi-year data were collected between 2008 and 2013 by 11 recorders deployed in the Weddell Sea and along the Greenwich meridian. Antarctic blue whale Z-calls were detected via spectrogram cross-correlation. A Blue Whale Index was developed to quantify the proportion of time during which acoustic energy from Antarctic blue whales dominatedover background noise. Our results show that Antarctic blue whales were acoustically present year-round, with most call detections between January and April.During austral summer, the number of detected calls peaked synchronously throughout the study area in mostyears, and hence, no directed meridional movement pattern was detectable. During austral winter,vocalizations were recorded at latitudes as high as 69°S, with sea ice cover exceeding 90%,suggesting that some Antarctic blue whales overwinterin Antarctic waters. Polynyas likely serve as an important habitat for baleen whales duringaustral winter, providing food and reliable access to open water for breathing. Overall, our results support increasing evidence of a complex and non-obligatory migratory behavior of Antarctic blue whales,potentially involving temporally and spatially dynamic migration routes and destinations, as well as variable timing of migration to and from the feeding grounds."

In [30]:
keyword_processor.extract_keywords(text)

['Balaenoptera musculus intermedia',
 'Scales',
 'Are',
 'This',
 'Data',
 'Data',
 'Sea',
 'Via',
 'Area',
 'As',
 'As',
 'Sea',
 'As',
 'As',
 'As']

In [31]:
where = gbif["canonicalName"] == "Sea"
gbif[where]

Unnamed: 0,taxonID,parentNameUsageID,canonicalName,scientificName,taxonRank,family,genus
3013958,1892931,7017.0,Sea,"Sea Hayward, 1950",genus,Nymphalidae,Sea


All these words "As", "Are", "Data", "This" seem to be actual canonical names. We will get many false positive when finding matches, we will have to add another filter on the articles.

First we will remove all stopwords in title and abstract.

In [32]:
stop_words = set(stopwords.words('english'))

own_list = set([
    "age",
    "sea",
    "data",
    "idea",
    "may",
])

list_stopwords = stop_words | own_list

def remove_stopwords_from_title_abstract(data: pd.DataFrame, list_stopwords: Set) -> pd.DataFrame:
    """
    Remove all stopwords from title and abstract in order to prevent many false positive.
    """
    data = data.reset_index(drop=True)
    data["full"] = data["title"] + " " + data["paperAbstract"]
    data["full"] = data["full"].str.lower()
    data["full"] = data["full"].map(lambda x: word_tokenize(x))
    data["full"] = data["full"].map(lambda sentence: " ".join([word for word in sentence if not word in list_stopwords]))
    return data

In [33]:
data = remove_stopwords_from_title_abstract(data, list_stopwords)

In [34]:
# par sécurité notamment à cause des mots composés comme as-grown
keyword_processor.remove_keywords_from_list(list(list_stopwords))

### Find keywords on all dataframe

In [37]:
def keep_articles_with_species(data: pd.DataFrame, keyword_processor: KeywordProcessor) -> pd.DataFrame:
    """
    Keep only articles for which we find a match and add a keyword column.
    Keywords are searched in paper title and paper abstract.
    The matched keywords need to be previously set in `keyword_processor`.
    """
    data["keyword"] = data["full"].map(lambda x: keyword_processor.extract_keywords(x))
    where = data["keyword"].astype(str) == "[]"
    data = data[~where]
    data = data.drop(["full"], axis=1)
    data = data.reset_index(drop=True)
    return data

In [38]:
data = keep_articles_with_species(data, keyword_processor)

In [39]:
data

Unnamed: 0,title,paperAbstract,fieldsOfStudy,keyword
0,Quantum phase diagrams and time-of-flight pict...,By treating the hopping parameter as a perturb...,[Physics],"[Via, Momentum]"
1,IPCC (Intergovernmental Panel on Climate Chang...,The European Science Foundation (ESF) and the ...,[],"[Panel, Des, Aix, Thoron, Aix]"
2,Increasing CRISPR Efficiency and Measuring Its...,Genome editing of human cluster of differentia...,"[Biology, Medicine]",[Bona]
3,Therapeutic effect of taurine against aluminum...,The aim of the study was to demonstrate the th...,[Medicine],[Gaba]
4,Conformationally restricted TRH analogs: a pro...,"In principle, the development of the active an...","[Chemistry, Medicine]",[Area]
...,...,...,...,...
8382,Dissecting the Core Fear in Anorexia Nervosa: ...,Anorexia nervosa (AN) is uniquely placed in th...,"[Medicine, Psychology]","[Spectrum, Via]"
8383,[Blood lipids in 11- to 14-year-old schoolchil...,"As a part of the cooperative programme ""Epidem...",[Medicine],[Alpha]
8384,Effect of Inhalation of Essential Oil of Rosa ...,Some studies document that odorants influence ...,[Medicine],"[Rosa, Rosa damascena]"
8385,Supercritical Carbon Dioxide Extraction of Squ...,Separation of squalene from Amaranthus panicul...,[Chemistry],"[Amaranthus paniculatus, Amaranthus paniculatus]"


In [40]:
data.shape

(8387, 4)