# Install libraries

Make sure you [install `rapidfuzz`](https://github.com/rapidfuzz/RapidFuzz?tab=readme-ov-file) for fuzzy matching. Click the link for documentation.

If you have issues with it, you can always go with [thefuzz](https://github.com/seatgeek/thefuzz). In that case, do `from thefuzz import fuzz, process` instead.

In [1]:
# Standard Python
import os
import unicodedata
from glob import glob
from importlib import reload
import datetime

# Canon libraries
import pandas as pd
import numpy as np

# You'll need to install this one
from rapidfuzz import fuzz, process

# The utils file to keep the notebook short
import utils

- The file `pnas.2217564120.sd01.xlsx` has a list of plant-specific academic journals. The base of the list comes [from here](https://www.pnas.org/doi/10.1073/pnas.2217564120). Then based on MU, I added missing journals containing the words 
```
['Plant', 'Botan', 'Phyto', 'Hort']
```

- The `MeSH_terms` file contains keywords that are mostly unique to plant biology research. 
- The terms are lowercased to have better chances of matching.

In [63]:
src = '..' + os.sep + 'raw' + os.sep
journals = pd.read_excel(src + 'pnas.2217564120.sd01.xlsx')
plantssns = pd.unique( journals.loc[:, ['ISSN','eISSN'] ].values.ravel() )
plantssns = plantssns[~pd.isna(plantssns)]
meshterms = np.char.lower(np.loadtxt(src + 'MeSH_terms.txt', dtype=str, delimiter=','))
meshterms = np.array([' ' + x for x in meshterms])

## Loading and preparing the data

- Load all the papers published with at least one author affiliated to MU since 2015.
- Data obtained from [dimensions.ai](https://www.dimensions.ai/)
- In reality `MU_Pubs_2026.xlsx` are all the papers published under *University of Missouri System*.
- To make name comparisons and diagnostics easier down the road, all the author names will be converted to `ascii` (standard English-language keyboard)

```python
# Example
i = 321
text = df.iloc[i]['Authors']
print(text)
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore').decode("utf-8")
print('\n',text,sep='')
```
The comands above rewrite the names
```
Sanz, Amparo; Pike, Sharon; Khan, Mather A; Carrió-Seguí, Àngela; Mendoza-Cózatl, David G; Peñarrubia, Lola; Gassmann, Walter
```
as:
```
Sanz, Amparo; Pike, Sharon; Khan, Mather A; Carrio-Segui, Angela; Mendoza-Cozatl, David G; Penarrubia, Lola; Gassmann, Walter
```
- Notice that all the accents and non-English characters have been replaced.
- Also, part of the data preparation removes any `.` or `-` characters: `David M. Braun ---> David M Braun`
- **This will also change `Missouri-Columbia` to `MissouriColumbia`**
- *You may want to save the 42K entry file so you won't have to compile it every time*

In [3]:
# Of the 54 fields available from Dimensions, these 29 might be relevant at some point
columns_to_keep = [
    'Publication ID', 'Title', 'Abstract', 'Source title', 'ISSN', 'Publisher', 'MeSH terms', 'PubYear',
    'Open Access', 'Publication Type', 'Document Type', 'Authors', 'Authors (Raw Affiliation)', 'Corresponding Authors',
    'Research Organizations - standardized', 'GRID IDs', 'City of standardized research organization',
    'State of standardized research organization', 'Country of standardized research organization', 'Funder',
    'Funder Group', 'Funder Country', 'Times cited', 'RCR', 'FCR', 'Altmetric', 'Fields of Research (ANZSRC 2020)',
    'Units of Assessment', 'Sustainable Development Goals'
]

institute = 'MU'
filenames = sorted(glob(src + institute + '*.xlsx'))

df = utils.prepare_dimensions_data(filenames, columns_to_keep)
print('Loaded and kept', len(df), 'publications')

../raw/MU_Pubs_2015.xlsx
../raw/MU_Pubs_2016.xlsx
../raw/MU_Pubs_2017.xlsx
../raw/MU_Pubs_2018.xlsx
../raw/MU_Pubs_2019.xlsx
../raw/MU_Pubs_2020.xlsx
../raw/MU_Pubs_2021.xlsx
../raw/MU_Pubs_2022.xlsx
../raw/MU_Pubs_2023.xlsx
../raw/MU_Pubs_2024.xlsx
../raw/MU_Pubs_2025.xlsx
../raw/MU_Pubs_2026.xlsx
Loaded and kept 41932 publications


## Criteria to determine if a publication is plant-specific

To determine if a paper is plant-specific, it has to match at least one criteria (out of three).

### 1. Published in a plant-specific journal

- Subset the papers that were published in plant-specific journals.
- Looking at journals by their unique numerical identifier (ISSN) instead of their names to avoid spelling confusions.

In [4]:
isplantjournal = utils.mask_plant_journals(df, plantssns)
print('Plant-specific journal publications:\t',np.sum(isplantjournal))

Plant-specific journal publications:	 1219


### 2. Categorized as *Agriculture*, *Plant Biology*, *Soil*, or *Horticulture* according to the ANZSRC Fields of Research

- There are other releated fields or research (e.g. Environmental Biotechnology), but those two are the ones pretty much exclusive for plant research

In [5]:
ANZSRC = ['3108 Plant Biology','3008 Horticultural', '4106 Soil']
isplantanz = utils.mask_plant_anzsrc(df, ANZSRC)
print('Categorized as plant-specific:\t',np.sum(isplantanz))

Categorized as plant-specific:	 1176


### 3. It has at least 3 plant-related keywords

- The `MeSH_terms` file has a list of what I deemed plant-related keywords that are very much related to plant biology and **not** biology in general
- We require at least 3 terms to make sure the paper is truly focused on plants
- e.g. While *cellulose* or *lignin* are very much plant-exclusive terms, you have papers that discuss them in the context of material science

In [64]:
isplantmesh = utils.count_plant_mesh(df, meshterms)
for N in range(isplantmesh.max()+1):
    print(N, np.sum(isplantjournal + (isplantmesh > N)), np.sum(isplantjournal + (isplantmesh > N)) - np.sum(isplantjournal), sep='\t')

data = df.loc[~isplantjournal & (isplantmesh >= 3), ['Publication ID', 'Title', 'Source title', 'Authors', 'Corresponding Authors']]
data = data[~pd.isna(data['Corresponding Authors'])]
print(data.shape)
data.tail(10)

## Getting plant-specific corresponding authors

- Find the union of those three criteria to determine the subset of plant-specific publications
- Discard those with no corresponding authors

In [69]:
isplant = isplantanz + isplantjournal + (isplantmesh > 2)
isplant = isplantjournal + (isplantmesh >= 3)

data = df.loc[isplant, ['Publication ID', 'Title', 'Source title', 'Authors', 'Corresponding Authors']]
data = data[~pd.isna(data['Corresponding Authors'])]
print(data.shape)
data.head()

(1363, 5)


Unnamed: 0,Publication ID,Title,Source title,Authors,Corresponding Authors
58,pub.1071447931,Genome‐Wide Association Analysis of Diverse So...,The Plant Genome,"Dhanapal, Arun Prabhu; Ray, Jeffery D; Singh, ...","Fritschi, Felix B (University of MissouriColum..."
59,pub.1071447926,Phytic Acid and Inorganic Phosphate Compositio...,The Plant Genome,"Vincent, Jennifer A; Stacey, Minviluz; Stacey,...","Bilyeu, Kristin D (University of MissouriColum..."
60,pub.1071447895,Linkage Maps of a Mediterranean × Continental ...,The Plant Genome,"Dierking, Ryan; Azhaguvel, Perumal; Kallenbach...","Dierking, Ryan (Purdue University West Lafayette)"
178,pub.1059858594,A model for intracellular movement of Cauliflo...,Journal of Experimental Botany,"Schoelz, James E; Angel, Carlos A; Nelson, Ric...","Schoelz, James E (University of MissouriColumbia)"
179,pub.1059858561,"Core clock, SUB1, and ABAR genes mediate flood...",Journal of Experimental Botany,"Syed, Naeem H; Prince, Silvas J; Mutava, Raymo...","Syed, Naeem H (Canterbury Christ Church Univer..."


- Susbset only the papers that have at least one corresponding author in the list of `institutes`
- Some papers have multiple corresponding authors: these are separated by `);`
    - They appear as `Author 1 (University 1); Author 2 (University 2)`
    - The closing parenthesis `)` is important to separate multiple authors
    - Otherwise, you can get confused with authors with multiple affiliations: `Author 1 (University 1; University 2)`
- Once you have separated all the corresponding authors, separate their name from their affiliation
- Get the unique authors (remove the repetitions)
- Count how many papers they have associated to them

In [72]:
institutes = ['University of MissouriColumbia', 'University of Missouri System']
authors, pnum = utils.corresponding_authors_from_institute(data, institutes)
print('Found', len(authors),'different corresponding authors')

Found 206 different corresponding authors


Some Corresponding Author values are weird in the raw data. For example, one paper lists its corresponding author as
```
Ferrieri, Richard (University of Missouri-Columbia; Missouri Research Reactor Center, University of Missouri, Columbia, MO 65211, USA;, srstt9@mail.missouri.edu, (S.S.);, afbkhn@mail.missouri.edu, (A.H.);, garren.powell@mail.missouri.edu, (G.P.);, alanstaett@burnsmcd.com, (A.A.);, gerheart@msu.edu, (A.G.);, mvbenoit@mail.missouri.edu, (M.B.);, wildersl@missouri.edu, (S.W.);, schuellerm@missouri.edu, (M.S.
```
Which gives 
```
University of Missouri-Columbia; University of Missouri-Columbia)
```
as a corresponding author.

- This is obviously wrong, so we are going to remove from the list those names that are *too* long.
- *Too long* in this case means much larger than the 90% quantile. As in a [boxplot](https://en.wikipedia.org/wiki/Box_plot).
- Keep a Series with number of papers associated to each name

In [74]:
reload(utils)
authors, pnum = utils.remove_long_corresponding(authors, pnum)
print('Reduced to', len(authors),'corresponding authors')

Dropped:
['University of MissouriColumbia)'
 'University of MissouriColumbia; University of MissouriColumbia)'
 'University of MissouriColumbia; University of MissouriColumbia; University of MissouriColumbia)']
--
Reduced to 203 corresponding authors


### Fuzzy-match each name with everyone else in the list

- A score of 100 means perfect match
- **I have not fully verified, but I think the fuzzy match operations are not symmetric**
- (Which does not make sense to me, but oh well...)

In [86]:
reload(utils)
fz = utils.fuzzy_matrix(authors)
fz.iloc[:5, :5]

Unnamed: 0,"Kalaitzandonakes, Nicholas","Mitchum, Melissa Goellner","VieiraPotter, Victoria J","Chhapekar, Sushil Satish","MunozSanz, Juan Vicente"
"Kalaitzandonakes, Nicholas",-1.0,-1.0,-1.0,-1.0,-1.0
"Mitchum, Melissa Goellner",-1.0,-1.0,-1.0,-1.0,-1.0
"VieiraPotter, Victoria J",-1.0,-1.0,-1.0,-1.0,14.285714
"Chhapekar, Sushil Satish",-1.0,-1.0,-1.0,-1.0,-1.0
"MunozSanz, Juan Vicente",-1.0,-1.0,14.285714,-1.0,-1.0


**Re-order the remaining authors by the length of their names.**
- Make a copy of the list
- Remove those names that are deemed copies
- Add the papers of the matches (if the fuzzy match is higher than `tol`)
- Only remove names downstream:
    - Since the list is ordered by name length, *David Braun* will be removed because of *David M Braun* but not the other way around
    - That way, we always keep the longer version of the name

In [87]:
tol = 90
pnums = utils.fuzzymatching_authors(authors, pnum, fz, tol)

Started with:	 203 

Mitchum, Melissa Goellner	-->	['Mitchum, Melissa G']
Voothuluru, Priyamvada	-->	['Voothuluru, Priya']
Patharkar, Osric Rahul	-->	['Patharkar, O Rahul']
FlintGarcia, Sherry A	-->	['FlintGarcia, Sherry']
Matthes, Michaela S	-->	['Matthes, Michaela']
Cocroft, Reginald B	-->	['Cocroft, Reginald']
Ferrieri, Richard A	-->	['Ferrieri, Richard' 'Ferrieri, R A']
Shannon, J Grover	-->	['Shannon, Grover']
Ferrieri, Richard	-->	['Ferrieri, R A']
Birchler, James A	-->	['Birchler, James']
Bilyeu, Kristin D	-->	['Bilyeu, Kristin']
Scaboo, Andrew M	-->	['Scaboo, Andrew']
Hibbard, Bruce E	-->	['Hibbard, B E']
Bradley, Kevin W	-->	['Bradley, Kevin']
Nguyen, Henry T	-->	['T Nguyen, Henry' 'Nguyen, H T']
T Nguyen, Henry	-->	['Nguyen, H T']
Shelby, Kent S	-->	['Shelby, KS']
Lall, Namrita	-->	['Lall, N']
Chen, Pengyin	-->	['Chen, P']
Bish, Mandy D	-->	['Bish, Mandy']
Vuong, Tri D	-->	['Vuong, T D']
Guo, Ya	-->	['Guo, Y']

After matching:	181


### Getting extra papers from the USDA

- Some folks are both USDA and MU, but Dimensions registers them as USDA only.
- We'll repeat the steps above and see if we can add any papers to the list of MU authors we already have
- **We'll be adding only papers, not authors**
- The end goal is to generate a file with USDA-affiliated authors and the number of plant-specific publications they have as corresponding authors.

In [92]:
filename = src + 'USDA_plant_corresponding_authors.csv'
usda_pubs = pd.read_csv(filename).set_index('Corresponding Authors').squeeze()
choices = usda_pubs.index.values
for i in range(len(pnums)):
    name = pnums.index[i]
    match, fscore, idx = process.extractOne(name, choices, scorer=fuzz.partial_ratio)
    if fscore >= 99:
        print(name, '--->', match, '[{:.2f}]'.format(fscore), sep='\t')
        pnums[name] += usda_pubs[match]

Beissinger, Timothy M	--->	Beissinger, Timothy M	[100.00]
Best, Norman B	--->	Best, Norman B	[100.00]
Bilyeu, Kristin D	--->	Bilyeu, Kristin D	[100.00]
Chen, Wei	--->	Chen, Weidong	[100.00]
Das, Debatosh	--->	Das, Debatosh	[100.00]
FlintGarcia, Sherry A	--->	FlintGarcia, Sherry	[100.00]
Gassmann, Walter	--->	Gassmann, Walter	[100.00]
Gillman, Jason D	--->	Gillman, Jason D	[100.00]
Hibbard, Bruce E	--->	Hibbard, Bruce E	[100.00]
Islam, Md Sariful	--->	Islam, Md Sariful	[100.00]
Krishnan, Hari B	--->	Krishnan, Hari B	[100.00]
Oliver, Melvin J	--->	Oliver, Melvin J	[100.00]
Pereira, Adriano E	--->	Pereira, Adriano E	[100.00]
Shelby, Kent S	--->	Shelby, Kent S	[100.00]
Washburn, Jacob D	--->	Washburn, Jacob D	[100.00]


## Determine which authors make the cut

- To be discussed with David
- How many papers are required to be considered an "IPG" member?

In [93]:
print('Number of authors with at least N plant-specific papers as corresponding author:\n--')
for N in range(1,11):
    print(N, np.sum(pnums >= N), sep='\t')

Number of authors with at least N plant-specific papers as corresponding author:
--
1	181
2	87
3	63
4	47
5	39
6	35
7	28
8	23
9	21
10	16


In [94]:
pub_req = 3
list(pnums[ pnums >= pub_req ].index.values)

['Angelovici, Ruthie',
 'Appel, Heidi M',
 'Beissinger, Timothy M',
 'Best, Norman B',
 'Bilyeu, Kristin D',
 'Birchler, James A',
 'Bish, Mandy D',
 'Bradley, Kevin W',
 'Braun, David M',
 'Chen, Pengyin',
 'Das, Debatosh',
 'Emerich, David W',
 'Ferrieri, Richard A',
 'FlintGarcia, Sherry A',
 'Fritsche, Kevin L',
 'Fritschi, Felix B',
 'Gassmann, Walter',
 'Gillman, Jason D',
 'Guo, Ya',
 'Heese, Antje',
 'Hibbard, Bruce E',
 'Huynh, Man P',
 'Islam, Md Sariful',
 'Joshi, Trupti',
 'Koo, Abraham J',
 'Krishnan, Hari B',
 'Lall, Namrita',
 'Mabry, Makenzie E',
 'Matthes, Michaela S',
 'McClure, Bruce',
 'McSteen, Paula',
 'MendozaCozatl, David G',
 'Meyers, Blake C',
 'Mitchum, Melissa Goellner',
 'Mittler, Ron',
 'Nguyen, Henry T',
 'Oliver, Melvin J',
 'Park, SoYon',
 'Patharkar, Osric Rahul',
 'Peck, Scott C',
 'Pereira, Adriano E',
 'Pires, J Chris',
 'Scaboo, Andrew M',
 'Schenck, Craig A',
 'Schoelz, James E',
 'Shelby, Kent S',
 'Slotkin, R Keith',
 'Stacey, Gary',
 'Stacey, M

In [95]:
foo = pnums.to_frame('N').reset_index(names='names').sort_values(by=['N','names'], ascending=[False,True]).set_index('names').squeeze()
foo.to_csv('MU_IPG.csv', index=True, index_label='Corresponding Authors', header=['Pubs Num'])
foo

names
Nguyen, Henry T       55
Mittler, Ron          43
Meyers, Blake C       33
Stacey, Gary          33
Birchler, James A     25
                      ..
Xi, Xiong              1
Xiao, Lihong           1
Xiong, Zhiyong         1
Zhang, Hao             1
atalya, K Kutsokon     1
Name: N, Length: 181, dtype: int64

____

# Computing the USDA file

- Some folks are both USDA and MU, but Dimensions registers them as USDA only.
- We'll repeat the steps above and see if we can add any papers to the list of MU authors we already have
- **We'll be adding only papers, not authors**
- The end goal is to generate a file with USDA-affiliated authors and the number of plant-specific publications they have as corresponding authors.

In [88]:
institute = 'USDA'

filenames = sorted(glob(src + institute + '*.xlsx'))
usda = utils.prepare_dimensions_data(filenames, columns_to_keep)
print('Loaded and kept', len(usda), 'publications')
print(usda.shape)

../raw/USDA_Pubs_1.xlsx
../raw/USDA_Pubs_2.xlsx
../raw/USDA_Pubs_3.xlsx
../raw/USDA_Pubs_4.xlsx
../raw/USDA_Pubs_5.xlsx
../raw/USDA_Pubs_6.xlsx
../raw/USDA_Pubs_7.xlsx
../raw/USDA_Pubs_8.xlsx
Loaded and kept 23090 publications
(23090, 29)


In [89]:
isplantjournal = utils.mask_plant_journals(usda, plantssns)
print('Plant-specific journal publications:\t',np.sum(isplantjournal))
isplantanz = utils.mask_plant_anzsrc(usda, ANZSRC)
print('Categorized as plant-specific:\t',np.sum(isplantanz))
isplantmesh = utils.count_plant_mesh(usda, meshterms)

isplant = isplantanz + isplantjournal + (isplantmesh > 2)
isplant = isplantjournal + (isplantmesh > 2)
data = usda.loc[isplant, ['Publication ID', 'Title', 'Source title', 'Authors', 'Corresponding Authors']]
data = data[~pd.isna(data['Corresponding Authors'])]
print(data.shape)

Plant-specific journal publications:	 3914
Categorized as plant-specific:	 4294
(3678, 5)


In [90]:
institutes = ['United States Department of Agriculture', 'Agricultural Research Service', 'Biological Control of Insects Research']
authors, pnum = utils.corresponding_authors_from_institute(data, institutes)
print('Found', len(authors),'different corresponding authors')

authors, pnum = utils.remove_long_corresponding(authors, pnum)
print('Reduced to', len(authors),'corresponding authors')

Found 661 different corresponding authors
Dropped:
['Agricultural Research Service  Midwest Area)'
 'Agricultural Research Service  Northeast Area)'
 'Agricultural Research Service  Pacific West Area)']
--
Reduced to 658 corresponding authors


In [91]:
fz = utils.fuzzy_matrix(authors)
usda_pubs = utils.fuzzymatching_authors(authors, pnum, fz, tol)
filename = src + institute + '_plant_corresponding_authors.csv'
pd.Series(usda_pubs).to_csv(filename, index=True, index_label='Corresponding Authors', header=['Pubs Num'])

Started with:	 658 

Richardson, Kelley Lynne	-->	['Richardson, Kelley L']
Clyde, Douglas Boyette	-->	['C, Douglas Boyette']
Mahalingam, Ramamurthy	-->	['Mahalingam, R']
BajsaHirschel, Joanna	-->	['BajsaHirschel, J']
Holbrook, Carl Corley	-->	['Holbrook, C Corley']
BalintKurti, Peter J	-->	['BalintKurti, Peter' 'BalintKurti, P J']
Heck, Michelle Cilia	-->	['Heck, Michelle L' 'Heck, Michelle']
Ellsworth, Patrick Z	-->	['Ellsworth, PZ']
Handoo, Zafar Ahmad	-->	['Handoo, Zafar A' 'Handoo, Zafar']
Baumgartner, Kendra	-->	['Baumgartner, K']
Munyaneza, Joseph E	-->	['Munyaneza, J E']
Giovannoni, James J	-->	['Giovannoni, James']
Vandemark, George J	-->	['Vandemark, George']
HulseKemp, Amanda M	-->	['HulseKemp, Amanda']
McElrone, Andrew J	-->	['McElrone, A J' 'McElrone, AJ']
Pooler, Margaret R	-->	['Pooler, Margaret']
Fallen, Benjamin D	-->	['Fallen, Benjamin' 'Fallen, Ben']
Polashock, James J	-->	['Polashock, James']
BalintKurti, Peter	-->	['BalintKurti, P J']
Bilyeu, Kristin D	-->	['Bilyeu,

# Ignore all below

In [48]:
iscorr = np.zeros(len(df), dtype=bool)
for i in range(len(df)):
    if not pd.isna(df.iloc[i]['Corresponding Authors']):
        iscorr[i] = 'Libault, Marc' in df.loc[i, 'Authors']
print(np.sum(iscorr))
foo = df.loc[iscorr, ['Publication ID', 'Title', 'Source title', 'MeSH terms', 'Fields of Research (ANZSRC 2020)', 'Authors', 'Corresponding Authors']]
foo.head(20)

16


Unnamed: 0,Publication ID,Title,Source title,MeSH terms,Fields of Research (ANZSRC 2020),Authors,Corresponding Authors
265,pub.1053348523,Identification of microRNAs and their mRNA tar...,New Phytologist,"cluster analysis; gene expression regulation, ...",31 Biological Sciences; 3108 Plant Biology,"Yan, Zhe; Hossain, Md Shakhawat; Arikit, Siwar...","Stacey, Gary (University of MissouriColumbia)"
758,pub.1028862353,"Xyloglucan, galactomannan, glucuronoxylan, and...",Planta,cell wall; galactose; glucans; mannans; pectin...,"30 Agricultural, Veterinary and Food Sciences;...","Muszynski, Artur; ONeill, Malcolm A; Ramasamy,...","ONeill, Malcolm A (University of Georgia)"
1008,pub.1016543825,Identification and functional characterization...,Plant Biotechnology Journal,bradyrhizobium; gene expression profiling; gen...,"30 Agricultural, Veterinary and Food Sciences;...","Yan, Zhe; Hossain, Md Shakhawat; ValdesLopez, ...","Stacey, Gary (University of MissouriColumbia)"
8114,pub.1084216895,The GmFWL1 (FW2‐2‐like) nodulation gene encode...,Plant Cell & Environment,"biomarkers; bradyrhizobium; genes, plant; geno...",31 Biological Sciences; 3101 Biochemistry and ...,"Qiao, Zhenzhen; Brechenmacher, Laurent; Smith,...","Libault, Marc (University of Oklahoma)"
10981,pub.1107640160,Phosphate Deficiency Negatively Affects Early ...,Genes,,31 Biological Sciences; 3108 Plant Biology,"IsidraArellano, Mariel C; del Rocio ReyeroSaav...","ValdesLopez, Oswaldo (National Autonomous Univ..."
34443,pub.1182844420,Building a FAIR data ecosystem for incorporati...,Frontiers in Genetics,,31 Biological Sciences; 3102 Bioinformatics an...,"Kapoor, Muskan; Ventura, Enrique Sapena; Walsh...","Kapoor, Muskan (Iowa State University of Scien..."
34480,pub.1182613577,Soybean genomics research community strategic ...,The Plant Genome,"glycine max; genomics; genome, plant; plant br...",31 Biological Sciences; 3105 Genetics,"Stupar, Robert M; Locke, Anna M; Allen, Doug K...","Stupar, Robert M (University of Minnesota Twin..."
34574,pub.1182049389,A strategy for identification and characteriza...,Plant Direct,,31 Biological Sciences; 3102 Bioinformatics an...,"Hancock, C Nathan; Germany, Tetandianocee; Red...","Hancock, C Nathan (University of South Carolin..."
34880,pub.1175692761,DIRT/µ: automated extraction of root hair trai...,Journal of Experimental Botany,"plant roots; algorithms; image processing, com...","30 Agricultural, Veterinary and Food Sciences;...","Pietrzyk, Peter; PhanUdom, Neen; Chutoe, Chart...","Bucksch, Alexander (University of Arizona)"
35402,pub.1172569567,Single-cell transcriptome atlases of soybean r...,Plant Communications,glycine max; transcriptome; plant root nodulat...,31 Biological Sciences; 3102 Bioinformatics an...,"CervantesPerez, Sergio Alan; Zogli, Prince; Am...","Libault, Marc (University of MissouriColumbia;..."


In [53]:
foo.iloc[isplantmesh[iscorr] == 2]

Unnamed: 0,Publication ID,Title,Source title,MeSH terms,Fields of Research (ANZSRC 2020),Authors,Corresponding Authors
34880,pub.1175692761,DIRT/µ: automated extraction of root hair trai...,Journal of Experimental Botany,"plant roots; algorithms; image processing, com...","30 Agricultural, Veterinary and Food Sciences;...","Pietrzyk, Peter; PhanUdom, Neen; Chutoe, Chart...","Bucksch, Alexander (University of Arizona)"
41856,pub.1189193754,The differential transpiration response of pla...,Philosophical Transactions of the Royal Societ...,"plant transpiration; stress, physiological; cr...",31 Biological Sciences; 3108 Plant Biology,"Sinha, Ranjita; PelaezVico, Maria Angeles; Pas...","Mittler, Ron (University of Missouri System)"


In [41]:
pd.unique(foo['Fields of Research (ANZSRC 2020)'])

array(['31 Biological Sciences; 3102 Bioinformatics and Computational Biology; 3105 Genetics; 3107 Microbiology',
       '31 Biological Sciences; 3101 Biochemistry and Cell Biology; 3107 Microbiology',
       '31 Biological Sciences; 3101 Biochemistry and Cell Biology',
       '34 Chemical Sciences; 3406 Physical Chemistry',
       '31 Biological Sciences; 3107 Microbiology',
       '31 Biological Sciences; 3102 Bioinformatics and Computational Biology; 3107 Microbiology',
       '31 Biological Sciences; 3107 Microbiology; 32 Biomedical and Clinical Sciences; 3207 Medical Microbiology',
       '31 Biological Sciences; 3102 Bioinformatics and Computational Biology; 3103 Ecology; 3105 Genetics; 3107 Microbiology'],
      dtype=object)

In [47]:
pd.unique(foo['Authors'])

array(['Daniel, Jeremy J; Givan, Scott A; Brun, Yves V; Brown, Pamela J B',
       'Williams, Michelle; Hoffman, Michelle D; Daniel, Jeremy J; Madren, Seth M; Dhroso, Andi; Korkin, Dmitry; Givan, Scott A; Jacobson, Stephen C; Brown, Pamela J B',
       'FigueroaCuilan, Wanda; Daniel, Jeremy J; Howell, Matthew; Sulaiman, Aliyah; Brown, Pamela J B',
       'Zhang, Chiqian; Brown, Pamela JB; Hu, Zhiqiang',
       'Attai, Hedieh; Rimbey, Jeanette; Smith, George P; Brown, Pamela J B',
       'Howell, Matthew; Aliashkevich, Alena; Salisbury, Anne K; Cava, Felipe; Bowman, Grant R; Brown, Pamela J B',
       'Zhang, Chiqian; Brown, Pamela J B; Miles, Randall J; White, Tommi A; Grant, DeAna G; Stalla, David; Hu, Zhiqiang',
       'Attai, Hedieh; Boon, Maarten; Phillips, Kenya; Noben, JeanPaul; Lavigne, Rob; Brown, Pamela J B',
       'Liao, Lisheng; Schaefer, Amy L; Coutinho, Bruna G; Brown, Pamela J B; Greenberg, E Peter',
       'FigueroaCuilan, Wanda M; Brown, Pamela J B',
       'Zhang, Chi

In [43]:
isplantmesh[iscorr]

array([0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 0, 0, 1, 2, 0,
       1, 0, 4, 1, 0, 0, 0, 0])

In [25]:
pd.unique(foo.loc[bar.index]['Fields of Research (ANZSRC 2020)'])

array(['30 Agricultural, Veterinary and Food Sciences; 3001 Agricultural Biotechnology',
       '30 Agricultural, Veterinary and Food Sciences; 3008 Horticultural Production'],
      dtype=object)

In [27]:
iscorr = np.zeros(len(usda), dtype=bool)
for i in range(len(usda)):
    if not pd.isna(usda.iloc[i]['Corresponding Authors']):
        iscorr[i] = 'Kang, David' in usda.loc[i, 'Corresponding Authors']
print(np.sum(iscorr))
usda.loc[iscorr, ['Publication ID', 'Title', 'Source title', 'ISSN', 'Authors', 'Corresponding Authors']]

4


Unnamed: 0,Publication ID,Title,Source title,ISSN,Authors,Corresponding Authors
23748,pub.1184931530,Motility genes are associated with the occurre...,ISME Communications,"2730-6151, 2730-6151","Bhandari, Rishi; Robbins, Caleb J; Arora, Arin...","Kang, David S (Biological Control of Insects R..."
23751,pub.1183876975,Efficacy and Fate of RNA Interference Molecule...,Archives of Insect Biochemistry and Physiology,"0739-4462, 1520-6327","Arora, Arinder K; Kang, David S","Kang, David S (Biological Control of Insects R..."
23762,pub.1169213581,Indomethacin and 20‐hydroxyecdysone influence ...,Archives of Insect Biochemistry and Physiology,"0739-4462, 1520-6327","Wang, Yong; Buer, Benjamin; Goodman, Cynthia L...","Kang, David (Biological Control of Insects Res..."
23797,pub.1140486029,Cell Line Platforms Support Research into Arth...,Insects,"2075-4450, 2075-4450","Goodman, Cynthia L; Kang, David S; Stanley, David","Kang, David S (Biological Control of Insects R..."


In [90]:
jkw = ['Plant', 'Botan', 'Phyto', 'Hort', 'Fung', 'Myco']
isjournal = np.zeros(len(df), dtype=bool)
for i in range(len(df)):
    if not pd.isna(df.loc[i, 'Source title']):
        foo = df.loc[i, 'Source title']
        isjournal[i] = any([ kw in foo for kw in jkw ])

print(np.sum(isjournal))
uq = [ x.upper() for x in pd.unique(df.loc[isjournal, 'Source title']) ]
print(len(uq))
uq, idx = np.unique(df.loc[isjournal, 'Source title'].values, return_index=True)
idx = df.loc[isjournal, 'Source title'].index[idx].values
bar = df.loc[idx, ['Source title', 'ISSN']]
bar['Source title'] = bar['Source title'].str.upper()
bar = bar.set_index('Source title')
bar.loc[ np.setdiff1d(bar.index, journals['Journal']) ]

1138
122


Unnamed: 0_level_0,ISSN
Source title,Unnamed: 1_level_1
FUNGAL BIOLOGY,"1878-6146, 1469-8102"
FUNGAL BIOLOGY REVIEWS,"1749-4613, 1878-0253"
FUNGAL GENETICS AND BIOLOGY,"1087-1845, 1096-0937"
JOURNAL OF CLINICAL TUBERCULOSIS AND OTHER MYCOBACTERIAL DISEASES,2405-5794
JOURNAL OF FUNGI,2309-608X
MEDICAL MYCOLOGY,"1369-3786, 1460-2709"
MYCOLOGY: CURRENT AND FUTURE DEVELOPMENTS,"2452-0780, 24520780"
MYCORRHIZA,"0940-6360, 1432-1890"
MYCOTOXIN RESEARCH,"0178-7888, 1867-1632"
WORLD MYCOTOXIN JOURNAL,"1875-0710, 1875-0796"


In [38]:
for i in range(len(df.columns)):
    print(i, df.columns[i], sep='\t')

0	Rank
1	Publication ID
2	DOI
3	PMID
4	PMCID
5	ISBN
6	Title
7	Abstract
8	Acknowledgements
9	Funding
10	Source title
11	Anthology title
12	Book editors
13	Publisher
14	ISSN
15	MeSH terms
16	Publication date
17	PubYear
18	Publication date (online)
19	Publication date (print)
20	Volume
21	Issue
22	Pagination
23	Open Access
24	Publication Type
25	Document Type
26	Authors
27	Authors (Raw Affiliation)
28	Corresponding Authors
29	Authors Affiliations
30	Research Organizations - standardized
31	GRID IDs
32	City of standardized research organization
33	State of standardized research organization
34	Country of standardized research organization
35	Funder
36	Funder Group
37	Funder Country
38	Grant IDs of Supporting Grants
39	Supporting Grants
40	Times cited
41	Recent citations
42	RCR
43	FCR
44	Altmetric
45	Source Linkout
46	Dimensions URL
47	Fields of Research (ANZSRC 2020)
48	RCDC Categories
49	HRCS HC Categories
50	HRCS RAC Categories
51	Cancer Types
52	CSO Categories
53	Units of Assessment
54	

In [46]:
meshs = df.loc[~isplant, 'MeSH terms']
meshs = meshs[~pd.isna(meshs)]
mesh = set()
for i in range(len(meshs)):
    mesh |= set(meshs.iloc[i].split('; '))
mesh = sorted(list(mesh))
#pd.Series(mesh).to_csv(src + 'mesh.txt', index=False, header=False, sep='\n')

In [109]:
meshs = df.loc[isplant, 'MeSH terms']
meshs = meshs[~pd.isna(meshs)]
mesh = set()
for i in range(len(meshs)):
    mesh |= set(meshs.iloc[i].split('; '))
mesh = sorted(list(mesh))
print(len(mesh))
pd.Series(mesh).to_csv(src + 'plant_MU_mesh.txt', index=False, header=False, sep='\n')

1274


In [158]:
foo = df.loc[~isplant]
bar = foo.iloc[ isplantmesh[~isplant] > 2 ][['Publication ID', 'MeSH terms', 'Title', 'Source title', 'Corresponding Authors']]
print(bar.shape)

(99, 5)


In [216]:
foo = pnum.to_frame(name='Pubs Num')
foo['Length'] = np.array(list(map(len,pnum.index.values)))
foo['Names'] = foo.index.values
foo = foo.sort_values(by=['Pubs Num', 'Length', 'Names'], ascending=[False, False, True])
foo

Unnamed: 0,Pubs Num,Length,Names
"Nguyen, Henry T",58,15,"Nguyen, Henry T"
"Mittler, Ron",51,12,"Mittler, Ron"
"Meyers, Blake C",36,15,"Meyers, Blake C"
"Stacey, Gary",34,12,"Stacey, Gary"
"Birchler, James A",28,17,"Birchler, James A"
...,...,...,...
"Li, Song",1,8,"Li, Song"
"Qin, Hua",1,8,"Qin, Hua"
"Song, Li",1,8,"Song, Li"
"Chen, P",1,7,"Chen, P"


In [85]:
name1, name2 = 'Stacey, Minviluz G', 'Stacey, Gary'

lname1, fname1 = name1.split(', ')
inits1 = [x[0] for x in fname1.split(' ')]

lname2, fname2 = name2.split(', ')
inits2 = [x[0] for x in fname2.split(' ')]

#If none of the initials match, then assume that names are not equal and move on
if not any([x in inits1 for x in inits2]):
    print('No initial matching')

fuzzscore = fuzz.ratio(name1.casefold(), name2.casefold())
if fuzzscore >= 98:
    print('Fuzzscore', fuzzscore)

# Add blank spaces for first names reduced to intials
# e.g. Riedell, WE --> Riedell, W E
                
fname1 = utils.add_blanks(fname1)
inits1 = [x[0] for x in fname1.split(' ')]

fname2 = utils.add_blanks(fname2)
inits2 = [x[0] for x in fname2.split(' ')]

print('Last name: ', lname1, ' --\tFirst name: ', fname1, ' --\tInitials: ',inits1, sep='')
print('Last comp: ', lname2, ' --\tFirst comp: ', fname2, ' --\tInitials: ',inits2, sep='')

Last name: Stacey --	First name: Minviluz G --	Initials: ['M', 'G']
Last comp: Stacey --	First comp: Minviluz G --	Initials: ['M', 'G']
