# _**AyeCite**™_ (Training Version) <small>by [_AyeAI_](https://ayeai.xyz)®</small>
### An tool to iterate over research portfolios
### **Research standing search utility**
##### _Copyright 2019-2022 [Abhishek Choudhary](https://bit.ly/cognitist), [Dr Srija Katta](https://www.linkedin.com/in/srija-katta)_
##### [**AyeSPL**](https://github.com/ayevdi/ayevdi/blob/master/LICENSE) License (latest version applies). This software tool is shared without any warranties. This must not be used for life saving or critical purposes.
##### <small>_™AyeCite is claimed as a trademark by Abhishek Choudhary in India and other geographies ®AyeAI is a registered trademark of Abhishek Choudhary in India and other geographies_</small>

In [1]:
#@title Descrption and source [URL](https://en.wikipedia.org/wiki/Url) of the files
#@markdown #### Click _show code_ to see / modify the list
%%writefile /content/seedllist
Authors index 2022
https://elsevier.digitalcommonsdata.com/datasets/btchxktzyw/4
https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/btchxktzyw-4.zip

Authors index 2021
https://elsevier.digitalcommonsdata.com/datasets/btchxktzyw/3
https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/btchxktzyw-3.zip

Authors index 2020
https://elsevier.digitalcommonsdata.com/datasets/btchxktzyw/2
https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/btchxktzyw-2.zip

Authors index 2019
https://elsevier.digitalcommonsdata.com/datasets/btchxktzyw/1
https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/btchxktzyw-1.zip

Article - 2020
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000918



Writing /content/seedllist


In [2]:
#@title Acquire the files ([ETA](https://en.wikipedia.org/wiki/Estimated_time_of_arrival): ~30 seconds) { vertical-output: true }
#@markdown We use [**wget**](https://en.wikipedia.org/wiki/Wget) to download the files as described in the source list. <br>
#@markdown The list is iterated over using a ["for" loop](https://en.wikipedia.org/wiki/For_loop) written in [**bash**](https://en.wikipedia.org/wiki/Bash_(Unix_shell)). <br>
#@markdown The list is filtered using [**grep**](https://en.wikipedia.org/wiki/Estimated_time_of_arrival) to identiy **URLs** pointing to **zip** archives
#@markdown The archive files are extracted using [**unzip**](https://en.wikipedia.org/wiki/ZIP_(file_format))


acquire_cmd = """
  for n in $(cat /content/seedllist | grep https | grep zip)
  do
    file=$(basename $n)
    echo Downloading $file from $n
    rm -f $file
    wget -q $n -O$file
    echo Extracting content from $file
    unzip -o $file
  done
"""

print("Acquiring the files...\nPlease wait a while (around 30 seconds)")
result = !{acquire_cmd}
print("Done")


Acquiring the files...
Please wait a while (around 30 seconds)
Done


In [3]:
#@title Install **tools** (ETA: ~30 seconds) { vertical-output: true }
#@markdown We use **ssconvert** for converting **.xlsx** files to **.csv** <br>
#@markdown The tool is part of [**gnumeric**](https://en.wikipedia.org/wiki/Gnumeric)
#Ref: https://inconsolation.wordpress.com/2014/05/09/ssconvert-an-awful-lot-of-baggage/
install_cmd="""
sudo apt-get install gnumeric
"""

print("Installing gnumeric...")
result = !{install_cmd}
print("Done")

Installing gnumeric...
Done


In [4]:
#@title Move the files to a clean folder (ETA: ~1 second) { vertical-output: true }
#@markdown We use **find** and **grep** to identify xlsx files.<br>
#@markdown The result is **piped** through **xargs** to be used with **mv** command
move_cmd="""
mkdir -p source
find . -type f | grep '.xlsx$' | xargs -n1 -I{} mv "{}" source/
"""

print("Moving the xlsx files to folder named source...")
result=!{move_cmd}
print("Done")

Moving the xlsx files to folder named source...
Done


In [5]:
#@title Convert the files to **csv** (ETA: ~5 minutes) { vertical-output: true }
#@markdown We use a **for** loop to iterate over the files. <br>
#@markdown Each xlsx file is converted to csv on a per sheet basis
convert_cmd = """
cd source
for n in $(ls *.xlsx)
do
  echo $n
  rm -f $n.csv
  ssconvert -S --export-type=Gnumeric_stf:stf_csv $n $n.csv
done
"""

print("Converting the xlsx to csv...")
print("Please wait a while (around 5 minutes)")
result = !{convert_cmd}
print("Done")

Converting the xlsx to csv...
Please wait a while (around 5 minutes)
Done


In [6]:
#@title List the generated CSV files by size (ETA: <1 second)
#@markdown Here we use a for loop in Python
sizelst_cmd = """
cd source
ls -lS | grep csv
"""

print("The generated csv files are listed below by size")
result=!{sizelst_cmd}
for i in result:
  print(i)

The generated csv files are listed below by size
-rw-r--r-- 1 root root 82476600 Oct 19 04:39 Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1
-rw-r--r-- 1 root root 81606313 Oct 19 04:41 Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1
-rw-r--r-- 1 root root 77767976 Oct 19 04:38 Table_1_Authors_career_2020_wopp_extracted_202108.xlsx.csv.1
-rw-r--r-- 1 root root 77093290 Oct 19 04:40 Table_1_Authors_singleyr_2020_wopp_extracted_202108.xlsx.csv.1
-rw-r--r-- 1 root root 64709035 Oct 19 04:43 Table-S6-career-2019.xlsx.csv.1
-rw-r--r-- 1 root root 63391746 Oct 19 04:44 Table-S7-singleyr-2019.xlsx.csv.1
-rw-r--r-- 1 root root 42048377 Oct 19 04:42 Table-S4-career-2018.xlsx.csv.1
-rw-r--r-- 1 root root 33005586 Oct 19 04:42 Table-S1-career-2017.xlsx.csv.1
-rw-r--r-- 1 root root 30743882 Oct 19 04:42 Table-S2-singleyr-2017.xlsx.csv.1
-rw-r--r-- 1 root root    44373 Oct 19 04:41 Table_2_field_subfield_thresholds_career_2020_wopp_extrac

In [45]:
#@title Find files for a specific tag { vertical-output: true }
#@markdown You can enter you own tags, or select from the list
search_tag = "_2021_" #@param ["_2021_", "_2020_"]{allow-input: true}

print("You can also type / paste the tags from below",end='\n\n')

tag_cmd = """
cd source
ls | sed 's/[^A-Za-z0-9]/\\n/g' | sort -u
"""
tags_list = !{tag_cmd}
items=0
for n in tags_list:
  print(n.ljust(15, ' '),end='')
  items = items+1
  if(0 == items % 5):
    print('')

print('\n\n')
search_cmd = """
cd source
ls *.csv.* | grep """ + search_tag

print("The following files match your search criteria\n")
result_search = !{search_cmd}
for n in result_search:
  print(n)

You can also type / paste the tags from below

0              1              1788           2              2017           
2018           2019           2020           2021           202108         
202209         3              Authors        career         compare100     
csv            extracted      field          Field          maxlog         
pubs           S1             S2             S3             S4             
S5             S6             S7             S8             S9             
since          singleyr       subfield       Subfield       Table          
thresholds     Thresholds     wopp           xlsx           


The following files match your search criteria

Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0
Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1
Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0
Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.

In [62]:
#@title Find the selected files' schema and entry count by making cat calls { vertical-output: true }
#@markdown The schemas of the selected files are as follows <br>
cat_bin = """
which cat
"""
cat_exe = !{cat_bin}

cat_call = """
cd source
cat filename | filter
"""

for n in result_search:
  print(n)
  print("Schema: ",end='')
  !{cat_call.replace("filename",n).replace("filter","head -1")}
  print("Entries (including header): ",end='')
  !{cat_call.replace("filename",n).replace("filter","wc -l")}
  print('\n')


Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0
Schema: FIELD,BASIS,DESCRIPTION
Entries (including header): 47


Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1
Schema: authfull,inst_name,cntry,np6021,firstyr,lastyr,"rank (ns)","nc9621 (ns)","h21 (ns)","hm21 (ns)","nps (ns)","ncs (ns)","cpsf (ns)","ncsf (ns)","npsfl (ns)","ncsfl (ns)","c (ns)","npciting (ns)","cprat (ns)","np6021 cited9621 (ns)",self%,rank,nc9621,h21,hm21,nps,ncs,cpsf,ncsf,npsfl,ncsfl,c,npciting,cprat,"np6021 cited9621",np6021_d,nc9621_d,sm-subfield-1,sm-subfield-1-frac,sm-subfield-2,sm-subfield-2-frac,sm-field,sm-field-frac,"rank sm-subfield-1","rank sm-subfield-1 (ns)","sm-subfield-1 count"
Entries (including header): 195606


Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0
Schema: FIELD,BASIS,DESCRIPTION
Entries (including header): 47


Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1
Schema: authfu

In [80]:
#@title Look for specific authors by making cat calls { vertical-output: true }
#@markdown Partial names are acceptable for serach criteria <br>
#@markdown TBD: The current version does a generic search and strings are searched across all fields
Author_last_name = "Balkrishna" #@param {type:"string"}
Author_first_name = "Acharya" #@param {type:"string"}
Author_country = "ind" #@param {type:"string"}
Author_institution = "Patanjali" #@param {type:"string"}
#@markdown The author search result per file <br>
cat_bin = """
which cat
"""
cat_exe = !{cat_bin}

cat_call = """
cd source
cat filename | filter
"""

grep_filter = "cat "
if len(Author_last_name)>0:
  grep_filter = grep_filter + '| grep "' + Author_last_name + '"'
if len(Author_first_name)>0:
  grep_filter = grep_filter + '| grep "' + Author_first_name + '"'
if len(Author_country)>0:
  grep_filter = grep_filter + '| grep "' + Author_country + '"'
if len(Author_institution)>0:
  grep_filter = grep_filter + '| grep "' + Author_institution + '"'

print(grep_filter,end='\n\n')

for n in result_search:
  print(n)
  try:
    !{cat_call.replace("filename",n).replace("filter",grep_filter)}
  except:
    print("Sorry, no matches! Please modify the criteria")
  print('\n')


cat | grep "Balkrishna"| grep "Acharya"| grep "ind"| grep "Patanjali"

Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0


Table_1_Authors_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1


Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0


Table_1_Authors_singleyr_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1
"Balkrishna, Acharya","Patanjali Yog Peeth Trust, Haridwar",ind,117,2009,2022,367268,189,6,3.436904761904762,0,0,56,100,109,178,2.222103713812799,171,1.105263157894737,68,0.1818,314687,231,7,3.996428571428571,0,0,56,137,109,217,2.343175321908442,193,1.196891191709845,69,2,2,"Medicinal & Biomolecular Chemistry",0.2,"Neurology & Neurosurgery",0.1478260869565217,"Clinical Medicine",0.6,1704,2078,99546


Table_2_field_subfield_thresholds_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.0


Table_2_field_subfield_thresholds_career_2021_pubs_since_1788_wopp_extracted_202209.xlsx.csv.1


Table_2_field