## Quick ALMA-centric Arxiv filter

The goal is to find any paper on the astro-ph arxiv that uses ALMA data.

For now, we search for keywords:
* ALMA
* mm
* millimeter
* millimetre

Results are returned in a table format for a quick look, and also exported to an excel spreadsheet.

### Suggestions always welcome! 

In [1]:
%matplotlib inline

In [2]:
import urllib
try:
    # Python 2
    from urllib import quote_plus
    from urllib import urlencode
    from urllib import urlretrieve
except ImportError:
    # Python 3
    from urllib.parse import quote_plus
    from urllib.parse import urlencode
    from urllib.request import urlretrieve


In [3]:
import feedparser
import pandas as pd
import numpy as np
import datetime
import time

In [4]:
#OLDER VERSION:
#url = 'http://export.arxiv.org/api/query?search_query=cat:%s+AND+%%28+all:%s+OR+all:%s+OR+all:%s+OR+all:%s+%%29&start=0&sortBy=submittedDate&sortOrder=descending'%(cat,keywords[0],keywords[1],keywords[2],keywords[3])
#data = urllib.urlopen(url).read()

In [80]:
## Choose the date range to search
enddate = '201812141159' #enter as yyyymmddhhmm or "today"
startdate = '201811011159' #if nothing, print since one month before enddate 
if enddate == 'today':
    now = datetime.datetime.now()
    enddate = str(now.year)+str(now.month)+str(now.day)+str(now.hour)+str(now.minute)
enddate_as_date = datetime.datetime(int(enddate[0:4]),int(enddate[4:6]),int(enddate[6:8]))
if startdate == '':
    startdate_as_date=enddate_as_date.replace(day=1)
    startdate = str(startdate_as_date.year)+str(startdate_as_date.month)+str(startdate_as_date.day)+str(startdate_as_date.hour)+str(startdate_as_date.minute)
print('Search dates: %s:%s')%(startdate,enddate)

Search dates: 201811011159:201812141159


In [77]:
## based on code from https://github.com/lukasschwab/arxiv.py/blob/master/arxiv/arxiv.py
root_url = 'http://export.arxiv.org/api/'

def query(search_query="",
         date_from=None, 
         date_until=None,
         id_list=[], 
         prune=True, 
         start=0, 
         max_results=10, 
         sort_by="relevance", 
         sort_order="descending"):
    url_args = urlencode({"id_list": ','.join(id_list),
                          "start": start,
                          "max_results": max_results,
                          "sortBy": sort_by,
                          "sortOrder": sort_order},)
    results = feedparser.parse(root_url + 'query?search_query='+search_query + url_args)
    if results.get('status') != 200:
        # TODO: better error reporting
        raise Exception("HTTP Error " + str(results.get('status', 'no status')) + " in query")
    else:
        results = results['entries']

    return results

In [78]:
## Make a query
cat = 'astro-ph*'
## Here are keywords from title/abstract (full text?)
keywords = ['ALMA','millimeter','millimetre','mm']
## You could also query authors using au:authorname
sq = 'cat:%s+AND+%%28+all:%s+OR+all:%s+OR+all:%s+OR+all:%s+%%29+AND+submittedDate:[%s+TO+%s]&'\
    %(cat,keywords[0],keywords[1],keywords[2],keywords[3],startdate,enddate)
num_results = 200
results=query(search_query=sq,sort_by='submittedDate',sort_order='descending',max_results=num_results)

**IMPORTANT NOTE**

These are not curated.  They may have *nothing* to do with ALMA, but simply have a keyword 'mm'.  

Later I will curate them myself as we go, and save them to a google spreadsheet (or elsewhere).

In [81]:
## NOW convert to dataframe for better table processing
posts = []
columns=['published','title','authors','summary','link','arxiv_primary_category','arxiv_comment']

for pp in np.arange(np.size(results)):
        try: comment=results[pp][columns[6]]
        except: comment=''  
        authorlist=[]
        for elem in results[pp][columns[2]]:
            authorlist.append(elem['name'])
        posts.append((results[pp][columns[0]], results[pp][columns[1]]\
                     , results[pp][columns[2]][0]['name'], authorlist\
                     , results[pp][columns[3]], results[pp][columns[4]]\
                     , results[pp][columns[5]]['term'], comment))
        
df = pd.DataFrame(posts, columns=[columns[0],columns[1],columns[2],'author list',columns[3],columns[4],columns[5],columns[6]]) # pass data to init



In [82]:
## LOOK at the table
df

Unnamed: 0,published,title,authors,author list,summary,link,arxiv_primary_category,arxiv_comment
0,2018-12-13T12:40:02Z,ALMA Observations of the Molecular Gas in the ...,B. Vila-Vilaro,"[B. Vila-Vilaro, D. Espada, P. Cortes, S. Leon...",We present the results of CO interferometric o...,http://arxiv.org/abs/1812.05385v1,astro-ph.GA,
1,2018-12-12T01:50:41Z,Dust formation in embryonic pulsar-aided super...,Conor Omand,"[Conor Omand, Kazumi Kashiyama, Kohta Murase]",We investigate effects of energetic pulsar win...,http://arxiv.org/abs/1812.04773v1,astro-ph.HE,"17 pages, 10 figures, submitted to MNRAS, comm..."
2,2018-12-11T16:50:11Z,The Disk Substructures at High Angular Resolut...,Nicolás Kurtovic,"[Nicolás Kurtovic, Laura Pérez, Myriam Benisty...",To characterize the substructures induced in p...,http://arxiv.org/abs/1812.04536v1,astro-ph.SR,"15 pages, 10 figures, accepted to ApJ Letters"
3,2018-12-11T12:36:38Z,Warm dust surface chemistry in protoplanetary ...,W. F. Thi,"[W. F. Thi, S. Hocuk, I. Kamp, P. Woitke, Ch. ...",The origin of the reservoirs of water on Earth...,http://arxiv.org/abs/1812.04357v1,astro-ph.GA,accepted to A&A
4,2018-12-11T10:20:49Z,Kinetic energy transfer from X-ray ultrafast o...,Misaki Mizumoto,"[Misaki Mizumoto, Takuma Izumi, Kotaro Kohno]","UltraFast Outflows (UFOs), seen as X-ray blues...",http://arxiv.org/abs/1812.04316v1,astro-ph.GA,"13 pages, 7 figures, accepted for publication ..."
5,2018-12-11T02:40:56Z,The Disk Substructures at High Angular Resolut...,Jane Huang,"[Jane Huang, Sean M. Andrews, Laura M. Pérez, ...",We present an analysis of ALMA 1.25 millimeter...,http://arxiv.org/abs/1812.04193v1,astro-ph.SR,"19 pages, 9 figures, accepted by ApJL"
6,2018-12-10T22:01:53Z,The Planet Formation Potential Around a 45 Myr...,Kevin M. Flaherty,"[Kevin M. Flaherty, A. Meredith Hughes, Eric E...",Debris disk detections around M dwarfs are rar...,http://arxiv.org/abs/1812.04124v1,astro-ph.SR,"9 pages, 4 figures, accepted to ApJ"
7,2018-12-10T21:37:51Z,Multi-band Optical and Near-infrared Propertie...,Pallavi Patil,"[Pallavi Patil, Kristina Nyland, Mark Lacy, Du...",We present a catalog of 26 faint submillimeter...,http://arxiv.org/abs/1812.04108v1,astro-ph.GA,"Accepted for publication in ApJ, 32 pages, 11 ..."
8,2018-12-10T20:02:16Z,Gas density perturbations induced by forming p...,Cécile Favre,"[Cécile Favre, Davide Fedele, Luke Maud, Richa...",The formation of planets occurs within protopl...,http://arxiv.org/abs/1812.04062v1,astro-ph.EP,Accepted for publication by ApJ
9,2018-12-10T19:37:57Z,The Disk Substructures at High Angular Resolut...,Laura M. Pérez,"[Laura M. Pérez, Myriam Benisty, Sean M. Andre...",We present a detailed analysis of new ALMA obs...,http://arxiv.org/abs/1812.04049v1,astro-ph.SR,"15 pages, 8 figures, accepted for publication ..."


In [83]:
## WRITE to excel
writer = pd.ExcelWriter('testoutput.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()

** TO DO **

Better select of range of dates https://github.com/Mahdisadjadi/arxivscraper/blob/master/arxivscraper/arxivscraper.py
--> Works, but gets confused between single and double digit dates.

Then sync directly with Google sheets (or github?)
https://github.com/burnash/gspread
https://github.com/robin900/gspread-dataframe