# PDFDataExtractor Demo

PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following various publishers: 
* |  Elsevier
* | Royal Society of Chemistry
* | Advanced Material Families (Wiley)
* | Angewandte
* | Chemistry A European Journal
* | American Chemistry Society
* | Springer (Temporarily unavailable)

### Add PDFDataExtractor to your Conda path

Use the following command to add PDFDataExtractor to your current conda environment

"directory" will be where the path to the root folder

In [1]:
conda develop "directory"

added /Users/miao/Desktop/PDFDataExtractor/directory
completed operation for: /Users/miao/Desktop/PDFDataExtractor/directory

Note: you may need to restart the kernel to use updated packages.


### Import necessary module

In [28]:
from pdfdataextractor import Reader

## Pass in a single file

In [29]:
path = r'/Users/miao/Desktop/test/els/1.pdf'

In [30]:
file = Reader()

In [31]:
pdf = file.read_file(path)

Reading:  /Users/miao/Desktop/test/els/1.pdf
*** Elsevier detected ***


### Test if PDF is returned successful

In [32]:
pdf.test()

PDF returned successfully


### Get Caption

In [33]:
pdf.caption()

{'figure 1': 'Fig. 1. The network model.',
 'figure 3': 'Fig. 3. Total travel time indexed for: (a) random, (b) central, (c) peripheral removals, and (d) all the strategies (average values).',
 'figure 2': 'Fig. 2. Example of ﬂow distribution in different network conﬁgurations (thicker lines mean higher ﬂows): (a) full grid, (b) 30% random removal, (c) 30% central removal, and (d) 30% peripheral removal.',
 'figure 4': 'Fig. 4. Average values of: (a) indexed distance travelled, (b) average speed, for the three removal strategies.',
 'figure 5': 'Fig. 5. Average values of: (a) percentage of total travel time spent on intersections, and (b) maximum V/C ratios on intersections, for the three removal strategies.',
 'figure 6': 'Fig. 6. Number and type of maneuvers at intersections (average values) for: (a) random removal, (b) central removal, and (c) peripheral removal, and (d) percentage of movements taking place at 4 leg intersections.',
 'figure 7': 'Fig. 7. Total travel times for rando

### Get Keywords

In [34]:
pdf.keywords()

'Keywords: Link removal Urban pattern Trafﬁc performance Grid'

### Get Title

In [35]:
pdf.title()

'Trafﬁc performance on quasi-grid urban structures'

### Get DOI

In [36]:
pdf.doi()

'10.1016/j.cities.2013.08.006'

### Get Abstract

In [37]:
pdf.abstract()

'Cities across the world are starting to recover space, previously devoted to cars, for other uses. The main purpose of this paper is to better understand the removal of space in urban settings and to provide some analytical results showing that it is possible to remove streets from a city without worsening trafﬁc excessively.'

### Get Journal

In [38]:
pdf.journal()

{'name': 'Cities 36 (2014) 18–27',
 'year': '2014',
 'volume': '36',
 'page': '18-27'}

### Get Journal name

In [39]:
pdf.journal('name')

'Cities 36 (2014) 18–27'

### Get Journal Year

In [40]:
pdf.journal('year')

'2014'

### Get Journal Volume

In [41]:
pdf.journal('volume')

'36'

### Get Journal Page

In [42]:
pdf.journal('page')

'18-27'

### Get Plain Text

In [43]:
pdf.plaintext()

'Cities 36 (2014) 18–27\n\nContents lists available at ScienceDirect\n\nCities\n\nj o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c i t i e s\n\nTrafﬁc performance on quasi-grid urban structures\nJavier Ortigosa a,⇑\n\n, Monica Menendez b\n\na Institute for Transport Planning and Systems, ETH Zurich, HIL F 37.3 Wolfgang-Pauli-Strasse 15, 8093 Zurich, Switzerland\nb Institute for Transport Planning and Systems, ETH Zurich, HIL F 37.2 Wolfgang-Pauli-Strasse 15, 8093 Zurich, Switzerland\n\na r t i c l e\n\ni n f o\n\na b s t r a c t\n\nArticle history:\nReceived 25 December 2012\nReceived in revised form 19 August 2013\nAccepted 30 August 2013\nAvailable online 26 September 2013\n\nKeywords:\nLink removal\nUrban pattern\nTrafﬁc performance\nGrid\n\nCities across the world are starting to recover space, previously devoted to cars, for other uses. The main\npurpose of this paper is to better understand the removal of space in urban settings and to provide som

### Get Section titles and corresponding text

In [44]:
pdf.section()

['Introduction', 'Methodology', 'Conclusions', 'References']


{'Introduction': ['Since the 1950s, with the advent of the automobile, many cities have experienced urban changes to devote more space and infra- structure to cars. This, in turn, has modiﬁed the travel behavior and the activities carried out in cities. This cycle has then become unsustainable: increasing capacity for cars, increases induced de- mand, and negative externalities. Fortunately, in the last decades, cities are trying to revert this cycle and shift toward more sustain- able urban environments. They are aiming at recovering space from cars to ﬁnd a more balanced mode share and higher living standards.',
  'The European Commission (2004) presented the positive expe- rience of many European cities that have restricted car usage in their city centers. The result is that cities not only gain in sustain- ability and livability, but trafﬁc is reduced overall. This phenome- non, called trafﬁc evaporation and based on empirical evidence, describes the reduction of car usage when inf

### Get References

In [45]:
pdf.reference()

['Introduction', 'Methodology', 'Conclusions', 'References']


{'0': 'Alexander, C., Ishikawa, S., Silverstein, M., Jakobson, M., Fiksdahl-King, I., & Angel, S. (1977). A pattern language: Towns, buildings, constructions. Oxford University Press.Aymerich, O., & Robuste, F. (1990). Fiabilidad de redes de transporte bajo condiciones excepcionales. Revista del Ministerio de Transportes, Turismo y Comunicaciones, 42, 25–37.portal/site/Mobilitat>.~bargera/tntp/>.Bar-Gera, H. (2001). Transportation network test problems. <http://www.bgu.ac.il/Bar-Gera, H.',
 '1': '(2010). Trafﬁc assignment by paired alternative segments.Transportation Research Part B: Methodological, 44(8), 1022–1046.',
 '2': 'Bar-Gera, H., Nie, Y., Boyce, D., Hu, Y., & Liu, Y. (2010). Consistent route ﬂows and the In Transportation research board 89th annual',
 '3': 'condition of proportionality. meeting (pp. 10–1526).Bell, M. G. (2000). A game theory approach to measuring the performance reliability of transport networks. Transportation Research Part B: Methodological, 34(6), 533–545.

## Pass multiple files at one time

In [46]:
import glob

In [47]:
def read_single(file):
    reader = Reader()
    pdf = reader.read_file(file)
    print(pdf.abstract())

    
def read_multiple(path):
    for i in path:
        read_single(i)
        print('-------------------', '\n')


In [48]:
read_multiple(glob.glob(r'/Users/miao/Desktop/test/els/*.pdf'))

Reading:  /Users/miao/Desktop/test/els/6.pdf
*** Elsevier detected ***
For policymakers, planners, urban design practitioners and city service decision-makers who endeavour to create policies and take decisions to improve the function of cities, developing an understanding of cities, and the particular city in question, is important. However, in the ever-increasing ﬁeld of urban measurement and analysis, the challenges cities face are frequently presumed: crime and fear of crime, social inequality, environmental degradation, economic deterioration and disjointed governance. Although it may be that many cities share similar problems, it is unwise to assume that cities share the same challenges, to the same degree or in the same combination. And yet, diagnosing the challenges a city faces is often overlooked in preference for improving the understanding of known challenges. To address this oversight, this study evidences the need to diagnose urban challenges, introduces a novel mixed-met

*** Elsevier detected ***
Cities are increasingly challenged to improve their competitiveness. Performance indicators stand as an important element to interpret the success of the policy regime adopted by the municipality. Cities with a set of superior economic, social and environmental indicators have the potential to present better living conditions for their inhabitants. In this context, the aim of this research is to analyze whether the in- dicators published by Brazilian cities are aligned with the approach of a smart or sustainable city. The research used a set of 3150 data points regarding the performance of these cities. It analyzed the per- formance of the 150 best cities, divided into three groups of interest identiﬁed as small cities, medium- sized cities and big cities, on a set of 21 indicators. The set of identiﬁed indicators shows the attention of the cities to socioeconomic and information and communication technologies issues, thus revealing that Brazilian city manager

## Known Issues

In ACS
* In ACS, a few journals have two section title styles existing at the same time, namely: numbered one and ■ one. This could confuse the title filtration function because two styles have largely different font sizes. But this won’t affect reference extraction
* Reference extracted might not be in order
* Parts of extracted reference could be missing

In Elesvier
* Potentially weak journal extraction leads to missing journal information
* Unnumbered references can be messy

In RSC
* Title can be missing
* Journal year, volume and page numbers can be missing in certain articles
* Some section titles can be missed but reference section remains solid


In Advanced Family
* Reference entries can be mixed
* Keywords can be found inside reference entries, roughly 1 in 20
* Some authors place their bio at the very end, such words are not excluded from reference at the moment

In CAEJ
* Keywords can be incomplete

In Angewandte
* Keywords might not be in order