# Extracting Properties: Example Code

This notebook is a short tutorial on how the ChemDataExtractor code base can be used to extract records from an input article. Here, it will be assumed that the user has an article containing text based Yield Strength information which will be processed by ChemDataExtractor-StressEng.

To source articles, please see the code contained in the "Webscraping" folder which can be used to download articles from Elsevier using their API.

## Importing Relevant Modules

The relevant ChemDataExtractor modules for extracting information are:

- ```Document``` the input article will be converted into a Document object consisting of Paragraphs, Figures, Tables and other article elements which will be used for further processing.
- ```ElsevierXmlReader``` as the input articles are from Elsevier, this reader will process the input article and handle the specific formatting tags in Elsevier XML files to convert into a ```Document``` object.
- ```YieldStrength``` this is the property model that defines how to extract the information. See "chemdataextractor/model/model.py" for a full list of currently support property models.

In [None]:
from chemdataextractor import Document
from chemdataextractor.reader.elsevier import ElsevierXmlReader
from chemdataextractor.model.model import (
    YieldStrength,
    TableYieldStrength, 
    GrainSize, 
    TableGrainSize
)

## Reading and Processing an Input Article

First, the article needs to be read and converted into a `Document` object for processing. This is done using the `Document.from_file` function which requires the path to the downloaded input article and optionally, which reader to use. Here, the `ElsevierXmlReader()` is used.

Next, a property model is assigned to the `Document` object. Multiple property models can be assigned and these will all be used to extract the associated property from the document. In this case, only the `YieldStrength` property model is assigned.

In [None]:
# Reading Article
downloaded_article_path = "/PATH"
doc = Document.from_file(downloaded_article_path, readers=[ElsevierXmlReader()])

extraction_model = YieldStrength
doc.models = [extraction_model]

## Extracting Metadata

Having processed the input article into a `Document` object, the metadata of the article can be extracted. This will include information such as the article title, authors, doi and journal information.

In [None]:
# Extracting Metadata
try:
    metadata = doc.metadata.serialize()
except:
    metadata = "Not Found"

## Parsing Article

The parsers defined in the property models are then used to extract information from the `Document` resulting in a list of dictionaries of relevant information.

In [None]:
# Parsing article
parsed_information = doc.records.serialize()

## Extracting Complete Records

From the parsed information, the records with an assigned compound name are extracted. Processing and article information are appended to the record and the result is a list of records containing dictionaries of material information. 


In [None]:
# Extracting Complete Records

records = []

for i in parsed_information:
    if "Compound" in i:
        continue
    i["Article Metadata"] = metadata
    i["Extraction Model"] = extraction_model
    records.append(i)

## Next Step

Having extracted information from the article, the next step is to process the records through the Post Processing system which has been described in `PostProcessing.ipynb` in the Post_Processing code folder.