# Post Processing

This notebook walks through the processing of records that have been extracted using ChemDataExtractor (CDE). The Classes and relevant Functions written for this stage are contained in PostProcessing.py. 

Here, four different databases will be created using the output from CDE:
1. Yield Strength
2. Grain Size
3. Combined Yield Strength & Grain Size
4. Engineering Ready Yield Strength

The databases will be exported in three formats:

1. Json
2. CSV
3. MongoDB BSON

In [None]:
# Importing Relevant Modules
from PostProcessing import DataProcessor # Custom Post Processing functions
import pandas as pd
import numpy as np
import pymongo # For MongoDB BSON

## Defining File Paths

Here the directories to records extracted and where to save the databases are defined.

In [None]:
ys_txt_open_path = "./PATH"
ys_tbl_open_path = "./PATH"

ys_txt_path = "./PATH"
ys_tbl_path = "./PATH"


gs_txt_open_path = "./PATH"
gs_tbl_open_path = "./PATH"

gs_txt_path = "./PATH"
gs_tbl_path = "./PATH"

save_dir_gs = "./PATH"
save_dir_ys = "./PATH"
save_dir_combined = "./PATH"


## Reading Text and Table Records for Yield Strength:

The extracted values of yield strength from text and tables are read and combined before being filtered using the defined functions in PostProcessing.py

The property models used for extraction need to be defined as they are used as keys within the dictionaries storing the extracted records. 

In [None]:
ys_txt_model = "YieldStrength"
ys_tbl_model = "TableYieldStrength"

### Yield Strength SubAccess Processing:


In [None]:
ys_txt = DataProcessor(model=ys_txt_model, path= ys_txt_path)
ys_tbl = DataProcessor(model=ys_tbl_model, path=ys_tbl_path)

ys_txt.readOutput()
ys_txt.recordReader()
ys_txt.filterMatParser()
ys_txt.filterDict()


ys_tbl.readOutput()
ys_tbl.recordReader()
ys_tbl.filterMatParser()
ys_tbl.filterDict()

ys_database = ys_txt.filtered_database.append(ys_tbl.filtered_database)

### Yield Strength Open Access Processing:


In [None]:
ys_txt_open = DataProcessor(model=ys_txt_model, path= ys_txt_open_path)
ys_tbl_open = DataProcessor(model=ys_tbl_model, path=ys_tbl_open_path)

ys_txt_open.readOutput()
ys_txt_open.recordReader()
ys_txt_open.filterMatParser()
ys_txt_open.filterDict()


ys_tbl_open.readOutput()
ys_tbl_open.recordReader()
ys_tbl_open.filterMatParser()
ys_tbl_open.filterDict()

ys_open_database = ys_txt_open.filtered_database.append(ys_tbl_open.filtered_database)

### Combining open access and sub access results


In [None]:
full_ys_database = ys_database.append(ys_open_database).reset_index(drop=True).sort_values('DOI')

## Engineering Ready Yield Strength Database
For studies on Engineering hard materials, the typical range of yield strength values tends to be between 100 MPa and 1500 MPa which are also motivated by the statistical outliers of the full yield strength database. Thus, the Engineering Ready Yield Strength database restricts value to be within in this range. 

In [None]:
er_ys_database = full_ys_database.explode("Value")
er_ys_database = er_ys_database[(er_ys_database["Value"]>100) & (er_ys_database["Units"]=='(10^6.0) * Pascal^(1.0)') | (er_ys_database["Value"]>0.1) & (er_ys_database["Units"]=='(10^9.0) * Pascal^(1.0)')]
er_ys_database = er_ys_database[(er_ys_database["Value"]<1500) & (er_ys_database["Units"]=='(10^6.0) * Pascal^(1.0)') | (er_ys_database["Value"]<1.5) & (er_ys_database["Units"]=='(10^9.0) * Pascal^(1.0)')]


## Reading Text and Table records for Grain Size

The process is the same as for Yield Strength

In [None]:
gs_txt_model = "GrainSize"
gs_tbl_model = "TableGrainSize"

### Grain Size Sub Access

In [None]:
gs_txt = DataProcessor(model=gs_txt_model, path= gs_txt_path)
gs_tbl = DataProcessor(model=gs_tbl_model, path=gs_tbl_path)

gs_txt.readOutput()
gs_txt.recordReader()
gs_txt.filterMatParser()
gs_txt.filterDict()


gs_tbl.readOutput()
gs_tbl.recordReader()
gs_tbl.filterMatParser()
gs_tbl.filterDict()
gs_database = gs_txt.filtered_database.append(gs_tbl.filtered_database)

### Grain Size Open Access

In [None]:
gs_txt_open = DataProcessor(model=gs_txt_model, path= gs_txt_open_path)
gs_tbl_open = DataProcessor(model=gs_tbl_model, path=gs_tbl_open_path)

gs_txt_open.readOutput()
gs_txt_open.recordReader()
gs_txt_open.filterMatParser()
gs_txt_open.filterDict()


gs_tbl_open.readOutput()
gs_tbl_open.recordReader()
gs_tbl_open.filterMatParser()
gs_tbl_open.filterDict()

gs_open_database = gs_txt_open.filtered_database.append(gs_tbl_open.filtered_database)

In [None]:
gs_txt_open = DataProcessor(model=gs_txt_model, path= gs_txt_open_path)
gs_tbl_open = DataProcessor(model=gs_tbl_model, path=gs_tbl_open_path)

gs_txt_open.readOutput()
gs_txt_open.recordReader()
gs_txt_open.filterMatParser()
gs_txt_open.filterDict()


gs_tbl_open.readOutput()
gs_tbl_open.recordReader()
gs_tbl_open.filterMatParser()
gs_tbl_open.filterDict()

gs_open_database = gs_txt_open.filtered_database.append(gs_tbl_open.filtered_database)
full_gs_database = gs_database.append(gs_open_database).reset_index(drop=True).sort_values('DOI')

### Combining Open Access and Sub Access results

In [None]:
full_gs_database = gs_database.append(gs_open_database).reset_index(drop=True).sort_values('DOI')

## Combined Yield Strength and Grain Size Database

For further study of the relationship between yield strength and grain size, the two databases are combined. Entries from either database that come from the same article are paired based on the following conditions:
1. Entries have the same DOI
2. Same number of grain size and yield strength values

The pairing begins by sorting the yield strength values in descending order and the grain size in ascending. Then, the list of values are paired such that the smaller grain sizes are assigned to the larger yield strength values.

In [None]:
full_ys_database['name'] = full_ys_database['Compound'].apply(','.join)
full_gs_database['name'] = full_gs_database['Compound'].apply(','.join)
combined = []
for name in full_ys_database['name'].unique():

    ys_name = full_ys_database[full_ys_database['name']==name]
    gs_name = full_gs_database[full_gs_database['name']==name]
    doi = ys_name[ys_name['DOI'].isin(gs_name['DOI'])].DOI.to_numpy()    
    for article in np.unique(doi):
        ys = ys_name[ys_name.DOI == article]
        gs = gs_name[gs_name.DOI == article]

        if len(gs) == len(ys):

            gs_val = gs.sort_values('Value',ascending=False).Value.to_numpy()
            gs_unit = gs.sort_values('Value',ascending=False).Units.to_numpy()
            ys_val = ys.sort_values('Value',ascending=True).Value.to_numpy()
            ys_unit = ys.sort_values('Value',ascending=True).Units.to_numpy()
            compound = ys['Compound'].to_numpy()
            open_access = ys['Open Access'].to_numpy()
            blacklist = ys['Blacklisted Compound?'].to_numpy()
            for i in range(len(ys)):
                combined.append([compound[i],blacklist[i],ys_val[i],ys_unit[i],gs_val[i],gs_unit[i],article,open_access[i]])

        else:

            pass        
combined_data = pd.DataFrame(combined, columns=['Compound', 'Blacklisted Compound?', 'Yield Strength Value', 'Yield Strength Unit', 'Grain Size Value', 'Grain Size Unit','DOI', 'Open Access']) 

## Exporting Databases

The processed data is exported into Json, CSV and MongoDB BSON formats such that they can be implented easily into any lookup or data-driven pipeline.

In [None]:
export_full_ys = full_ys_database.drop(columns=['name'])
export_full_gs = full_gs_database.drop(columns=['name'])
export_er_ys = er_ys_database
export_combined = combined_data

### Exporting To CSV & Json

This is done using the inbuilt functions of pandas.

In [None]:
#To Json
export_full_ys.to_json(save_dir_ys+"YieldStrength_Database.json", orient="records")
export_full_gs.to_json(save_dir_gs+"GrainSize_Database.json", orient="records")
export_er_ys.to_json(save_dir_ys+"EngineeringReady_YieldStrength_Database.json", orient="records")
export_combined.to_json(save_dir_combined+"Combined_YieldStrength_GrainSize_Database.json", orient="records")

#To CSV
export_full_ys.to_csv(save_dir_ys+"YieldStrength_Database.csv", index=False)
export_full_gs.to_csv(save_dir_gs+"GrainSize_Database.csv", index=False)
export_er_ys.to_csv(save_dir_ys+"EngineeringReady_YieldStrength_Database.csv", index=False)
export_combined.to_csv(save_dir_combined+"Combined_YieldStrength_GrainSize_Database.csv", index=False)

### Exporting to MongoDB BSON

This requires MongoDB to be installed and setup alongside the pymongo package.

See:
[1] https://pypi.org/project/pymongo/
[2] https://docs.mongodb.com/

In [None]:
client = pymongo.MongoClient()

db = client["StressDatabases"]

ys_col = db["Yield Strength"]
gs_col  = db["Grain Size"]
combined_col = db["Combined Yield Strength and Grain Size"]
er_ys_col = db["Engineering Ready Yield Strength"]


ys_col.insert_many(export_full_ys.to_dict("records"))
gs_col.insert_many(export_full_gs.to_dict("records"))
combined_col.insert_many(export_combined.to_dict("records"))
er_ys_col.insert_many(export_er_ys.to_dict("records"))