# PBA Analysis
Ethan Woodbury

In this notebook, I examine data on PBA structures to get a feel for the data, etc.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Data Importing and Processing

Pre-processing:

In [2]:
with open('pba.json', 'r') as file :
  pba_json = file.read()

In [3]:
#Previewing the file:
print(pba_json[:100])

/* 1 */
{
    "_id" : ObjectId("58d009ea48a464edfdfb435d")
}

/* 2 */
{
    "_id" : ObjectId("58e5d1


As we can see, the file contains multiple json objects. Python is only capable of parsing one json object at a time, so we have to find a way to make python read only a single object while ignoring the rest of the file.

We also see that in the "\_id" field of each json object, there is a callable ObjectId term. This is not standard json format (or at least I wasn't able to get it to open when ObjectId was there), so let's get rid of it.

In [4]:
pba_json = pba_json.replace("ObjectId(", "")
pba_json = pba_json.replace(")", "")
print(pba_json[:50])

/* 1 */
{
    "_id" : "58d009ea48a464edfdfb435d"
}


Now we figure out how to parse the json objects. Because there are multiple json objects in the file, we will modify the file to be a list of objects, surrounded by square brackets and separated by commas.

In [5]:
for i in range(1,8000):
    j = str(i)
    pba_json = pba_json.replace('/* ' + j + ' */', ',')

print(pba_json[:100])

,
{
    "_id" : "58d009ea48a464edfdfb435d"
}

,
{
    "_id" : "58e5d103d95cbb63a64878f0",
    "input


In [6]:
#Adding square brackets:
pba_json = '[\n' + pba_json + '\n]'
print(pba_json[:52])

[
,
{
    "_id" : "58d009ea48a464edfdfb435d"
}

,
{



We also have to delete the first comma:

In [7]:
pba_json = pba_json[:2] + pba_json[3:]
print(pba_json[:50])

[

{
    "_id" : "58d009ea48a464edfdfb435d"
}

,
{


Finally, we save this string as a json file.

In [8]:
pba_json_formatted = open('pba_json_formatted.json', 'w')

In [9]:
pba_json_formatted.write(pba_json)
pba_json_formatted.close()

Now that the json file is in the correct format as an array of json objects, we can use the json library to import the file as a list of dictionaries (each dictionary is one object).

In [10]:
import json

In [11]:
with open('pba_json_formatted.json') as pba_json_formatted:  
    pba_data = json.load(pba_json_formatted)

In [17]:
pba_data[1]

{'_id': '58e5d103d95cbb63a64878f0',
 'input': {'structure': {'@module': 'pymatgen.core.structure',
   '@class': 'Structure',
   'lattice': {'matrix': [[9.95090252, -0.0003358, -0.0003358],
     [-0.0003358, 9.95090252, 0.0003358],
     [-0.0003358, 0.0003358, 9.95090252]],
    'a': 9.9509025313318,
    'b': 9.9509025313318,
    'c': 9.9509025313318,
    'alpha': 89.9961329643568,
    'beta': 90.0038670356432,
    'gamma': 90.0038670356432,
    'volume': 985.34295115756},
   'sites': [{'species': [{'element': 'Ca', 'occu': 1}],
     'abc': [0.75135993, 0.75127745, 0.75127745],
     'xyz': [7.4762048629286, 7.47588864272739, 7.47588864272739],
     'label': 'Ca'},
    {'species': [{'element': 'Ca', 'occu': 1}],
     'abc': [0.24872255, 0.24864007, 0.75127745],
     'xyz': [2.47467807727261, 2.4743618570714, 7.47588864272739],
     'label': 'Ca'},
    {'species': [{'element': 'Ca', 'occu': 1}],
     'abc': [0.24872255, 0.75127745, 0.24864007],
     'xyz': [2.47467807727261, 7.475888642727

We now have all of the data stored in a Python dictionary pba_data.

## Data Exploration

First let's check how many structures are in the file:

In [33]:
print(len(pba_data))

536


Let's start by looking at the energy_per_atom values for each of the 536 structures. We'll put each of them into a pandas dataframe along with the \_id value.

In [113]:
pba_df = pd.DataFrame(np.zeros([536,2]), columns = ['_id', 'energy_per_atom'])
pba_df.loc[:,'energy_per_atom'] = np.nan
pba_df.head()

Unnamed: 0,_id,energy_per_atom
0,0.0,
1,0.0,
2,0.0,
3,0.0,
4,0.0,


In [115]:
for struct in range(len(pba_data)):
    pba_df.loc[struct,'_id'] = pba_data[struct]['_id']
    if 'output' in pba_data[struct].keys():
        pba_df.loc[struct,'energy_per_atom'] = pba_data[struct]['output']['energy_per_atom']
pba_df.head()

Unnamed: 0,_id,energy_per_atom
0,58d009ea48a464edfdfb435d,
1,58e5d103d95cbb63a64878f0,-7.947785
2,58e5d318d95cbb63a648790f,-8.352124
3,58e5d53cd95cbb63a6487922,-8.191138
4,58e5d670d95cbb63a6487926,-8.412139
