## Prerequisites

### Google colab

This notebook can be used in colab (**this is the fastest way to run calculation on unconfigured system**):

In google colab https://colab.research.google.com/ go to File | Open notebook | GitHub - 
insert the path to the current notebook and open it: https://github.com/alexnkorovin/ocp-airi/blob/dev/airi_utils/our_base_model.ipynb

Before start:

1. Put this shared folder with datasets in your Google Drive root folder /drive/MyDrive/

This folders  can are available by the **sharing** link below:

*   ocp_datasets [[ share link to drive](https://drive.google.com/drive/folders/1Nn9t-zTJiRP1-34rdAugv6aY_2-BSQfN?usp=sharing)]<br>

```
Note:
if this folder is saved by sharing link it should contain the following files

ocp-datasets/data/is2re/train/all/val_ood_both/data.lmdb
ocp-datasets/data/is2re/train/all/test_ood_both/data.lmdb
ocp-datasets/data/is2re/train/all/test_ood_both/structures.pkl

 ```
2. Enable GPU support in Edit/Notebook Settings

### on local pc

download specified data files by [link](https://drive.google.com/drive/folders/1Nn9t-zTJiRP1-34rdAugv6aY_2-BSQfN?usp=sharing) into local folder.


### Use the cell below it to mount your google drive to dataset
 - go by the link
 - log in under your google accout
 - copy token key
 - imput it to this the imput line in this notebook

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive')
except:
    pass

## Enviroment installation

### on local pc
```
$ conda install pytorch-geometric -c rusty1s -c conda-forge
```
or via pip Wheels

```
$ python -c "import torch; print(torch.__version__)"
>>> 1.9.0 - > {TORCH}=1.9.0
python -c "import torch; print(torch.version.cuda)"
>>> 11.1 - > {CUDA}=cu111
```

substite {TORCH} and {CUDA} in commands below by appropriate for your system
```
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
pip install torch-geometric
```

#### on colab and also local pc (but on locat preferable is conda way)

In [2]:
# # This might take about 10 min in Colab (нужно только в колабе)
# !pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.4.0+cu101.html
# !pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.4.0+cu101.html
# !pip install -q torch-geometric

In [1]:
import os
import pickle

import numpy as np
import pandas as pd
import torch

import torch.nn.functional as F
import torch.optim as optim

from datetime import datetime
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter

In [2]:
def simple_preprocessing(system):
    
    tags = system['tags'].long().to(device)
    tags = F.one_hot(tags, num_classes=3)
    
    atom_numbers = system['atomic_numbers'].long().to(device)
    atom_numbers = F.one_hot(atom_numbers, num_classes=100)
    
    pos = system['pos'].to(device)
    
    atom_features = (tags, atom_numbers, pos)#, spherical_radii)
    atom_embeds = torch.cat(atom_features, 1)
                    
    #padding
    pad_value = 0#-float("Inf")
    pads = torch.full((MAX_LEN-atom_embeds.shape[0], atom_embeds.shape[1]), pad_value)
    padding_mask = torch.cat((torch.full((atom_embeds.shape[0], ), False), torch.full((MAX_LEN-atom_embeds.shape[0], ), True)))
    atom_embeds = torch.cat((atom_embeds, pads))
    
    return (atom_embeds, padding_mask)

In [3]:
#датасет, который умеет возвращать эелемент и собственную длину
class Dataset(Dataset):

    def __init__(self, data, features_fields, target_field, type_='train', preprocessing=simple_preprocessing):
        
        self.data = data[features_fields]
        self.length = len(data)
        self.target = torch.Tensor(data[target_field].values)
        self.type_ = type_
        self.preprocessing = preprocessing
        
        for feature in features_fields:
             self.data[feature] = self.data[feature].apply(lambda x: x[:MAX_LEN])

    def __len__(self):
        return self.length

    def __getitem__(self, index):
        
        system = self.preprocessing(self.data.iloc[index])
        
        if self.type_ == 'train':
            y = self.target[index]
            
            return system, y

In [4]:
#собственно нейросеть
class NN(nn.Module):
    
    def __init__(self, dim_atom=106):
        
        super().__init__() 
                
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=dim_atom, nhead=1)
        
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=3)
        
        self.lin = torch.nn.Linear(dim_atom, 1, bias=True)
        
    def forward(self, batch):
        
        padded, src_key_padding_mask = batch[0], batch[1]
                                
        padded = padded.permute((1, 0, 2))

        embeds = self.transformer_encoder(padded, src_key_padding_mask=src_key_padding_mask)
                
        embeds = embeds.permute((1, 0, 2))
        
        summed = torch.sum(embeds, 1)
                
        energy = self.lin(summed)
        
        return energy

In [5]:
def send_scalars(lr, loss, writer, step=-1, epoch=-1, type_='train'):
    if type_ == 'train':
        writer.add_scalar('lr per step on train', lr, step) 
        writer.add_scalar('loss per step on train', loss, step)
    if type_ == 'val':
        writer.add_scalar('loss per epoch on val', loss, epoch)

In [6]:
def send_hist(model, writer, step):
    for name, weight in model.named_parameters():
        try:
            writer.add_histogram(name, weight, step)
        except:
            pass

In [7]:
#train -- ходим по батчам из итератора, обнуляем градиенты, предсказываем у, считаем лосс, считаем градиенты, делаем шаг оптимайзера, записываем лосс
def train(model, iterator, optimizer, criterion, print_every=10, epoch=0, writer=None):
    
    epoch_loss = 0
    
    model.train()
    
    for i, (systems, ys) in enumerate(iterator):
        
        optimizer.zero_grad()
        predictions = model(systems).squeeze()
        
        loss = criterion(predictions.float(), ys.to(device).float())
        loss.backward()     
        
        optimizer.step()
        
        batch_loss = loss.item() 
        epoch_loss += batch_loss  
        
        if writer != None:
            
            lr = optimizer.param_groups[0]['lr']
            
            step = i + epoch*len(iterator)
            
            send_hist(model, writer, i)
            send_scalars(lr, batch_loss, writer, step=step, epoch=epoch, type_='train')
        
        if not (i+1) % print_every:
            print(f'step {i} from {len(iterator)} at epoch {epoch}')
            print(f'Loss: {batch_loss}')
        
    return epoch_loss / len(iterator)

In [8]:
def evaluate(model, iterator, criterion, epoch=0, writer=False):
    
    epoch_loss = 0
    
#    model.train(False)
    model.eval()  
    
    with torch.no_grad():
        for systems, ys in iterator:   

            predictions = model(systems).squeeze()
            loss = criterion(predictions.float(), ys.to(device).float())        

            epoch_loss += loss.item()
            
    overall_loss = epoch_loss / len(iterator)

    if writer != None:
        send_scalars(None, overall_loss, writer, step=None, epoch=epoch, type_='val')
                
    print(f'epoch loss {overall_loss}')
            
    return overall_loss

In [9]:
def inferens(model, iterator):
    y = torch.tensor([])

#    model.train(False)
    model.eval()  
    
    with torch.no_grad():
        for systems in iterator:   
          predictions = model(systems).squeeze()
          y = torch.cat((y, predictions))
      
    return y

## DATA

In [10]:
def read_df(filename):    

    with open(filename, 'rb') as f:
        data_ori = pickle.load(f)
    
    #сливаем новые фичи и фичи из Data
    for system in data_ori:
        for key in system['data']:
            system[key[0]] = key[1]
        del system['data']
        
    df = pd.DataFrame(data_ori)
    data_ori=[]
    print(df.columns)
    
    return df

In [11]:
# for colab
# train_dataset_file_path = "/content/drive/MyDrive/ocp_datasets/data/is2re/10k/train/structures_train.pkl"

# user specific folder
# train_dataset_file_path = os.path.expanduser("~/Downloads/structures_train.pkl")
train_dataset_file_path= "../../ocp_datasets/data/is2re/10k/train/structures_train.pkl"

val_dataset_file_path = os.path.expanduser("../../ocp_datasets/data/is2re/all/val_ood_both/structures_val_ood_both.pkl")
# val_dataset_file_path = os.path.expanduser("~/Downloads/structures_train.pkl")

In [12]:
df_train = read_df(train_dataset_file_path)
df_val = read_df(val_dataset_file_path)

Index(['id', 'voronoi_volumes', 'voronoi_surface_areas',
       'spherical_domain_radii', 'distances_new', 'contact_solid_angles',
       'direct_neighbor', 'edge_index_new', 'atomic_numbers', 'cell',
       'cell_offsets', 'distances', 'edge_index', 'fixed', 'force', 'natoms',
       'pos', 'pos_relaxed', 'sid', 'tags', 'y_init', 'y_relaxed'],
      dtype='object')
Index(['id', 'voronoi_volumes', 'voronoi_surface_areas',
       'spherical_domain_radii', 'distances_new', 'contact_solid_angles',
       'direct_neighbor', 'edge_index_new', 'atomic_numbers', 'cell',
       'cell_offsets', 'distances', 'edge_index', 'fixed', 'force', 'natoms',
       'pos', 'pos_relaxed', 'sid', 'tags', 'y_init', 'y_relaxed'],
      dtype='object')


In [13]:
batch_size = 64
num_workers = 0
MAX_LEN = 100

In [14]:
# features_cols = ['atomic_numbers', 'edge_index_new', 'distances_new', 
#                  'contact_solid_angles', 'tags', 'voronoi_volumes', 'spherical_domain_radii']

features_cols = ['pos', 'atomic_numbers', 'tags']

target_col = 'y_relaxed'

In [15]:
#инициализируем тренировочный датасети и тренировочный итератор
training_set = Dataset(df_train, features_cols, target_col)
training_generator = DataLoader(training_set, batch_size=batch_size, num_workers=num_workers)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.data[feature] = self.data[feature].apply(lambda x: x[:MAX_LEN])


In [16]:
#инициализируем валидационный датасет и валидационный итератор
valid_set = Dataset(df_val, features_cols, target_col)
valid_generator = DataLoader(valid_set, batch_size=batch_size, num_workers=num_workers)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.data[feature] = self.data[feature].apply(lambda x: x[:MAX_LEN])


In [17]:
df_train = []
df_val = []

## MODEL

In [18]:
#чтобы тензор по умолчанию заводился на куде
if torch.cuda.is_available():
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
    print('cuda')

cuda


In [19]:
#set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
print(device)

cuda


In [20]:
#model
model = NN(dim_atom=next(iter(training_generator))[0][0].shape[2])

#optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.L1Loss()

#переносим на куду если она есть
model = model.to(device)
criterion = criterion.to(device)

In [21]:
timestamp = str(datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))

print(timestamp)

2021-09-06-18-42-16


In [22]:
#tensorboard writer, при первом запуске надо руками сделать папку для логов

# server
#log_folder_path = "../../ocp_results/logs/tensorboard/out_base_model"

# colab
# log_folder_path = "/content/drive/MyDrive/ocp_results/logs/tensorboard/out_base_model"

# user_specific 
log_file_path = "../logs/tensorboard_airi"

writer = SummaryWriter(log_file_path + '/' + timestamp)

In [23]:
#граф модели
trace_system = next(iter(training_generator))[0]
writer.add_graph(model, (trace_system,))

## Training

In [None]:
%%time
loss = []
loss_eval = []
epochs = 50
print(timestamp)
#print(f'Start training model {str(model)}')
for i in range(epochs):
    print(f'epoch {i}')
    loss.append(train(model, training_generator, optimizer, criterion, epoch=i, writer=writer))
    loss_eval.append(evaluate(model, valid_generator, criterion, epoch=i, writer=writer))

2021-09-06-18-42-16
epoch 0
step 9 from 157 at epoch 0
Loss: 60.193321228027344
step 19 from 157 at epoch 0
Loss: 14.562389373779297
step 29 from 157 at epoch 0
Loss: 5.056543350219727
step 39 from 157 at epoch 0
Loss: 6.576595783233643
step 49 from 157 at epoch 0
Loss: 1.9704363346099854
step 59 from 157 at epoch 0
Loss: 2.0375914573669434
step 69 from 157 at epoch 0
Loss: 1.682887077331543
step 79 from 157 at epoch 0
Loss: 1.7606043815612793
step 89 from 157 at epoch 0
Loss: 2.0335230827331543
step 99 from 157 at epoch 0
Loss: 2.151097297668457
step 109 from 157 at epoch 0
Loss: 1.8127589225769043
step 119 from 157 at epoch 0
Loss: 1.8522560596466064
step 129 from 157 at epoch 0
Loss: 2.257084369659424
step 139 from 157 at epoch 0
Loss: 3.031787872314453
step 149 from 157 at epoch 0
Loss: 1.9613006114959717
epoch loss 2.232994241177883
epoch 1
step 9 from 157 at epoch 1
Loss: 1.8941553831100464
step 19 from 157 at epoch 1
Loss: 1.876897931098938
step 29 from 157 at epoch 1
Loss: 1.72

In [None]:
writer.close()

In [None]:
loss_eval