## Prerequisites

### Google colab

This notebook can be used in colab (**this is the fastest way to run calculation on unconfigured system**):

In google colab https://colab.research.google.com/ go to File | Open notebook | GitHub - 
insert the path to the current notebook and open it: https://github.com/alexnkorovin/ocp-airi/blob/dev/airi_utils/our_base_model.ipynb

Before start:

1. Put this shared folder with datasets in your Google Drive root folder /drive/MyDrive/

This folders  can are available by the **sharing** link below:

*   ocp_datasets [[ share link to drive](https://drive.google.com/drive/folders/1Nn9t-zTJiRP1-34rdAugv6aY_2-BSQfN?usp=sharing)]<br>

```
Note:
if this folder is saved by sharing link it should contain the following files

ocp-datasets/data/is2re/train/all/val_ood_both/data.lmdb
ocp-datasets/data/is2re/train/all/test_ood_both/data.lmdb
ocp-datasets/data/is2re/train/all/test_ood_both/structures.pkl

 ```
2. Enable GPU support in Edit/Notebook Settings

### on local pc

download specified data files by [link](https://drive.google.com/drive/folders/1Nn9t-zTJiRP1-34rdAugv6aY_2-BSQfN?usp=sharing) into local folder.


### Use the cell below it to mount your google drive to dataset
 - go by the link
 - log in under your google accout
 - copy token key
 - imput it to this the imput line in this notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Enviroment installation

### on local pc
```
$ conda install pytorch-geometric -c rusty1s -c conda-forge
```
or via pip Wheels

```
$ python -c "import torch; print(torch.__version__)"
>>> 1.9.0 - > {TORCH}=1.9.0
python -c "import torch; print(torch.version.cuda)"
>>> 11.1 - > {CUDA}=cu111
```

substite {TORCH} and {CUDA} in commands below by appropriate for your system
```
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-{TORCH}+{CUDA}.html
pip install torch-geometric
```

#### on colab and also local pc (but on locat preferable is conda way)

In [None]:
# This might take about 10 min in Colab
!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.4.0+cu101.html
!pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.4.0+cu101.html
!pip install -q torch-geometric

## Utils and definitions

In [None]:
def my_reshape(tensor):
    return torch.reshape(tensor, (tensor.shape[0], 1))

In [None]:
#делаем из данных матрицу векторов-атомов, список рёбер (edge_index) и матрицу векторов-рёбер
def simple_preprocessing(batch):
    #spherical_radii = torch.Tensor(batch['spherical_domain_radii'])
    #spherical_radii = my_reshape(spherical_radii)
    
    tags = batch['tags'].long()
    tags = F.one_hot(tags, num_classes=3)
    
    atom_numbers = batch['atomic_numbers'].long()
    atom_numbers = F.one_hot(atom_numbers, num_classes=100)
    
    voronoi_volumes = torch.Tensor(batch['voronoi_volumes'])
    voronoi_volumes = my_reshape(voronoi_volumes)
    
    atom_features = (tags, atom_numbers, voronoi_volumes)#, spherical_radii)
    atom_embeds = torch.cat(atom_features, 1)
    
    edge_index = torch.Tensor(batch['edge_index_new']).long()
    
    distances = torch.Tensor(batch['distances_new'])
    distances = my_reshape(distances)
    
    angles = torch.Tensor(batch['contact_solid_angles'])
    angles = my_reshape(angles)
    
    edges_embeds = torch.cat((distances, angles), 1)
    
    
    return Data(x=atom_embeds.to(device), edge_index=edge_index.to(device), edge_attr=edges_embeds.to(device))

In [None]:
#датасет, который умеет возвращать эелемент и собственную длину
class Dataset(Dataset):

    def __init__(self, data, features_fields, target_field, type_='train', preprocessing=simple_preprocessing):
        
        self.data = data[features_fields]
        self.length = len(data)
        self.target = torch.Tensor(data[target_field].values)
        self.type_ = type_
        self.preprocessing = preprocessing

    def __len__(self):
        return self.length

    def __getitem__(self, index):
        
        system = self.preprocessing(self.data.iloc[index])
        
        if self.type_ == 'train':
            y = self.target[index]
            
            return system, y

$$
\mathbf{x}_i^{(k)} = \gamma^{(k)} \left( \mathbf{x}_i^{(k-1)}, \square_{j \in \mathcal{N}(i)} \, \phi^{(k)}\left(\mathbf{x}_i^{(k-1)}, \mathbf{x}_j^{(k-1)},\mathbf{e}_{j,i}\right) \right)
$$

Гамма лежит в апдейт, квадратик в aggr, а фи в месседж; в этом примере гамма и фи -- умножение на матрицу после конкатенации, а квадратик -- суммирование

In [None]:
class GConv(MessagePassing):
    def __init__(self, dim_atom=103, dim_edge=2, out_channels=2):
        super(GConv, self).__init__(aggr='add')  # "Add" aggregation
        self.phi_output = 3
        self.lin_phi = torch.nn.Linear(dim_atom*2+dim_edge, self.phi_output, bias=False)
        self.lin_gamma = torch.nn.Linear(dim_atom + self.phi_output, out_channels, bias=False)
        self.nonlin = nn.Sigmoid()

    def forward(self, batch):
        x = batch['x']
        edge_index = batch['edge_index']
        edge_attr = batch['edge_attr']
        
        # x has shape [N -- количество атомов в системе(батче), in_channels -- размерность вектора-атома]
        # edge_index has shape [2, E] -- каждое ребро задаётся парой вершин

        # Start propagating messages. 
    
        return self.propagate(edge_index, x=x, edge_attr=edge_attr, size=None)  #не совсем понял что такое сайз

    def message(self, x, x_i, x_j, edge_attr):
        concatenated = torch.cat((x_i, x_j, edge_attr), 1)
        phi = self.lin_phi(concatenated)
        return self.nonlin(phi)
        
    def update(self, aggr_out, x, edge_attr, edge_index):
                
        concatenated = torch.cat((x, aggr_out), 1)

        return Data(x=self.nonlin(self.lin_gamma(concatenated)), edge_attr=edge_attr, edge_index=edge_index)

In [None]:
#собственно нейросеть
class ConvNN(nn.Module):
    
    def __init__(self, dim_atom=103, dim_edge=2):
        
        super().__init__()          
        self.conv_1 = GConv(dim_atom=dim_atom, dim_edge=dim_edge, out_channels=dim_atom)
        self.conv_2 = GConv(dim_atom=dim_atom, dim_edge=dim_edge, out_channels=dim_atom)
        self.conv_3 = GConv(dim_atom=dim_atom, dim_edge=dim_edge, out_channels=dim_atom)
        self.conv_4 = GConv(dim_atom=dim_atom, dim_edge=dim_edge, out_channels=dim_atom)
        self.conv_5 = GConv(dim_atom=dim_atom, dim_edge=dim_edge, out_channels=dim_atom)
        self.conv_last = GConv(dim_atom=dim_atom, dim_edge=dim_edge, out_channels=2)
        
        self.lin = torch.nn.Linear(2, 1, bias=True)
        
    def forward(self, batch):
        convoluted_1 = self.conv_1(batch)
        convoluted_2 = self.conv_2(convoluted_1)
        convoluted_3 = self.conv_3(convoluted_2)
        convoluted_4 = self.conv_4(convoluted_3)
        convoluted_5 = self.conv_5(convoluted_4)
        convoluted_last = self.conv_last(convoluted_5)['x']
        scattered = scatter(convoluted_last, batch['batch'], dim=0, reduce='sum')
        summed = scattered
        energy = self.lin(summed)
        
        return energy

In [None]:
def send_scalars(model, writer):
    pass

In [None]:
def send_hist(model, writer, step):
    for name, weight in model.named_parameters():
        writer.add_histogram(name, weight, step)

In [None]:
#train -- ходим по батчам из итератора, обнуляем градиенты, предсказываем у, считаем лосс, считаем градиенты, делаем шаг оптимайзера, записываем лосс
def train(model, iterator, optimizer, criterion, print_every=10, writer=None):
    
    epoch_loss = 0
    
    model.train()

    for i, (systems, ys) in enumerate(iterator):
        
        optimizer.zero_grad()
        predictions = model(systems).squeeze()
        loss = criterion(predictions.float(), ys.to(device).float())
        loss.backward()     
        
        optimizer.step()      
        
        epoch_loss += loss.item()  
        
        if writer != None:
            send_hist(model, writer, i)
            send_scalars(model, writer)
        
        if not (i+1) % print_every:
            print(f'step {i} from {len(iterator)}')
            print(f'Loss: {epoch_loss/i}')
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion, writer=False):
    
    epoch_loss = 0
    
#    model.train(False)
    model.eval()  
    
    with torch.no_grad():
        for systems, ys in iterator:   

            predictions = model(systems).squeeze()
            loss = criterion(predictions.float(), ys.to(device).float())        

            epoch_loss += loss.item()
            
            if writer != None:
                send_scalars(optimizer, writer)
                
    print(f'epoch loss {epoch_loss / len(iterator)}')
            
    return epoch_loss / len(iterator)

In [None]:
def inferens(model, iterator):
    y = torch.tensor([])

#    model.train(False)
    model.eval()  
    
    with torch.no_grad():
        for systems, ys in iterator:   
          predictions = model(systemhs).squeeze()
          y = torch.cat((y, predictions))
      
    return y

In [None]:
import pickle

import numpy as np
import pandas as pd
import torch

import torch.nn.functional as F
import torch_geometric.nn as pyg_nn
import torch_geometric.utils as pyg_utils
import torch.optim as optim

from datetime import datetime
from sklearn.model_selection import train_test_split
from torch import nn
from torch_geometric.data import Data, Dataset, DataLoader
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops
from torch_scatter import scatter
from torch.utils.tensorboard import SummaryWriter

## DATA

In [None]:
# %%time
# for colab
# dataset_file_path = "/content/drive/MyDrive/ocp_datasets/data/is2re/10k/train/structures_train.pkl"

# user specific folder
dataset_file_path= "~/Downloads/structures_train.pkl"
# dataset_file_path= "../../ocp_datasets/data/is2re/10k/train/structures_tain.pkl"

with open(dataset_file_path,'rb') as f:
    data_ori = pickle.load(f)

In [None]:
%%time
#сливаем новые фичи и фичи из Data
for system in data_ori:
    for key in system['data']:
        system[key[0]] = key[1]
    del system['data']

In [None]:
%%time
df = pd.DataFrame(data_ori)

In [None]:
data_ori=[]

In [None]:
df

In [None]:
df.columns

In [None]:
#делим на обучующую и валидационную выборки
df_train, df_val = train_test_split(df, test_size=0.15)
df = []

In [None]:
#сбрасываем индексы
df_train = df_train.reset_index()
df_val = df_val.reset_index()

In [None]:
batch_size = 64
num_workers = 0

In [None]:
# features_cols = ['voloroi_volumes', 'voronoi_surface_areas', 'electronegativity', 
#                  'dipole_polarizability', 'edge_index_new', 'distances_new', 'contact_solid_angles']

features_cols = ['atomic_numbers', 'edge_index_new', 'distances_new', 
                 'contact_solid_angles', 'tags', 'voronoi_volumes', 'spherical_domain_radii']
target_col = 'y_relaxed'

In [None]:
#инициализируем тренировочный датасети и тренировочный итератор
training_set = Dataset(df_train, features_cols, target_col)
training_generator = DataLoader(training_set, batch_size=batch_size, num_workers=num_workers)

In [None]:
#инициализируем валидационный датасет и валидационный итератор
valid_set = Dataset(df_val, features_cols, target_col)
valid_generator = DataLoader(valid_set, batch_size=batch_size, num_workers=num_workers)

In [None]:
df_train = []
df_val = []

## MODEL

In [None]:
#чтобы тензор по умолчанию заводился на куде
if torch.cuda.is_available():
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
    print('cuda')

In [None]:
#set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
print(device)

In [None]:
#model
model = ConvNN(dim_atom=training_set[0][0].x.shape[1],dim_edge=training_set[0][0].edge_attr.shape[1])

#optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.L1Loss()

#переносим на куду если она есть
model = model.to(device)
criterion = criterion.to(device)

In [None]:
optimizer.state_dict

In [None]:
dir(optimizer)

In [None]:
timestamp = str(datetime.now()).split('.')[0]

In [None]:
#tensorboard writer, при первом запуске надо руками сделать папку для логов

# server
#log_folder_path = "../../ocp_results/logs/tensorboard/out_base_model"

# colab
# log_folder_path = "/content/drive/MyDrive/ocp_results/logs/tensorboard/out_base_model"

# user_specific 
log_file_path = "tensorboard_logs"

writer = SummaryWriter(log_file_path + '/' + timestamp)

In [None]:
#граф модели
trace_system = dict(list(next(iter(training_generator))[0]))
writer.add_graph(model, trace_system)

## Training

In [None]:
%%time
loss = []
loss_eval = []
epochs = 10
print(timestamp)
print(f'Start training model {str(model)}')
for i in range(epochs):
    print(f'epoch {i}')
    loss.append(train(model, training_generator, optimizer, criterion, writer=writer))
    loss_eval.append(evaluate(model, valid_generator, criterion, writer=writer))

In [None]:
writer.close()

In [None]:
loss_eval