Table of contents:

[Repository and batch information](#repository-and-batch-information)

[Batch content overview](#batch-content-overview)

[Initial parameters of the simulation](#initial-parameters-of-the-simulation)
  - [Initial parameters overview](#initial-parameters-overview)
  - [Query json-object data](#query-json-object-data)

[Simulated data](#simulated-data)
  - [Topic overview](#topic-overview)
  - [Query json-array data](#query-json-array-data)
  - [Query and rename columns](#query-and-rename-columns)
  - [Data query constraints](#data-query-constraints)
  - [Sampling methods](#sampling-methods)
  - [Separate tables by sid](#separate-tables-by-sid)
  - [Maximum and minimum values](#maximum-and-minimum-values)
  - [Unique values](#unique-values)
  - [Number of unique values](#number-of-unique-values)

[Plots](#plots)
  - [Quick plots](#quick-plots)
    - [Quick x vs. y plot](#quick-x-vs-y-plot)
    - [Time plot](#time-plot)
    - [Plot with pandas](#plot-with-pandas)

  - [Advanced plots](#advanced-plots)
    - [x vs. y](#x-vs-y)
    - [3d plot](#3d-plot)
    - [Subplots with a shared x-axis](#subplots-with-a-shared-x-axis)
    - [Grid of plots](#grid-of-plots)
    - [Ellipse error plot](#ellipse-error-plot)


Download the data_access module, which allows you to retrieve general information about the repositories and batches, as well as query and visualize data:

In [3]:
from citros_data_analysis import data_access as da

Show the version of the package:

In [4]:
#show the version:
import citros_data_analysis
print(citros_data_analysis.__version__)

Create CitrosDB object to get access to the database. Repository is set automatically to the current repository

In [5]:
citros = da.CitrosDB()

Besides, both batch and repository can be set manually:

In [None]:
citros_gal = da.CitrosDB(repo = 'gal_orbits', batch = 'galactic orbits_4')

## Repository and batch information

Repository is set automatically to the current repository

Show all repositories that were created by me:

In [5]:
citros.repo_info(user='me').print()

Show information about all repositories with word 'orbits':

In [6]:
citros.repo_info('orbits').print()

Show information about the repository with the exact name 'gal_orbits':

In [8]:
citros.repo_info('gal_orbits', exact_match = True).print()

In the 'gal_orbits' repository, show the most recent batch I created:

In [5]:
citros.repo('gal_orbits').batch_info(-1, user = 'me').print()

Display a list of all batches that were created by me:

In [10]:
list(citros.batch_info(user='me').keys())

Show information about all batches that have 'galactic orbits' in their names:

In [11]:
citros.batch_info('galactic orbits').print()

Show information about the exact batch named 'galactic orbits':

In [None]:
citros.batch_info('galactic orbits', exact_match = True).print()

Look for the batches with 'test' in their names that were created in specific simulation - 'simulation_gal_orbits':

In [4]:
citros.simulation('simulation_gal_orbits').batch_info('test').print()

To get sizes of all batches in the current schema, use `get_batch_size()` method.
Each resulting row contains name of the batch, batch size and total size with indexes:

In [21]:
citros.get_batch_size()

To get size of the current batch, use `get_current_batch_size()`:

In [22]:
citros.batch('galactic orbits_1').get_current_batch_size()

## Batch content overview

To get general information about the batch content, execute `info()` method. 

It returns dictionary, that contains:
   - 'size': size of the selected data
   - 'sid_count': number of sids
   - 'sid_list': list of the sids
   - 'topic_count': number of topics
   - 'topic_list': list of topics
   - 'message_count': number of messages

Show information about topics of the batch 'galactic orbits_1':

In [23]:
citros.batch('galactic orbits_1').info().print()

The specific piece of information could be achived by the keywords.

For example, to get total number of messages:

In [24]:
num = citros.batch('galactic orbits_1').info()['message_count']
print(f'Total number of messages: {num}')

## Initial parameters of the simulation

### Initial parameters overview

The '/config' topic contains information about the parameters of the simulation.

If topic is specified, method `info()` appends dictionary 'topics', that has the following structure:
  - 'topics':
     - str topic name:
        - 'type': type
        - 'data_structure': structure of the data
        - 'message_count': number of messages

To show information about the topic '/config':

In [25]:
citros.batch('galactic orbits_1').topic('/config').info().print()

Under the key 'data_structure' the structure of the data is presented.

To show only the dictionary with structure and types of the parameters:

In [26]:
citros.batch('galactic orbits_1').topic('/config').info()['topics']['/config']['data_structure'].print()

### Query json-object data

As it can be seen from the previous output, parameters are stored in dictionary under the key 'ros__parameters'.

To retrieve them as a pandas.DataFrame, where each row corresponds to the simulation's sid:

In [27]:
citros.batch('galactic orbits_1').topic('/config').data('data.gal_orbits.ros__parameters')

Query values of the specific parameter, for example 'M_disc':

In [28]:
citros.batch('galactic orbits_1').topic('/config').data('data.gal_orbits.ros__parameters.M_disc')

Or query for several parameter, for example 'M_disc' and 'M_sph':

In [29]:
citros.batch('galactic orbits_1').topic('/config').data(['data.gal_orbits.ros__parameters.M_disc','data.gal_orbits.ros__parameters.M_sph'])

Get parameters for the simulation sid = 2:

In [30]:
citros.batch('galactic orbits_1').topic('/config').sid(2).data('data.gal_orbits.ros__parameters')

If the method `data()` is called without parameters, all dictionary items are automatically divided by separate columns:

In [31]:
citros.batch('galactic orbits_1').topic('/config').data()

## Simulated data

### Topic overview

Show information about the topic '/gal_orbits':

In [32]:
citros.batch('galactic orbits_1').topic('/gal_orbits').info().print()

If specific sid is set, the dictionary 'sids' is also appended in the output, with the following structure:
  - 'sids':
     - int sid value:
        - 'topics': 
           - str topic name:
              - 'message_count': number of messages
              - 'start_time': time when simulation started
              - 'end_time': time when simulation ended
              - 'duration': duration of the simalation process
              - 'frequency': frequency of the simulation process

In [33]:
citros.batch('galactic orbits_1').topic('/gal_orbits').sid([1, 2]).info().print()

To get number of messages for topic for each of the sids 1 and 2:

In [34]:
inf = citros.batch('galactic orbits_1').topic('/gal_orbits').sid([1, 2]).info()
num_1 = inf['sids'][1]['topics']['/gal_orbits']['message_count']
num_2 = inf['sids'][2]['topics']['/gal_orbits']['message_count']

print(f'Number of messages in "/gal_orbits" for sid = 1: {num_1}, for sid = 2: {num_2}\n')

To get structure of the topic and total number of messages for this topic:

In [35]:
inf_dyn = citros.batch('galactic orbits_1').topic('/gal_orbits').info()
num = inf_dyn['message_count']

print('Total number of messages in topic: {}'.format(num))
print('topic structure:')
inf_dyn['topics']['/gal_orbits']['data_structure'].print()

Print the table with data structure for the topic '/gal_orbits':

In [36]:
citros.batch('galactic orbits_1').topic('/gal_orbits').get_data_structure()

### Query json-array data

Data in '/gal_orbits' topic is stired as a json array, that correspond to python list.
Data may be queried by `data()` method. The output depends on the passed to `data()` parameters. See the difference in the following examples:

Get all data from json 'data' column calling `data()` without parameters.

This way all json-objects will be divided by column. As in 'data' column we have only one json-object under the key 'data' with nested json-array, 
the output will be column 'data.data' with lists in each row.

In [37]:
print(citros.batch('galactic orbits_1').topic('/gal_orbits').data().head(5))
# Method head(n) shows first n rows of the pandas.DataFrame.

Get all data as a dictionary in a one column (download 'data' column as it is in database):

In [38]:
print(citros.batch('galactic orbits_1').topic('/gal_orbits').data('data').head(5))

Explicitly query content under the key 'data':

In [39]:
print(citros.batch('galactic orbits_1').topic('/gal_orbits').data('data.data').head())

Get the first element of the 'data' list:

In [40]:
print(citros.batch('galactic orbits_1').topic('/gal_orbits').data('data.data[0]'))

Get the first and second elements of the 'data' list:

In [41]:
print(citros.batch('galactic orbits_1').topic('/gal_orbits').data(['data.data[0]', 'data.data[1]']))

### Query and rename columns

Let's query data and divide content of the list of the 'data' column by separate columns and rename them according to the names 
of the variable that listed in Readme.md:

In [42]:
column_names = ['t', 'R', 'Vr', 'fi', 'Vfi', 'z', 'Vz', 'E', 'C', 'xg', 'yg']
query = ['data.data['+str(i)+']' for i in range(len(column_names))]

F = citros.batch('galactic orbits_1').topic('/gal_orbits').data(query)
F.rename({query[i]: column_names[i] for i in range(len(query))}, axis = 1, inplace = True)

print(F)

### Data query constraints

To apply constraints on sid, rid and time, functions sid(), rid() and time() may be used.

- sid()\
To query data for specific simulation run ids, pass them in a list to sid().\
For example for sid = 1 or 2: sid([1,2])\
or specify `start`, `end`/`count` parameters:\
For example for 3 <= sid <= 7: sid(start = 3, end = 7)

- rid()\
To set specific rid, pass them in a list:\
For example for rid = 1, 2, 3: sid([1,2,3])\
To set range of rids, specify `start`, `end`/`count`:\
For 5 <= rid <= 10: rid(start = 5, end = 10) or rid(start = 5, count = 6)

- time()\
To get range of time, set `start`, `end`/`duration` for time (in nanoseconds).\
For 10 <= time <= 20: time(start = 10, end = 20)\
or set `duration` to exclude the end point: 10s <= time < 20s: time(start = 10, duration = 10)\

- set_filter()\
To set constraints on json-data column, `set_filter()` method is used.\
Constraints are provided as dictionary {key_1: value_1, key_2: value_2, ...}, where:\
key_n - must match labels of the columns,\
value_n:
  - in the case of equality: list of exact values, like set_filter({'data.x': [1,2]})\
  - in the case of inequality: dictionary with key '>', '>=', '<' & '<=', like set_filter({'data.x': {'gte': 5}})\

- set_order()\
By default, the results are ordered by 'sid' and 'rid' in ascending order.
To apply sorting of the result in ascending order pass label(s) of the column as a str (list of str)\
set_order(['rid', 'time'])\
To specify ascending or descending way of ordering, provide dictionary {'column_label': order}, where 'column_label' is a label of the column and 'order' is 'asc' for ascending order and 'desc' for descending order.\
For example, to sort by 'rid' in ascending order and by 'sid' in descending:\
set_order('rid': 'asc', 'sid': 'desc')

To get columns 'data.data[0]', 'data.data[1]' with sid = 1 or 2, with 128 <= rid <= 180 and time >= 5ns, 
where data.data[1] > 1 and sort the result in descending order by the column 'rid':

In [43]:
citros.batch('galactic orbits_1').topic('/gal_orbits')\
      .sid([1,2]).rid(start = 128, end = 180).time(start = 5)\
      .set_filter({'data.data[1]': {'>': 1}})\
      .set_order({'rid': 'desc'})\
      .data(['data.data[0]', 'data.data[1]'])

### Sampling methods

If the amount of data is too large, sampling functions `skip()`, `avg()` and `move_avg()` may be applied.

`skip(n)` is used to select every nth message.

Limits on sid, rid and time (set by sid(), rid() and time()) are applyied before selection, 
while limits on json-data column, set by set_filter() are applied after selection.

Selection for each sid is performed separately.

The code below select each 10th messages (pandas.DataFRame method `head()` is used to show first 5 rows of the output):

In [44]:
citros.batch('galactic orbits_1').topic('/gal_orbits').skip(10).data().head(5)

Instead of just skipping messages, each n messages may be everaged by `avg(n)`.

Limits on sid, rid and time (set by sid(), rid() and time()) are applyied before averaging, 
while limits on json-data column, set by set_filter() are applied after averaging.

Averaging for each sid is performed separately.

The value of 'rid' for each everaged range is set as a minimum value among the averaged rid values.

Only numeric values may be averaged and the labels of the json-data columns with numeric content should be explicitly listed in data([]).

To everage each 10 messages (pandas.DataFRame method `head()` is used to show first 5 rows of the output):

In [45]:
citros.batch('galactic orbits_1').topic('/gal_orbits').avg(10).data(['data.data[0]']).head(5)

Use move_avg(n, m) to apply moving average over n messages and select each m-th row of the result.

Limits on sid, rid and time are applyied before averaging and selection,
while limits on json-data column, set by set_filter() are applied after averaging.

Averaging and selection for each sid is performed separately.

The value of 'rid' for each everaged range is set as a minimum value among the averaged rid values.

Only numeric values may be averaged and the labels of the json-data columns with numeric content 
should be explicitly listed in data([]).

To average each 10 messages and select every second row of the result (pandas.DataFRame method `head()` is used to show first 5 rows of the output):

In [46]:
citros.batch('galactic orbits_1').topic('/gal_orbits').move_avg(10,2).data(['data.data[0]']).head(5)

### Separate tables by sid

Get tables with different sid separetly by `get_sid_tables()` method
the returning dictionary contains sid as dictionary keys and tables as dictionary values.

[Constraints](#data-query-constraints) and such [sampling methods](#sampling-methods) as skipping, averaging and calculation of moving average may be applied too.

Let's query 'data.data[0]' and 'data.data[1]' columns with the following requairements:\
sid equals 1, 2 or 3\
11 <= rid <= 1000\ 
the output is averaged over 10 rows\
only rows with "data.data[1]" >= 1 are selected\
the result is ordered by 'rid' in descending order

In [47]:
dfs = citros.batch('galactic orbits_1').topic('/gal_orbits')\
            .sid([1, 2, 3])\
            .rid(start = 11, end = 1000)\
            .avg(10)\
            .set_filter({"data.data[1]": {'>=': 1}})\
            .set_order({'rid':'desc'})\
            .get_sid_tables(data_query = ['data.data[0]', 'data.data[1]'])

#Let's print the table with sid = 1
print('sid values are: {}\n'.format(list(dfs.keys())))
print('table with sid = 1:')
print(dfs[1].head(5))

### Maximum and minimum values

Find maximum and minimum values of the simulation ids (sid):

In [48]:
column_name = 'sid'

result_max = citros.batch('galactic orbits_1').topic('/gal_orbits').get_max_value(column_name)
result_min = citros.batch('galactic orbits_1').topic('/gal_orbits').get_min_value(column_name)

print(f"max value of the column '{column_name}' : {result_max}")
print(f"min value of the column '{column_name}' : {result_min}")

Find maximum and minimum values of the parameter 'R' that is stored at 'data.data[1]' for the simulation with sid = 3:

In [49]:
column_name = 'data.data[1]'

result_max = citros.batch('galactic orbits_1').topic('/gal_orbits').sid(3).get_max_value(column_name)
result_min = citros.batch('galactic orbits_1').topic('/gal_orbits').sid(3).get_min_value(column_name)

print(f"for sid = 3 max value of the column '{column_name}' : {result_max}")
print(f"for sid = 3 min value of the column '{column_name}' : {result_min}")

Find maximum value of the parameter 'R', corresponding to this maximum time ('data.data[0]') and simulation ('sid'). 
Also determine the initial mass of the disk M_disk when the maximum is achieved.

In [21]:
#set return_index = True to get the sid and rid of the max/min value
#if there are several pairs of sid and rid values that correspond to the same max/min value, the sid and rid are returned as lists
R_max, sid_max, rid_max = citros.batch('galactic orbits_1').topic('/gal_orbits').get_max_value('data.data[0]', return_index = True)

#At this step, both sid_max and rid_max are either lists or integers.
#To simplify and unify the code for both cases, we'll first determine if sid_max is an integer (indicating that R_max was reached in just one simulation).
#If that's the case, we'll convert both sid_max and rid_max into lists:
if isinstance(sid_max, int):
    sid_max = [sid_max]
    rid_max = [rid_max]

#get the 'data.data[0]' value that correspods to sid_max and rid_max:

time = []
for s, r in zip(sid_max, rid_max):
    time.append(citros.batch('galactic orbits_1').topic('/gal_orbits').sid(s).rid(r).data('data.data[0]')['data.data[0]'].iloc[0])

#get the M_disk parameter that corresponds to the sid_max:
init_mass = []
for s in sid_max:
    init_mass.append(citros.batch('galactic orbits_1').topic('/config').sid(s).data(['data.gal_orbits.ros__parameters.M_disc'])['data.gal_orbits.ros__parameters.M_disc'].iloc[0])

print(f"maximum R value = {R_max},\nmaximum is reached at:")

for i in range(len(sid_max)):
    print(f"sid = {sid_max[i]} (M_disk = {init_mass[i]}), time = {time[i]}")

### Unique values

Print unique values by `get_unique_values()`, for example get all possible topics for the batch:

In [50]:
column_names = ['topic']
result = citros.batch('galactic orbits_1').get_unique_values(column_names)

#print the result
print(result)

To get unique combination of values, specify list of the columns in "column_names".
For example, to get unique combination of topic-type:

In [51]:
column_names = ['topic', 'type']
result = citros.batch('galactic orbits_1').get_unique_values(column_names)

#print the result
from prettytable import PrettyTable
table = PrettyTable(field_names=column_names, align='r')
table.add_rows(result)
print(table)

### Number of unique values

Print number of the unique values in the column "column_name"

In [52]:
#name of the column:
column_name = 'type'

counts = citros.batch('galactic orbits_1').topic('/gal_orbits').sid([1]) .get_unique_counts(column_name)
print(f"number of unique values in column '{column_name}' for topic '/gal_orbits': {counts[0][0]}")

#you may group the result, for example by topics:
group_by = 'topic'
counts = citros.batch('galactic orbits_1').sid([1]) .get_unique_counts(column_name, group_by = group_by)

print(f"number of unique values in column '{column_name}':")
table = PrettyTable(field_names= [group_by, 'unique_counts'], align='r')
table.add_rows(counts)
table.border = False
print(table)

## Plots

### Quick plots

May be used to take a quick look on data without saving it.

#### Quick x vs. y plot

Plot 'data.data[0]' vs. 'data.data[1]' for sids = 1 and 2:

In [53]:
#import matplotlib
import matplotlib.pyplot as plt

#create a figure to plot on
fig, ax = plt.subplots()

citros.batch('galactic orbits_1').topic('/gal_orbits').sid([1,2]).\
       xy_plot(ax, var_x_name = 'data.data[0]',var_y_name = 'data.data[1]',
               x_label = 't', y_label = 'R', title_text = 'R vs. t')
#do not specify `sid` if you would like to plot for all simulations

#### Time plot

Sometimes 'rid' may represent the time scale, for example if the results of the simulation is recorded with the step `time_step`. 

Let's imagine that in topic '/gal_orbits' simulations were recorded each 0.5 units of times. Then the plot 'data.data[1]' vs. time may be plotted in the following way:

In [54]:
#import matplotlib
import matplotlib.pyplot as plt

#create a figure to plot on
fig, ax = plt.subplots()

citros.batch('galactic orbits_1').topic('/gal_orbits').sid([1,2]).\
       time_plot(ax, var_name = 'data.data[1]', time_step = 0.5, y_label = 'R', title_text = 'R vs. t')

#### Plot with pandas

pandas methods of plotting may be used too. To make separate plot for each simulation id, for example 'data.data[0]' vs. 'rid':

In [55]:
citros.batch('galactic orbits_1').topic('/gal_orbits').data(['data.data[0]'])\
      .set_index(['rid','sid']).unstack()['data.data[0]'].plot()

Or plot several columns 'data.data[0]', 'data.data[1]' vs. 'rid'

In [56]:
citros.batch('galactic orbits_1').topic('/gal_orbits').data(['data.data[0]', 'data.data[1]'])\
      .set_index(['rid','sid']).unstack()[['data.data[0]', 'data.data[1]']].plot(legend = False)

### Advanced plots

Let's use DataFrame that we quieried and save under the name *F* [previously](#query-and-rename-columns)

In all the following methods the matplotlib.Figure and matplotlib.Axes are returned, that may be used for further plotting

#### x vs. y

Plot simple graph 'R' vs 't' for different sid:

In [57]:
fig, ax = citros.plot_graph(F, 't', 'R', '-')

#### 3d plot

Plot x vs. y vs. z

In [58]:
# parameter 'scale' = True will set the same ranges for all axis

fig, ax = citros.plot_3dgraph(F, "xg", "yg", "z", '-', 
                                scale = False, title = 'Cluster coordinates', legend = True)

ax.set_box_aspect(aspect=None, zoom=0.9)
fig.tight_layout()
fig.show()

#### Subplots with a shared x-axis

Plot 'xg' vs 't' and 'yg' vs. 't' on adjacent panels on one figure:

In [59]:
fig, ax = citros.multiple_y_plot(F, "t", ["xg", "yg"], '--', legend = True, title = 'x and y vs. t', set_x_label = 'time', set_y_label = ['x','y'])

#### Grid of plots 

Plot a matrix of 3 x 3 graphs, each displaying either the histogram with values distribution (for graphs on the diagonal) or the relationship between variables 'xg', 'yg' and 'z':

In [60]:
# parameter 'scale' = True set the same ranges for all axis

fig, ax = citros.multiplot(F, ["xg", "yg", "z"], '-' , 
                           legend = True, title = 'coordinates', scale = False, set_x_label = ['x', 'y', 'z' ],
                           set_y_label = ['x', 'y', 'z' ])

#### Ellipse error plot

Plot error ellipse for the values of "xg" (data.data[9]) and "yg" (data.data[10]) columns that corresponds to the last rid in each sid:

In [61]:
#get all possible sid:
sid_list = citros.batch('galactic orbits_1').topic('/gal_orbits').get_unique_values('sid')
print(f"sid numbers: {sid_list}")

#for each sid get the last rid:
rid_dict = {}
for s in sid_list:
    rid_dict[s] = citros.batch('galactic orbits_1').topic('/gal_orbits').sid(s).get_max_value('rid')
print(f"rid last numbers: {rid_dict}")

# get the values of "xg" (data.data[9]) and "yg" (data.data[10]), that corresponds to the last rid:
# we are creating an empty DataFrame 'df', query for the values of the exact sid and rid and add the result to the 'df'.

import pandas as pd
df = pd.DataFrame()

for s, r in rid_dict.items():
    df = pd.concat([df, citros.batch('galactic orbits_1').topic('/gal_orbits').sid(s).rid(r).data(['data.data[9]', 'data.data[10]'])])

df.rename({'data.data[9]': 'xg', 'data.data[10]': 'yg'}, axis = 1, inplace = True)

fig, ax = citros.plot_sigma_ellipse(df, x_label = 'xg', y_label = 'yg', 

                                    n_std = [1,2,3], plot_origin=False, bounding_error=False,

                                    set_x_label='xg', set_y_label = 'yg', title = 'Error ellipse')

Get ellipse parameters by setting `return_ellipse_param` to True. If multiple ellipses are depicted, a list of their parameters is returned. If `bounding_error` is also set to True, the bounding error radius is added to the returned parameters.

In [62]:
bounding_error = True
fig, ax, ellipse_param = citros.plot_sigma_ellipse(df, x_label = 'xg', y_label = 'yg', 
                                    n_std = 3, plot_origin=False, bounding_error=bounding_error,
                                    set_x_label='xg', set_y_label = 'yg', title = 'Error ellipse',
                                    return_ellipse_param = True)

print("ellipse parameters:")
print(f"center: {ellipse_param['x']}, {ellipse_param['y']}")
print(f"width: {ellipse_param['width']}, height: {ellipse_param['height']}")
print(f"angle: {ellipse_param['alpha']}\n")
if bounding_error:
        print(f"radius of the error circle: {ellipse_param['bounding_error']}\n")

Add references to the all batches that were used in the current notebook:

In [None]:
ref = da.Ref()
ref.print()