In [1]:
import tally_core as tc
import json
from pprint import pprint
dataset = tc.DataSet("Museums")

# Data and meta-data structure

The main object of Tally is the `tally_core.DataSet`. This represents the data from a survey.

Tally builds upon the `pandas` library. Case-data is represented
by a `pandas.DataFrame` and each column is a `pandas.Series`. The native
format for Tally to save this data is a parquet file.

Tally defines its own meta-data schema to describe the data columns
and provide additional information on the underlying structure of the data.
The meta-data is represented as a nested `dict` and saved as a json file.

We refer to the parquet/json combination as the Tally General Variable Notation (gvn).

The data and meta-data are stored in private variables called `DataSet._meta` and `DataSet._data`.

| variable name      | stored in memory  | stored on file | 
----------     |  ----------| ----- |
| `DataSet._meta`| Python dict | json |
| `DataSet._data`| pandas.DataFrame | parquet |


We have already converted our data file, the standard Museum example data file from Dimensions, into a gvn file.

In [2]:
dataset.read_gvn("./data/Example_Museum.json", "./data/Example_Museum.parquet")

## The case-data dataframe
The case-data is stored in a variable called `_data` but we can retreive the dataframe with the `DataSet.data` method.


In [3]:
dataset.data().head()

Unnamed: 0,id_HDATA,Respondent.Serial,DataCollection.Status,DataCollection.StartTime,DataCollection.FinishTime,DataCollection.RoutingContext,address,age,before,biology,...,rating_ent[{whales}].Column,rating_ent[{mammals}].Column,rating_ent[{minerals}].Column,rating_ent[{ecology}].Column,rating_ent[{botany}].Column,rating_ent[{origin_of_species}].Column,rating_ent[{human_biology}].Column,rating_ent[{evolution}].Column,rating_ent[{wildlife_in_danger}].Column,@1
0,1,1,177;,2002-07-19 12:42:30.999,2002-07-19 14:52:31,186,"124 Dill Hall Lane, Church Ditton",5,9,10,...,48;,49;,51;,48;,48;,51;,51;,51;,51;,1.0
1,2,2,177;,2002-07-19 12:42:30.999,2002-07-19 16:52:31,186,"22 Southbank Road, Hounslow",4,10,10,...,51;,51;,52;,51;,51;,51;,50;,51;,51;,1.0
2,3,3,177;,2002-07-19 12:42:30.999,2002-07-19 18:52:31,186,"Gatehouse, Church Strarmthorpe",4,10,10,...,51;,48;,48;,51;,48;,51;,50;,51;,52;,1.0
3,4,4,177;,2002-07-19 12:42:30.999,2002-07-19 20:52:31,186,"151 Linacre Road, London SE2",4,9,10,...,48;,48;,51;,48;,48;,51;,48;,51;,48;,1.0
4,5,5,177;,2002-07-19 12:42:30.999,2002-07-19 22:52:31,186,"73 Kings Road, North Ormesby",5,9,10,...,52;,51;,52;,51;,51;,51;,50;,52;,51;,1.0


:::{note} 
When examining pandas dataframes, we often use the `pandas.DataFrame.head(n=5)` method which shows the top `n` rows of the dataframe. 
:::

Tally mimics the pandas `[]` syntax. You can view one or more variable using this bracket syntax.

In [4]:
dataset[['age', 'gender']].head()

Unnamed: 0,age,gender
0,5,23
1,4,24
2,4,24
3,4,23
4,5,24


Tally supports grids/loops, which we call arrays. We can either view all data in an array by using it's name, or by referring to its individual columns.

In [5]:
dataset['rating_ent.Column'].head()

Unnamed: 0,rating_ent[{dinosaurs}].Column,rating_ent[{conservation}].Column,rating_ent[{fish_and_reptiles}].Column,rating_ent[{fossils}].Column,rating_ent[{birds}].Column,rating_ent[{insects}].Column,rating_ent[{whales}].Column,rating_ent[{mammals}].Column,rating_ent[{minerals}].Column,rating_ent[{ecology}].Column,rating_ent[{botany}].Column,rating_ent[{origin_of_species}].Column,rating_ent[{human_biology}].Column,rating_ent[{evolution}].Column,rating_ent[{wildlife_in_danger}].Column
0,51;,48;,51;,48;,51;,51;,48;,49;,51;,48;,48;,51;,51;,51;,51;
1,51;,50;,50;,51;,50;,50;,51;,51;,52;,51;,51;,51;,50;,51;,51;
2,51;,49;,48;,50;,52;,48;,51;,48;,48;,51;,48;,51;,50;,51;,52;
3,52;,49;,51;,51;,48;,51;,48;,48;,51;,48;,48;,51;,48;,51;,48;
4,52;,49;,51;,51;,51;,51;,52;,51;,52;,51;,51;,51;,50;,52;,51;


In [6]:
dataset[['rating_ent[{dinosaurs}].Column', 'rating_ent[{conservation}].Column']].head()

Unnamed: 0,rating_ent[{dinosaurs}].Column,rating_ent[{conservation}].Column
0,51;,48;
1,51;,50;
2,51;,49;
3,52;,49;
4,52;,49;


### Storing case-data
The case-data is stored as a parquet file, created with `pandas.DataFrame.to_parquet` (see the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html">pandas docs</a> for more).

## meta-data

Tally implements a meta-data schema to describe the data columns
and provide additional information on the underlying structure of the data.

The meta-data stores information on each column, e.g. variable type, labels, values. The meta-data stores information on how arrays (grids/loops) are composed, their labels etc. 

### meta-data structure
The metadata document is saved to file as a json document, and in memory is stored in the `DataSet._meta` variable, a nested `dict`. We can examine the top-level keys like this:

In [7]:
dataset._meta.keys()

dict_keys(['columns', 'info', 'lib', 'masks', 'sets', 'type'])

| element      | contains  | 
----------     |  ----------| 
| ``'columns'``|   info on ``DataFrame`` columns (types, labels, etc.)| 
| ``'info'``	 |   info on the source data| 
| ``'lib'``	   | shared use references| 
| ``'masks'``  |   complex variable type definitions (arrays, dichotomous, etc.)|
| ``'sets'``	 |   ordered groups of variables pointing to other parts of the meta| 
| ``'type'``	 |   case-data type| 


## `columns`
Every column in the case-data dataset has a correspondeing key in this dictionary. If we look at the `gender` variable as an example, this is the meta-data stored in `columns`. 

In [8]:
pprint(dataset._meta['columns']['gender'])

{'name': 'gender',
 'parent': {},
 'properties': {'LookName': 'Contemporary::1 Column Narrow:'},
 'text': {'en-US': 'Gender of respondent',
          'es-ES': 'Género del encuestado',
          'ja-JP': '回答者の性別'},
 'type': 'single',
 'values': [{'properties': {},
             'text': {'en-US': 'Male', 'es-ES': 'Masculino', 'ja-JP': '男性'},
             'value': 23},
            {'properties': {},
             'text': {'en-US': 'Female', 'es-ES': 'Femenino', 'ja-JP': '女性'},
             'value': 24}]}


Columns that are part of an array (grid/loop) are also stored here. The array `rating_eng.Column` has 15 columns, `rating_ent[{dinosaurs}].Column`, `rating_ent[{biology}].Column`, etc. This is what one one of them looks like in the `columns` meta-data. 

In [9]:
pprint(dataset._meta['columns']['rating_ent[{dinosaurs}].Column'])

{'name': 'rating_ent[{dinosaurs}].Column',
 'parent': {'masks@rating_ent.Column': {'type': 'array'}},
 'properties': {'DisplayOrientation': 'Vertical'},
 'text': {'en-US': 'Q46 - Dinosaurs',
          'es-ES': 'P46 - Dinosaurios',
          'ja-JP': '問46 - 恐竜'},
 'type': 'delimited set',
 'values': 'lib@values@rating_ent.Column'}


## `masks`
Arrays are stored with what we call a `mask`. A mask is a "virtual" variable that is comprised of one or more columns. In this dataset we have four grid variables, and these are all stored in the `mask` part of the meta-data dict.

In [10]:
dataset._meta['masks'].keys()

dict_keys(['order.Column', 'plan_order.Column', 'rating.Column', 'rating_ent.Column'])

If we examine one of them more closely, we find:

In [11]:
pprint(dataset._meta['masks']['rating_ent.Column'])

{'items': [{'properties': {},
            'source': 'columns@rating_ent[{dinosaurs}].Column',
            'text': {'en-US': 'Dinosaurs',
                     'es-ES': 'Dinosaurios',
                     'ja-JP': '恐竜'}},
           {'properties': {},
            'source': 'columns@rating_ent[{conservation}].Column',
            'text': {'en-US': 'Conservation',
                     'es-ES': 'Conservación',
                     'ja-JP': '保全'}},
           {'properties': {},
            'source': 'columns@rating_ent[{fish_and_reptiles}].Column',
            'text': {'en-US': 'Fish and reptiles',
                     'es-ES': 'Peces y reptiles',
                     'ja-JP': '魚・両生類'}},
           {'properties': {},
            'source': 'columns@rating_ent[{fossils}].Column',
            'text': {'en-US': 'Fossils', 'es-ES': 'Fósiles', 'ja-JP': '化石'}},
           {'properties': {},
            'source': 'columns@rating_ent[{birds}].Column',
            'text': {'en-US': 'Birds', 'es-ES': '

## `lib` - values shared between columns and arrays
Both the `rating_ent.Column` meta-data in the previous section, and the meta-data for one of its variables as explored in the `columns` section above, have a reference to `lib@values@rating_ent.Column`.

This means that the values for the array (and thereby the column in the array) are stored in the `lib` section of the meta-data.

If we examine the `lib` section of the meta-data, we see the following keys.

In [12]:
dataset._meta['lib'].keys()

dict_keys(['ddf_categorymap', 'default text', 'values'])

### `values`
The `values` dict contains information about values that are accessed from many masks and/or columns. For example, we previously saw a reference to `lib@values@rating_ent.Column`, which is this

In [13]:
pprint(dataset._meta['lib']['values']['rating_ent.Column'])

[{'factor': 1.0,
  'properties': {},
  'text': {'en-US': 'Not at all interested (1)',
           'es-ES': 'No me interesa para nada (1)',
           'ja-JP': '全く興味がない (1)'},
  'value': 48},
 {'factor': 2.0,
  'properties': {},
  'text': {'en-US': 'Not particularly interested (2)',
           'es-ES': 'No me interesa en particular (2)',
           'ja-JP': '特に興味はない (2)'},
  'value': 49},
 {'factor': 3.0,
  'properties': {},
  'text': {'en-US': 'No opinion (3)',
           'es-ES': 'No tengo opinión',
           'ja-JP': 'どちらでもない (3)'},
  'value': 50},
 {'factor': 4.0,
  'properties': {},
  'text': {'en-US': 'Slightly interested (4)',
           'es-ES': 'Me interesa un poco',
           'ja-JP': '少し興味がある (4)'},
  'value': 51},
 {'factor': 5.0,
  'properties': {},
  'text': {'en-US': 'Very interested (5)',
           'es-ES': 'Me interesa mucho',
           'ja-JP': '非常に興味がある (5)'},
  'value': 52}]


### `ddf_categorymap`
The ddf_categorymap is only applicaple for Dimensions files, and stores the <categorymap> data as a dict.

This categorymap in dimensions
```
<categorymap>
  <categoryid name="e1116_years" value="1"/>
  <categoryid name="e1720_years" value="2"/>
  <categoryid name="e2124_years" value="3"/>
  ...
```
would map to this dict in `ddf_categorymap`

```
{
  "e1116_years":"1",
  "e1720_years":"2",
  "e2124_years":"3",
  ...
}
```

### `default_text`

The `default_text` key stores the name of the default language used if nothing else is specified. If we look at the default text for this dataset, it's US English.

In [14]:
dataset._meta['lib']['default text']

'en-US'

So, when we ask for the meta-data (or run a crosstab, or do anything that accepts a `text_key` parameter) we get US english as the default labels.

In [15]:
dataset.meta('gender')

single,codes,texts,missing
gender: Gender of respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,23,Male,
2,24,Female,


But if we supply another text key, we get a label from another language.

In [16]:
dataset.meta('gender', text_key='es-ES')

single,codes,texts,missing
gender: Género del encuestado,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,23,Masculino,
2,24,Femenino,


## `sets`, `info` and `type`
The `sets` is an ordered group of variables that links to all valid variables. This is used to list the variables, store the order of the variables etc.

`info` stores basic information about the origins of the `DataSet`.

`type` currently only shows `pandas.DataFrame` and is a legacy variable, not currently used.