# 1. Load and examine data

In [None]:
#
# In order to run this notebook, you first have to install Tally. To install tally you need a token that gives you access.
#
import json
with open('tally_keys.json', 'r') as f:
    keys = json.load(f)

!pip install git+https://{keys['tally_api']}@github.com/datasmoothie/tally-core.git@master

## Load data

The first step to any project is loading in data. Tally supports multiple survey data formats and can convert data from different sources. Here, we will use Tally's native data and meta-data format, which we call GVN (General Variable Notation).

For documentation about reading data from different platforms, refer to the chapter about [converting service data](converting_survey_data).

In [2]:
import tally_core as tc
import pandas as pd
import json

dataset = tc.DataSet('Museum')

meta = json.load(open('./data/Example_Museum.json'))
data = pd.read_parquet('./data/Example_Museum.parquet')
dataset.from_components(meta_dict=meta, data_df=data)


You now have a Tally <a href="API/DataSet.html">`DataSet`</a> object which you will use for everything in the following documentation.

:::{note} 
The `DataSet` object is the interface we use for every operation we do on the dataset, both data processing and aggregation. In every example from now on, we will assume we've already loaded data into a variable called `dataset`.
:::

## Viewing and finding variables in a dataset
We can now explore what variables are in the dataset. The main methods for viewing what variables are in the data are

 - <a href="API/DataSet.html#tally_core.DataSet.variables">`DataSet.variables`</a> to get a list,
 - <a href="API/DataSet.html#tally_core.DataSet.by_type">`DataSet.by_type`</a> to get a dataframe with one column per variable type, and
 - <a href="API/DataSet.html#tally_core.DataSet.find">`DataSet.find`</a> to search for variable by name

In [3]:
dataset.find('Column')

['order.Column', 'rating.Column', 'plan_order.Column', 'rating_ent.Column']

In [4]:
dataset.variables()

['address',
 'age',
 'before',
 'biology',
 'expect',
 'gen_ent',
 'gender',
 'museums',
 'visits',
 'visits12',
 'adults',
 'under16s',
 'under11s',
 'galleries',
 'dinosaurs',
 'whales',
 'human',
 'species',
 'mammals',
 'genbalance',
 'agebalance',
 '_name',
 'order.Column',
 'oth_mus',
 'plan',
 'prefer',
 'rating.Column',
 'school',
 'signs',
 'similar',
 'sounds',
 'side_main',
 'serial',
 'entrance',
 'education',
 'certificat',
 'location',
 'who_with',
 'grp_type',
 'group_org',
 'resident',
 'distance',
 'interview',
 'time_spent',
 'long_short',
 'remember',
 'interest',
 'found_way',
 'signs_how',
 'desc_leave',
 'when_decid',
 'why_decid',
 'plan_time',
 'plan_none',
 'plan_view',
 'plan_order.Column',
 'know_way',
 'find_way',
 'rating_ent.Column',
 'desc_enter',
 'visothers']

In [5]:
dataset.by_type()

size: 602,single,delimited set,array,int,float,string,date,time,N/A
0,age,museums,order.Column,visits,genbalance,Respondent.Serial,,,id_HDATA
1,before,certificat,plan_order.Column,visits12,agebalance,DataCollection.Status,,,
2,biology,location,rating.Column,adults,,DataCollection.StartTime,,,
3,expect,remember,rating_ent.Column,under16s,,DataCollection.FinishTime,,,
4,gen_ent,found_way,,under11s,,DataCollection.RoutingContext,,,
...,...,...,...,...,...,...,...,...,...
56,,rating_ent[{botany}].Column,,,,,,,
57,,rating_ent[{origin_of_species}].Column,,,,,,,
58,,rating_ent[{human_biology}].Column,,,,,,,
59,,rating_ent[{evolution}].Column,,,,,,,


## Examining variables
We can explore the variables with the <a href="API/DataSet.html#tally_core.DataSet.meta">`DataSet.meta`</a> command, which shows the labels for the items/categories and answers.

In [6]:
dataset.meta('gender')

single,codes,texts,missing
gender: Gender of respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,23,Male,
2,24,Female,


The `meta` command also supports arrays, in which case it shows items as well as codes.

In [7]:
dataset.meta('rating_ent.Column')

delimited set,items,item texts,codes,texts,missing
rating_ent.Column: Q46,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,rating_ent[{dinosaurs}].Column,Dinosaurs,48.0,Not at all interested (1),
2,rating_ent[{conservation}].Column,Conservation,49.0,Not particularly interested (2),
3,rating_ent[{fish_and_reptiles}].Column,Fish and reptiles,50.0,No opinion (3),
4,rating_ent[{fossils}].Column,Fossils,51.0,Slightly interested (4),
5,rating_ent[{birds}].Column,Birds,52.0,Very interested (5),
6,rating_ent[{insects}].Column,Insects,,,
7,rating_ent[{whales}].Column,Whales,,,
8,rating_ent[{mammals}].Column,Mammals,,,
9,rating_ent[{minerals}].Column,Minerals,,,
10,rating_ent[{ecology}].Column,Ecology,,,


Tally understands grids/loops, which are collectively referred to as `arrays`. If a grid/loop in Dimensions contains multiple variables, Tally will create one variable for each grid variable. In the Museums data, each grid only contains one variable, `Column`.

Tally "flattens" the grid data into columns, so for every item in the array, like `rating_ent[{human_biology}].Column` and `rating_ent[{evolution}].Column` you will find one column in the dataframe.

## Examining the case data dataframe

Tally supports a bracket syntax, similar to pandas, to examine the case data dataframe. 

In [8]:
dataset['gender'].head()

0    23
1    24
2    24
3    23
4    24
Name: gender, dtype: int64

In [9]:
dataset[['gender', 'age']].head()

Unnamed: 0,gender,age
0,23,5
1,24,4
2,24,4
3,23,4
4,24,5


The bracket syntax also supports grids, so if you put in a grid variable, Tally fetches all relevant columns in the dataset to display.

In [10]:
dataset['rating_ent.Column'].head()

Unnamed: 0,rating_ent[{dinosaurs}].Column,rating_ent[{conservation}].Column,rating_ent[{fish_and_reptiles}].Column,rating_ent[{fossils}].Column,rating_ent[{birds}].Column,rating_ent[{insects}].Column,rating_ent[{whales}].Column,rating_ent[{mammals}].Column,rating_ent[{minerals}].Column,rating_ent[{ecology}].Column,rating_ent[{botany}].Column,rating_ent[{origin_of_species}].Column,rating_ent[{human_biology}].Column,rating_ent[{evolution}].Column,rating_ent[{wildlife_in_danger}].Column
0,51;,48;,51;,48;,51;,51;,48;,49;,51;,48;,48;,51;,51;,51;,51;
1,51;,50;,50;,51;,50;,50;,51;,51;,52;,51;,51;,51;,50;,51;,51;
2,51;,49;,48;,50;,52;,48;,51;,48;,48;,51;,48;,51;,50;,51;,52;
3,52;,49;,51;,51;,48;,51;,48;,48;,51;,48;,48;,51;,48;,51;,48;
4,52;,49;,51;,51;,51;,51;,52;,51;,52;,51;,51;,51;,50;,52;,51;


In grids, we can also use the <a href="API/DataSet.html#tally_core.DataSet.categories">`DataSet.categories`</a> method to explore what items are in the grid. This is especially useful for grids that have more levels than one.

In [11]:
dataset.categories('rating_ent.Column')

['rating_ent[{dinosaurs}].Column',
 'rating_ent[{conservation}].Column',
 'rating_ent[{fish_and_reptiles}].Column',
 'rating_ent[{fossils}].Column',
 'rating_ent[{birds}].Column',
 'rating_ent[{insects}].Column',
 'rating_ent[{whales}].Column',
 'rating_ent[{mammals}].Column',
 'rating_ent[{minerals}].Column',
 'rating_ent[{ecology}].Column',
 'rating_ent[{botany}].Column',
 'rating_ent[{origin_of_species}].Column',
 'rating_ent[{human_biology}].Column',
 'rating_ent[{evolution}].Column',
 'rating_ent[{wildlife_in_danger}].Column']