# Creating derived variables using Tally, the API for market research

Tally is available on pip, to install it run

```
pip install datasmoothie-tally-client
```

If you are running this in gitpod, the python client has already been installed.

In [1]:
import tally
import os
import pandas as pd
import pprint as pp

## Working with different data sources

Tally works with SPSS, CSV files, the Confirmit API and Unicom/Dimensions files (mdd/ddf). Here we demonstrate using an SPSS file.

You need to get a Tally API key to run the example. Get in touch at info@datasmoothie.com if you need one. 

In [44]:
# we store the tally key in an environment variable, get in touch to get your own key
dataset = tally.DataSet(api_key=os.environ.get('tally_api_key'))
dataset.use_spss('data/Example Data (A).sav')

# also compatible with Confirmit, Nebu, Dimensions.

## Create a variable derived from two other variables

We will create a derived variable from the variables `q2__1` and `q2__2` which are two questions in a series of questions that ask what sports people do. We want to know who does either sky diving or base jumping and we will use Tally's derive function ([documented here](https://tally.datasmoothie.com/#tag/Data-Processing/operation/derive)).

First we take a look at what the crosstabs look like:

In [36]:
dataset.crosstab(x=['q2__1', 'q2__2'], ci=['c%'])

Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
q2__1. Sky diving,Base,2999.0
q2__1. Sky diving,No,62.4
q2__1. Sky diving,Yes,37.6
q2__2. Base jumping,Base,2999.0
q2__2. Base jumping,No,54.5
q2__2. Base jumping,Yes,45.5


Next, we create our condition map. If someone has either answered yes to questions 1 or 2 (a "union"), we give it the value "Yes". Anyone who answered "no" to both questions (an "intersection") we give it the value "No".

Then we create the derived variable (in this case it is a single choice variable).

In [51]:
cond_map = [
    (1, "Yes", 'union', {'q2__1':[1], 'q2__2':[1]}),
    (0, "No", 'intersection', {'q2__1':[0], 'q2__2':[0]}),
]
result = dataset.derive(
    name='q2_1_or_2', 
    label='Extreme sports?', 
    cond_map=cond_map, 
    qtype="single"
)


For a sanity check we check the crosstab.

In [52]:
dataset.crosstab(x='q2_1_or_2', ci=['c%'])


Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
q2_1_or_2. Extreme sports?,Base,2999.0
q2_1_or_2. Extreme sports?,Yes,56.5
q2_1_or_2. Extreme sports?,No,43.5


and finally, we check the underlying data to see if our logic was correct.

In [53]:
dataset.get_dataframe()[['q2__1', 'q2__2', 'q2_1_or_2']]

Unnamed: 0,q2__1,q2__2,q2_1_or_2
0,1.0,1.0,1.0
1,0.0,0.0,0.0
2,,,
3,,,
4,,,
...,...,...,...
8250,0.0,1.0,1.0
8251,,,
8252,,,
8253,,,
