# Importing data from survey platforms and files

Tally supports importing data from multiple different survey platforms. The files and APIs supported include

- Dimensions
- SPSS
- CSV
- Excel
- Confirmit

## Reading Unicom/Dimensions data

The Unicom/Dimensions converter reads mdd/ddf files into a dataset.

The available parameters are
- `path_meta` - file location of the mdd file
- `path_data` - file location of the ddf file
- `map_values` - whether to change answer codes (default, True)
- `multi_process` - whether to use all of the machine's processors (default, False)

In [1]:
import tally_core as tc
dataset = tc.DataSet('Museum')
dataset.read_dimensions(
  path_meta='./data/Example_Museum.mdd',
  path_data='./data/Example_Museum.ddf'
)

b'Variables identified as part of a blacklist: [\'name\']. \nThey have been renamed by adding "_" as prefix'


  self._data['@1'] = np.ones(len(self._data))


The `map_values` parameter dictates whether the codes in the datafile are converted, for easier reading and understanding. Let's look at the meta-data of the dataset we've just created.

In [2]:
dataset.meta('gender')

single,codes,texts,missing
gender: Gender of respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,Male,
2,2,Female,


The codes are not the same as they were in the original ddf, they have been mapped to 1,2. This is easier if we don't have to rely on Unicom/Dimensions for anything other than gathering the data, but can be risky if we need to be compatible with old scripts and such that rely on the codes. In that case we set `map_values` to `False`.

In [3]:
dataset.read_dimensions(
  path_meta='./data/Example_Museum.mdd',
  path_data='./data/Example_Museum.ddf',
  map_values=False
)

b'Variables identified as part of a blacklist: [\'name\']. \nThey have been renamed by adding "_" as prefix'


If we look at the meta-data for the `gender` variable now, we see that the Dimensions codes are unchanged.

In [4]:
dataset.meta('gender')

single,codes,texts,missing
gender: Gender of respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,23,Male,
2,24,Female,


## Reading SPSS
Tally supports reading SPSS files and uses two engines to do this.

::{warning}
The default engine for read_spss is savReaderWriter. This can be unstable on some systems, so we use readstat instead. readstat is faster, but doesn't support SPSS's multi-choice variables.
::

In [5]:
dataset = tc.DataSet('Example data')
dataset.read_spss('./data/Example Data (A).sav', engine='readstat')

In [6]:
dataset.meta('gender')

single,codes,texts,missing
gender: What is your gender?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,Male,
2,2,Female,


## Reading Excel

:::{note}
To read Excel files, the dependency `openpyxl` needs to be installed with `pip install openpyxl`.
:::

In [11]:
dataset = tc.DataSet('excel')
dataset.read_excel('./data/Example Data (A).xlsx')

Inferring meta data from pd.DataFrame.columns (13)...
Converted 13 columns!


Because categorical data is often stored as strings in Excel, we need to convert these to single variables. This is what the variables `gender` and `q1` look like before we convert.

In [14]:
dataset[['gender', 'q1']].head()

Unnamed: 0,gender,q1
0,Male,Aerobics
1,Female,Aerobics
2,Male,Football (soccer)
3,Male,Yoga
4,Male,Running/jogging


We use the `DataSet.strings` method to fetch all string variables and we use the `DataSet.convert` function to convert these into strings.

In [17]:
for string_var in dataset.strings():
  dataset.convert(string_var, to='single')

In [18]:
dataset[['gender', 'q1']].head()

Unnamed: 0,gender,q1
0,2,1
1,1,1
2,2,3
3,2,12
4,2,10


In [19]:
dataset.crosstab('q1', 'gender')

Unnamed: 0_level_0,Question,gender.,gender.
Unnamed: 0_level_1,Values,Female,Male
Question,Values,Unnamed: 2_level_2,Unnamed: 3_level_2
q1.,Base,4303.0,3952.0
q1.,Aerobics,1561.0,1438.0
q1.,Basketball,66.0,65.0
q1.,Football (soccer),447.0,447.0
q1.,Hockey,1.0,3.0
q1.,I regularly change my fitness activity,65.0,39.0
q1.,Lifting weights,1204.0,1094.0
q1.,Not applicable - I don't exercise,188.0,181.0
q1.,Other,50.0,41.0
q1.,Pilates,278.0,199.0
