# 4. Weight data

We imagine that census data shows us that in the population are surveying, the distribution between gender, age groups, and where people live, is

- Gender: 51% women, 49% men.
- Location: 17.1% in rural areas, 82.9% in densely populated areas.

We want our survey to represent these groups correctly, so we weight the data.


In [None]:
#
# In order to run this notebook, you first have to install Tally. To install tally you need a token that gives you access.
#
from google.colab import files
import json
import io
import os
# Check if the file 'tally_keys.json' exists
if not os.path.exists('tally_keys.json'):
  uploaded = files.upload()
  # Assuming only one file is uploaded, get its filename and content
  filename = list(uploaded.keys())[0]
  file_content = uploaded[filename]
  # Load JSON directly from the uploaded content
  keys = json.loads(file_content.decode('utf-8'))
else:
  # If the file already exists, just load its content
  with open('tally_keys.json', 'r') as f:
      keys = json.load(f)

try:
  # Try to import the package
  import example_package
except ImportError:
  # If the import fails, the package is not installed. Install it.
  !pip install git+https://{keys['tally_api']}@github.com/datasmoothie/tally-core.git@master

In [1]:
import tally_core as tc
import pandas as pd
import json
dataset = tc.DataSet('Museum')

dataset = tc.DataSet("Sports stores")
meta = json.load(open('./data/Example Data (A).json'))
data = pd.read_parquet('./data/Example Data (A).parquet')
dataset.from_components(meta_dict=meta, data_df=data)

Our data doesn't match the census data in that we have more localities than two and we have a numerical variable for age. We fix this by creating a variable with `DataSet.derive` from the localities.

In [2]:
dataset.derive(name="urban_rural", label="Urban/rural", qtype="single", cond_map=[
        (2, "Rural/remote", {'locality': [4,5]}),
        (1, "Urban/sub-urban", {'locality': [1,2,3]}),
])

# Run the weight algorithm
Now that we have the variables we need, we create a `Rim` object and set our weithting targets.

In [3]:
from tally_core.core.weights.rim import Rim

gender_targets = {'gender':{1:49, 2:51}}
locality_targets = {'urban_rural':{1:82.9 , 2:17.1}}

scheme = Rim('gender_age_locality')
scheme.set_targets(targets=[gender_targets, locality_targets])

We run the algorithm with `DataSet.weight`. Unless we set the parameter `report` to `False`, a weight report is printed to screen.

In [4]:
weight = dataset.weight(
        weight_name='weight_c', 
        unique_key='unique_id',
        weight_scheme=scheme
)

np.NaN found in weight variables:
gender           0
urban_rural    177
dtype: int64
Please check if weighted results are acceptable!

Weight variable       weights_gender_age_locality
Weight group                       _default_name_
Weight filter                                None
Total: unweighted                     8255.000000
Total: weighted                       8255.000000
Weighting efficiency                    99.696868
Iterations required                     16.000000
Mean weight factor                       1.000000
Minimum weight factor                    0.877281
Maximum weight factor                    1.046117
Weight factor ratio                      1.192454



We've now run the RIM weighting algorithm and created a new variable, `weight_c``. We can run a crosstab to check whether it has worked.

In [5]:
weighted = dataset.crosstab(
        x=['gender', 'urban_rural'], 
        ci=['c%'], 
        w='weight_c').rename(columns={"Total":"Weighted"}, level=1)

unweighted = dataset.crosstab(
        x=['gender', 'urban_rural'], 
        ci=['c%']).rename(columns={"Total":"Unweighted"}, level=1)

pd.concat([unweighted, weighted], axis=1)

Unnamed: 0_level_0,Question,Total,Total
Unnamed: 0_level_1,Values,Unweighted,Weighted
Question,Values,Unnamed: 2_level_2,Unnamed: 3_level_2
gender. What is your gender?,Base,8255.0,8255.0
gender. What is your gender?,Male,47.9,48.9
gender. What is your gender?,Female,52.1,51.1
urban_rural. Urban/rural,Base,8078.0,8078.0
urban_rural. Urban/rural,Rural/remote,19.2,17.1
urban_rural. Urban/rural,Urban/sub-urban,80.8,82.9


We run one crosstab with the weight applied and another with no weight, combine the two into one dataframe with `pandas.concat` and compare. We can see that the percentage distribution of the weighted data meets our targets.