# Introduction to the `survey_tools` Package

For this small vignette, I show the four primary functions available in the package: `tabs`, `rake_weight`, `recode`, and `get_names`

In [5]:
# Load in Packages
from survey_tools.survey_tools import tabs, rake_weight, recode, get_names
import pandas as pd
import numpy as np
import os

I am importing a survey dataset I have worked on in the past - the [American Family Survey](https://csed.byu.edu/american-family-survey). Which is a national panel survey N ≈ 3000 studying American Family trends over time.

In [30]:
link = 'https://csed.byu.edu/00000183-a4c5-d2da-abe3-feed7be30001/2021data'
data = pd.read_stata(link)
print(data.shape)
data.head()

(3000, 413)


Unnamed: 0,caseid,weight,PAR006_treat,FAMTAX007_treat,s21_MSC001,s21_MSC003,s21_MSC003_b_1,s21_MSC003_b_2,s21_MSC003_b_3,s21_MSC003_c,...,votereg,ideo5,newsint,religpew,pew_churatd,pew_bornagain,pew_religimp,pew_prayer,starttime,endtime
0,1492039695,0.698217,Show rows: The coronavirus pandemic and Racial...,"Treatment 1 (""pull parents away"")",Not currently in a committed relationship,,not selected,not selected,selected,,...,Yes,Very liberal,Most of the time,Protestant,Once a week,No,Very important,Once a day,1940257000000.0,1940258000000.0
1,1492042119,1.195809,Show rows: The coronavirus pandemic and Racial...,"Treatment 2 (""encourage poverty"")",Not currently in a committed relationship,,selected,not selected,not selected,2005.0,...,Yes,Conservative,Most of the time,Roman Catholic,Never,No,Somewhat important,Seldom,1940257000000.0,1940258000000.0
2,1492870805,1.155043,Show rows: The coronavirus pandemic and Racial...,Control,Married,7 years,not selected,not selected,selected,,...,Yes,Moderate,Don't know,Nothing in particular,Never,No,Not at all important,Never,1940258000000.0,1940258000000.0
3,1492850287,0.771161,No extra rows on PAR006,"Treatment 2 (""encourage poverty"")",Not currently in a committed relationship,,not selected,selected,not selected,,...,Yes,Moderate,Most of the time,Roman Catholic,Seldom,No,Somewhat important,A few times a week,1940257000000.0,1940258000000.0
4,1492863669,0.810394,No extra rows on PAR006,"Treatment 1 (""pull parents away"")","Currently in a committed relationship, but not...",,selected,not selected,not selected,2005.0,...,Don't know,Conservative,Some of the time,Nothing in particular,Seldom,No,Not at all important,A few times a week,1940257000000.0,1940258000000.0


Let's start by looking at a few tabs

In [31]:
tabs(data, 'newsint', dropna=False)

Most of the time     1510
Some of the time      759
Only now and then     385
Hardly at all         211
Don't know            135
NaN                     0
dtype: int64

As you can see from the above, we have tabulated news interest and we can see that there are no missing values as `NaN` is 0.

Let's collapse this variables into more dense categories

In [32]:
data['newsint'] = data.newsint.cat.codes
data['newsint_rc'] = recode(data, 'newsint', "0=1;1:5=0")
data['newsint_rc']

0       1
1       1
2       0
3       1
4       0
       ..
2995    1
2996    1
2997    1
2998    0
2999    1
Name: newsint_rc, Length: 3000, dtype: int8

We've now recoded these, so T1B news interested is 1 and everything else is 0.

Now lets look at this tabulated by religion

In [33]:
tabs(data, 'religpew', dropna=False)

Protestant                   830
Roman Catholic               559
Mormon                        37
Eastern or Greek Orthodox     17
Jewish                        92
Muslim                        29
Buddhist                      31
Hindu                          9
Atheist                      244
Agnostic                     209
Nothing in particular        667
Something else               276
NaN                            0
dtype: int64

In [38]:
tabs(data, 'religpew', 'newsint_rc', display="row").sort_values(1, ascending=False)

Unnamed: 0,1,0
Buddhist,67.7,32.3
Atheist,63.1,36.9
Jewish,62.0,38.0
Agnostic,57.4,42.6
Protestant,53.0,47.0
Eastern or Greek Orthodox,52.9,47.1
Roman Catholic,52.1,47.9
Something else,44.9,55.1
Nothing in particular,40.2,59.8
Muslim,37.9,62.1


Looking for a weighting variable to use for tabs...

In [41]:
get_names(data,'w')

['weight',
 'faminc_new',
 'newsint',
 'religpew',
 'pew_churatd',
 'pew_bornagain',
 'pew_religimp',
 'pew_prayer',
 'newsint_rc']

In [42]:
tabs(data, 'religpew', 'newsint_rc', display="row", wts='weight').sort_values(1, ascending=False)

Unnamed: 0,1,0
Buddhist,62.0,38.0
Atheist,59.3,40.7
Jewish,58.6,41.4
Agnostic,54.2,45.8
Protestant,52.6,47.4
Roman Catholic,50.6,49.4
Eastern or Greek Orthodox,47.3,52.7
Something else,44.0,56.0
Muslim,43.5,56.5
Nothing in particular,37.8,62.2


In [44]:
tabs(data, 'religpew')

Protestant                   830
Roman Catholic               559
Mormon                        37
Eastern or Greek Orthodox     17
Jewish                        92
Muslim                        29
Buddhist                      31
Hindu                          9
Atheist                      244
Agnostic                     209
Nothing in particular        667
Something else               276
dtype: int64

I'll also now test out the raking function

In [46]:
get_names(data,"gender")

['s21_MSC014_gender_child1',
 's21_MSC014_gender_child2',
 's21_MSC014_gender_child3',
 's21_MSC014_gender_child4',
 's21_MSC014_gender_child5',
 's21_MSC014_gender_child6',
 's21_MSC014_gender_child7',
 's21_MSC014_gender_child8',
 'Pick5to18Child_gender',
 'gender']

In [58]:
data['age'] = 2021 - data.birthyr
data['age_rc'] = recode(data, 'age', '0:30=1;31:45=2;46:65=3;66:120=4')
tabs(data, 'age_rc', display='column')

3    35.1
2    26.8
4    19.6
1    18.6
dtype: float64

In [60]:
data['gender_rc'] = recode(data, 'gender', '"Male"=1;"Female"=2')
tabs(data, 'gender_rc', display='column')



1    46.8
2    53.2
dtype: float64

In [61]:
true_props = pd.DataFrame({
    'Names':['gender','gender','age_rc','age_rc','age_rc','age_rc',],
    'Levels':['Male', 'Female',1,2,3,4],
    'Proportions':[0.5,0.5,0.2,0.25,0.35,0.2],
})

data_w_new_wts = rake_weight(data, true_props, weight_nm='new_weight')

Variable:  gender
Male      50.0
Female    50.0
dtype: float64
Variable:  age_rc
3    35.0
2    25.0
4    20.0
1    20.0
dtype: float64

            Iterations: 1
            Max Weight: 1.1487352180792596
            Min Weight: 0.876682464644851
            
