# PewRel2019

<img src='images/PF_20.09.12_teens_featured.webp'>
<i>Image from <a href="https://www.pewresearch.org/religion/wp-content/uploads/sites/7/2020/09/PF_20.09.12_teens_featured.jpg?resize=640,360">Pew Research</a></i>

The Pew Research Center is a nonpartisan think tank that conducts research and polling on a variety of issues, including religion. In this notebook we conceptually replicate analyses from their 2019 report ["U.S. Teens Take After Their Parents Religiously, Attend Services Together and Enjoy Family Rituals"](https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/) 

You can read about their methodology [here.](https://www.pewresearch.org/religion/2020/09/10/methodology-34/)  You will need to download their questions [here.](https://www.pewresearch.org/religion/wp-content/uploads/sites/7/2020/09/PF_09.10.20_teens.religion.topline.pdf) 

You will be provided the data and codebook, but you can download it or other Pew data by [registering for an account.](https://www.pewresearch.org/profile/registration/) (They have a lot of great stuff!)

First, download the notebook and SPSS file titled `US Teens and their parents - 2019 Pew Research Survey - FOR RELEASE.sav` and place them in the same folder.

Note: there will be some slight differences in our calculations and those provided by the research center report.  This is due to weightings we are not privy to, but we'll get pretty close regardless.  Additionally, for educational purposes we will sometime not do things  in the most efficient way.  If you can do it more quickly, great!

### Initial Install

Will need to install pyreadstat

`pip install pyreadstat`

or

`conda install pyreadstat`

In [1]:
pip install pyreadstat

Collecting pyreadstat
  Downloading pyreadstat-1.1.9-cp38-cp38-macosx_10_9_x86_64.whl (574 kB)
[K     |████████████████████████████████| 574 kB 4.1 MB/s eta 0:00:01
[?25hCollecting pandas>=1.2.0
  Downloading pandas-1.4.4-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
[K     |████████████████████████████████| 11.4 MB 12.5 MB/s eta 0:00:01    |▌                               | 174 kB 12.5 MB/s eta 0:00:01
Installing collected packages: pandas, pyreadstat
  Attempting uninstall: pandas
    Found existing installation: pandas 1.0.5
    Uninstalling pandas-1.0.5:
      Successfully uninstalled pandas-1.0.5
Successfully installed pandas-1.4.4 pyreadstat-1.1.9
Note: you may need to restart the kernel to use updated packages.


### Initial Imports

In [1]:
#import common libraries

import numpy as np
import pandas as pd

In [2]:
#import data from SPSS

pewdata = pd.read_spss('US Teens and their parents - 2019 Pew Research Survey - FOR RELEASE.sav')

Let's explore `pewdata`

In [3]:
# view shape 

pewdata.shape

(1811, 162)

In [4]:
# view head

pewdata.head()

Unnamed: 0,CaseID,parent_wt,teen_wt,XSPANISH,XPPRACEM,PPETHM,PPAGEREC,PPEDUCAT,PPGENDER,PPINCIMPREC,...,tparsame3oe,TGODMORAL,TSEXASK,thisp,TRACECMB,tracethn,TPARTY,TPARTYLN,tpartysum,TIDEO
0,12.0,0.4651,0.337,English,White,"White, Non-Hispanic",50-64,Bachelor's degree or higher,Female,"$75,000 or more",...,,It is NOT necessary to believe in God in order...,Female,No,White,White non-Hispanic,Independent,The Democratic Party,Democrat/lean Democrat,Moderate
1,15.0,4.4208,4.4733,English,White,Hispanic,30-49,Less than high school,Female,"$75,000 or more",...,One parent is religious or cares more (e.g. be...,It is NOT necessary to believe in God in order...,Male,Yes,White,Hispanic,Something else,The Republican Party,Republican/lean Republican,Moderate
2,16.0,1.2422,1.8688,English,White,"White, Non-Hispanic",30-49,Bachelor's degree or higher,Male,"$75,000 or more",...,,It is NOT necessary to believe in God in order...,Male,No,White,White non-Hispanic,Something else,The Republican Party,Republican/lean Republican,Conservative
3,19.0,1.1299,1.0989,English,White,"White, Non-Hispanic",30-49,Some college,Male,"$75,000 or more",...,,It is NOT necessary to believe in God in order...,Female,No,White,White non-Hispanic,Something else,The Democratic Party,Democrat/lean Democrat,Very liberal
4,21.0,0.2653,0.1712,Spanish,White,Hispanic,50-64,Some college,Female,"$30,000 to less than $75,000",...,,It is necessary to believe in God in order to ...,Female,Yes,White,Hispanic,Democrat,,Democrat/lean Democrat,Liberal


In [5]:
# view columns

pewdata.columns

Index(['CaseID', 'parent_wt', 'teen_wt', 'XSPANISH', 'XPPRACEM', 'PPETHM',
       'PPAGEREC', 'PPEDUCAT', 'PPGENDER', 'PPINCIMPREC',
       ...
       'tparsame3oe', 'TGODMORAL', 'TSEXASK', 'thisp', 'TRACECMB', 'tracethn',
       'TPARTY', 'TPARTYLN', 'tpartysum', 'TIDEO'],
      dtype='object', length=162)

In [6]:
# print all columns

for col in pewdata.columns:
    print(col)

CaseID
parent_wt
teen_wt
XSPANISH
XPPRACEM
PPETHM
PPAGEREC
PPEDUCAT
PPGENDER
PPINCIMPREC
PPMARIT
PPHHSIZEREC
PPMSACAT
PPREG4
DOV_ACSLANG
DOV_TACSLANG
S1
S2
PTIME
PMOTHER
PFATHER
PSPRELAT1
PSPRELAT2
PTACTVa
PTACTVb
PTACTVc
PTACTVd
PTACTVe
preligrec
PBORN
Preltrad
spreligrec
sphisp
SPRACECMB
spracethn
SPBORN
SPreltrad
PATTEND
PRELIMP
PPRAY
PGOD1
PGOD2
PEXCLUS
KIDTRAIT1a
KIDTRAIT1b
kidtrait1c
kidtrait1d
PTEENRELIG2
pteenrelig3oe
PTEENIMP
pshare1
pshare2
PGODMORAL
KIDTRAIT2a
KIDTRAIT2b
KIDTRAIT2c
KIDTRAIT2d
KIDTRAIT2e
KIDTRAIT2f
PPARTY
PPARTYLN
ppartysum
PIDEO
HOUSEEDUCREC
S4
Teen_lang
TAGE
TSATISFY
TFITIN
TGUIDE1a
TGUIDE1b
TGUIDE2a
TGUIDE2b
TGUIDE2c
TGUIDE2d
tschool
TGRADEREC
TEXPa
TEXPb
TEXPc
TEXPd
TEXPe
TEXPf
treligrec
tborn
treltrad
TATTEND
TATTDWTH
TATTDRSN
TRELIMP
TPRAY
TGOD1
TGOD2
TEXCLUS
TPRACTICES1
TPRACTICES2
TPRACTICES3
TPRACTICES4
TOBSERV1
TOBSERV2
TOBSERV3
TOBSERV4
TOBSERV5
TENJOY
TOBLIG
TFAMTALK
TRACESURV21
TRACESURV22
TPARRELIG1
TPARRELIG2
TPARRELIG3
tsimpar2oe
TPARIMP
TFRTA

`describe()` method provides summary stats for numeric variables 

In [7]:
# use describe method

pewdata.describe()

Unnamed: 0,CaseID,parent_wt,teen_wt
count,1811.0,1811.0,1811.0
mean,1286.601325,1.0,1.0
std,756.782246,0.889595,0.919161
min,12.0,0.1232,0.1345
25%,630.5,0.42505,0.39355
50%,1252.0,0.7198,0.687
75%,1943.5,1.28905,1.2145
max,2679.0,4.9789,4.7234


In [8]:
# print all data types

for i in pewdata.dtypes:
    print(i)

float64
float64
float64
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
category
cate

We only have three (we'll come back to the categories).  

But are they useful?

Lets look at `XPPRACE` and `PPMARIT` both here and in the codebook

In [9]:
pewdata.XPPRACEM

0       White
1       White
2       White
3       White
4       White
        ...  
1806    White
1807    White
1808    White
1809    White
1810    White
Name: XPPRACEM, Length: 1811, dtype: category
Categories (6, object): ['2 + Races', 'American Indian Or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian/Pacific Islander', 'White']

When calling `describe()` on non-numeric, it will provide different statistics 

In [10]:
# describe() XPPRACEM

pewdata.XPPRACEM.describe()

count      1811
unique        6
top       White
freq       1526
Name: XPPRACEM, dtype: object

To include all variables, add `include = 'all'` 

In [11]:
# describe() all columns

pewdata.describe(include = 'all')

Unnamed: 0,CaseID,parent_wt,teen_wt,XSPANISH,XPPRACEM,PPETHM,PPAGEREC,PPEDUCAT,PPGENDER,PPINCIMPREC,...,tparsame3oe,TGODMORAL,TSEXASK,thisp,TRACECMB,tracethn,TPARTY,TPARTYLN,tpartysum,TIDEO
count,1811.0,1811.0,1811.0,1811,1811,1811,1811,1811,1811,1811,...,233,1811,1811,1811,1811,1811,1811,868,1811,1811
unique,,,,2,6,5,4,4,2,3,...,9,3,3,3,6,5,5,3,3,6
top,,,,English,White,"White, Non-Hispanic",30-49,Bachelor's degree or higher,Female,"$75,000 or more",...,Made own decision based on what felt right or ...,It is NOT necessary to believe in God in order...,Male,No,White,White non-Hispanic,Democrat,The Democratic Party,Democrat/lean Democrat,Moderate
freq,,,,1618,1526,1150,1280,786,1232,838,...,57,1115,912,1306,1376,1022,484,419,903,767
mean,1286.601325,1.0,1.0,,,,,,,,...,,,,,,,,,,
std,756.782246,0.889595,0.919161,,,,,,,,...,,,,,,,,,,
min,12.0,0.1232,0.1345,,,,,,,,...,,,,,,,,,,
25%,630.5,0.42505,0.39355,,,,,,,,...,,,,,,,,,,
50%,1252.0,0.7198,0.687,,,,,,,,...,,,,,,,,,,
75%,1943.5,1.28905,1.2145,,,,,,,,...,,,,,,,,,,


`value_counts()` for categorical

In [12]:
# value_counts() for XPPRACEM

pewdata.XPPRACEM.value_counts()

White                               1526
Black or African American            154
Asian                                 62
2 + Races                             45
American Indian Or Alaska Native      21
Native Hawaiian/Pacific Islander       3
Name: XPPRACEM, dtype: int64

In [13]:
# view PPMARIT

pewdata.PPMARIT

0       Married
1       Married
2       Married
3       Married
4       Married
         ...   
1806    Married
1807    Married
1808    Married
1809    Married
1810    Married
Name: PPMARIT, Length: 1811, dtype: category
Categories (6, object): ['Divorced', 'Living with partner', 'Married', 'Never married', 'Separated', 'Widowed']

In [14]:
# value_counts() for PPMARIT

pewdata.PPMARIT.value_counts()

Married                1401
Divorced                164
Never married           103
Living with partner      67
Separated                48
Widowed                  28
Name: PPMARIT, dtype: int64

#### Student Practice
Pause the video and try to perform the following tasks on the `pewdata` dataset. Then check your answers as I walk through the solutions. 

**Exercise**: How many parents are agnostic while their children are Roman Catholic? 


I am intentionally not telling you the column names to look for as I want you to try to use the code book to figure this out.  This will be great practice in reading a code book or data dictionary.  Be careful with your lower and upper cases!

In [15]:
q = pewdata[(pewdata.preligrec=='Agnostic (not sure if there is a God)') & (pewdata.treligrec=='Roman Catholic')]
q
#3 parents only

Unnamed: 0,CaseID,parent_wt,teen_wt,XSPANISH,XPPRACEM,PPETHM,PPAGEREC,PPEDUCAT,PPGENDER,PPINCIMPREC,...,tparsame3oe,TGODMORAL,TSEXASK,thisp,TRACECMB,tracethn,TPARTY,TPARTYLN,tpartysum,TIDEO
350,499.0,1.6294,2.052,English,Black or African American,"Black, Non-Hispanic",30-49,Bachelor's degree or higher,Male,"$75,000 or more",...,"Chose to follow one parent, no specific reason...",It is necessary to believe in God in order to ...,Female,No,Black or African American,Black non-Hispanic,Something else,The Democratic Party,Democrat/lean Democrat,Moderate
358,509.0,1.0166,0.8487,English,White,"White, Non-Hispanic",30-49,Some college,Male,"$75,000 or more",...,Don't know/refused,It is NOT necessary to believe in God in order...,Male,No,White,White non-Hispanic,Something else,The Republican Party,Republican/lean Republican,Moderate
378,536.0,3.9922,3.1924,English,White,"White, Non-Hispanic",30-49,Some college,Male,"$75,000 or more",...,,It is necessary to believe in God in order to ...,Female,No,White,White non-Hispanic,Independent,The Democratic Party,Democrat/lean Democrat,Moderate


In [16]:
len(q)

3

In [17]:
pewdata.size

293382

**Exercise:** Which five parent/child denomination combinations are largest?  In other words, group by the parent's denomination and then the teen's denomination and see which five combinations are the largest.

In [18]:
pewdata.groupby(['Preltrad', 'treltrad']).size().nlargest(5)  
#The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group

Preltrad                                       treltrad                                     
Catholic                                       Catholic                                         398
Theologically Evangelical Protestant Churches  Theologically Evangelical Protestant Churches    353
Unaffiliated                                   Unaffiliated                                     344
Historic Mainline Protestant Churches          Historic Mainline Protestant Churches            145
Catholic                                       Unaffiliated                                      79
dtype: int64

In [70]:
#Example:
#pewdata['Preltrad'].groupby(pewdata.Preltrad).count()

#or
#pewdata.groupby('Preltrad').size()

# Let's recreate this!

<img src='images/PF_09.10.20_religion.teens-00-0.webp' width = "400">
<i>Image from <a href="https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/">Pew Research</a></i>


Look at the codebook

In [46]:
# value counts for Preltrad

pewdata.Preltrad.value_counts()

Catholic                                         498
Theologically Evangelical Protestant Churches    441
Unaffiliated                                     404
Historic Mainline Protestant Churches            246
Historically Black Protestant Churches            71
Mormon                                            51
Jewish                                            21
Other Faiths                                      16
Muslim                                            14
Hindu                                             11
Jehovah's Witness                                  9
Buddhist                                           8
Orthodox Christian                                 8
DK/REF                                             7
Other Christian                                    4
Other World Religions                              2
Name: Preltrad, dtype: int64

We want 'Evangelical Protestant', 'Mainline Protestant', 'Catholic', 'Unaffiliated' only

In [19]:
# change values fit order in chart

pewdata.loc[pewdata['Preltrad'] == 'Theologically Evangelical Protestant Churches', 'prelid'] = 1
pewdata.loc[pewdata['Preltrad'] == 'Historic Mainline Protestant Churches', 'prelid'] = 2
pewdata.loc[pewdata['Preltrad'] == 'Catholic', 'prelid'] = 3
pewdata.loc[pewdata['Preltrad'] == 'Unaffiliated', 'prelid'] = 4

pewdata.loc[pewdata['treltrad'] == 'Theologically Evangelical Protestant Churches', 'trelid'] = 1
pewdata.loc[pewdata['treltrad'] == 'Historic Mainline Protestant Churches', 'trelid'] = 2
pewdata.loc[pewdata['treltrad'] == 'Catholic', 'trelid'] = 3
pewdata.loc[pewdata['treltrad'] == 'Unaffiliated', 'trelid'] = 4

In [20]:
# crosstabs using our two variables

pd.crosstab(pewdata.prelid, pewdata.trelid)

trelid,1.0,2.0,3.0,4.0
prelid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,353,29,4,47
2.0,27,145,4,61
3.0,6,9,398,79
4.0,8,20,18,344


Include `normalize = 'index'` to get percentages

In [21]:
# add `normalize = 'index'` to get percentages, and round to two decimals

our_table = pd.crosstab(pewdata.prelid, pewdata.trelid, normalize = 'index').round(2)
our_table

trelid,1.0,2.0,3.0,4.0
prelid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,0.82,0.07,0.01,0.11
2.0,0.11,0.61,0.02,0.26
3.0,0.01,0.02,0.81,0.16
4.0,0.02,0.05,0.05,0.88


How accurate are we?

<img src='images/PF_09.10.20_religion.teens-00-0.webp' width = "300" align = "left">


In [None]:
# recreate the table in a DF

pew_numbers = {1: [.80, .12, .01, .02], 2: [.06, .55, .01, .03],
    3: [.01, .04, .81, .05], 4: [.12, .24, .15, .86]}

pew_table = pd.DataFrame(data = pew_numbers, index = [1, 2, 3, 4], )
pew_table

In [None]:
# subtract their values from ours to two decimals

(our_table - pew_table).round(2)

## Categorical data

Read more about categorical data from the [Pandas documentation](https://pandas.pydata.org/docs/user_guide/categorical.html)

Categorical data used extensively in statistics.  

Features:
* categorical variables take on a limited, usually fixed, number of possible values
* categorical variables can have an order (ordinal data)
* all values are either in categories or np.nan
* if an order is present (e.g., "strongly agree", "agree", "neutral", etc.), sorting will use the logical 
* especially useful in survey research

In [22]:
# view annual household income of participating parent, PPINCIMPREC 

pewdata['PPINCIMPREC']

0                    $75,000 or more
1                    $75,000 or more
2                    $75,000 or more
3                    $75,000 or more
4       $30,000 to less than $75,000
                    ...             
1806    $30,000 to less than $75,000
1807               Less than $30,000
1808                 $75,000 or more
1809                 $75,000 or more
1810                 $75,000 or more
Name: PPINCIMPREC, Length: 1811, dtype: category
Categories (3, object): ['$30,000 to less than $75,000', '$75,000 or more', 'Less than $30,000']

In [23]:
# PPINCIMPREC value counts

pewdata['PPINCIMPREC'].value_counts()

$75,000 or more                 838
$30,000 to less than $75,000    627
Less than $30,000               346
Name: PPINCIMPREC, dtype: int64

In [24]:
# sort PPINCIMPREC

pewdata['PPINCIMPREC'].sort_values()
#notice it's out of logical order

487     $30,000 to less than $75,000
460     $30,000 to less than $75,000
810     $30,000 to less than $75,000
463     $30,000 to less than $75,000
809     $30,000 to less than $75,000
                    ...             
1489               Less than $30,000
1164               Less than $30,000
373                Less than $30,000
1522               Less than $30,000
977                Less than $30,000
Name: PPINCIMPREC, Length: 1811, dtype: category
Categories (3, object): ['$30,000 to less than $75,000', '$75,000 or more', 'Less than $30,000']

To alleviate this, we use `reorder_categories`.

In our case, we will use `pewdata['PPINCIMPREC'].cat.reorder_categories()` and pass a list of our values in order

In [25]:
# reorder categories

pewdata['PPINCIMPREC'] = pewdata['PPINCIMPREC'].cat.reorder_categories(
    ['Less than $30,000', '$30,000 to less than $75,000', '$75,000 or more'])

In [26]:
# check values

pewdata['PPINCIMPREC'].values

['$75,000 or more', '$75,000 or more', '$75,000 or more', '$75,000 or more', '$30,000 to less than $75,000', ..., '$30,000 to less than $75,000', 'Less than $30,000', '$75,000 or more', '$75,000 or more', '$75,000 or more']
Length: 1811
Categories (3, object): ['Less than $30,000', '$30,000 to less than $75,000', '$75,000 or more']

In [27]:
# check sorted values

pewdata['PPINCIMPREC'].sort_values()

606     Less than $30,000
893     Less than $30,000
1362    Less than $30,000
357     Less than $30,000
354     Less than $30,000
              ...        
786       $75,000 or more
784       $75,000 or more
783       $75,000 or more
775       $75,000 or more
1810      $75,000 or more
Name: PPINCIMPREC, Length: 1811, dtype: category
Categories (3, object): ['Less than $30,000', '$30,000 to less than $75,000', '$75,000 or more']

# Importance

Let's turn to importance, and match the values from this:

<img src='images/PF_09.10.20_religion.teens-00-1.jpg' width = "400">
<i>Image from <a href="https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/">Pew Research</a></i>

In [28]:
# importance value counts

pewdata.PRELIMP.value_counts()

Very important          824
Somewhat important      525
Not at all important    252
Not too important       203
Refused                   7
Name: PRELIMP, dtype: int64

In [29]:
# normalize value counts

pewdata.PRELIMP.value_counts(normalize = True)

Very important          0.454997
Somewhat important      0.289895
Not at all important    0.139150
Not too important       0.112093
Refused                 0.003865
Name: PRELIMP, dtype: float64

In [30]:
# remove 'Refused' responses

df3 = pewdata.loc[pewdata['PRELIMP'] != 'Refused']

In [31]:
# normalized value counts, no refused

df3.PRELIMP.value_counts(normalize = True)

Very important          0.456763
Somewhat important      0.291020
Not at all important    0.139690
Not too important       0.112528
Refused                 0.000000
Name: PRELIMP, dtype: float64

In [32]:
# remove 'Refused' for teens

df4 = pewdata.loc[pewdata['TRELIMP'] != 'Refused']

In [33]:
# normalized value counts for teens, no refused

df4.TRELIMP.value_counts(normalize = True)

Somewhat important      0.367574
Very important          0.247640
Not too important       0.204886
Not at all important    0.179900
Refused                 0.000000
Name: TRELIMP, dtype: float64

# Percent of teens holding same religoius beliefs as parents



<img src='images/PF_09.10.20_religion.teens-00-2.webp' width = "400">
<i>Image from <a href="https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/">Pew Research</a></i>

Let's first look to the first percentages

To the codebook!

In [34]:
# value counts 

pewdata.TPARRELIG1.value_counts(normalize = True).round(2)

All the same religious beliefs        0.49
Some of the same religious beliefs    0.42
Quite different religious beliefs     0.08
Refused                               0.01
Name: TPARRELIG1, dtype: float64

"...but among teens who say their beliefs differ, a third say the parent is unaware"

How should we examine this?

1. Select only those who say beliefs differ
2. Find question on awareness, and split.

To the codebook!

In [35]:
# value counts

pewdata.TPARRELIG2.value_counts()

Yes        611
No         296
Refused      6
Name: TPARRELIG2, dtype: int64

In [36]:
# normalized value counts

pewdata['TPARRELIG2'].value_counts(normalize=True).round(2)

Yes        0.67
No         0.32
Refused    0.01
Name: TPARRELIG2, dtype: float64

## Further Analysis

As we have said many times, this course is meant for you to have time to practice a lot on your own.  This truly is how you are going to grow in your data manipulation knowledge.

I suggest that you take time to do further analysis with this data and see if you can replicate some of the other Pew analysis from the article.  Also, this would be a great time to practice your data visualization skills as well.  Uploading the work you do with this dataset to Github will show your current or respective employers the type of quality work that you can do!