# PewRel2019

<img src='images/PF_20.09.12_teens_featured.webp'>
<i>Image from <a href="https://www.pewresearch.org/religion/wp-content/uploads/sites/7/2020/09/PF_20.09.12_teens_featured.jpg?resize=640,360">Pew Research</a></i>

The Pew Research Center is a nonpartisan think tank that conducts research and polling on a variety of issues, including religion. In this notebook we conceptually replicate analyses from their 2019 report ["U.S. Teens Take After Their Parents Religiously, Attend Services Together and Enjoy Family Rituals"](https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/) 

You can read about their methodology [here.](https://www.pewresearch.org/religion/2020/09/10/methodology-34/)  You will need to download their questions [here.](https://www.pewresearch.org/religion/wp-content/uploads/sites/7/2020/09/PF_09.10.20_teens.religion.topline.pdf) 

You will be provided the data and codebook, but you can download it or other Pew data by [registering for an account.](https://www.pewresearch.org/profile/registration/) (They have a lot of great stuff!)

First, download the notebook and SPSS file titled `US Teens and their parents - 2019 Pew Research Survey - FOR RELEASE.sav` and place them in the same folder.

Note: there will be some slight differences in our calculations and those provided by the research center report.  This is due to weightings we are not privy to, but we'll get pretty close regardless.  Additionally, for educational purposes we will sometime not do things  in the most efficient way.  If you can do it more quickly, great!

### Initial Install

Will need to install pyreadstat

`pip install pyreadstat`

or

`conda install pyreadstat`

<span style=color:red>Note: A few students have issues trying to install pyreadstat.  If that is the case with you, we created a csv file with this data, and you can use the following code to load the data instead:</span>
```
pewdata = pd.read_csv('pew_data.csv')
```

### Initial Imports

In [11]:
#import common libraries

import numpy as np
import pandas as pd

In [15]:
!pip install pyreadstat




In [17]:
#import data from SPSS

pewdata = pd.read_spss('US Teens and their parents - 2019 Pew Research Survey - FOR RELEASE.sav')

ImportError: Missing optional dependency 'pyreadstat'.  Use pip or conda to install pyreadstat.

Let's explore `pewdata`

In [19]:
# view shape 

pewdata.shape

NameError: name 'pewdata' is not defined

In [None]:
# view head

pewdata.head()

In [None]:
# view columns

pewdata.columns

In [None]:
# print all columns

for col in pewdata.columns:
    print(col)

`describe()` method provides summary stats for numeric variables 

In [None]:
# use describe method

pewdata.describe()

In [None]:
# print all data types

for i in pewdata.dtypes:
    print(i)

We only have three (we'll come back to the categories).  

But are they useful?

Lets look at `XPPRACE` and `PPMARIT` both here and in the codebook

In [None]:
pewdata.XPPRACEM

When calling `describe()` on non-numeric, it will provide different statistics 

In [None]:
# describe() XPPRACEM

pewdata.XPPRACEM.describe()

To include all variables, add `include = 'all'` 

In [None]:
# describe() all columns

pewdata.describe(include = 'all')

`value_counts()` for categorical

In [None]:
# value_counts() for XPPRACEM

pewdata.XPPRACEM.value_counts()

In [None]:
# view PPMARIT

pewdata.PPMARIT

In [None]:
# value_counts() for PPMARIT

pewdata.PPMARIT.value_counts()

#### Student Practice
Pause the video and try to perform the following tasks on the `pewdata` dataset. Then check your answers as I walk through the solutions. 

**Exercise**: How many parents are agnostic while their children are Roman Catholic? 


I am intentionally not telling you the column names to look for as I want you to try to use the code book to figure this out.  This will be great practice in reading a code book or data dictionary.  Be careful with your lower and upper cases!

In [None]:
### ENTER CODE HERE ###

**Exercise:** Which five parent/child denomination combinations are largest?  In other words, group by the parent's denomination and then the teen's denomination and see which five combinations are the largest.

In [None]:
### ENTER CODE HERE ###

# Let's recreate this!

<img src='images/PF_09.10.20_religion.teens-00-0.webp' width = "400">
<i>Image from <a href="https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/">Pew Research</a></i>


Look at the codebook

In [None]:
# value counts for Preltrad

pewdata.Preltrad.value_counts()

We want 'Evangelical Protestant', 'Mainline Protestant', 'Catholic', 'Unaffiliated' only

In [None]:
# change values fit order in chart

pewdata.loc[pewdata['Preltrad'] == 'Theologically Evangelical Protestant Churches', 'prelid'] = 1
pewdata.loc[pewdata['Preltrad'] == 'Historic Mainline Protestant Churches', 'prelid'] = 2
pewdata.loc[pewdata['Preltrad'] == 'Catholic', 'prelid'] = 3
pewdata.loc[pewdata['Preltrad'] == 'Unaffiliated', 'prelid'] = 4

pewdata.loc[pewdata['treltrad'] == 'Theologically Evangelical Protestant Churches', 'trelid'] = 1
pewdata.loc[pewdata['treltrad'] == 'Historic Mainline Protestant Churches', 'trelid'] = 2
pewdata.loc[pewdata['treltrad'] == 'Catholic', 'trelid'] = 3
pewdata.loc[pewdata['treltrad'] == 'Unaffiliated', 'trelid'] = 4

In [None]:
# crosstabs using our two variables

pd.crosstab(pewdata.prelid, pewdata.trelid)

Include `normalize = 'index'` to get percentages

In [None]:
# add `normalize = 'index'` to get percentages, and round to two decimals

our_table = pd.crosstab(pewdata.prelid, pewdata.trelid, normalize = 'index').round(2)
our_table

How accurate are we?

<img src='images/PF_09.10.20_religion.teens-00-0.webp' width = "300" align = "left">


In [None]:
# recreate the table in a DF

pew_numbers = {1: [.80, .12, .01, .02], 2: [.06, .55, .01, .03],
    3: [.01, .04, .81, .05], 4: [.12, .24, .15, .86]}

pew_table = pd.DataFrame(data = pew_numbers, index = [1, 2, 3, 4], )
pew_table

In [None]:
# subtract their values from ours to two decimals

(our_table - pew_table).round(2)

## Categorical data

Read more about categorical data from the [Pandas documentation](https://pandas.pydata.org/docs/user_guide/categorical.html)

Categorical data used extensively in statistics.  

Features:
* categorical variables take on a limited, usually fixed, number of possible values
* categorical variables can have an order (ordinal data)
* all values are either in categories or np.nan
* if an order is present (e.g., "strongly agree", "agree", "neutral", etc.), sorting will use the logical 
* especially useful in survey research

In [None]:
# view annual household income of participating parent, PPINCIMPREC 

pewdata['PPINCIMPREC']

In [None]:
# PPINCIMPREC value counts

pewdata['PPINCIMPREC'].value_counts()

In [None]:
# sort PPINCIMPREC

pewdata['PPINCIMPREC'].sort_values()

To alleviate this, we use `reorder_categories`.

In our case, we will use `pewdata['PPINCIMPREC'].cat.reorder_categories()` and a list of our values in order

In [None]:
# reorder categories

pewdata['PPINCIMPREC'] = pewdata['PPINCIMPREC'].cat.reorder_categories(
    ['Less than $30,000', '$30,000 to less than $75,000', '$75,000 or more'])

In [None]:
# check values

pewdata['PPINCIMPREC'].values

In [None]:
# check sorted values

pewdata['PPINCIMPREC'].sort_values()

# Importance

Let's turn to importance, and match the values from this:

<img src='images/PF_09.10.20_religion.teens-00-1.jpg' width = "400">
<i>Image from <a href="https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/">Pew Research</a></i>

In [None]:
# importance value counts

pewdata.PRELIMP.value_counts()

In [None]:
# normalize value counts

pewdata.PRELIMP.value_counts(normalize = True)

In [None]:
# remove 'Refused' responses

df3 = pewdata.loc[pewdata['PRELIMP'] != 'Refused']

In [None]:
# normalized value counts, no refused

df3.PRELIMP.value_counts(normalize = True)

In [None]:
# remove 'Refused' for teens

df4 = pewdata.loc[pewdata['TRELIMP'] != 'Refused']

In [None]:
# normalized value counts for teens, no refused

df4.TRELIMP.value_counts(normalize = True)

# Percent of teens holding same religoius beliefs as parents



<img src='images/PF_09.10.20_religion.teens-00-2.webp' width = "400">
<i>Image from <a href="https://www.pewresearch.org/religion/2020/09/10/u-s-teens-take-after-their-parents-religiously-attend-services-together-and-enjoy-family-rituals/">Pew Research</a></i>

Let's first look to the first percentages

To the codebook!

In [None]:
# value counts 

pewdata.TPARRELIG1.value_counts(normalize = True).round(2)

"...but among teens who say their beliefs differ, a third say the parent is unaware"

How should we examine this?

1. Select only those who say beliefs differ
2. Find question on awareness, and split.

To the codebook!

In [None]:
# value counts

pewdata.TPARRELIG2.value_counts()

In [None]:
# normalized value counts

pewdata['TPARRELIG2'].value_counts(normalize=True).round(2)

## Further Analysis

As we have said many times, this course is meant for you to have time to practice a lot on your own.  This truly is how you are going to grow in your data manipulation knowledge.

I suggest that you take time to do further analysis with this data and see if you can replicate some of the other Pew analysis from the article.  Also, this would be a great time to practice your data visualization skills as well.  Uploading the work you do with this dataset to Github will show your current or respective employers the type of quality work that you can do!