<a href="https://colab.research.google.com/github/atoothman/DATA-70500/blob/main/Activity_2b_IntroProbability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Probability

In this notebook, we'll use probability to answer some sociological questions. This will give us an opportunity to practice some code for recoding and counting cases as well as computing odds.

For reference, I've copied the material on basic probability into this notebook.

We'll use Downey's probability function, which we reviewed here: https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/notebooks/chap01.ipynb#scrollTo=6L0EX3-337a2&line=2&uniqifier=1



In [2]:
def prob(A):
    """Computes the probability of a proposition, A."""
    return A.mean()

## Data for Computing Probability

We'll use the Baylor survey dataset from The Association of Religion Data Archives that we saw in an earlier notebook.

Here's the link to get the data file: https://www.thearda.com/data-archive?fid=BRSW5ED&tab=3

I downloaded the Excel file and the codebook.

Since we can't read the data from ARDA into a DataFrame directly, I put a copy of the Excel spreadsheet as a comma delimited file (CSV) in a place we can access it. (You could also download the file from ARDA and put it on your Google Drive and read it in that way.)

We'll need the codebook to make sense of the variables and values. Here's the link: http://data.shortell.nyc/files/BaylorReligionSurveyWaveV2017InstructionalDatasetcb_data.TXT


In [3]:
# Code block 1a: Installing some libraries we'll need
!pip install researchpy

Collecting researchpy
  Downloading researchpy-0.3.6-py3-none-any.whl.metadata (1.2 kB)
Downloading researchpy-0.3.6-py3-none-any.whl (34 kB)
Installing collected packages: researchpy
Successfully installed researchpy-0.3.6


In [4]:
# Code block 1b. Libraries
import pandas as pd
import numpy as np
import researchpy as rp


In [5]:
# Code Block 2: Importing data
Baylor2017 = pd.read_csv('http://data.shortell.nyc/files/BaylorReligionSurveyWaveV2017InstructionalDataset.csv')

# We can inspect the top of the file to make sure that the data were read in correctly.
Baylor2017.head()

Unnamed: 0,MOTHERLODE_ID,RESPONDENT_DATE,RESPONDENT_LANGUAGE,ENTITY_ID,SCAN_RESPONDENT_ID,LANG1,Q1,Q2_DK,Q3,Q3_1,...,AGER,LIBCONR,PARTYIDR,CHILDSR,HRSWORKD,EDUCR,I_AGE,I_EDUC,I_RACE,I_RELIGION
0,165167557,2/14/2017,en-US,4221710666,4221710666,en-US,10.0,,4.0,4.0,...,5.0,1.0,1.0,1.0,,3.0,6.0,3.0,1.0,1.0
1,165172207,2/21/2017,en-US,4221711095,4221711095,en-US,45.0,1.0,4.0,4.0,...,,,,,,4.0,,4.0,1.0,1.0
2,165167589,2/14/2017,en-US,4221711129,4221711129,en-US,45.0,,4.0,4.0,...,6.0,1.0,1.0,1.0,,4.0,6.0,4.0,1.0,1.0
3,165167427,2/10/2017,en-US,4221709180,4221709180,en-US,20.0,,4.0,4.0,...,2.0,1.0,1.0,1.0,,4.0,2.0,4.0,1.0,1.0
4,165171895,2/14/2017,en-US,4221707213,4221707213,en-US,12.0,,3.0,3.0,...,4.0,1.0,1.0,1.0,2.0,2.0,4.0,2.0,1.0,1.0


We can use the religious identity or affiliation variable as an example for computing probabilities.



```
7) Q1
[I. Religious Behaviors and Attitudes] With what religious family, if any, do you most closely identify? (Please mark only one box.)
RANGE: 1 to 46
	N	Mean	Std. Deviation
Total	1431	26.779	13.946
1) Other	39	2.7
6) Adventist	3	0.2
7) African Methodist	5	0.3
8) Anabaptist	1	0.1
9) Asian Folk Religion	1	0.1
10) Assemblies of God	13	0.9
11) Baha'i	1	0.1
12) Baptist	194	13.6
13) Bible Church	15	1.0
14) Brethren	1	0.1
15) Buddhist	11	0.8
16) Catholic/Roman Catholic	376	26.3
17) Christian & Missionary Alliance	6	0.4
18) Christian Reformed	3	0.2
19) Christian Science	3	0.2
20) Church of Christ	22	1.5
21) Church of God	9	0.6
22) Church of the Nazarene	5	0.3
23) Congregational	3	0.2
25) Episcopal/Anglican	30	2.1
26) Hindu	1	0.1
27) Holiness	6	0.4
28) Jehovah's Witnesses	11	0.8
29) Jewish	29	2.0
30) Latter-day Saints	30	2.1
31) Lutheran	66	4.6
32) Mennonite	1	0.1
33) Methodist	67	4.7
34) Muslim	7	0.5
35) Orthodox (Eastern, Russian, Greek)	5	0.3
36) Pentecostal	20	1.4
37) Presbyterian	38	2.7
38) Quaker/Friends	5	0.3
39) Reformed Church in America/Dutch Reformed	2	0.1
40) Salvation Army	1	0.1
41) Seventh-Day Adventist	5	0.3
42) Sikh	1	0.1
43) Unitarian Universalist	11	0.8
44) United Church of Christ	11	0.8
45) Non-denominational Christian	153	10.7
46) No religion	220	15.4
Missing	70
```

Using the function that Downey created to compute probabilities, we can ask some questions about probabilities, including conditional probabilities.


Let's begin with the probability that a randomly selected respondent identified as Muslim:

In [6]:
Muslim = (Baylor2017['Q1'] == 34)
prob(Muslim)

0.004663557628247834

Now, let's compute the probability that a person identifies as religious. Here, we'll use Q3, religiosity.



```
9) Q3
How religious do you consider yourself to be?
RANGE: 1 to 4
	N	Mean	Std. Deviation
Total	1429	2.675	1.07
1) Not religious	298	20.9
2) Slightly religious	231	16.2
3) Moderately religious	537	37.6
4) Very religious	363	25.4
Missing	72
```

We can compute probabilities to answer questions about being religious in a number of ways. To begin, let's think about a comparison between people who responded as slightly, moderately, or very religious as one group, and people who said they were not religious as the other group.

In [7]:
religious = (Baylor2017['Q3'] >= 2)
prob(religious)

0.7534976682211859

Not suprisingly (for sociologists of religion, at least) in the contemporary US, people tend to identify as religious to some degree.

We could also ask a question about the probability of being very religious, since this group is often a factor in US politics and in our communities (whether geographic or on social media).

In [8]:
very_religious = (Baylor2017['Q3'] == 4)
prob(very_religious)

0.2418387741505663

Now let's look at a compound probability. How likely is a respondent in this sample to identify as Baptist and as very religious.

In [9]:
Baptist = (Baylor2017['Q1'] == 12)

# The compound probability of being Baptist AND very religious
prob(Baptist & very_religious)

0.057961359093937376

Now we might ask about conditional probability. If a person is Baptist, what is the likelihood that they will identify as very religious?

Again, we'll use Downey's function for conditional probability: https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/notebooks/chap01.ipynb#scrollTo=GbbG7bsh37bG&line=1&uniqifier=1



In [10]:
def conditional(proposition, given):
    """Probability of A conditioned on given."""
    return prob(proposition[given])

In [11]:
conditional(very_religious, given=Baptist)

0.4484536082474227

Let's think sociologically about what we've learned from these probability computations.



We can now return to a question we've asked before, about the relationship between religiosity and attendance at religious services. In the codebook, attendance is measured with Q4.

```
11) Q4
How often do you attend religious services at a place of worship?
RANGE: 0 to 7
	N	Mean	Std. Deviation
Total	1445	3.322	2.565
0) Never - Skip to Question 12	356	24.6
1) Less than once a year	102	7.1
2) Once or twice a year	160	11.1
3) Several times a year	164	11.3
4) Once a month	51	3.5
5) Two to three times a month	124	8.6
6) About once a week	354	24.5
7) Several times a week	134	9.3
Missing	56
```

For this question, we'll define high attendance as weekly, which would include answers 6 and 7.

We can compute the conditional probability of high attendance given identifying as very religious. We can then compute the conditional probability for those who identify as somewhat religious and as not religious. If the probability of high attendance is larger for the very religious than the moderately religious or the nonreligious, it would suggest a relationship between religiosity and attendance.


In [12]:
high_attendance = (Baylor2017['Q4'] >= 6)

moderately_religious = ((Baylor2017['Q3'] == 2) | (Baylor2017['Q3'] == 3))
# Here, we want to say filter if equals 2 or equals 3, so we use the '|'
nonreligious = (Baylor2017['Q3'] == 1)

print("The probability of high attendance given very religious: %3.2f" % conditional(high_attendance, given=very_religious))
print("The probability of high attendance given moderately religious: %3.2f" % conditional(high_attendance, given=moderately_religious))
print("The probability of high attendance given nonreligious: %3.2f" % conditional(high_attendance, given=nonreligious))


The probability of high attendance given very religious: 0.79
The probability of high attendance given moderately religious: 0.23
The probability of high attendance given nonreligious: 0.06


In [13]:
print("The probability of being very religious: %3.2f" % prob(very_religious))
print("The probability of being moderately religious: %3.2f" % prob(moderately_religious))
print("The probability of being nonreligious: %3.2f" % prob(nonreligious))

The probability of being very religious: 0.24
The probability of being moderately religious: 0.51
The probability of being nonreligious: 0.20


The results suggest that those who are very religious are more likely to be high attenders than those who are moderately religious or nonreligious. There could be mediating factors involved, but this result is consistent with the sociological assertion that religosity contributes to attendance at religious services.

## Bayes's Theorem

One of the ways that probability can help us analyze data is through a framework to understand relationships (variables that co-vary) in terms of differential probability. As we saw in the previous example, if we compute the probability of high attendance for those who are very religious and compare this to the probability of high attendance for those who are nonreligious, we can use the result to draw sociological insight. If the pattern is consistent with what we expect from a theoretical perspective (since expected outcomes derive from theory), we can use this as a causal argument.

That is, our theory predicts that those with high religiosity will be more likely to have high attendance. If the empirical results are consistent with this expectation, we can say our theoretical assertion is supported. To be clear, this computation of probabilities is not definitive proof of a causal relationship. Rather, the pattern is necessary but not sufficient to make a causal argument.

In the usual context for data analysis, we have sample data not population data. Here, we're using "sample" and "population" in a technical sense. A population is a theoretically defined collection of elements. In this sense, it needn't be a collection of people, though it often is in the social and behavorial sciences. A population need not be large, though, again, it often is in the social and behavioral sciences.

A sample is a subset taken from a particular population that is used to represent the population as a whole, using inference. This is where Bayes's Theorem is a useful tool.





In [14]:
table = pd.DataFrame(index=['Very Religious', 'Moderately Religious', 'Nonreligious'])
table['prior'] = 1/3, 1/3, 1/3
table

Unnamed: 0,prior
Very Religious,0.333333
Moderately Religious,0.333333
Nonreligious,0.333333


In [15]:
table['likelihood'] = 0.24, 0.51, 0.20
table

Unnamed: 0,prior,likelihood
Very Religious,0.333333,0.24
Moderately Religious,0.333333,0.51
Nonreligious,0.333333,0.2


In [16]:
table['unnorm'] = table['prior'] * table['likelihood']
table

Unnamed: 0,prior,likelihood,unnorm
Very Religious,0.333333,0.24,0.08
Moderately Religious,0.333333,0.51,0.17
Nonreligious,0.333333,0.2,0.066667


In [17]:
prob_data = table['unnorm'].sum()
table['posterior'] = table['unnorm'] / prob_data
table

Unnamed: 0,prior,likelihood,unnorm,posterior
Very Religious,0.333333,0.24,0.08,0.252632
Moderately Religious,0.333333,0.51,0.17,0.536842
Nonreligious,0.333333,0.2,0.066667,0.210526


In this example, we use Bayes's Theorem to guide our decisions about inference. The value for the prior is an expectation derived from theory or from a hypothetical (as in this case, the unrealistic assumption that all options on a survey item are equally likely) before we've examined any data. The posterior reflects our understanding of the likelihood of outcomes after examining sample data. We recognize that the specific values from our sample data are subject to some measurement error but also reflect relationships in the population.

We get different values for the posterior probabilities if we begin with different priors, but the pattern--the differences in probabilities reflected in the sample data--holds and is the foundation for making inferential claims.

For example, if we set the priors based on an expectation that survey respondents tend to favor the middle values of the scale rather than the extremes, we could compute the probabilities under a different hypothesis. (There is some evidence that social demand effects creates such an effect in surveys. Realistically, there is probably reason to believe that social demand favors expression of greater religiosity since there are still stigmas attached to nonbelief in the contemporary US.)

In [18]:
table['prior2'] = 1/4, 1/2, 1/4
table['unnorm2'] = table['prior2'] * table['likelihood']
prob_data = table['unnorm2'].sum()
table['posterior2'] = table['unnorm2'] / prob_data
table

Unnamed: 0,prior,likelihood,unnorm,posterior,prior2,unnorm2,posterior2
Very Religious,0.333333,0.24,0.08,0.252632,0.25,0.06,0.164384
Moderately Religious,0.333333,0.51,0.17,0.536842,0.5,0.255,0.69863
Nonreligious,0.333333,0.2,0.066667,0.210526,0.25,0.05,0.136986


### Activity

Identify a sociological question that you can ask of the Baylor survey data. This will require you to browse the codebook to see the kinds of variables available. Start with a simple question of the form "Is Y related to X?" where X and Y are sociological concepts. (In the example in this notebook, we asked if attendance at religious services was related to religiosity.)

Select variables that operationalize the concepts in your question.

Compute probabilities that allow you to formulate an answer to your question.

Explain your answer and why you computed the specific probabilities you chose.



1. Identify a sociological question that you can ask of the Baylor survey data.
Is a respondent's answer to "How religious do you consider yourself to be?" related to how often ones spends praying outside of religious services? Do people who consider themselves more religious pray more?

2. Select variables that operationalize the concepts in your question.

X = How religious do you consider yourself to be?  (9/Q3)

Y = About how often do you spend time alone praying outside of religious services? (Q10)

In [23]:
#3. Compute probabilities that allow you to formulate an answer to your question.

religious = (Baylor2017['Q3'] >= 2)
prob(religious)
print(prob(religious))

very_religious = (Baylor2017['Q3'] == 4)
prob(very_religious)
print(prob(very_religious))

0.7534976682211859
0.2418387741505663


In [26]:
pray = (Baylor2017['Q10'] >= 5)

moderately_religious = ((Baylor2017['Q3'] == 2) | (Baylor2017['Q3'] == 3))
# Here, we want to say filter if equals 2 or equals 3, so we use the '|'
nonreligious = (Baylor2017['Q3'] == 1)

print("The probability of praying outside services given very religious: %3.2f" % conditional(pray, given=very_religious))
print("The probability of praying outside services given moderately religious: %3.2f" % conditional(pray, given=moderately_religious))
print("The probability of praying outside services given nonreligious: %3.2f" % conditional(pray, given=nonreligious))

The probability of praying outside services given very religious: 0.60
The probability of praying outside services given moderately religious: 0.18
The probability of praying outside services given nonreligious: 0.07


4. Explain your answer and why you computed the specific probabilities

It appears that there is a probability that 75% of the respondents would consider themselves religious, while only 24% would probably consider themselves very religious. When assessing whether there is a higher probability that very religious respondents would pray more outside of services, the probability is high at 60%. However, the probability of outside prayer for moderately or slightly religious respondents is significantly lower at 18%. It is more likely that someone who identifies as very religious would pray outside of services, but surprisingly, the rate significantly drops for those who are only moderately religious.

I computed the probabilities using the function created by Downey, as well as conditional probability. I chose this approach because it is based on how religious people see themselves and the likelihood that they would pray based on their previous response about rating their religiousness.


