# hw 10
### due April 27, 12:30 pm

### data
We will work with survey data on marital infidelity from a Psychology study from the 1960s. The dataset is provided by the creators of RDatasets, a package for the R statistical programming language.

You can download the data at

https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv

Excel users should be able to open this csv and receive a prompt asking whether you want to convert it to xlsx format. You do.

#### more info
Each observation represents one respondent's answers.

**See here for more information** on this dataset
https://vincentarelbundock.github.io/Rdatasets/doc/AER/Affairs.html

### research question

Do people who are less than or equal to 34 years old engage in some extramarital sex (any amount) at different rates than those who are older?

In [1]:
import pandas as pd
import scipy.stats as stats

In [2]:
d = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv')

In [3]:
d.head()

Unnamed: 0.1,Unnamed: 0,affairs,gender,age,yearsmarried,children,religiousness,education,occupation,rating
0,4,0,male,37.0,10.0,no,3,18,7,4
1,5,0,female,27.0,4.0,no,4,14,6,4
2,11,0,female,32.0,15.0,yes,1,12,1,4
3,16,0,male,57.0,15.0,yes,5,18,6,5
4,23,0,male,22.0,0.75,no,2,17,6,3


# Q1
Look at the data description linked above to answer the questions below.

- **State clearly, with a brief explanation** whether the `age` variable is numeric or categorical. 
- **State clearly, with a brief explanation** whether the `affairs` variable is numeric or categorical.

This is a little tricky. **Do not solely rely on the variable description's determination of numeric/categorical type**, which gives variable type from a programming language perspective. Use the definition of numeric/categorical discussed in class.

# Q2

- **create** a new variable `young` that is 1 (or in python you can use True) if the respondent's age is less than or equal to 34, and 0 (or False) otherwise.
- **create** a pivot table (excel) or grouped data frame summary (python) showing the **number of observations** by groups defined by pairs of `gender` and `young` variables

In [4]:
d.loc[:, 'young'] = d.loc[:, 'age'].le(32)
d.groupby(['gender', 'young']).size()

gender  young
female  False     85
        True     230
male    False    125
        True     161
dtype: int64

# Q3
You may do Q3, Q2 together in one table/summary data frame if you choose.

- **create** a new variable `some_affairs` that is 1 (or in python you can use True) if the `affairs` variable is greater than zero, and 0 (or False) otherwise. This records whether or not the respondent claimed an extramarital sex in the past year. 
- **create** a pivot table (excel) or grouped data frame summary (python) showing the **average of** `some_affairs` by groups defined by the `gender` and `young` variables
- **describe briefly** (1 sentence is fine) what the summary table shows

In [5]:
d.loc[:, 'some_affairs'] = d.loc[:, 'affairs'].ge(1)
d.groupby(['young']).agg('mean').loc[:, 'some_affairs']

young
False    0.271429
True     0.237852
Name: some_affairs, dtype: float64

# Q4

- **calculate** `mhat_young` and `mhat_old`, point estimates for the average of `some_affairs` for the groups `young == 1` (or True in python) and `young == 0` respectively.
- **calculate** `shat_young` and `shat_old`, point estimates for the standard deviations of `some_affairs` for the groups `young == 1` (or True in python) and `young == 0` respectively.
- **calculate** `serr_young` and `serr_old`, point estimates for the standard deviations of **of the means** of `some_affairs` for the groups `young == 1` (or True in python) and `young == 0` respectively, for samples of the appropriate sizes for each group.

Python note: You don't need to create separate objects with those given names if you don't want to. The best way to do this is with a `groupby().agg(['f1', 'f2', 'f3'])` where `f1` etc are appropriate function names. This will give a data frame with all of the relevant information. You can then calculate a new column called `serr` using `assign()` or whatever method you prefer.

Hint: For the final calculation you will first need to count the number of observations within each group.

In [6]:
dhat = d.loc[:, ['young', 'some_affairs']].groupby('young').agg(['mean', 'std', 'size'])

In [7]:
# column names actually are ('some_affairs', 'mean'), ('some_affairs', 'std') etc.
# don't need that extra 'some_affairs' so dropping it
dhat.columns

MultiIndex([('some_affairs', 'mean'),
            ('some_affairs',  'std'),
            ('some_affairs', 'size')],
           )

In [8]:
dhat.columns = dhat.columns.droplevel(0)

In [9]:
dhat

Unnamed: 0_level_0,mean,std,size
young,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,0.271429,0.445759,210
True,0.237852,0.426313,391


In [10]:
# careful using the shortcuts dhat.std and dhat.size for those columns. These are also the names of *methods* so it won't work here
# use .loc[]
dhat = dhat.assign(serr = dhat.loc[:, 'std'] / dhat.loc[:, 'size']**.5)

In [11]:
dhat

Unnamed: 0_level_0,mean,std,size,serr
young,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,0.271429,0.445759,210,0.03076
True,0.237852,0.426313,391,0.02156


# Q5
Suppose the assumptions needed for a two-sample difference of means test are valid here.

- **Calculate the p-value** for the following hypothesis test using the point estimates from Q4:

$$H_0: m_{\text{young}} = m_{\text{old}}, \quad \quad H_1: m_{\text{young}} \neq m_{\text{old}}$$

In [12]:
s = (dhat.loc[True, 'serr']**2 + dhat.loc[False, 'serr']**2)**.5
z = abs(dhat.loc[True, 'mean'] - dhat.loc[False, 'mean'])/s

In [13]:
s, z

(0.037563448774103364, 0.893871838722849)

In [14]:
p = 2*stats.norm.cdf(-z)
p

0.37139046679850773

# Q6

- **Explain briefly:** Can you reject the null hypothesis with 90 percent confidence? Be specific in your justification and use concepts from class, but the answer need not be more than one sentence.

# Q7
We don't have much information about how this study was performed. Without spending time to research how it was done (you can for your own purposes if you want), answer the following:

- **Write one or two sentences** about what you should ask the study creators to evaluate whether your hypothesis test assumptions are valid. Be specific and use concepts from class.

# Q8

- **explain in words** your answer in Q7 using concepts from class
- **in particular:**
    - how does the result from Q7 compare with your answer to Q6?
    - how does the small-sample CI compare to the big-sample CI?
    - if they are different, why might they be different? If not, why not?