# Using the Cochran-Mantel-Haenszel (CMH) statistic to learn about masculinity in America

FiveThirtyEight, a web site about polling, sports, and statistics, worked with a few other parties to run a survey of men's relation to masculinity in the #meToo era. The discussion is here:
https://fivethirtyeight.com/features/what-do-men-think-it-means-to-be-a-man/ .

The analysis is noticeably univariate: it mostly discusses each question's responses individually, without many cross-tabulations like "How does the likelihood of self-perceived masculinity differ between married and single?" This is not to fault the authors, who are writing a short article for generalists.

But the staff at FiveThirtyEight have graciously and openly posted the data. So we can do further analysis on these kinds of questions. We can look at the one posed above, or whether men are more likely to pay on dates when they self-report as high-masculinity or when they want to be _perceived_ as masculine, or how reporting as high-masculinity changes the odds of having kids.

This page supplements <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3197362">An Analysis of U.S. Domestic Migration via Subset-stable Measures of Administrative Data</a>, a paper analyzing 82 million moves made by members of the U.S. formal economy, 2001-2015.

To properly do such an undertaking, the paper develops the Cochran-Mantel-Haenszel (CMH) statistic to answer questions about the relationship between factors. For example,

Q: What is the likelihood of moving for people who are married, relative to singles?  
A: Marrieds are less likely to move than singles.  
Q: What if we control for having kids and a mortgage?  
A: By adding controls to the same CMH calculation, we find that marrieds are much more likely to move than singles.

This page will:

* Ask lots of intersting questions about the relationship between masculinity and other opinions and factors.
* Introduce you to the CMH statistic and the `cmh.py` package.

Let's get on with the analysis.

##  First, download the data and prep some variables.

The next three cells provide and run the functions to do so, producing a Pandas data frame named `d`. This is standard data prep.

Many of your questions about the survey questions may be answered by the full instrument itself, at https://github.com/fivethirtyeight/data/blob/master/masculinity-survey/masculinity-survey.pdf.

In [1]:
from urllib.request import urlopen
import pandas as pd  
                                            
Data_URL = "https://github.com/fivethirtyeight/data/blob/master/masculinity-survey/raw-responses.csv?raw=true" 
 
def get_data():    
    """Download a copy of the survey if we don't already have it. Return a data frame with the observations."""
    try:
        return pd.read_csv(open("survey.csv", 'rb'))
    except FileNotFoundError:                     
        in_csv = urlopen(Data_URL).read().decode('utf-8')
        f = open("survey.csv", 'w')                      
        for data in in_csv:                              
            f.write(data)
        return pd.read_csv(open("survey.csv", 'rb'))

In [2]:
def prep_data(d):
    """The columns have generic names; give them something more readable.
       Then recode some of the codes to numeric values."""
    d.rename(columns= {
        'q0001': 'masc_self_rate',        
        'q0002': 'perception_importance', 
        'q0005': 'societal_pressure',
        'q0010_0007': 'advantage_at_work',
        'q0011_0004': 'disadvantage_at_work',
        'q0018': 'pays_on_dates',         
        'q0021': 'concerns_about_too_far',
        'q0024': 'married',
        'q0025': 'kids',  
        'q0026': 'sexual_orientation',  
        'q0027': 'age',   
        'q0028': 'race',  
        }, inplace = True)

    """Encode text values to numeric. For control variables this will not actually be necessary."""
    likert_λ = (lambda x:
        1 if (x.startswith('Somewhat') or x.startswith('Very')) else 
        0 if (x.startswith('Not very') or x.startswith('Not at all'))
        else -1)
    none_λ = lambda x: 0 if str(x).startswith("None") else 1
    
    """Encode the masculinity self-rating, and listwise delete those that did not reply to the key questions."""
    d.loc[:,"masc_self_rate"] = d["masc_self_rate"].apply(likert_λ)
    d.loc[:, "perception_importance"] = d["perception_importance"].apply(likert_λ)
    d = d.loc[d["masc_self_rate"]>=0]
    d = d.loc[d["perception_importance"]>=0]

    d.loc[:, "advantage_at_work"] = d["advantage_at_work"].apply(none_λ)
    d.loc[:, "disadvantage_at_work"] = d["disadvantage_at_work"].apply(none_λ)
    d.loc[:, "sexual_orientation"] = d["sexual_orientation"].apply(lambda x: 0 if x == "Straight" else 1)
    d.loc[:, "married"] = d["married"].apply(lambda x: 1 if x == "Married" else 0)
    d.loc[:, "age3"] = d["age3"].apply(lambda x: 0 if x == "18 - 34" else 1 if x == "35 - 64" else 3)
    return d

In [3]:
d = prep_data(get_data())

## Now for some risk ratios

Now that we have the data set, we can start asking about how variables relate.

For example, how does being married relate to the chance of self-describing as "somewhat" or "very" masculine? The first way to answer this is a simple crosstab, as shown below. The survey has weighted results to adjust the sample of respondents to more closely match the population at large, so it is preferable to get an aggregate total (`agg`) using the sum of weights, not simple observation counts.

As per the encodings above, unmarried=0, married=1, report high masculinity=1, not as 0.

In [4]:
d.groupby(["married","masc_self_rate"]).agg({"weight": sum})

Unnamed: 0_level_0,Unnamed: 1_level_0,weight
married,masc_self_rate,Unnamed: 2_level_1
0,0,127.765599
0,1,525.307171
1,0,43.740838
1,1,450.619175


Among the unmarried, the likelihood or _risk_ of self-reporting as somewhat/very masculine is (count of marrieds who claim high masculinity) divided by (count of marrieds), which the table shows to be about 450.62/(450.62 + 43.74).
There is a similar ratio for the unmarried. These are the two ratios to compare.

It is worth writing this down in general notation to make clear what the CMH statistic is doing.

The table of possibilities, including married=no and high masculinity self-rate=yes, married=yes and high masculinity-self-rate=no, and so on looks like this:

|               |single     |  married  | 
|---------------|-----------:|-----------:|
|  high  | h$_y$ m$_n$ = 525.3 | h$_y$ m$_y$ = 450.62 |
|  low  | h$_n$ m$_n$ = 127.8 | h$_n$ m$_y$ = 43.74 |

The risk of high self-rate given married is the chance of marrieds reporting as high masculinity over the chance that somebody is single: $$ \frac{h_y m_y}{(h_y m_y + h_n m_y)}= \frac{450.62}{450.62 + 43.74} = 91\%$$

Similiarly for the chance that a single person claims high masculinity. Then the risk ratio is indeed the ratio of the two risks. The numerator is the same as the denominator, but replace the $m_y$s with $m_n$s:

$$ \frac{\frac{h_y m_y}{(h_y m_y + h_n m_y)}}{\frac{h_y m_n}{(h_y m_n + h_n m_n)}} = \frac{
\frac{450.62}{450.62 + 43.74}}
{\frac{525.3}{525.3 + 127.8}} = \frac{91\%}{80.4\%} =113\%$$

The denominator can be flipped to reduce this fraction-of-fractions to a more legible (and more useful) simple fraction:
$$ \frac{(h_y m_y)(h_y m_n + h_n m_n)}{(h_y m_n)(h_y m_y + h_n m_y)} = \frac{.224}{.197} = 113\%$$


The CMH calculation with no controls gives exactly that risk ratio:

In [5]:
from cmh import cmh

mar_to_masc = cmh(d, "married", "masc_self_rate", "weight", [])
print(f"The risk of self-reporting as somehwat/very masculine for marrieds relative to singles: {mar_to_masc:.1%}")

The risk of self-reporting as somehwat/very masculine for marrieds relative to singles: 113.3%


So, the married are about 13% more likely to report as somewhat/very masculine.

The order of arguments to the `cmh` function are "independent" followed by "dependent"; picture an arrow `married → masc_self_rate`. Is it a measure of causality? That is for you to decide, because like all statistical measures, the risk ratio and its generalization via CMH statistic advises but does not prove causality.

We can ask the CMH calculator to be a little more verbose in how it did the math. It will present a table giving the total weight for dependent=yes and independent=yes (`dyiy`), then dependent=yes and independent=no (`dyin`), and so on. These numbers will match the crosstab above. The numerator and denominator of the de-compounded fraction above is also given.

In [6]:
cmh(d, "married", "masc_self_rate", "weight", [], verbose=True)

Independent: married, dependent: masc_self_rate, controls: []
         dyiy        dyin       dniy        dnin       weight      num  \
1                                                                        
1  450.619175  525.307171  43.740838  127.765599  1147.432783  0.22352   

        den  
1            
1  0.197243  


1.1332209122528545

## Like the real world, the CMH is asymmetric

The masculine → married and married → masculine questions are distinct and not necessarily numerically related. Correlation-based statistics, like a univariate linear regression, are symmetric and gives effectively the same answer for these two different questions. Because the CMH statistic gives different answers to the question _does $A$ influence $B$?_ and the question _does $B$ influence $A$?_, it can give more information pertinent to a causality inquiry.

Indeed, the chance of being married given a claim of higher masculinity, relative to those with a claim of lower masculinity, is noticeably larger than the likelihood of claiming higher masculinity given being married, relative to single (13% as above):

In [7]:
print(f"""
masculinity self-rate → married:
{cmh(d, "masc_self_rate", "married", "weight", []):.1%}
""")


masculinity self-rate → married:
181.0%



To give another example of how asymmetric the CMH statistic can be, let's ask these two questions about reported masculinity and the second item on the survey, "How important is it to you that others see you as masculine?":

perception → masculinity:
How does the chance of rating as somewhat/very masculine rise for those who report high importance to being perceived as masculine relative to others?


masculinity → perception: 
How does the chance of giving high importance to being perceived as masculine rise for those with high masculinity self-reports relative to others?

In [8]:
print(f"""
perception → masculinity self-rate
{cmh(d, "perception_importance", "masc_self_rate", "weight", []):.1%}

masculinity self-rate → perception
{cmh(d, "masc_self_rate", "perception_importance", "weight", []):.1%}
""")


perception → masculinity self-rate
139.0%

masculinity self-rate → perception
192.9%



So: men who care whether others perceive them as masculine are 39% more likely to report as masculine relative to people who don't perceive it as important.

But men who report as high-masculinity are on the way to twice as likely to think that everybody else cares about this than people who don't report as high-masculinity.

## Controlling for confounding factors

But there are all sorts of other issues. Maybe the marriage-based relations are really about having children. Married people are more likely to have kids, and maybe fathers feel more masculine.

Above, the controls were empty (the last argument to the calls to `cmh` were `[]`), and the married → claim of high masculinity risk ratio, calculated via `cmh`, was 113%. Let's control for kids. Now, the CMH calculation generates a row in its grouped table for those with kids and those without, and finds the elements of the risk ratio for both subsets, with each row's calculation exactly as above. Then the CMH statistic, the aggregate risk ratio, is the sum of numerators over the sum of denominators.

In [9]:
print(f'{cmh(d, "married", "masc_self_rate", "weight", ["kids"], verbose=True):.1%}')

Independent: married, dependent: masc_self_rate, controls: ['kids']
                    dyiy        dyin       dniy        dnin      weight  \
kids                                                                      
Has children  356.378401  201.500139  22.272502   23.115162  603.266204   
No children    93.523366  322.633410  21.468335  103.605786  541.230897   

                   num       den  
kids                              
Has children  0.115642  0.110224  
No children   0.064189  0.059740  
105.8%


Indeed, the change in risk of self-reporting as somewhat/very masculine given married versus non-married is smaller, shifting from up 13% above to only 5.8% here. Kids really do have an effect.

We can look at that effect by looking at CMH statistics regarding kids, controlling for marriage.

A technical detail regarding the `cmh` function: above, we converted many columns of the data set to numbers, but did not do so with the "has children" column. The `cmh` function allows you to provide a lambda (or other function) to evaluate whether the value of some column has or does not have the property you are asking questions about. Here, the function is simple, `1 if kids=="Has children" else 0`. But the option to provide a coding function is general enough that it can be used for less obvious situations.

In [10]:
kid_λ = lambda kids: 1 if kids=="Has children" else 0

k_to_m = cmh(d, "kids", "masc_self_rate", "weight", [], indep_c=kid_λ)
m_to_k = cmh(d, "masc_self_rate", "kids", "weight", [], dep_c=kid_λ)

k_to_m_ctrl = cmh(d, "kids", "masc_self_rate", "weight", ["married"], indep_c=kid_λ)
m_to_k_ctrl = cmh(d, "masc_self_rate", "kids", "weight", ["married"], dep_c=kid_λ)

print(f"""
kids → high-masculinity self-rate:
{k_to_m:.1%}
     
high-masculinity self-rate → kids:
{m_to_k:.1%}


kids → high-masculinity self-rate, controlling for marital status:
{k_to_m_ctrl:.1%}
     
high-masculinity self-rate → kids, controlling for marital status:
{m_to_k_ctrl:.1%}
""")


kids → high-masculinity self-rate:
120.4%
     
high-masculinity self-rate → kids:
216.0%


kids → high-masculinity self-rate, controlling for marital status:
117.5%
     
high-masculinity self-rate → kids, controlling for marital status:
182.4%



I'm avoiding causal language, but if you wanted to claim that having kids changes self-perception of masculinity somewhat (up 20%), but self-perceiving as high masculinity has a big effect on the chance of having kids (more than double the chance!), the statistics would support—or at least fail to reject—your claim.

But somebody might point to the numbers above that marriage and masculinity self-perception are related, so maybe it's really marriage that's the causal factor. We could then recalculate the aggregate risk ratio controlling for marriage, and find that marriage does explain part of the effect, but kids are still a very big part of the story, and claims of asymmetric causation are still somewhat supported.

As more people raise other possibile confounders (sexual orientation, for example), you can add those to the control list.

## Stability given subsets

For example, this is a survey of attitudes in the United States, so we have to take race into account. We can add race to the list of controls, so our controls are now `["race", "kids"]`.

Controlling for race has a minimal effect on the marriage-to-masculinity report relationships, but by producing a verbose aggregation table, we see why: the weighted survey is heavily White. The White-to-Asian ratio in the general U.S. population is about ten-to-one, and here is 19-to-one; similarly for other minorities. There are a lot of choices to be made in selecting survey weights, especially in a situation where only one weights column can be provided, and it seems the designers chose weights that focus on other aspects, perhaps age or sexual orientation.

In [11]:
print(f"""
Controlling for race and kids:
{cmh(d, "married", "masc_self_rate", "weight", ["race", "kids"], verbose=True):.1%}

Controlling only for kids:
{cmh(d, "married", "masc_self_rate", "weight", ["kids"]):.1%}
""")

Independent: married, dependent: masc_self_rate, controls: ['race', 'kids']
                             dyiy        dyin       dniy       dnin  \
race     kids                                                         
Asian    Has children    6.498116    4.043315   0.000000   0.000000   
         No children     0.248532   21.255771   1.889128   3.811718   
Black    Has children   35.701493   34.346141   1.047586   8.643674   
         No children     3.045584   49.421789   0.000000  21.180150   
Hispanic Has children   56.601465   39.839144   0.495502   0.215332   
         No children    24.232309   43.663806   3.509484  15.419392   
Other    Has children   14.644500   15.771335   1.708185   0.097685   
         No children     3.066757   13.747157   0.102593   1.027961   
White    Has children  242.932826  107.500204  19.021230  14.158471   
         No children    62.930184  194.544887  15.967130  62.166563   

                           weight       num       den  
race     kids  

In fact, if we do the same check only on rows of the data set where `d["race"]=="White"`, almost nothing changes. Let's compare only Whites, everybody, then only non-Whites. Throughout, we'll keep race as a control, though it is irrelevant in the first case with only one race.

In [12]:
print(f"""
{cmh(d.loc[d["race"]=="White"], "married", "masc_self_rate", "weight", ["race","kids"]):.1%} ← only Whites
{cmh(d, "married", "masc_self_rate", "weight", ["race","kids"]):.1%} ← Everybody
{cmh(d.loc[d["race"]!="White"], "married", "masc_self_rate", "weight", ["race","kids"]):.1%} ← only non-Whites
""")


105.1% ← only Whites
105.6% ← Everybody
106.6% ← only non-Whites



We broke the full population into two parts, White and non-White, and got two CMH statistics, and the aggregate CMH statistic was in between. The migration study linked at the top of this analysis refers to this as _subset stability_. That is, if I start with the married → high-masculinity self-rate CMH statistic (105.6%), and then I tell you that the person is White, then you will adjust your expectation of the effect of married → masc downward to 105.1%. If I tell you that the person is non-White, then you will adjust your expectations upward to 106.6%.

This is extremely natural, and the above paper proves that the CMH statistic is the _only_ aggregate risk ratio among those admissible that guarantees this behavior. Without it, we could have the awkward situation where you could start with a baseline, and then I give you more information about the observation, and then your estimate shifts upward _no matter what new information you have_. It would not be hard to find examples like this for regression parameters, simple averages of risk ratios, or other popular statistics.

That concludes the methodological part of this essay: the CMH statistic is an excellent option for asking questions about how variables relate, controlling for others. The controls are literal, comparing only those married with kids to others married with kids, and subset statistics are subset stable. An appendix to the above paper lists several other manners in which the CMH statistic improves over correlation-based statistics like regression coefficients.

Other issues, such as bootstrapping confidence intervals or using the lambdas to handle categorical data are omitted due to space considerations. There is also far more detail in the appendix of the above-linked paper.

# A few more masculinity stats

Here is another issue we can tackle using the CMH statistic: Which has better support in the data: men are more likely to pay on dates because they perceive themselves as masculine, or because they want _others_ to perceive them as masculine? There is more support for the self-perception:

In [13]:
pay_λ=lambda x: 1 if x=="Always" else 0

print(f"""
others' perception is important → always pays on dates, controlling for masculinity self-report
{cmh(d, "perception_importance", "pays_on_dates", "weight", ["masc_self_rate"], dep_c=pay_λ):.1%}

masculinity self-rate → always pays on dates, controlling for importance of perception
{cmh(d, "masc_self_rate", "pays_on_dates", "weight", ["perception_importance"], dep_c=pay_λ):.1%}
""")


others' perception is important → always pays on dates, controlling for masculinity self-report
109.2%

masculinity self-rate → always pays on dates, controlling for importance of perception
160.6%



The survey asked respondents whether being male had any advantages or disadvantages at work. Several options were given, but a large percentage marked _None_ for both questions. How do the likelihoods of marking some not-none value change for high-masculinity self-raters versus others?

In [14]:
print(f"""
masculinity self-rate → Being male has advantages at work
{cmh(d, "masc_self_rate", "advantage_at_work", "weight", []):.1%}

masculinity self-rate → Being male has disadvantages at work
{cmh(d, "masc_self_rate", "disadvantage_at_work", "weight", []):.1%}
""")


masculinity self-rate → Being male has advantages at work
91.4%

masculinity self-rate → Being male has disadvantages at work
112.1%



There are a lot more questions you could ask of the data, by modifying the `cmh` calls above. Please do clone the repository and modify the code to ask more questions, add more controls, check reliability via bootstrap, or compare the results to your favorite alternative measures of controlled relationships.