# **Pride and Joy**
### *An investigation of mental health correlates in LGBQ+ people*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Capstone Project|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |June 13, 2024|
---

## Problem Statement

My goal is to create a statistical model from which to draw inferences about different variables that predict better or worse psychological outcomes for people with minority sexual identities. In my conclusions section, I will interpret these findings and offer suggestions for social, political, and personal actions that we can all take to improve these outcomes.

### Deliverables

- A written report of my work and findings, including interpretation and recommendations.
- A slideshow to accompany the presentation of this report.

## Table of Contents

- [Introduction](#introduction)
  - [Terminology](#terminology)
- [Method](##Method)  
  - [Apparatus](###Apparatus)
  - [Data Acquisition](###Data-Acquisition)  
  - [Dataset](###Datasets)  
  - [Operational Definitions](###Operational-Definitions)
  - [Strategy](###Strategy) 
- [Notebook Summary](##Notebook-Summary)  

## Introduction

Prior research has established that people with minority sexual identities experience higher rates of psychological distress and mental illness, and strongly suggests that, in addition to universal stressors like poverty or abuse, these experiences of distress may be driven by unique factors related to social oppression ([Meyer, 1995](https://www.jstor.org/stable/2137286?origin=crossref)).  Prior research has also found, however, that these same populations can draw on unique sources of strength and protection from psychological harm.  In particular, there is some evidence to suggest that feelings of connection, belonging, and pride in one's sexual minority identity and community may buffer - or even eliminate - the harmful effects of social stigma and oppression ([Fingerhut, Peplau, & Gable, 2010](https://www.tandfonline.com/doi/abs/10.1080/19419899.2010.484592)).  In the current work, I aim to replicate and expand upon these prior findings, so as to better understand the drivers of psychological distress in this population, as well as potential protective factors.

### Terminology

The terminology used for sexual orientations and identities is a perennial source of both discourse and discord.  Although it is always best practice to refer to individuals by the terms they choose for themselves, the practical logistics of a project such as this one, in which many individual's responses are considered in the aggregate, require researchers to choose an umbrella term.  In the original documentation accompanying the dataset I used, the authors referred to their participants as "LGB," which stands for lesbian, gay, and bisexual. However, when reviewing their results, they discovered that many participants identified with terms other than these three (see p. 14-15 of *37166-Documentation-methodology.pdf*, in the download folder with the data).  In order to represent these participants more explicitly, and to avoid the use of acronyms that readers may or may not be familiar with, I opted to use the word "queer" to describe the sample.  Although this term is not preferred by all non-heterosexual or non-cisgender people, it is an established, simple term that is broadly accepted in academia and social movements.  For a brief overview, please see [this excellent guide](https://guides.libraries.indiana.edu/c.php?g=995240&p=8361766) from the Indiana University at Bloomington Library.

### Definition of Success

Because my goal is to create a model for inference, not prediction, I am not overly concerned with its ability to explain all of the variance in the data, nor with its generalizability to other datasets.  I will, of course, attempt to create the best model I can, as measured by $R^2$ and several loss functions.  However, many methods to improve a model's performance (e.g., regularization) impair its interpretability, and would therefore be inappropriate in the current context.  For this reason, I do not expect, and do not require, the exceptional metrics that many of my peers in the field of data science are able to obtain with predictive models.  Instead, I will interpret my results on their own terms, drawing what conclusions I can, relating them to prior findings where possible, and identifying areas that remain ambiguous for followup research.  In this work, the definition of success is simply to understand queer mental health better - even if that understanding is limited to a wish list for future work.

## Method

To answer the question of what factors predict positive or negative mental health in queer people, I created a linear regression model of participant's responses to a general mental health inventory, based on their responses to questions about themselves, their experiences, and their social contexts.

My target variable, scores from a general mental health inventory known as [Kessler-6](https://pubmed.ncbi.nlm.nih.gov/12578436/) (Kessler et al., 2003, as cited in the original documentation), consists of participant's responses of how many days in the past month they experienced 6 psychological symptoms.  As explained on p. 22-23 of *37166-Documentation-methodology.pdf*, part of the original documentation:
>\[The Kessler-6 inventory is] a 6-item scale from the National Comorbidity Survey (Kessler et al., 2003). Scale items (w1q77A- w1q77F) asked respondents how often, in the past 30 days, they had felt “nervous,” “hopeless,” “restless or fidgety,” “so depressed that nothing could cheer you up,” “that everything was an effort,” and “worthless.” Responses were recorded on a 5-point scale ranging from “all of the time” to “none of the time.”

The overall score of this scale was calculated by summing up the responses to each item, resulting in scores from 0-24, with higher scores indicating greater mental distress.

Using such a general scale as a target variable allowed me to look for factors that predict overall distress or protection.  Much previous research has used more specific scales to assess specific psychological outcomes, but this approach necessarily requires much more complex interpretation than is feasible for the current project (e.g., in 2016, Sanders and Chalk found that, among gay and lesbian participants, experiencing overt homophobic discrimination predicted many negative outcomes, but internalized homophobia only predicted stress).  Furthermore, using a general scale like Kessler-6 allows researchers to capture distress, even if it manifests differently in different people.  If a suitable model can be found for this variable, the inferences drawn from it may prove to be more generalizable than those related to any one specific psychological symptom(s). 

### Apparatus

Throughout the course of this project, I used `python` to manipulate, analyze, and model the data.  Some of the modules I used include `pandas`, `numpy`, `matplotlib`, `seaborn`, `time`, `datetime`, `os`, `string`, `statsmodels`, and several libraries within `sklearn`.  The developers of these modules, and of `python` itself, have graciously made their work open source. 

### Data Acquisition

For my report, I found and used a dataset called *Generations*, published by Ilan H. Meyer, Ph.D., in *Data Sharing for Demographic Research* ([DSDR](https://www.icpsr.umich.edu/web/pages/DSDR/index.html)), a data archive of the *Institute for Social Research* ([ICPSR](https://www.icpsr.umich.edu/web/pages/index.html)) at the University of Michigan.  This dataset, like all datasets in the archive, is only available to download with an ICPSR account, and members are not permitted to redistribute the datasets.  Therefore, **I have not included the dataset in my public report**, and I urge anyone reproducing my work to similarly avoid posting the data anywhere publically.  However, the data is publically available and free to access, so any reader wishing to follow along with or replicate my work may do so by following the instructions below.  **Please note that none of the notebooks will run without this step.**

1. Visit the [DSDR study page](https://www.icpsr.umich.edu/web/DSDR/studies/37166)
2. Click "Download" and select "Delimited"
3. Select one of the "Sign In" options.
4. Click "Set up a new account." (Or sign in, if you happen to already have an account.)
5. Fill out the information and click "Next."
6. Follow the instructions on screen to verify your account.  Once your account is verified, you should be automatically redirected back to the study page.
7. Click "Download" again and select "Delimited" again.
8. Answer the questions on screen and click "Save."
9. Click "Agree" on the Terms of Use.
10. The file will either download automatically into your Downloads folder, or a pop-up window will appear for you to choose the destination.
11. Locate the file on your computer.
12. Unzip the file `ICPSR_37166-v2.zip` to the location of your choice.
13. Within the zip folder, locate the folder `DS0007`.
14. Within `DS0007`, locate `37166-0007-Data.tsv`.  This is the data file.

Curious readers may also be interested in the other file in that folder, `DS0007/37166-0007-Codebook-ICPSR.pdf`, which is part of the documentation for the original data.  It includes frequency counts for every variable, as well as some archival notes from DSDR.  If you exit out of `DS0007` back to the root directory of the unzipped folder, you'll find another documentation file: `37166-Documentation-methodology.pdf`.  This file came from the original author, and includes instructions, descriptions, and Cronbach's alpha scores for their computed variables.  I referred to both documents extensively during my work.

As indicated in the name of the zip file above, I used version 2 of this dataset, which was the only one available when I downloaded it on May 23, 2024.  Version 2 was then, and still is as of this writing, the most recent and up-to-date version of the materials, having been published on January 5, 2023.  However, it should be noted that the datasets on ICPSR's website are occasionally updated. Should future replications encounter errors or differing results than my own, I humbly request that they first ensure that the ICPSR files have not changed since that time.

Finally, I would like to extend my gratitude to Dr. Meyer for sharing this data with the public, and to the teams at DSDR and ICPSR for maintaining the archive.

Citation:
Meyer, Ilan H. Generations: A Study of the Life and Health of LGB People in a Changing Society, United States, 2016-2019. Inter-university Consortium for Political and Social Research \[distributor\], 2023-01-05. https://doi.org/10.3886/ICPSR37166.v2

### Sample

The terminology used for sexual orientations and identities is a perennial source of both discourse and discord.  Although it is always best practice to refer to individuals by the terms they choose for themselves, the practical logistics of a project such as this one, in which many individual's responses are considered in the aggregate, require researchers to choose an umbrella term.  In the original documentation accompanying the dataset I used, the authors referred to their participants as "LGB," the first three letters of the common acronym LGBT, which stands for lesbian, gay, bisexual, and transgender.  The authors opted to drop the T because transgender participants were routed to a different study, although participants who did not identify as cisgender or transgender were retained in *Generations*:
>Respondents who were
transgender, regardless of their sexual orientation, were screened for participation in a
companion study, TransPop (see [www.TransPop.org](http://www.transpop.org/), which included questions to
address issues that are specific to transgender people (e.g., transitioning). Respondents
who were sexual minorities and gender nonbinary, but did not identify as transgender,
were included in the Generations study. 

### Strategy

My Method section is continued in the next two notebooks.  In these notebooks, I have included the `python` code I used and demonstrated how I cleaned and transformed the data in preparation for modeling.  Once those processes were complete, I conducted exploratory data analysis to reveal the relationships between the different variables, and then modeled the participants' Kessler-6 scores on their other responses.  I then interpreted this model to look for insights on how to improve mental health outcomes for queer people.

## Notebook Summary

In this notebook, I have introduced the purpose and background of this project, and begun to explain my methods.

In the next notebook, I will begin cleaning the data in preparation for modeling.