# **Pride and Joy**
### *An investigation of mental health correlates in LGBQ+ people*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Capstone Project|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |June 13, 2024|
---

## Prior Notebooks Summary

In the previous notebook, I ingested the data, changed the column names, conducted a first round of feature selection, and resolved all missing values.

In this notebook, I will demonstrate how I used `python` to finish preparing the data for modeling through feature engineering.  Specifically, I will recode columns as necessary and create new features that are composite scores of existing features.

## Table of Contents

- [Data Preparation](#data-preparation-continued)
  - [Imports](#imports)
  - [Feature Engineering](#feature-engineering)
    - [Recoding Variables](#recoding-variables)
    - [Creating Composite Variables](#creating-composite-variables)
- [Notebook Summary](#notebook-summary)  

## Data Preparation, Continued

### Imports

In [1]:
# Import modules
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from warnings import simplefilter

In [2]:
# Settings preferences
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None 
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
# Thanks to daydaybroskii and KingOtto at Stack Overflow for that one
# https://stackoverflow.com/questions/68292862/performancewarning-dataframe-is-highly-fragmented-this-is-usually-the-result-o

In [3]:
# Import the data - mine
meyer = pd.read_csv('../02_data/df_after_data_preparation_part_1.csv') 
meyer.shape

(1494, 225)

### Feature Engineering

In [4]:
# Define a dictionary to hold groups of columns and some instructions - more on this later
feat_eng_dict = {}

#### Recoding Variables

say some stuff

In [5]:
# 'w1q89_ei' - 7s like 97s
# This is the question about how often people smoke cigarettes, and the 7s are people who 
# said on the previous question that they do not smoke. Therefore, I imputed "not at all" here.
cond = meyer['w1q89_ei']==7.0
meyer.loc[cond, 'w1q89_ei']=3

In [6]:
# All of the 123 questions are about outness, but they're coded unintuitively. "Are you out to..."
# 1 = All, 2 = Most, 3 = Some, 4 = None, 5 = don't know/does not apply/[missing value]
# I recoded it like this:
# 4 -> 0 = I can confidently say I am out to "None" of these people. (LOWEST OUTNESS)
# 5 -> 1 = I'm being wishy-washy about how out I am to these people.
# 3 -> 2 = Some
# 2 -> 3 = Most
# 1 -> 4 = All (HIGHEST OUTNESS)

# Create new columns
meyer[['w1q123a_ei_r', 'w1q123b_ei_r', 'w1q123c_ei_r', 'w1q123d_ei_r']] = meyer[[
  'w1q123a_ei', 'w1q123b_ei', 'w1q123c_ei', 'w1q123d_ei']]

# Create a recoding dictionary
recode_123s = {4: 0, 5: 1, 3: 2, 2: 3, 1: 4}

# Create lists
old_cols = ['w1q123a_ei', 'w1q123b_ei', 'w1q123c_ei', 'w1q123d_ei']
new_cols = ['w1q123a_ei_r', 'w1q123b_ei_r', 'w1q123c_ei_r', 'w1q123d_ei_r']

# Recode
for old, new in list(zip(old_cols, new_cols)):
  meyer[new] = meyer[old].map(recode_123s)

# Drop the old versions
meyer.drop(columns = old_cols, inplace = True)
meyer.shape

(1494, 225)

In [7]:
# 124, about visibility, needs to be reverse coded so that higher numbers = more visibly queer

# Adjust the values in a new column
meyer[['w1q124_ei_r']] = 5-meyer[['w1q124_ei']]

# Drop the original
meyer.drop(columns = ['w1q124_ei'], inplace = True)

# Add an entry to the dictionary - more on this later
feat_eng_dict['outness'] = ['mean', 'w1q123a_ei_r', 'w1q123b_ei_r', 
  'w1q123c_ei_r', 'w1q123d_ei_r', 'w1q124_ei_r']

Rather than typing out the syntax above over and over for each column, I wrote a function to generate most of the code I would need.  All I had to do was tinker with it to adjust the values correctly.

In [8]:
def recode(dictry, name, cols):
  '''cols is a list of columns that need to be recoded.  
  name is the name I want to give the composite column based on cols
  dict is the STRING name of the dictionary (I need that for the func to modify it)
    this function will spit out the appropriate syntax, but only for THIS df.'''
  for i in cols[1:]:
    print(f"meyer[['{i}_r']] = meyer[['{i}']]")
  print('')
  for i in cols[1:]:
    # Note: These counts won't show in the Jupyter output, but they were helpful when I was working.
    print(f"meyer['{i}'].value_counts(dropna = False).sort_index()")
    print(f"meyer['{i}_r'].value_counts(dropna = False).sort_index()")
  print('')
  print(f"meyer.drop(columns = {cols[1:]}).shape")
  print(f"meyer.drop(columns = {cols[1:]}, inplace = True)")
  print('')
  cols_r = cols[:1] + [''.join([x, '_r']) for x in cols[1:]]
  print(f"{dictry}['{name}'] = {cols_r}")
# I then ran this a bunch of times to generate the code seen below. 
# One example is provided, but the rest are omitted for space.

In [9]:
recode('feat_eng_dict', 'bad_neighbhd', ['sum', 'w1q19a_ei', 'w1q19b_ei', 'w1q19c_ei', 'w1q19d_ei'])

meyer[['w1q19a_ei_r']] = meyer[['w1q19a_ei']]
meyer[['w1q19b_ei_r']] = meyer[['w1q19b_ei']]
meyer[['w1q19c_ei_r']] = meyer[['w1q19c_ei']]
meyer[['w1q19d_ei_r']] = meyer[['w1q19d_ei']]

meyer['w1q19a_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19a_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q19b_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19b_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q19c_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19c_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q19d_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19d_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q19a_ei', 'w1q19b_ei', 'w1q19c_ei', 'w1q19d_ei']).shape
meyer.drop(columns = ['w1q19a_ei', 'w1q19b_ei', 'w1q19c_ei', 'w1q19d_ei'], inplace = True)

feat_eng_dict['bad_neighbhd'] = ['sum', 'w1q19a_ei_r', 'w1q19b_ei_r', 'w1q19c_ei_r', 'w1q19d_ei_r']


In [10]:
# Question 19 is about whether their neighborhood is a good place to live for people of different
# identities.  I reverse coded it so that 1=bad neighborhood and 0=fine
meyer[['w1q19a_ei_r']] = meyer[['w1q19a_ei']]-1
meyer[['w1q19b_ei_r']] = meyer[['w1q19b_ei']]-1
meyer[['w1q19c_ei_r']] = meyer[['w1q19c_ei']]-1
meyer[['w1q19d_ei_r']] = meyer[['w1q19d_ei']]-1

meyer['w1q19a_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19a_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q19b_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19b_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q19c_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19c_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q19d_ei'].value_counts(dropna = False).sort_index()
meyer['w1q19d_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q19a_ei', 'w1q19b_ei', 'w1q19c_ei', 'w1q19d_ei']).shape  # See note above
meyer.drop(columns = ['w1q19a_ei', 'w1q19b_ei', 'w1q19c_ei', 'w1q19d_ei'], inplace = True)

feat_eng_dict['bad_neighbhd'] = ['sum', 'w1q19a_ei_r', 'w1q19b_ei_r', 'w1q19c_ei_r', 'w1q19d_ei_r']

In [11]:
# Questions 75 and 76 ask if the participant has a disability, with 1=disabled, 2=non-disabled
# I reverse coded these so that 1=disabled, 0=non-disabled
meyer[['w1q75_ei_r']] = abs(meyer[['w1q75_ei']]-2)
meyer[['w1q76_ei_r']] = abs(meyer[['w1q76_ei']]-2)

meyer['w1q75_ei'].value_counts(dropna = False, sort = True, ascending = True)
meyer['w1q75_ei_r'].value_counts(dropna = False, sort = True, ascending = True)
meyer['w1q76_ei'].value_counts(dropna = False, sort = True, ascending = True)
meyer['w1q76_ei_r'].value_counts(dropna = False, sort = True, ascending = True)

meyer.drop(columns = ['w1q75_ei', 'w1q76_ei']).shape
meyer.drop(columns = ['w1q75_ei', 'w1q76_ei'], inplace = True)

feat_eng_dict['disabled'] = ['binarize', 'w1q75_ei_r', 'w1q76_ei_r']

In [12]:
# Questions 'w1q101_ei', 'w1q105_ei', 'w1q109_ei', and 'w1q114_ei' are about suicidal ideation and
# behavior, with 1=never, 2=once, and 3=more than once.  I adjusted the first 3 so that 0=never, 1=once, 
# and 2=more than once.  I did this for a lot of variables, and refer to it as "0-basing" throughout.
meyer[['w1q101_ei_r']] = meyer[['w1q101_ei']]-1
meyer[['w1q105_ei_r']] = meyer[['w1q105_ei']]-1
meyer[['w1q109_ei_r']] = meyer[['w1q109_ei']]-1

meyer['w1q101_ei'].value_counts(dropna = False).sort_index()
meyer['w1q101_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q105_ei'].value_counts(dropna = False).sort_index()
meyer['w1q105_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q109_ei'].value_counts(dropna = False).sort_index()
meyer['w1q109_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q101_ei', 'w1q105_ei', 'w1q109_ei']).shape
meyer.drop(columns = ['w1q101_ei', 'w1q105_ei', 'w1q109_ei'], inplace = True)

feat_eng_dict['suicidality'] = ['sum', 'w1q101_ei_r', 'w1q105_ei_r', 'w1q109_ei_r', 'w1q114_ei_r']

In [13]:
# Question 135 has sub-questions, all asking about different types of victimization the participant 
# may have experienced in their life.  I 0-based these so that 0="Never."
meyer[['w1q135a_ei_r']] = meyer[['w1q135a_ei']]-1
meyer[['w1q135b_ei_r']] = meyer[['w1q135b_ei']]-1
meyer[['w1q135c_ei_r']] = meyer[['w1q135c_ei']]-1
meyer[['w1q135d_ei_r']] = meyer[['w1q135d_ei']]-1
meyer[['w1q135e_ei_r']] = meyer[['w1q135e_ei']]-1
meyer[['w1q135f_ei_r']] = meyer[['w1q135f_ei']]-1

meyer['w1q135a_ei'].value_counts(dropna = False).sort_index()
meyer['w1q135a_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q135b_ei'].value_counts(dropna = False).sort_index()
meyer['w1q135b_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q135c_ei'].value_counts(dropna = False).sort_index()
meyer['w1q135c_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q135d_ei'].value_counts(dropna = False).sort_index()
meyer['w1q135d_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q135e_ei'].value_counts(dropna = False).sort_index()
meyer['w1q135e_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q135f_ei'].value_counts(dropna = False).sort_index()
meyer['w1q135f_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q135a_ei', 'w1q135b_ei', 'w1q135c_ei', 'w1q135d_ei', 'w1q135e_ei', 'w1q135f_ei']).shape
meyer.drop(columns = ['w1q135a_ei', 'w1q135b_ei', 'w1q135c_ei', 'w1q135d_ei', 'w1q135e_ei', 'w1q135f_ei'], inplace = True)

feat_eng_dict['abusive_treatment'] = ['sum', 'w1q135a_ei_r', 'w1q135b_ei_r', 'w1q135c_ei_r', 'w1q135d_ei_r', 'w1q135e_ei_r', 'w1q135f_ei_r']

In [14]:
# Questions 137 and 138 ask about workplace discrimination.  I 0-based these so that 0="Never."
meyer[['w1q137_ei_r']] = meyer[['w1q137_ei']]-1
meyer[['w1q138_ei_r']] = meyer[['w1q138_ei']]-1

meyer['w1q137_ei'].value_counts(dropna = False).sort_index()
meyer['w1q137_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q138_ei'].value_counts(dropna = False).sort_index()
meyer['w1q138_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q137_ei', 'w1q138_ei']).shape
meyer.drop(columns = ['w1q137_ei', 'w1q138_ei'], inplace = True)

feat_eng_dict['work_neg_outcomes'] = ['sum', 'w1q137_ei_r', 'w1q138_ei_r']

In [15]:
# Question 142 has sub-questions, all asking about different types of stress the participant 
# may have experienced in the past year.  I recoded them so that 0 = "No" and 1 = "Yes."
# Note: I have broken these across different cells because they are part of different sub-scales of 142.
meyer[['w1q142a_ei_r']] = abs(meyer[['w1q142a_ei']]-2)
meyer[['w1q142h_ei_r']] = abs(meyer[['w1q142h_ei']]-2)
meyer[['w1q142i_ei_r']] = abs(meyer[['w1q142i_ei']]-2)

meyer['w1q142a_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142a_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142h_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142h_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142i_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142i_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q142a_ei', 'w1q142h_ei', 'w1q142i_ei']).shape
meyer.drop(columns = ['w1q142a_ei', 'w1q142h_ei', 'w1q142i_ei'], inplace = True)

feat_eng_dict['stress_past_year_gen'] = ['sum', 'w1q142a_ei_r', 'w1q142h_ei_r', 'w1q142i_ei_r']

In [16]:
meyer[['w1q142b_ei_r']] = abs(meyer[['w1q142b_ei']]-2)
meyer[['w1q142c_ei_r']] = abs(meyer[['w1q142c_ei']]-2)
meyer[['w1q142e_ei_r']] = abs(meyer[['w1q142e_ei']]-2)

meyer['w1q142b_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142b_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142c_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142c_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142e_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142e_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q142b_ei', 'w1q142c_ei', 'w1q142e_ei']).shape
meyer.drop(columns = ['w1q142b_ei', 'w1q142c_ei', 'w1q142e_ei'], inplace = True)

feat_eng_dict['stress_past_year_work'] = ['sum', 'w1q142b_ei_r', 'w1q142c_ei_r', 'w1q142e_ei_r']

In [17]:
meyer[['w1q142d_ei_r']] = abs(meyer[['w1q142d_ei']]-2)
meyer[['w1q142f_ei_r']] = abs(meyer[['w1q142f_ei']]-2)
meyer[['w1q142g_ei_r']] = abs(meyer[['w1q142g_ei']]-2)

meyer['w1q142d_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142d_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142f_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142f_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142g_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142g_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q142d_ei', 'w1q142f_ei', 'w1q142g_ei']).shape
meyer.drop(columns = ['w1q142d_ei', 'w1q142f_ei', 'w1q142g_ei'], inplace = True)

feat_eng_dict['stress_past_year_interpersonal'] = ['sum', 'w1q142d_ei_r', 'w1q142f_ei_r', 'w1q142g_ei_r']

In [18]:
meyer[['w1q142j_ei_r']] = abs(meyer[['w1q142j_ei']]-2)
meyer[['w1q142k_ei_r']] = abs(meyer[['w1q142k_ei']]-2)

meyer['w1q142j_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142j_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q142k_ei'].value_counts(dropna = False).sort_index()
meyer['w1q142k_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q142j_ei', 'w1q142k_ei']).shape
meyer.drop(columns = ['w1q142j_ei', 'w1q142k_ei'], inplace = True)

feat_eng_dict['stress_past_year_crime'] = ['sum', 'w1q142j_ei_r', 'w1q142k_ei_r']

In [19]:
# Question 179 asks which, if any, religion the participant practices, and question 180 asks which they 
# were raised in.  The original survey offered 13 options (see p. 261-262 of 37166-0007-Codebook-ICPSR.pdf), 
# but many categories had very few people in them.  Given how many features I had relative to observations,
# I opted to consolidate these categories into 3 options: 
# 0 = Atheist, Agnostic, or "Nothing in particular" (roughly: not religious)
# 1 = Protestant, Roman Catholic, Mormon, Orthodox (roughly: religous, Christian)
# 5 = Jewish, Muslim, Buddhist, Hindu, Spiritual, "Something else" (roughly: religous, not Christian)
# I kept 5, rather than recoding to 2, because it was the original code for Jewish, the most populous 
# original category in the new group.  All three columns would be One-Hot Encoded later anyway.

# To see the original counts:
print(meyer['w1q179_ei'].value_counts(dropna = False).sort_index())
print(meyer['w1q180_ei'].value_counts(dropna = False).sort_index())

w1q179_ei
1.0     291
2.0     133
3.0      10
4.0       6
5.0      38
6.0       2
7.0      30
8.0       1
9.0     190
10.0    153
11.0    260
12.0     94
13.0    286
Name: count, dtype: int64
w1q180_ei
1.0     689
2.0     435
3.0      29
4.0       7
5.0      51
6.0       2
7.0       3
8.0       1
9.0      23
10.0     16
11.0     25
12.0     50
13.0    163
Name: count, dtype: int64


In [20]:
# Recode the religion columns
relig_recode = {1: 1, 2: 1, 3: 1, 4: 1, 
  5: 5, 6: 5, 7: 5, 8: 5, 11: 5, 12: 5, 
  9: 0, 10: 0, 13: 0} 

# Recode in new columns
meyer['w1q179_ei_r'] = meyer['w1q179_ei'].map(relig_recode)
meyer['w1q180_ei_r'] = meyer['w1q180_ei'].map(relig_recode)

meyer['w1q179_ei'].value_counts(dropna = False).sort_index()
meyer['w1q179_ei_r'].value_counts(dropna = False).sort_index()
meyer['w1q180_ei'].value_counts(dropna = False).sort_index()
meyer['w1q180_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q179_ei', 'w1q180_ei']).shape
meyer.drop(columns = ['w1q179_ei', 'w1q180_ei'], inplace = True)

# No dictionary entry - it doesn't make sense to combine these columns

In [21]:
# Check the shape - make sure I have the same number of columns I started with
meyer.shape

(1494, 225)

In [22]:
# One-Hot Encode religion columns

# Instantiate the transformers
ohe = OneHotEncoder(drop = None, # I want to manually drop specific ones
  handle_unknown = 'ignore', sparse_output = False) 

ctx = ColumnTransformer(transformers=[('one_hot', ohe, ['w1q179_ei_r', 'w1q180_ei_r'])],
    remainder = 'passthrough', verbose_feature_names_out=False)

# Encode
meyer_ohe = pd.DataFrame(data = ctx.fit_transform(meyer), 
  columns = ctx.get_feature_names_out())

# Make sure the shape matches
print(meyer_ohe.shape)
meyer_ohe.shape == (meyer.shape[0], meyer.shape[1]+4)

(1494, 229)


True

In [23]:
# Drop the non-religious ones so they're the baseline
meyer_ohe.drop(columns = ['w1q179_ei_r_0', 'w1q180_ei_r_0']).shape
meyer_ohe.drop(columns = ['w1q179_ei_r_0', 'w1q180_ei_r_0'], inplace = True)
meyer_ohe.shape

(1494, 227)

In [24]:
# Give the new columns more semantic names
relig_rename = {'w1q179_ei_r_1': 'w1q179_ei_r_relig_christ', 'w1q179_ei_r_5': 'w1q179_ei_r_relig_other', 
  'w1q180_ei_r_1': 'w1q180_ei_r_relig_christ', 'w1q180_ei_r_5': 'w1q180_ei_r_relig_other'}

meyer_ohe.rename(columns = relig_rename, inplace = True)

In [25]:
# Put it back under its own name
meyer = meyer_ohe.copy(deep = True)
del meyer_ohe
meyer.shape

(1494, 227)

In [26]:
# Question 181 is about how often participants attend religious services.  I reverse coded and 
# 0-based it, so that 0 = "Never" and 5 = "More than once a week"
meyer[['w1q181_ei_r']] = 6-(meyer[['w1q181_ei']])

meyer['w1q181_ei'].value_counts(dropna = False).sort_index()
meyer['w1q181_ei_r'].value_counts(dropna = False).sort_index()

meyer.drop(columns = ['w1q181_ei']).shape
meyer.drop(columns = ['w1q181_ei'], inplace = True)

#### Creating Composite Variables

Once each of those columns was recoded to work seamlessly in the model, I began creating composite variables.  These are similar to the scale scores that Meyer and his team included in the dataset, in that they mathematically combine multiple columns into one.  However, these combinations are of my own design, and were not laid out in the codebook.  When choosing which columns to condense into one, I relied on cues from the structure of the questions, as well as my own experience and domain knowledge.

Because there were quite a few composite columns I wanted to create, I used a dictionary to organize them, rather than writing out each combination individually.  This is the dictionary that was referenced in the `recode()` function above.  Dynamically updating it while recoding the columns allowed me to iteratively improve upon it, but for the sake of readability, I will simply display the most up-to-date version here.

Each entry in the dictionary has the same structure.  The keys are the names I want to assign to the new columns that will be created from a combination of existing columns.  The values are lists.  The first entry in each of those lists is the method by which I wanted to combine the existing columns, and the rest of the entries are the names of those columns.  That is why, when working iteratively, it was so important to update this dictionary every time I recoded, renamed, or dropped a column: the column names in the most up-to-date version of the dictionary had to exactly match the column names in the most up-to-date version of the dataframe.

In [27]:
feat_eng_dict = {
  'health_insurance': [
    'binarize', 'w1q64_2_ei', 'w1q64_3_ei', 'w1q64_4_ei', 'w1q64_5_ei', 'w1q64_6_ei', 'w1q64_7_ei', 
    'w1q64_8_ei', 'w1q64_9_ei', 'w1q64_10_ei', 'w1q64_11_ei', 'w1q64_12_ei', 'w1q64_13_ei', 'w1q64_t_num'], 
  'serious_health_cond': [
    'binarize', 'w1q74_5_ei', 'w1q74_6_ei', 'w1q74_10_ei', 'w1q74_11_ei', 
    'w1q74_14_ei', 'w1q74_17_ei', 'w1q74_18_ei', 'w1q74_20_ei'], 
  'disabled': [
    'binarize', 'w1q75_ei_r', 'w1q76_ei_r'], 
  'outness': [
    'mean', 'w1q123a_ei_r', 'w1q123b_ei_r', 'w1q123c_ei_r', 'w1q123d_ei_r', 'w1q124_ei_r'], 
  'abusive_treatment': [
    'sum', 'w1q135a_ei_r', 'w1q135b_ei_r', 'w1q135c_ei_r', 'w1q135d_ei_r', 'w1q135e_ei_r', 'w1q135f_ei_r'], 
  'work_neg_outcomes': [
    'sum', 'w1q137_ei_r', 'w1q138_ei_r'], 
  'abus_treat_non_queer': [
    'sum', 'w1q136_1_ei', 'w1q136_5_ei', 'w1q136_6_ei', 'w1q136_8_ei', 'w1q136_9_ei', 'w1q136_10_ei'], 
  'stress_past_year_gen': [
    'sum', 'w1q142a_ei_r', 'w1q142h_ei_r', 'w1q142i_ei_r'], 
  'stress_past_year_work': [
    'sum', 'w1q142b_ei_r', 'w1q142c_ei_r', 'w1q142e_ei_r'], 
  'stress_past_year_interpersonal': [
    'sum', 'w1q142d_ei_r', 'w1q142f_ei_r', 'w1q142g_ei_r'], 
  'stress_past_year_crime': [
    'sum', 'w1q142j_ei_r', 'w1q142k_ei_r'], 
  'work_disc_non_queer': [
    'sum', 'w1q139_1_ei', 'w1q139_5_ei', 'w1q139_6_ei', 'w1q139_8_ei', 'w1q139_9_ei', 'w1q139_10_ei'], 
  'housing_disc_non_queer': [
    'sum', 'w1q141_1_ei', 'w1q141_5_ei', 'w1q141_6_ei', 'w1q141_8_ei', 'w1q141_9_ei', 'w1q141_10_ei'], 
  'stress_past_year_non_queer': [
    'sum', 'w1q143_1_ei', 'w1q143_5_ei', 'w1q143_6_ei', 'w1q143_8_ei', 'w1q143_9_ei', 'w1q143_10_ei'], 
  'daily_discr_non_queer': [
    'sum', 'w1q145_1_ei', 'w1q145_5_ei', 'w1q145_6_ei'], 
  'childhd_bullying_non_queer': [
    'sum', 'w1q163_1_ei', 'w1q163_5_ei', 'w1q163_6_ei', 'w1q163_8_ei', 'w1q163_9_ei', 'w1q163_10_ei'], 
  'abus_treat_sex_gender': [
    'sum', 'w1q136_2_ei', 'w1q136_3_ei', 'w1q136_4_ei'], 
  'work_disc_sex_gender': [
    'sum', 'w1q139_2_ei', 'w1q139_3_ei', 'w1q139_4_ei'], 
  'housing_disc_sex_gender': [
    'sum', 'w1q141_2_ei', 'w1q141_3_ei', 'w1q141_4_ei'], 
  'stress_past_year_sex_gender': [
    'sum', 'w1q143_2_ei', 'w1q143_3_ei', 'w1q143_4_ei'], 
  'daily_discr_sex_gender': [
    'sum', 'w1q145_2_ei', 'w1q145_3_ei', 'w1q145_4_ei', 'w1q145_8_ei', 'w1q145_9_ei', 'w1q145_10_ei'], 
  'childhd_bullying_sex_gender': [
    'sum', 'w1q163_2_ei', 'w1q163_3_ei', 'w1q163_4_ei'], 
  'chronic_strain': [
    'sum', 'w1q146a_ei', 'w1q146b_ei', 'w1q146c_ei', 'w1q146d_ei', 'w1q146e_ei', 'w1q146f_ei', 'w1q146g_ei', 
    'w1q146h_ei', 'w1q146i_ei', 'w1q146j_ei', 'w1q146k_ei', 'w1q146l_ei'], 
  'bad_neighbhd': [
    'sum', 'w1q19a_ei_r', 'w1q19b_ei_r', 'w1q19c_ei_r', 'w1q19d_ei_r'], 
  'suicidality': ['sum', 'w1q101_ei_r', 'w1q105_ei_r', 'w1q109_ei_r', 'w1q114_ei']}

In [28]:
# Check the shape before combining
meyer.shape

(1494, 227)

I used the following for loop to go through each entry in the dictionary and create a new column, with the appropriate name, consisting of the indicated existing columns, combined in the indicated way.  I also included two "counters" in the loop to help me keep track of its progress and make sure the resulting dataframe had the correct dimensions.

In [29]:
# Keep some records
starting_cols = meyer.shape[1]
component_cols = sum([len(v[1:]) for v in feat_eng_dict.values()])
new_cols = len([k for k in feat_eng_dict.keys()])
print(f'Before feature engineering: {starting_cols} columns')
print(f'Feature engineering should add {new_cols} columns, based on {component_cols} columns')

Before feature engineering: 227 columns
Feature engineering should add 25 columns, based on 121 columns


In [30]:
# Create a counter for every (composite) column added
cols_added = 0

# Create a counter for every (existing) column condensed into a composite column
cols_done = []

# For each item in the dictionary
for k, v in feat_eng_dict.items():
  meyer[k] = meyer[v[1]]       # Copy the first component column into a new column with the indicated name. 
  cols_done.append(v[1])       # Add the first component column to the "done" list.
  cols_added += 1              # Count the newly added column
  for i in v[2:]:              # For every subsequent component column on the list in the dictionary entry
    meyer[k] += meyer[i]       # Add its values to those already in the new column (calc a sum)
    cols_done.append(i)        # Add it to the "done" list - if the method is "sum", it's done!
  if v[0]=='mean':             # If the method (first entry in the list) is "mean"...
    meyer[k] = meyer[k]/len(v[1:])     # Divide the new column (sum) by the number of composite columns
  elif v[0]=='binarize':               # If the method is "binarize"...
    meyer[k] = np.where(meyer[k]>1, 1, meyer[k])  # Change any values >1 to 1. Leave current 0s and 1s alone.

# Print out the progress numbers at the end
print(f'Columns Added: {cols_added}')
print(f'New Dimensions: {meyer.shape}')
print(f"That's correct? {(starting_cols + cols_added) == meyer.shape[1]}")
print(f"Everyone accounted for? {component_cols == len(set(cols_done))}")

Columns Added: 25
New Dimensions: (1494, 252)
That's correct? True
Everyone accounted for? True


In [31]:
# Examples for verification

# Binarization calculated by hand
meyer['check_disabled'] = (meyer['w1q75_ei_r'] + meyer['w1q76_ei_r'])
meyer.loc[(meyer['check_disabled']==2), 'check_disabled'] = 1 
# Make sure it matches the automatically generated column; should be two 0s
print(sum((meyer['check_disabled']-meyer['disabled'])!=0))
print(sum(meyer['check_disabled']!=meyer['disabled']))

# Sum calculated by hand
meyer['check_suicidality'] = (meyer['w1q101_ei_r'] + meyer['w1q105_ei_r'] + meyer[
                              'w1q109_ei_r'] + meyer['w1q114_ei'])
# Make sure it matches the automatically generated column; should be two 0s
print(sum((meyer['check_suicidality']-meyer['suicidality'])!=0))
print(sum(meyer['check_suicidality']!=meyer['suicidality']))

# Mean calculated by hand
meyer['check_outness'] = (meyer['w1q123a_ei_r'] + meyer['w1q123b_ei_r'] + meyer[
  'w1q123c_ei_r'] + meyer['w1q123d_ei_r'] + meyer['w1q124_ei_r'])/len([
  'mean', 'w1q123a_ei_r', 'w1q123b_ei_r', 'w1q123c_ei_r', 'w1q123d_ei_r', 'w1q124_ei_r'][1:])
# Make sure it matches the automatically generated column; should be two 0s
print(sum((meyer['check_outness']-meyer['outness'])!=0))
print(sum(meyer['check_outness']!=meyer['outness']))

0
0
0
0
0
0


In [32]:
# Update the drop list
cols_done.append('check_outness')
cols_done.append('check_suicidality')
cols_done.append('check_disabled')

# Calculate how many columns should be left post-drop
# started with + new composites + verifications - components - verifications
goal = starting_cols + new_cols + 3 - component_cols - 3
print(f"We're ready to drop? {goal==(meyer.drop(columns = cols_done).shape[1])}")

We're ready to drop? True


In [33]:
# Assuming that's True
meyer.drop(columns = cols_done, inplace = True)
meyer.shape

(1494, 131)

In [34]:
# Reorder the columns
ordered_cols = sorted(list(meyer.columns))
ordered_cols.remove('studyid')
ordered_cols.remove('w1kessler6_i')
ordered_cols = ['studyid', 'w1kessler6_i'] + ordered_cols
print(f"Everyone accounted for? {len(ordered_cols)==meyer.shape[1]}")

Everyone accounted for? True


In [35]:
meyer = meyer[ordered_cols]
meyer.shape

(1494, 131)

## Notebook Summary

In this notebook, I finished preparing the data for modeling through feature engineering.  

In the following notebook, I will conduct exploratory data analysis to guide my choices in modeling.

Any readers who are following along or attempting to reproduce my work should use the cell below to save a copy of the dataframe as it exists now.  A cell is provided at the top of the next notebook in which to import that copy.

In [36]:
# Save a copy of the dataframe to use in the next notebook
meyer.to_csv('../02_data/df_after_data_preparation_part_2.csv', index = False)