# **Pride and Joy**
### *An investigation of mental health correlates in LGBQ+ people*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Capstone Project|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |June 13, 2024|
---

## Prior Notebooks Summary

In the previous notebook, I introduced the background and purpose of this project, and gave an overview of my intended methods.

In this notebook, I will begin preparing the data for modeling, and demonstrate the `python` code I used.  Specifically, this notebook addresses preliminary feature selection and the imputation of missing values.

## Table of Contents

- [Data Preparation](#data-preparation)
  - [Imports](#imports)
  - [Renaming Columns](#renaming-columns) 
  - [Feature Selection](#feature-selection)
  - [Missing Values and Imputation](#missing-values-and-imputation)
- [Notebook Summary](#notebook-summary)  

## Data Preparation

In the cells below, I have reported the steps I took to prepare the dataset for modeling.  Readers should recall from the *Introduction and Methods* notebook that none of the code will run unless they download the dataset from DSDR's archive, and import it when prompted below.  **Anyone wishing to reproduce my work must first obtain a copy of the dataset from DSDR.**

Where neccessary, I have provided commentary on the code below.  Otherwise, I trust that readers will find the code self-explanatory.

**Note:** Because I add and drop many columns throughout this notebook, the following `PerformanceWarning` may appear, particularly if a reader tries to reproduce my work:
>PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

Should this happen, readers need not worry about it.  The last cell of this notebook saves a new CSV to be imported into the next notebook, at which point, the dataframe will no longer be fragmented.

### Imports

In [1]:
# Import modules
import pandas as pd
import numpy as np
import os
from sklearn.impute import SimpleImputer
from warnings import simplefilter

In [2]:
# Settings preferences
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None 
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
# Thanks to daydaybroskii and KingOtto at Stack Overflow for that one
# https://stackoverflow.com/questions/68292862/performancewarning-dataframe-is-highly-fragmented-this-is-usually-the-result-o

While preparing this report on my local computer, I used the absolute path in the cell below, which has since been commented out, to import my copy of the dataset.  Any readers wishing to follow along should uncomment the code in the cell after that, and modify it to point to the location of the file on their own computer.  I named the resulting dataframe `meyer`, in honor of Ilan H. Meyer, the author of the original study and a pioneer in the study of minority stress, especially as it pertains to queer people.  Readers should note that if they change that name in their import cell, they will also have to change it throughout the code.

In [73]:
# # Import the data - mine
# meyer = pd.read_csv(
#   '../../potential_datasets/2024_05_23_download_ICPSR_Meyer_2023_generations_data_attempt_2/ICPSR_37166/DS0007/37166-0007-Data.tsv', 
#   sep = '\t', low_memory=False, na_values = ' ') # Many thanks to ibrahim rupawala for highlighting the na_values argument
#   # https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas/47105408#47105408

In [4]:
# Import the data - yours
# meyer = pd.read_csv('your_path/37166-0007-Data.tsv', 
#   sep = '\t', low_memory=False, na_values = ' ')

In [5]:
# Ours
meyer.shape

(1518, 1329)

### Renaming Columns

The variable names in the original dataset are in all uppercase letters.  Per `python` norms (and, conveniently, my own preference), I converted them to all lowercase letters.  This was the only change necessary to achieve `snake_case`.

In [6]:
# Rename the columns
meyer.columns = [c.lower() for c in list(meyer.columns)]

### Feature Selection

Meyer's original study was longitudinal, so the dataset he provided included responses from the participants at up to three different times of measurement, referred to as waves.  Many authors have done excellent analyses on this time-series element.  However, as is often the case, Meyer's longitudinal study suffered from attrition effects, and only 616 participants responded to all three rounds of data collection.  In the interest of preserving a larger sample, I chose to cross-sectionally investigate the wave 1 responses only, for an initial sample size of 1518.  Therefore, I dropped the columns pertaining to wave 2 or 3.

In [7]:
# Display the n per wave; note that 3 = waves 1 and 2, but not wave 3, & 4 = all 3 waves
meyer['waveparticipated'].value_counts(dropna = False).sort_index()

waveparticipated
1    533
2    278
3     91
4    616
Name: count, dtype: int64

In [8]:
# Eliminate W2 and W3 variables from the list
cols = list(meyer.columns)
w1_cols = [c for c in cols if c[:2]!='w2']
w1_cols = [c for c in w1_cols if c[:2]!='w3']
len(w1_cols)

# Drop columns
meyer = meyer[w1_cols]
meyer.shape

(1518, 505)

Additionally, as described on page 17 of *37166-Documentation-methodology.pdf*, several participants identified as straight/heterosexual on the survey instrument, despite indicating that they were LGB on the screening questionaire.  Although these people likely have a more nuanced identity than either of those options can fully capture, I opted to err on the side of caution and exclude them from my analyses.

In [9]:
# Exclude anyone marking themselves as straight/heterosexual
meyer = meyer[meyer['w1sexualid']!=1]
meyer.shape

(1507, 505)

Next, I sought to eliminate the columns corresponding to items in scales, in favor of using the scale scores themselves, which were kindly already calculated and included as their own columns.

In [10]:
# Make lists of the items that comprise scales
soc_supp_items = ['w1q164a', 'w1q164b', 'w1q164c', 'w1q164d', 'w1q164e', 
  'w1q164f', 'w1q164g', 'w1q164h', 'w1q164i', 'w1q164j', 'w1q164k', 'w1q164l']

ace_items = ['w1q151', 'w1q152', 'w1q153', 'w1q154', 'w1q155', 
  'w1q156', 'w1q157', 'w1q158', 'w1q159', 'w1q160', 'w1q161']

childhd_gnc_items = ['w1q147', 'w1q148', 'w1q149', 'w1q150']

daily_discr_items = ['w1q144a', 'w1q144b', 'w1q144c','w1q144d', 
  'w1q144e', 'w1q144f', 'w1q144g', 'w1q144h', 'w1q144i']

int_homo_items = ['w1q128', 'w1q129', 'w1q130', 'w1q131', 'w1q132']

felt_stigma_items = ['w1q125', 'w1q126', 'w1q127']

drug_items = ['w1q90', 'w1q91', 'w1q92', 'w1q93', 'w1q94', 
  'w1q95', 'w1q96', 'w1q97', 'w1q98', 'w1q99', 'w1q100']

alc_items = ['w1q85', 'w1q86', 'w1q87']

hc_ster_threat_items = ['w1q60', 'w1q61', 'w1q62', 'w1q63']

comm_conn_items = ['w1q53', 'w1q54', 'w1q55', 'w1q56', 'w1q57', 'w1q58', 'w1q59']

lgbis_items = ['w1q40', 'w1q41', 'w1q42', 'w1q43', 'w1q44']

meim_items = ['w1q21', 'w1q22', 'w1q23', 'w1q24', 'w1q25', 'w1q26']

swl_items = ['w1q186', 'w1q187', 'w1q188', 'w1q189', 'w1q190']

swb_items = ['w1q04', 'w1q05', 'w1q06', 'w1q07', 'w1q08', 'w1q09', 'w1q10', 
  'w1q11', 'w1q12', 'w1q13', 'w1q14', 'w1q15', 'w1q16', 'w1q17', 'w1q18']

# This one is the y variable, so I'm not dropping them just yet
ment_dis_items = ['w1q77a', 'w1q77b', 'w1q77c', 'w1q77d', 'w1q77e', 'w1q77f']

# I will drop the others though
drop_cols = (soc_supp_items + ace_items + childhd_gnc_items + daily_discr_items + 
  int_homo_items + felt_stigma_items + drug_items + alc_items + comm_conn_items + 
  hc_ster_threat_items + lgbis_items + meim_items + swl_items + swb_items)
len(drop_cols)

100

In [11]:
# Drop the scale component columns
meyer.drop(columns = drop_cols, inplace = True)
meyer.shape

(1507, 405)

In [12]:
# These are the scales - more on them later
combo_features = ['w1socialwb', 'w1socialwb_i', 'w1lifesat', 'w1lifesat_i', 'w1meim', 'w1meim_i', 
  'w1idcentral', 'w1idcentral_i', 'w1connectedness', 'w1connectedness_i', 'w1hcthreat', 
  'w1hcthreat_i', 'w1kessler6', 'w1kessler6_i', 'w1auditc', 'w1auditc_i', 'w1dudit', 'w1dudit_i', 
  'w1feltstigma', 'w1feltstigma_i', 'w1internalized', 'w1internalized_i', 'w1everyday', 
  'w1everyday_i', 'w1childgnc', 'w1childgnc_i', 'w1ace', 'w1ace_i', 'w1ace_emo', 'w1ace_emo_i', 
  'w1ace_inc', 'w1ace_inc_i', 'w1ace_ipv', 'w1ace_ipv_i', 'w1ace_men', 'w1ace_men_i', 'w1ace_phy', 
  'w1ace_phy_i', 'w1ace_sep', 'w1ace_sep_i', 'w1ace_sex', 'w1ace_sex_i', 'w1ace_sub', 'w1ace_sub_i', 
  'w1socsupport', 'w1socsupport_fam', 'w1socsupport_fam_i', 'w1socsupport_fr', 'w1socsupport_fr_i', 
  'w1socsupport_i', 'w1socsupport_so', 'w1socsupport_so_i']

Keen-eyed readers may have noticed a pattern in the list above: each variable name appears twice, with `'_i'` added to the end of the second occurrence.  This is because Meyer and his team provided two versions of the computed score column for each scale - one for the original values, and one in which missing values have been imputed.  I discuss imputation - these, and my own - in the next section.

Finally, before investigating any missing values, I dropped any columns that were not pertinent to my problem statement.

In [13]:
# Identify excess columns
drop_cols_2 = ['gemployment2010', 'gmethod_type', 'gmethod_type_w2', 'gmethod_type_w3', 'gmilesaway',
  'gmsaname', 'gp1', 'gp2', 'grace', 'grespondent_date_w2', 'grespondent_date_w3', 'gruca', 'gruca_i',
  'gsurvey', 'gurban', 'gzipcode', 'gzipstate', 'nopolicecontact', 'w1cumulative_wt_nr1', 
  'w1cumulative_wt_nr2', 'w1cumulative_wt_nr3', 'w1cumulative_wt_sampling', 'w1frame_wt', 'w1hinc', 
  'w1pinc', 'w1poverty', 'w1povertycat', 'w1q02', 'w1q102', 'w1q103', 'w1q104', 'w1q106', 'w1q107', 
  'w1q108', 'w1q110', 'w1q111', 'w1q112', 'w1q115', 'w1q116', 'w1q117', 'w1q118', 'w1q120', 'w1q121',
  'w1q122', 'w1q133', 'w1q133_1', 'w1q133_2', 'w1q133_3', 'w1q134', 'w1q165', 'w1q170_1', 'w1q170_2',
  'w1q170_3', 'w1q170_4', 'w1q172', 'w1q173', 'w1q174', 'w1q176', 'w1q177_1', 'w1q177_10', 
  'w1q177_11', 'w1q177_12', 'w1q177_2', 'w1q177_3', 'w1q177_4', 'w1q177_5', 'w1q177_6', 'w1q177_7', 
  'w1q177_8', 'w1q177_9', 'w1q178', 'w1q182', 'w1q183', 'w1q184', 'w1q185', 'w1q20_1', 'w1q20_2', 
  'w1q20_3', 'w1q20_4', 'w1q20_5', 'w1q20_6', 'w1q20_7', 'w1q20_t_verb', 'w1q27', 'w1q28', 'w1q29', 
  'w1q29_t_verb', 'w1q31a', 'w1q31b', 'w1q31c', 'w1q31d', 'w1q39_1', 'w1q39_10', 'w1q39_11', 
  'w1q39_12', 'w1q39_2', 'w1q39_3', 'w1q39_4', 'w1q39_5', 'w1q39_6', 'w1q39_7', 'w1q39_8', 'w1q39_9',
  'w1q39_t_verb', 'w1q45', 'w1q46', 'w1q47', 'w1q48', 'w1q49', 'w1q50', 'w1q51', 'w1q66_1', 
  'w1q66_2', 'w1q66_3', 'w1q66_4', 'w1q66_5', 'w1q66_t_verb', 'w1q67', 'w1q68_1', 'w1q68_2', 
  'w1q68_3', 'w1q70', 'w1q71', 'w1q73', 'w1q74_1', 'w1q74_12', 'w1q74_13', 'w1q74_15', 'w1q74_16', 
  'w1q74_19', 'w1q74_2', 'w1q74_3', 'w1q74_4', 'w1q74_7', 'w1q74_8', 'w1q74_9', 'w1q80', 'w1q81', 
  'w1q82', 'w1q83', 'w1q84', 'w1q88', 'w1sample', 'w1weight_orig', 'w1weighting_cell_nr1', 
  'w1weighting_cell_nr2and3', 'wave3']

In [14]:
meyer.drop(columns = drop_cols_2, inplace = True)
meyer.shape

(1507, 258)

### Missing Values and Imputation

Because this dataset only has about 1500 rows, I felt it important to preserve as many of them as possible.  Out of this necessity, I opted to allow a larger proportion of imputed data than I would be inclined to do if I had rows to spare.  The implications of this decision will be discussed at further length in the *Limitations* section, but suffice to say it will require me to interpret my models cautiously and conservatively.

Fortunately for me, Meyer and his team used thorough, theoretically-sound methods for their imputations, inspiring confidence in me regarding the quality of the imputed data.  Where possible, they filled missing values with logical inferences based on that participant's other responses.  Where that was not possible, they predicted the missing value based on a regression of the participant's other responses, then imputed a randomly selected value from the 5 extant values nearest to that prediction.  See page 30 of *37166-Documentation-methodology.pdf* for more details.

Satisfied with this methodology, I was inclined to retain the imputed versions and drop the original columns.  Before doing so, however, I wanted to assess how many values had originally been missing.  Features for which a high proportion of values had to be imputed may not be suitable for modeling, and observations for which a high proportion of the Kessler items had to be imputed may also be inappropriate for use.  

For all of the reasons outlined above, however, I opted not to immediately drop any features based solely on their original missing proportions, but simply to monitor their performance in the model(s) with an extra watchful eye.  I also decided to tolerate up to one imputed value among the 6 items on the Kessler scale.  However, in the interest of preserving the integrity of my target variable, I sacrificed any observations for which two or more of these items had required imputation.

In [15]:
# Create a subsetted dataframe to identify any rows to drop
meyer_kessler = meyer[ment_dis_items + ['w1kessler6', 'w1kessler6_i', 'studyid']]
# The original scale score column was NA wherever any items were NA
meyer_kessler = meyer_kessler[meyer_kessler['w1kessler6'].isna()]
meyer_kessler

Unnamed: 0,w1q77a,w1q77b,w1q77c,w1q77d,w1q77e,w1q77f,w1kessler6,w1kessler6_i,studyid
39,,,,,,,,1,151876818
50,1.0,,,,,,,9,152304751
96,5.0,5.0,,4.0,3.0,4.0,,4,152640441
155,5.0,,5.0,4.0,5.0,5.0,,3,153039829
214,,,,,,,,0,153390973
227,,,,,,,,9,153535970
415,,,,,,,,8,154944576
447,1.0,,3.0,3.0,4.0,2.0,,16,155106769
457,,,,,,,,12,155163167
577,,,,,,,,2,155814388


In [16]:
# Extract the index numbers of rows with 2+ NAs

# Get the unique identifier
drop_rows = []
for i in meyer_kessler.index:
  if meyer_kessler.loc[i, ment_dis_items].isna().sum() > 1:
    drop_rows.append(meyer_kessler.loc[i, 'studyid'])

# Get the matching index number from the full dataframe
drop_index = []
for i in drop_rows:
  drop_index.append(meyer[meyer['studyid']==i].index[0])

# Drop those rows
meyer.drop(index = drop_index, inplace = True)
meyer.shape

(1494, 258)

In [17]:
# Reset the index (thanks to pandas documentation for this syntax)
meyer.reset_index(drop = True, inplace = True)
meyer.head(3)

Unnamed: 0,studyid,waveparticipated,w1weight_full,w1survey_yr,cohort,geduc1,geduc2,geducation,gurban_i,gcendiv,...,w1socialwb,w1socialwb_i,w1socsupport,w1socsupport_fam,w1socsupport_fam_i,w1socsupport_fr,w1socsupport_fr_i,w1socsupport_i,w1socsupport_so,w1socsupport_so_i
0,151339842,4,0.309003,2016,3,3,2,5,1,1,...,5.4,5.4,4.916667,5.0,5.0,5.0,5.0,4.916667,4.75,4.75
1,151351600,2,0.931862,2016,2,2,2,4,1,7,...,4.2,4.2,5.916667,6.0,6.0,5.75,5.75,5.916667,6.0,6.0
2,151396232,1,0.656089,2016,2,3,2,5,1,4,...,,5.466667,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0


In [18]:
# Drop the individual scale items
meyer.drop(columns = ment_dis_items, inplace = True)
meyer.shape

(1494, 252)

With those rows removed, I next assessed the missing values in the non-imputed scale score columns.

In [19]:
meyer[combo_features].isna().sum().sort_values(ascending = False)

w1ace                 267
w1ace_ipv             130
w1ace_emo              83
w1ace_sex              69
w1dudit                56
w1ace_phy              54
w1socialwb             51
w1connectedness        39
w1socsupport           37
w1everyday             31
w1meim                 22
w1socsupport_fr        22
w1socsupport_fam       21
w1socsupport_so        20
w1lifesat              19
w1internalized         19
w1hcthreat             16
w1ace_inc              16
w1ace_sub              15
w1ace_men              14
w1idcentral            14
w1kessler6             13
w1ace_sep              10
w1feltstigma            8
w1auditc                6
w1ace_men_i             0
w1socsupport_fam_i      0
w1ace_sub_i             0
w1ace_sex_i             0
w1ace_sep_i             0
w1socsupport_fr_i       0
w1ace_phy_i             0
w1socsupport_i          0
w1lifesat_i             0
w1idcentral_i           0
w1ace_emo_i             0
w1ace_ipv_i             0
w1ace_inc_i             0
w1meim_i    

As mentioned above, I was impressed enough by the original author's imputation methods to not drop any of these features over their missing values alone.  However, quite a few columns had troublingly high numbers of missing values.  Those with approximately 1% missing data or more are displayed below.

|Variable Name| Missing Values|Variable Name| Missing Values|Variable Name| Missing Values|
|--|--|--|--|--|--|
|w1ace        |267 |w1connectedness  | 39 |w1lifesat       | 19 |
|w1ace_ipv    |130 |w1socsupport     | 37 |w1internalized  | 19 |
|w1ace_emo    | 83 |w1everyday       | 31 |w1hcthreat      | 16 |
|w1ace_sex    | 69 |w1meim           | 22 |w1ace_inc       | 16 |
|w1dudit      | 56 |w1socsupport_fr  | 22 |w1ace_sub       | 15 |
|w1ace_phy    | 54 |w1socsupport_fam | 21 |w1ace_men       | 14 |
|w1socialwb   | 51 |w1socsupport_so  | 20 |w1idcentral     | 14 |

As can be seen above, many of the scale scores have large numbers of missing values.  The implications of this will be discussed in the *Limitations* section, but in the interest of not sacrificing too much data, I went forward with my plan to drop these original columns and use the imputed versions.

In [20]:
# Divide that combination list between imputed and not
combo_imputed = [c for c in combo_features if c[-2:]=='_i']
combo_not_imputed = [c for c in combo_features if c[-2:]!='_i']

# Drop the ones that aren't
meyer.drop(columns = combo_not_imputed, inplace = True)
meyer.shape

(1494, 226)

Although many of the scale scores came pre-imputed, other variables did not.  I imputed these variables myself.  I chose to use the entire dataset for this, prior to the train test split, because my primary interest lies in an inferential model.  This means that it is less crucial that my workflow and model be generalizable to new data gathered in the future.  Furthermore, because this data concerns sensitive personal and political issues, the median or mode responses will likely change over time if the survey were to be readministered, certainly over a long enough timeframe.  (Note: The obvious implications of this temporal drift for a 2016 dataset will be further discussed in the *Limitations* and *Recommendations* sections.)  It would be entirely illogical to fit the imputer on a subset of data from 2016 and then continue projecting those values forward into new data indefinitely.  Therefore, I prioritized fidelity to the current data over adaptability to hypothetical future data, and calculated/fit the imputation values on my entire dataset.

In [21]:
# See how many missing values remain
meyer.isna().sum().sort_values(ascending = False)

w1q64_12              1494
w1q64_5               1483
w1q74_6               1481
w1q74_5               1480
w1q74_20              1475
w1q64_11              1474
w1q64_10              1470
w1q74_18              1467
w1q74_14              1463
w1q74_17              1460
w1q64_13              1451
w1q171_8              1451
w1q30_3               1450
w1q64_t_verb          1449
w1q64_7               1448
w1q30_4               1445
w1q171_6              1442
w1q171_4              1440
w1q64_8               1423
w1q74_10              1419
w1q74_21              1417
w1q74_11              1409
w1q171_5              1406
w1q64_6               1405
w1q64_3               1393
w1q64_1               1375
w1q171_9              1362
w1q30_5               1339
w1q171_3              1320
w1q74_22              1309
w1q64_9               1305
w1q145_3              1297
w1q163_3              1264
w1q136_3              1260
w1q145_9              1227
w1q171_2              1222
w1q143_3              1213
w

As seen in the cell above, the columns with the most missing values tend to be those with a `w1qx_x` naming convention.  These are questions with a matrix-style response prompt, from which participants selected only a few - usually one - answer(s).  For example, question 135 asks if the participant has ever experienced certain types of violent or abusive treatment, and then question 136 asks if they think they were targeted for that treatment based on an identity they hold (e.g., age, gender, sexual orientation, disability, etc.).  Each sub-question of 136 is one of these identities.

In [22]:
meyer[['w1q136_1', 'w1q136_2', 'w1q136_3', 'w1q136_4', 'w1q136_5',
    'w1q136_6', 'w1q136_7', 'w1q136_8', 'w1q136_9', 'w1q136_10']].head()

Unnamed: 0,w1q136_1,w1q136_2,w1q136_3,w1q136_4,w1q136_5,w1q136_6,w1q136_7,w1q136_8,w1q136_9,w1q136_10
0,,,,4.0,,,7.0,,,
1,,,,,,,7.0,8.0,,
2,,,,,,,,,,
3,,,,,,,7.0,,,
4,,2.0,,,,,,,,


Each of these columns had up to 3 values - NaN, 7 or 97, and the number corresponding to the position of the response in the matrix (i.e., the fourth row above has a **2** in the `w1q136_2` column).  NaN, 97, and sometimes 7, all referred to missing data; NaNs appeared where participants declined to answer a question that was asked of them, and 7 or 97 appeared when participants were not asked the question at all, usually due to their response to a previous question (e.g., answering "no" to "Do you currently have health insurance?", and then not being asked "What kind of health insurance do you have?").  For all such columns, I converted the NaNs and 97s to 0s, and the target values to 1s.  The 7s required more careful attention, because they were legitimate values in some columns and essentially NAs in others.

In [23]:
# Make a list of all such columns
diy_ohe = ["w1q145_3", "w1q163_3", "w1q136_3", "w1q145_9", "w1q143_3", "w1q136_10", "w1q145_10", 
  "w1q136_9", "w1q163_10", "w1q163_9", "w1q143_9", "w1q143_4", "w1q136_6", "w1q143_10", "w1q143_5", 
  "w1q145_4", "w1q136_5", "w1q143_8", "w1q163_5", "w1q136_4", "w1q163_6", "w1q145_6", "w1q136_1", 
  "w1q143_7", "w1q163_1", "w1q143_2", "w1q143_1", "w1q145_5", "w1q143_6", "w1q163_2", "w1q163_4", 
  "w1q136_8", "w1q145_8", "w1q145_1", "w1q145_7", "w1q163_7", "w1q136_2", "w1q145_2", "w1q139_3", 
  "w1q136_7", "w1q139_9", "w1q139_10", "w1q139_5", "w1q139_4", "w1q139_8", "w1q139_6", "w1q139_2", 
  "w1q139_1", "w1q139_7", "w1q163_8", "w1q141_3", "w1q141_9", "w1q141_10", "w1q141_8", "w1q141_4", 
  "w1q141_1", "w1q141_5", "w1q141_2", "w1q141_7", "w1q141_6", "w1q64_12", "w1q64_5", "w1q74_5", 
  "w1q74_6", "w1q74_20", "w1q64_11", "w1q64_10", "w1q74_18", "w1q74_14", "w1q74_17", "w1q171_8", 
  "w1q30_3", "w1q64_13", "w1q64_7", "w1q30_4", "w1q171_6", "w1q171_4", "w1q64_8", "w1q74_10", 
  "w1q74_21", "w1q64_6", "w1q74_11", "w1q171_5", "w1q64_3", "w1q64_1", "w1q171_9", "w1q30_5", 
  "w1q171_3", "w1q74_22", "w1q64_9", "w1q171_2", "w1q171_7", "w1q74_23", "w1q64_4", "w1q64_2", 
  "w1q30_1", "w1q171_1", "w1q30_2"]
print(len(diy_ohe))

98


In [24]:
# Make distinct column names to put them into - don't overwrite the originals until I'm done!
diy_ohe_new = [''.join([x, '_ei']) for x in diy_ohe]
print(len(diy_ohe_new))

98


In [25]:
# Convert the 97s to 0s
meyer.loc[:, diy_ohe_new] = np.where(meyer.loc[:, diy_ohe]==97, 0, meyer.loc[:, diy_ohe])

In [26]:
# Identify columns with legitimate 7s
nat_7s = [
  'w1q143_7', 'w1q145_7', 'w1q163_7', 'w1q136_7', 'w1q139_7', 'w1q141_7', 'w1q64_7', 'w1q171_7']

In [27]:
# Get the new names for these columns (same as in diy_ohe_new)
nat_7s_ei = [''.join([x, '_ei']) for x in nat_7s]

# Turn those 7s into 1s
meyer.loc[:, nat_7s_ei] = np.where(meyer.loc[:, nat_7s]==7, 1, meyer.loc[:, nat_7s_ei])

In [28]:
# Convert the remaining 7s, where they are equivalent to 97 or NaN, to 0s
meyer.loc[:, diy_ohe_new] = np.where(meyer.loc[:, diy_ohe_new]==7, 0, meyer.loc[:, diy_ohe_new])

In [29]:
# Fill the NAs with 0s
meyer[diy_ohe_new] = meyer[diy_ohe_new].fillna(0)

In [30]:
# Check that all the values are correct
value_list_x = []
check_7s = []
for i in diy_ohe_new:
  a = list(meyer[i].unique())
  if 7 in a:
    check_7s.append(i)
  value_list_x += a
value_list_x = list(set(value_list_x))

print(sorted(value_list_x)) # should be 0-23, no 7s, no NaNs
print(check_7s) # should be blank

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 9.0, 10.0, 11.0, 13.0, 14.0, 17.0, 18.0, 20.0, 21.0, 22.0, 23.0]
[]


In [31]:
# Reduce to 0s and 1s
meyer.loc[:, diy_ohe_new] = np.where(meyer.loc[:, diy_ohe_new]!=0, 1, meyer.loc[:, diy_ohe_new])

In [32]:
# Check that all the values are correct
value_list_x = []
check_7s = []
for i in diy_ohe_new:
  a = list(meyer[i].unique())
  if 7 in a:
    check_7s.append(i)
  value_list_x += a
value_list_x = list(set(value_list_x))

print(sorted(value_list_x)) # should be 0-1
print(check_7s) # should be blank

[0.0, 1.0]
[]


In [33]:
# Are all the NAs in those columns gone?
meyer[diy_ohe_new].isna().sum().sum()

0

In [34]:
# Drop the old versions
meyer.drop(columns = diy_ohe, inplace = True)

In [35]:
# w1q64_t_verb has a similar structure to the questions above, but it's a write-in field
meyer[['w1q64_t_num']] = meyer[['w1q64_t_verb']].fillna('0')
# thanks to this SO article for advice about an error in a previous version of this code
# https://stackoverflow.com/questions/68292862/performancewarning-dataframe-is-highly-fragmented-this-is-usually-the-result-o

# Reduce to 0s and 1s
meyer.loc[:, 'w1q64_t_num'] = np.where(meyer.loc[:, 'w1q64_t_num']!='0', 1, 0)

# Drop the original
meyer.drop(columns = ['w1q64_t_verb'], inplace = True)
meyer.shape

(1494, 226)

In [36]:
# Are we back the pre-adding-columns shape?
meyer.shape

(1494, 226)

In [37]:
# Reassess the NAs
meyer.isna().sum().sort_values(ascending = False)

w1q03                 39
w1q175                36
w1q19c                21
w1povertycat_i        20
w1poverty_i           20
w1q72                 19
gmilesaway2           18
w1q179                18
w1q142d               16
w1q119                16
w1q146d               16
w1q109                16
w1q167                15
w1q166                15
w1q180                15
w1q19d                15
w1q52                 14
w1q142i               13
w1q142g               13
w1q75                 13
w1q181                13
w1q162                13
w1q65                 12
w1q142h               11
w1q69                 11
w1q146c               11
w1q135d               11
w1q78                 11
w1q105                10
w1q146f               10
w1q142b               10
w1q142j               10
w1q142f               10
w1q146g               10
w1q142e               10
w1q169                10
w1q32                 10
w1q19b                10
w1q168                10
w1q146e                9


The next batch of columns I tackled included `'w1q101'`, `'w1q105'`, `'w1q109'`, `'w1q113'`, and `'w1q114'`, which are about suicidal ideation and behavior.  Among the first four, each question increases in severity - thoughts, intent, planning, and attempts, respectively.  If someone failed to answer an earlier question, then said yes to a later question, I imputed "yes, once" (2) for the earlier question.  Otherwise I imputed "no" (1).  "Yes, more than once," was also an option, but I chose to err on the conservative side.

In [38]:
# Create new columns for the imputed values
meyer['w1q101_ei'] = meyer['w1q101'].copy()
meyer['w1q105_ei'] = meyer['w1q105'].copy()
meyer['w1q109_ei'] = meyer['w1q109'].copy()

# Define the row conditions - NA in a given column
cond0 = meyer['w1q101'].isna()
cond1 = meyer['w1q105'].isna()
cond2 = meyer['w1q109'].isna()

# Define the column conditions - they gave an answer other than "no"
y1 = ((meyer['w1q105'].notna()) & (meyer['w1q105']!=1))
y2 = ((meyer['w1q109'].notna()) & (meyer['w1q109']!=1))
y3 = ((meyer['w1q113'].notna()) & (meyer['w1q113']!=1))

# Use more severe questions to impute less severe questions
meyer.loc[((y1 | y2 | y3) & cond0), 'w1q101_ei'] = 2
meyer.loc[((y2 | y3) & cond1), 'w1q105_ei'] = 2
meyer.loc[(y3 & cond2), 'w1q109_ei'] = 2

In [39]:
# Impute all others - 'w1q119' is a similarly structured question about non-suicidal self-harm
meyer[['w1q101_ei', 'w1q105_ei', 'w1q109_ei', "w1q113_ei", "w1q119_ei"]] = meyer[[
    'w1q101_ei', 'w1q105_ei', 'w1q109_ei', "w1q113", "w1q119"]].fillna(1)

In [40]:
# Drop the originals
meyer.drop(columns = ['w1q101', 'w1q105', 'w1q109', "w1q113", "w1q119"], inplace = True)
meyer.shape

(1494, 226)

Column `'w1q114'` asks how many suicide attempts someone has made, and includes 97s for "planning missing" values, as discussed above.  First I had to convert those 97s to 0s, then 0 for all others, although some of those 0s were changed again in a later step.  I also recoded some values on the scale to make it more (approximately) linear.  In the original question, participants could select a value from 1-5 to indicate, respectively, 1-5 lifetime suicide attempts.  Starting at 6, however, each value represented a range of attempts:
|Value|Number of Lifetime Suicide Attempts|
|-|-|
|6|6-10|
|7|11-15|
|8|16-20|
|9|21 or more|

Having individual values and ranges on the same scale presents challenges for linear modeling.  Although there is no way to know the specific number of attempts for any participant who selected a range, I mitigated the deleterious effects by changing each value to the minimum number in its range.  This replaced the original set of possible responses (1-9) with one that more accurately represented the distance between any two options (1-6, 11, 16, 21).

In [41]:
# Replace the 97s with 0s
meyer.loc[:, 'w1q114_ei'] = np.where(meyer.loc[:, 'w1q114']==97, 0, meyer.loc[:, 'w1q114'])

# Replace the range values
cond7 = meyer['w1q114_ei']==7.0
meyer.loc[cond7, 'w1q114_ei']=11

cond8 = meyer['w1q114_ei']==8.0
meyer.loc[cond8, 'w1q114_ei']=16

cond9 = meyer['w1q114_ei']==9.0
meyer.loc[cond9, 'w1q114_ei']=21

In [42]:
# Impute all others
meyer['w1q114_ei'] = meyer['w1q114_ei'].fillna(0)

In [43]:
# Drop the original
meyer.drop(columns = ['w1q114'], inplace = True)
meyer.shape

(1494, 226)

Features `'w1q113'` ("Did you ever make a suicide attempt?"), and `'w1q114'` ("If yes, how many different suicide attempts did you ever make?") are redundant, and using both in the model would introduce unnecessary colinearity.  Therefore, wherever someone was marked as a 0 in `'w1q114'` (would have originally been NA; the scale did not include 0), but a "yes" in `'w1q113'`, I imputed a value into `'w1q114'` to rectify the anomaly.  I then dropped `'w1q113'`.

In [44]:
# Define the conditions
cond1 = meyer['w1q113_ei']>1
cond2 = meyer['w1q114_ei']==0

# See the observations
check = meyer.loc[(cond1 & cond2), ['w1q113_ei', 'w1q114_ei']]
check

Unnamed: 0,w1q113_ei,w1q114_ei
443,3.0,0.0
959,3.0,0.0


In both cases, the participant said they attempted suicide more than once, and then declined to say how many times.  I imputed 2 attempts for each of them.

In [45]:
# Imputation
meyer.loc[(cond1 & cond2), 'w1q114_ei'] = 2

In [46]:
# Drop 'w1q113_ei'
meyer.drop(columns = ['w1q113_ei'], inplace = True)
meyer.shape

(1494, 225)

Questions 166-168 asked participants if they and/or their parents were born in the United States, or if they had primarily lived in the United States as a child.  Where possible, I imputed missing values in any one of these columns based on deductions from the other two.  Where that was not possible, I imputed the most common responses.

In [47]:
# Create new columns for the imputed values
meyer[['w1q166_ei']] = meyer[['w1q166']]
meyer[['w1q167_ei']] = meyer[['w1q167']]
meyer[['w1q168_ei']] = meyer[['w1q168']]

# If they left all 3 blank, impute the mode
a = meyer['w1q166'].isna()
b = meyer['w1q167'].isna()
c = meyer['w1q168'].isna()

meyer.loc[(a & b & c), 'w1q166_ei'] = 1 # Yes, I was born in the US (92.7%)
meyer.loc[(a & b & c), 'w1q167_ei'] = 1 # Yes, I primarily grew up in the US (94.5%)
meyer.loc[(a & b & c), 'w1q168_ei'] = 3 # Both of my parents were born in the US (79.2%)

# Check the rest
test = meyer[['w1q166_ei', 'w1q167_ei', 'w1q168_ei']]
test1 = test[test['w1q166_ei'].isna()]
test2 = test[test['w1q167_ei'].isna()]
test3 = test[test['w1q168_ei'].isna()]

In [48]:
test1

Unnamed: 0,w1q166_ei,w1q167_ei,w1q168_ei
150,,1.0,3.0
196,,1.0,3.0
892,,1.0,3.0
967,,,3.0
1067,,1.0,2.0


In [49]:
# Looking at the pattern of associated 167s and 168s, 
# I decided it made the most sense to impute 1s for the remaining NAs in 166
meyer[['w1q166_ei']] = meyer[['w1q166_ei']].fillna(1)

In [50]:
test2

Unnamed: 0,w1q166_ei,w1q167_ei,w1q168_ei
647,1.0,,3.0
881,2.0,,2.0
933,1.0,,3.0
967,,,3.0
1125,1.0,,3.0


In [51]:
# Looking at the pattern of associated 166s and 168s, I decided it made the most 
# sense to impute values for the remaining NAs in 167 according to this logic

# Define some conditions
a = meyer['w1q166']==2 # I wasn't born here
b = meyer['w1q167'].isna() # The target values
c = ((meyer['w1q168'].notna()) & (meyer['w1q168']<3)) # 1+ parent not born here

# Under those conditions, impute "I didn't primarily live in the US as a child"
meyer.loc[(a & b & c), 'w1q167_ei'] = 2 
# Otherwise impute "yes"
meyer[['w1q167_ei']] = meyer[['w1q167_ei']].fillna(1)

In [52]:
test3

Unnamed: 0,w1q166_ei,w1q167_ei,w1q168_ei


In [53]:
# 168 is all done!

# Drop the originals
meyer.drop(columns = ['w1q166', 'w1q167', 'w1q168'], inplace = True)
meyer.shape

(1494, 225)

With the "batches" of columns handled, I next devised imputation plans for the rest of the columns with missing values on more or less case-by-case basis.  First, wherever I could logically deduce a reasonable imputation, I did that.  Comments are provided in the code to briefly explain that logic.

In [54]:
# gmilesaway2 is about their proximity to an LGBT-specific health center
meyer[['gmilesaway2_ei']] = meyer[['gmilesaway2']].fillna(0)

# Reverse code so 1=close
meyer[['gmilesaway2_ei_r']] = 1-meyer[['gmilesaway2_ei']]

# Drop the original and interim
meyer.drop(columns = ['gmilesaway2', 'gmilesaway2_ei'], inplace = True)
meyer.shape

(1494, 225)

In [55]:
# w1q123 is about outness in different social contexts.
# They had an option (5) for "don't know/doesn't apply."  I don't know what 
# the truth is for these missing values, so I'm recoding them to "don't know."
meyer[['w1q123a_ei', 'w1q123b_ei', 'w1q123c_ei', 'w1q123d_ei']] = meyer[[
  'w1q123a', 'w1q123b', 'w1q123c', 'w1q123d']].fillna(5)

# Drop the originals
meyer.drop(columns = ['w1q123a', 'w1q123b', 'w1q123c', 'w1q123d'], inplace = True)
meyer.shape

(1494, 225)

In [56]:
# w1q179 is about religion
# Option 13 means "nothing in particular", which seems logical for missing values
meyer[['w1q179_ei']] = meyer[['w1q179']].fillna(13)
meyer[['w1q179', 'w1q179_ei']]

meyer.drop(columns = 'w1q179', inplace = True)
meyer.shape

(1494, 225)

In [57]:
# w1q32 asks if they are in a relationship.  The subsequent questions are about that 
# relationship, and people who said "no" or didn't answer were not shown those questions.
# Given that NAs were treated as "no", I imputed them as "no."
meyer[['w1q32_ei']] = meyer[['w1q32']].fillna(2)

# Drop the original
meyer.drop(columns = ['w1q32'], inplace = True)
meyer.shape

(1494, 225)

In [58]:
# 'w1q33' and 'w1q34'
# These questions are about romantic partner relationships.
# The 97s are people who said they don't have partners.  
# The logical imputation is 0.  The original values did not include 0s.

# Fill the NAs
meyer[['w1q33_ei']] = meyer[['w1q33']].fillna(0)
meyer[['w1q34_ei']] = meyer[['w1q34']].fillna(0)

# Recode the 97s and 7s
cond1 = meyer['w1q33_ei']==97.0
meyer.loc[cond1, 'w1q33_ei']=0

cond2 = meyer['w1q34_ei']==7.0
meyer.loc[cond2, 'w1q34_ei']=0

# Drop the originals
meyer.drop(columns = ['w1q33', 'w1q34'], inplace = True)
meyer.shape

(1494, 225)

In [59]:
# Questions 137 and 138 ask about poor treatment in the workplace, and question 139 asks 
# if they believe they were targeted for that treatment because of an identity they hold.
# If they didn't answer 137 and/or 138, but did provide an answer in 139, I imputed "Once" (2)
# into both 137 and 138.  Otherwise, I imputed "Never" (1).

# Create a new column to sum up the sub-parts of 139
meyer['sum_w1q139'] = meyer["w1q139_1_ei"] + meyer[
  "w1q139_2_ei"] + meyer["w1q139_3_ei"] + meyer[
  "w1q139_4_ei"] + meyer["w1q139_5_ei"] + meyer[
  "w1q139_6_ei"] + meyer["w1q139_7_ei"] + meyer[
  "w1q139_8_ei"] + meyer["w1q139_9_ei"] + meyer["w1q139_10_ei"]

# Create new columns for 137 and 138
meyer[['w1q137_ei']] = meyer[['w1q137']]
meyer[['w1q138_ei']] = meyer[['w1q138']]

# Define conditions
cond1 = meyer['w1q137'].isna()
cond2 = meyer['w1q138'].isna()

# In the new columns, fill the NAs with 2 or 1 based on the sum column
meyer.loc[cond1, 'w1q137_ei'] = np.where(meyer.loc[cond1, 'sum_w1q139']>0, 2, 1)
meyer.loc[cond2, 'w1q138_ei'] = np.where(meyer.loc[cond2, 'sum_w1q139']>0, 2, 1)

# Drop the originals and the sum
meyer.drop(columns = ['w1q137', 'w1q138', 'sum_w1q139'], inplace = True)
meyer.shape

(1494, 225)

In [60]:
# Similarly, question 140 asked about housing discrimination, and question 141 asks why
# the participant thinks they had that experience.

# Create a new column to sum up the sub-parts of 141
meyer['sum_w1q141'] = meyer["w1q141_1_ei"] + meyer[
  "w1q141_2_ei"] + meyer["w1q141_3_ei"] + meyer[
  "w1q141_4_ei"] + meyer["w1q141_5_ei"] + meyer[
  "w1q141_6_ei"] + meyer["w1q141_7_ei"] + meyer[
  "w1q141_8_ei"] + meyer["w1q141_9_ei"] + meyer["w1q141_10_ei"]

# Create a new column for 140
meyer[['w1q140_ei']] = meyer[['w1q140']]

# Define the condition
cond1 = meyer['w1q140'].isna()

# In the new column, fill the NAs with 2 or 1 based on the sum column
meyer.loc[cond1, 'w1q140_ei'] = np.where(meyer.loc[cond1, 'sum_w1q141']>0, 2, 1)

# Drop the original and the sum
meyer.drop(columns = ['w1q140', 'sum_w1q141'], inplace = True)
meyer.shape

(1494, 225)

In [61]:
# Question 146b asks if the participant agrees with the statement, "you don't have enough
# money to make ends meet."  'w1poverty_i' categorized people as below or above the poverty 
# line, based on their reported income and household size.  If someone was categorized as 
# below the poverty line, but didn't answer 146b, I imputed "somewhat true" (1) for 146b.
# Otherwise I imputed "Not true" (0).  This was a difficult decision for me, because although
# "Not true" was the most common response among the populated values (47.4%), most participants 
# did indicate agreement with the statement ("Somewhat true": 29.2%, "Very true": 22.4%, 
# combined: 51.6%).  However, I chose to err on the side of caution, and use 0.

# Create a new column for 146d
meyer[['w1q146b_ei']] = meyer[['w1q146b']]

# The np.where got messy, so I'm doing the same thing with a for loop here.
for i in list(range(meyer.shape[0])):
  if pd.notna(meyer.loc[i, 'w1q146b_ei'])==True:
    continue
  elif meyer.loc[i, 'w1poverty_i']==1:  # census poverty yes
    meyer.loc[i, 'w1q146b_ei']=1
  else:
    meyer.loc[i, 'w1q146b_ei']=0

# Drop the original
meyer.drop(columns = 'w1q146b', inplace = True)
meyer.shape

(1494, 225)

In [62]:
# Question 146d asks about unemployment, but 142 also asked about unemployment.
# I used those responses to infer "Not true" (0) or "Somewhat true" (1) for 146d.

# Create a new column for 146d
meyer[['w1q146d_ei']] = meyer[['w1q146d']]

# The np.where got messy, so I'm doing the same thing with a for loop here.
for i in list(range(meyer.shape[0])):
  if pd.notna(meyer.loc[i, 'w1q146d_ei'])==True:  # skip the ones that are already populated
    continue 
  elif meyer.loc[i, 'w1q142c']==1:
    meyer.loc[i, 'w1q146d_ei']=1
  elif meyer.loc[i, 'w1q142b']==1:
    meyer.loc[i, 'w1q146d_ei']=1
  else:
    meyer.loc[i, 'w1q146d_ei']=0

# Drop the original
meyer.drop(columns = 'w1q146d', inplace = True)
meyer.shape

(1494, 225)

When the only remaining features were ones for which no logical imputation value presented itself, I opted to use either the median or mode of their populated counterparts.  I chose which measure to use for each column based on my own best judgment of which made the most logical sense.  I will discuss how I would like to improve upon my imputation process the *Limitations* and *Recommendations* sections, but I deemed that this expedient approximation would suffice for the time being.

In [63]:
# Fix some 98s (like 97s) on this one
meyer.loc[:, 'w1q01'] = np.where(meyer.loc[:, 'w1q01']==98, np.nan, meyer.loc[:, 'w1q01'])

# Define the lists for imputation
s_median = ["w1q146c", "w1q162", "w1q181", "w1q146f", "w1q146g", "w1q146e", 
  "w1q146a", "w1q146h", "w1q146i", "w1q146j", "w1q146k", "w1q146l", "w1q169"]

s_mode = ["w1q175", "w1q72", "w1q142d", "w1q65", "w1q52", "w1q142i", "w1q142g", 
  "w1q180", "w1q69", "w1q135d", "w1q142h", "w1q135b", "w1q135c", "w1q135e", 
  "w1q142b", "w1q142e", "w1q142f", "w1q142j", "w1q142a", "w1q142k", "w1q19c", 
  "w1q19d", "w1q19b", "w1q19a", "w1q75", "w1q78", "w1q79", "w1q76", "w1q135a", 
  "w1q135f", "w1q142c", "w1q35", "w1q36", "w1q37", "w1q38", "w1q124", "w1q89",
  "w1q03", 'w1q01', 'w1poverty_i', 'w1povertycat_i']
  # re the _i variables : Meyer imputed these many of these values by imputing  
  # household income, but left NAs where household size was missing.
  # See p. 18-19 of 37166-Documentation-methodology.pdf

# Define new column names
s_median_new = [''.join([x, '_ei']) for x in s_median]
s_mode_new = [''.join([x, '_ei']) for x in s_mode]

In [64]:
# Make extra sure they're in the same order
print(sorted(s_mode)==['_'.join(s.split('_')[:-1]) for s in sorted(s_mode_new)])
print(sorted(s_median)==['_'.join(s.split('_')[:-1]) for s in sorted(s_median_new)])

True
True


In [65]:
# Split the dataframe up to use 2 different imputers
meyer_median = meyer[s_median+['studyid']]
meyer_mode = meyer[s_mode+['studyid']]

# Rename the columns in the new dataframes
s_median_new.append('studyid')
s_mode_new.append('studyid')
meyer_median.columns = s_median_new
meyer_mode.columns = s_mode_new

In [66]:
# Instantiate the imputers
si_med = SimpleImputer(strategy = 'median')
si_mode = SimpleImputer(strategy = 'most_frequent')

# Do the imputation
meyer_median = pd.DataFrame(
  si_med.fit_transform(meyer_median), columns = si_med.get_feature_names_out())

meyer_mode = pd.DataFrame(
  si_mode.fit_transform(meyer_mode), columns = si_mode.get_feature_names_out())

# Check it out - should be 0s
print(meyer_median.isna().sum().sum())
print(meyer_mode.isna().sum().sum())

0
0


In [67]:
# Merge them back together
meyer_m = meyer.merge(
  meyer_median, left_on = 'studyid', right_on = 'studyid', how = 'left').merge(
    meyer_mode, left_on = 'studyid', right_on = 'studyid', how = 'left')

# All good?
meyer_m.shape[1]==(meyer.shape[1]+len(s_median)+len(s_mode))

True

In [68]:
# Put it back under its own name
meyer = meyer_m.copy(deep = True)
del meyer_m

# Drop the originals
drop_cols = s_median + s_mode
meyer.drop(columns = drop_cols, inplace = True)
meyer.shape

(1494, 225)

In [69]:
# Recheck the NAs
meyer.isna().sum().sort_values(ascending = False)

studyid               0
w1q74_6_ei            0
w1q64_4_ei            0
w1q64_2_ei            0
w1q30_1_ei            0
w1q171_1_ei           0
w1q30_2_ei            0
w1q64_t_num           0
w1q101_ei             0
w1q105_ei             0
w1q109_ei             0
w1q119_ei             0
w1q114_ei             0
w1q166_ei             0
w1q167_ei             0
w1q168_ei             0
gmilesaway2_ei_r      0
w1q123a_ei            0
w1q123b_ei            0
w1q123c_ei            0
w1q123d_ei            0
w1q179_ei             0
w1q32_ei              0
w1q33_ei              0
w1q34_ei              0
w1q137_ei             0
w1q138_ei             0
w1q74_23_ei           0
w1q171_7_ei           0
w1q171_2_ei           0
w1q171_4_ei           0
w1q64_11_ei           0
w1q64_10_ei           0
w1q74_18_ei           0
w1q74_14_ei           0
w1q74_17_ei           0
w1q171_8_ei           0
w1q30_3_ei            0
w1q64_13_ei           0
w1q64_7_ei            0
w1q30_4_ei            0
w1q171_6_ei     

In [70]:
# Reorder the columns
ordered_cols = sorted(list(meyer.columns))
ordered_cols.remove('studyid')
ordered_cols = ['studyid'] + ordered_cols
len(ordered_cols)==meyer.shape[1]

True

In [71]:
meyer = meyer[ordered_cols]
meyer.shape

(1494, 225)

## Notebook Summary

In this notebook, I ingested the data, gave the columns more convenient names, conducted a first round of feature selection, and resolved all missing values.

In the following notebook, I will finish preparing the data for modeling by creating composite score features, recoding columns as necessary, and any other necessary feature engineering.

Any readers who are following along or attempting to reproduce my work should use the cell below to save a copy of the dataframe as it exists now.  A cell is provided at the top of the next notebook in which to import that copy.

In [72]:
# Save a copy of the dataframe to use in the next notebook
meyer.to_csv('../02_data/df_after_data_preparation_part_1.csv', index = False)