# **Pride and Joy**
### *An investigation of mental health correlates in LGBQ+ people*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Capstone Project|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |June 13, 2024|
---

In [492]:
#is there any way to space that text out so that it fits the edges of whatever screen it's on?

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Capstone Project|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |June 13, 2024|
---

## Prior Notebooks Summary

In the previous notebook, I 

- **gave a problem statement!**
- **listed deliverables!**
- **explained that I will be using the word queer!!!**
- **explained that it was the original author's idea to exclude binary trans people, not mine**

In this notebook, I will ____________, including `python` code.

## Table of Contents

- [Data Preparation](##Data-Preparation)
  - [Imports](###Imports)
  - [Renaming Columns](###Renaming-Columns) 
  - [Feature Selection, Part 1](###Feature-Selection-Part-1)
  - [Missing Values and Imputation](###Missing-Values-and-Imputation)
  - [Feature Engineering](###Feature-Engineering)
  - [Feature Selection, Part 2](###Feature-Selection-Part-2)
- [Notebook Summary](##Notebook-Summary)  

## Data Preparation

In the cells below, I have reported the steps I took to prepare the dataset for modeling.  Readers should recall from the Introduction and Methods notebook that none of the code will run unless they download the dataset from DSDR's archive, and import it when prompted below.  **Anyone wishing to reproduce my work must first obtain a copy of the dataset from DSDR.**

Where neccessary, I have provided commentary on the code below.  Otherwise, I trust that readers will the code self-explanatory.

### Imports

In [493]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from emilys_functions import my_date, autoplots

In [494]:
# Settings preferences
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None 

While preparing this report on my local computer, I used the absolute path in the cell below, which has since been commented out, to import my copy of the dataset.  Any readers wishing to follow along should uncomment the code in the cell after that, and modify it to point to the location of the file on their own computer.  I named the resulting dataframe `meyer`, in honor of Ilan H. Meyer, the author of the original study.  Readers should note that if they change that name in their import cell, they will also have to change it throughout the code.

In [495]:
# Import the data - mine
meyer = pd.read_csv(
  '../../potential_datasets/2024_05_23_download_ICPSR_Meyer_2023_generations_data_attempt_2/ICPSR_37166/DS0007/37166-0007-Data.tsv', 
  sep = '\t', low_memory=False, na_values = ' ') # Many thanks to ibrahim rupawala for highlighting the na_values argument
  # https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas/47105408#47105408

In [496]:
# Import the data - yours
# meyer = pd.read_csv('your_path/37166-0007-Data.tsv', 
#   sep = '\t', low_memory=False, na_values = ' ')

In [497]:
# Ours
meyer.shape

(1518, 1329)

### Renaming Columns

The variable names in the original dataset are in all uppercase letters.  Per `python` norms (and, conveniently, my own preference), I converted them to all lowercase letters.  

In [498]:
# Rename the columns
meyer.columns = [c.lower() for c in list(meyer.columns)]

### Feature Selection, Part 1

Meyer's original study was longitudinal, so the dataset he provided included responses from the participants at up to three different times of measurement, referred to as waves.  Many authors have done excellent analyses on this time-series element.  However, as is often the case, Meyer's study suffered from attrition effects, and only 616 participants responded to all three rounds of data collection.  In the interest of preserving a larger sample, I chose to cross-sectionally investigate the wave 1 responses only, for an initial sample size of 1518.  Therefore, I dropped the columns pertaining to wave 2 or 3.

In [499]:
# Display the n per wave; note that 4 = all 3 waves
meyer['waveparticipated'].value_counts(dropna = False).sort_index()

waveparticipated
1    533
2    278
3     91
4    616
Name: count, dtype: int64

In [500]:
# Eliminate W2 and W3 variables from the list
cols = list(meyer.columns)
w1_cols = [c for c in cols if c[:2]!='w2']
w1_cols = [c for c in w1_cols if c[:2]!='w3']
len(w1_cols)

# Drop columns
meyer = meyer[w1_cols]
meyer.shape

(1518, 505)

Additionally, as described on page 17 of *37166-Documentation-methodology.pdf*, several participants identified as heterosexual on the survey instrument, despite indicating that they were LGB on the screening questionaire.  Although these people likely have a more complicated identity than either of those options can fully capture, I opted to err on the side of caution and exclude them from my analyses.

In [501]:
# Exclude anyone marking themselves as straight/heterosexual
meyer = meyer[meyer['w1sexualid']!=1]
meyer.shape

(1507, 505)

Next, I sought to eliminate the columns corresponding to items in scales, in favor of using the scale scores themselves, which were kindly already calculated and included as their own columns.

In [502]:
# Make lists of the items that comprise scales
soc_supp_items = ['w1q164a', 'w1q164b', 'w1q164c', 'w1q164d', 'w1q164e', 
  'w1q164f', 'w1q164g', 'w1q164h', 'w1q164i', 'w1q164j', 'w1q164k', 'w1q164l']

ace_items = ['w1q151', 'w1q152', 'w1q153', 'w1q154', 'w1q155', 
  'w1q156', 'w1q157', 'w1q158', 'w1q159', 'w1q160', 'w1q161']

childhd_gnc_items = ['w1q147', 'w1q148', 'w1q149', 'w1q150']

daily_discr_items = ['w1q144a', 'w1q144b', 'w1q144c','w1q144d', 
  'w1q144e', 'w1q144f', 'w1q144g', 'w1q144h', 'w1q144i']

int_homo_items = ['w1q128', 'w1q129', 'w1q130', 'w1q131', 'w1q132']

felt_stigma_items = ['w1q125', 'w1q126', 'w1q127']

drug_items = ['w1q90', 'w1q91', 'w1q92', 'w1q93', 'w1q94', 
  'w1q95', 'w1q96', 'w1q97', 'w1q98', 'w1q99', 'w1q100']

alc_items = ['w1q85', 'w1q86', 'w1q87']

hc_ster_threat_items = ['w1q60', 'w1q61', 'w1q62', 'w1q63']

comm_conn_items = ['w1q53', 'w1q54', 'w1q55', 'w1q56', 'w1q57', 'w1q58', 'w1q59']

lgbis_items = ['w1q40', 'w1q41', 'w1q42', 'w1q43', 'w1q44']

meim_items = ['w1q21', 'w1q22', 'w1q23', 'w1q24', 'w1q25', 'w1q26']

swl_items = ['w1q186', 'w1q187', 'w1q188', 'w1q189', 'w1q190']

swb_items = ['w1q04', 'w1q05', 'w1q06', 'w1q07', 'w1q08', 'w1q09', 'w1q10', 
  'w1q11', 'w1q12', 'w1q13', 'w1q14', 'w1q15', 'w1q16', 'w1q17', 'w1q18']

# This one is the y variable, so I'm not dropping them just yet
ment_dis_items = ['w1q77a', 'w1q77b', 'w1q77c', 'w1q77d', 'w1q77e', 'w1q77f']

# I will drop the others though
drop_cols = (soc_supp_items + ace_items + childhd_gnc_items + daily_discr_items + 
  int_homo_items + felt_stigma_items + drug_items + alc_items + comm_conn_items + 
  hc_ster_threat_items + lgbis_items + meim_items + swl_items + swb_items)
len(drop_cols)

100

In [503]:
# Drop the scale component columns
meyer.drop(columns = drop_cols, inplace = True)
meyer.shape

(1507, 405)

In [504]:
# These are the scales - more on them later
combo_features = ['w1socialwb', 'w1socialwb_i', 'w1lifesat', 'w1lifesat_i', 'w1meim', 'w1meim_i', 
  'w1idcentral', 'w1idcentral_i', 'w1connectedness', 'w1connectedness_i', 'w1hcthreat', 
  'w1hcthreat_i', 'w1kessler6', 'w1kessler6_i', 'w1auditc', 'w1auditc_i', 'w1dudit', 'w1dudit_i', 
  'w1feltstigma', 'w1feltstigma_i', 'w1internalized', 'w1internalized_i', 'w1everyday', 
  'w1everyday_i', 'w1childgnc', 'w1childgnc_i', 'w1ace', 'w1ace_i', 'w1ace_emo', 'w1ace_emo_i', 
  'w1ace_inc', 'w1ace_inc_i', 'w1ace_ipv', 'w1ace_ipv_i', 'w1ace_men', 'w1ace_men_i', 'w1ace_phy', 
  'w1ace_phy_i', 'w1ace_sep', 'w1ace_sep_i', 'w1ace_sex', 'w1ace_sex_i', 'w1ace_sub', 'w1ace_sub_i', 
  'w1socsupport', 'w1socsupport_fam', 'w1socsupport_fam_i', 'w1socsupport_fr', 'w1socsupport_fr_i', 
  'w1socsupport_i', 'w1socsupport_so', 'w1socsupport_so_i']

Keen-eyed readers may have noticed a pattern in the list above: each variable name appears twice, with `'_i'` added to the end of the second occurrence.  This is because Meyer and his team provided two versions of the computed column for each scale - one for the original values, and one in which missing values have been imputed.  I discuss imputation - these, and my own - in the next section.

Finally, before investigating any missing values, I dropped any columns that were not pertinent to my problem statement.

In [505]:
# Identify excess columns
drop_cols_2 = ['w1q165', 'w1q27', 'w1q28', 'w1q20_1', 'w1q20_2', 'w1q20_3', 
  'w1q20_4', 'w1q20_5', 'w1q20_6', 'w1q20_7', 'w1q29', 'w1q29_t_verb', 'gruca', 
  'gruca_i', 'gurban', 'gzipstate',  'gzipcode', 'w1hinc', 'w1poverty', 
  'w1povertycat', 'w1q133', 'w1q133_1', 'w1q133_2', 'w1q133_3', 
  'gmsaname', 'gmethod_type', 'gmethod_type_w2', 'gmethod_type_w3', 
  'w1cumulative_wt_nr1', 'w1cumulative_wt_nr2', 'w1cumulative_wt_nr3', 
  'w1cumulative_wt_sampling', 'w1weighting_cell_nr1', 'w1weight_orig', 
  'w1weighting_cell_nr2and3', 'w1frame_wt', 'w1q31a', 'w1q31b', 'w1q31c', 
  'w1q31d', 'w1q66_1', 'w1q66_2', 'w1q66_3', 'w1q66_4', 'w1q66_5', 
  'w1q66_t_verb', 'w1q67', 'w1q68_1', 'w1q68_2', 'w1q68_3', 'w1q70', 'w1q71', 
  'w1q73', 'w1q74_1', 'w1q74_2', 'w1q74_3', 'w1q74_4', 'w1q74_7', 'w1q74_8', 
  'w1q74_9', 'w1q74_12', 'w1q74_13', 'w1q74_15', 'w1q74_16', 'w1q74_19', 
  'w1q80', 'w1q81', 'w1q82', 'w1q83', 'w1q84', 'w1q88', 'w1q118', "gp1",
  "w1q20_t_verb", "w1q39_7", "w1q39_12", "w1q39_t_verb", "w1q39_6", 
  "w1q39_5", "w1q39_2", "w1q39_3", "w1q39_4", "w1q39_1", "w1q39_9", "w1q39_8", 
  "w1q39_10", "w1q39_11", "gemployment2010", "gmilesaway"]
meyer.drop(columns = drop_cols_2, inplace = True)
meyer.shape

(1507, 316)

### Missing Values and Imputation

Because this dataset only has about 1500 rows, I felt it important to preserve as many of them as possible.  Out of this necessity, I opted to allow a larger proportion of imputed data than I would b inclined to do if I had rows to spare.  The implications of this decision will be discussed at further length in the *Limitations* section, but suffice to say it will require me to interpret my models cautiously and conservatively.

Fortunately, however, Meyer and his team used thorough, theoretically-sound methods for their imputations, inpsiring confidence in me regarding the quality of the imputed data.  Where possible, they filled missing values with logical inferences based on that participant's other responses.  Where that was not possible, they predicted the missing value based on a regression of the participant's other responses, then imputed a randomly selected value from the 5 extant values nearest to that prediction.  See page 30 of *37166-Documentation-methodology.pdf* for more details.

# **MAYBE TAKE THIS OUT???  I CAN'T TELL IF IT'S CONVINCING**  vv

For a quick-and-dirty test of the facial validity of these imputations, I compared the value counts of `w1ace_ipv` (before imputation) and `w1ace_ipv_i` (after imputation).  This variable corresponds to questions from the Adverse Childhood Experiences (ACE) scale regarding intimate partner violence (IPV) in their home when they were a child.  IPV is notoriously underreported, and not surprisingly, this variable had one of the highest rates of missing data.

In [506]:
print(meyer['w1ace_ipv'].value_counts(dropna = False))
print(meyer['w1ace_ipv_i'].value_counts(dropna = False))

w1ace_ipv
0.0    945
1.0    426
NaN    136
Name: count, dtype: int64
w1ace_ipv_i
0    1018
1     489
Name: count, dtype: int64


In [507]:
1010-940, 484-424

(70, 60)

In [508]:
len(meyer[(meyer['w1ace_ipv']==1) & (meyer['w1ace_sex'].isna())])

19

In [509]:
len(meyer[(meyer['w1ace_ipv'].isna()) & (meyer['w1ace_sex']==1)])

50

In [510]:
meyer[(meyer['w1ace_ipv'].isna()) & (meyer['w1ace_sex']==1)]['w1ace_ipv_i'].value_counts(dropna = False)

w1ace_ipv_i
1    27
0    23
Name: count, dtype: int64

In [511]:
meyer[(meyer['w1ace_ipv'].isna()) & (meyer['w1ace_ipv_i']==0)]['w1ace_sex'].value_counts(dropna = False, normalize = True)

w1ace_sex
0.0    0.438356
1.0    0.315068
NaN    0.246575
Name: proportion, dtype: float64

In [512]:
meyer[(meyer['w1ace_ipv'].isna()) & (meyer['w1ace_ipv_i']==1)]['w1ace_sex'].value_counts(dropna = False, normalize = True)

w1ace_sex
1.0    0.428571
0.0    0.380952
NaN    0.190476
Name: proportion, dtype: float64

In [513]:
meyer['w1ace_ipv'].value_counts(dropna = True, normalize = True)

w1ace_ipv
0.0    0.689278
1.0    0.310722
Name: proportion, dtype: float64

In [514]:
len(meyer[(meyer['w1ace_ipv']==' ') & (meyer['w1ace_sex']=='1')])

0

Of the 130 missing values in the original column, 

# ^^ **MAYBE TAKE THIS OUT???  I CAN'T TELL IF IT'S CONVINCING**

Heartened by this level of care - and lacking any better ideas for imputing those values - I was inclined to retain the imputed versions and drop the original columns.  Before doing so, however, I wanted to assess how many values had originally been missing.  Features for which a high proportion of values had to be imputed may not be suitable for modeling, and observations for which a high proportion of the Kessler-6 items had to be imputed may also be inappropriate for use.  

For all of the reasons outlined above, I opted not to immediately drop any features based solely on their original missing proportions, but simply to monitor their performance in the model(s) with an extra watchful eye.  I also decided to keep observations for which only one value out of the 6 items on the Kessler scale had to be imputed, but I did opt to drop any observations with two or more originally missing values among the Kessler items.

In [515]:
# Create a subsetted dataframe to identify any rows to drop
meyer_kessler = meyer[ment_dis_items + ['w1kessler6', 'w1kessler6_i', 'studyid']]
meyer_kessler = meyer_kessler[meyer_kessler['w1kessler6'].isna()]
meyer_kessler

Unnamed: 0,w1q77a,w1q77b,w1q77c,w1q77d,w1q77e,w1q77f,w1kessler6,w1kessler6_i,studyid
39,,,,,,,,1,151876818
50,1.0,,,,,,,9,152304751
96,5.0,5.0,,4.0,3.0,4.0,,4,152640441
155,5.0,,5.0,4.0,5.0,5.0,,3,153039829
214,,,,,,,,0,153390973
227,,,,,,,,9,153535970
415,,,,,,,,8,154944576
447,1.0,,3.0,3.0,4.0,2.0,,16,155106769
457,,,,,,,,12,155163167
577,,,,,,,,2,155814388


In [516]:
# Extract the index numbers of rows with 2+ NAs

# Get the unique identifier
drop_rows = []
for i in meyer_kessler.index:
  if meyer_kessler.loc[i, ment_dis_items].isna().sum() > 1:
    drop_rows.append(meyer_kessler.loc[i, 'studyid'])

# Get the matching index number from the full dataframe
drop_index = []
for i in drop_rows:
  drop_index.append(meyer[meyer['studyid']==i].index[0])

# Drop those rows
meyer.drop(index = drop_index, inplace = True)
meyer.shape

(1494, 316)

In [517]:
# Reset the index (thanks to pandas documentation for this syntax)
meyer.reset_index(drop = True, inplace = True)
meyer.head(3)

Unnamed: 0,studyid,waveparticipated,w1weight_full,w1survey_yr,cohort,geduc1,geduc2,geducation,gurban_i,gcendiv,...,w1socsupport_i,w1socsupport_so,w1socsupport_so_i,grespondent_date_w2,gsurvey,gp2,grace,grespondent_date_w3,wave3,nopolicecontact
0,151339842,4,0.309003,2016,3,3,2,5,1,1,...,4.916667,4.75,4.75,20958.0,2.0,2.0,1.0,21306.0,1.0,1.0
1,151351600,2,0.931862,2016,2,2,2,4,1,7,...,5.916667,6.0,6.0,21070.0,,,,,,
2,151396232,1,0.656089,2016,2,3,2,5,1,4,...,7.0,7.0,7.0,,,,,,,


With those rows removed, I next checked for missing values in the non-imputed scale score columns.

In [518]:
meyer[combo_features].isna().sum().sort_values(ascending = False)

w1ace                 267
w1ace_ipv             130
w1ace_emo              83
w1ace_sex              69
w1dudit                56
w1ace_phy              54
w1socialwb             51
w1connectedness        39
w1socsupport           37
w1everyday             31
w1meim                 22
w1socsupport_fr        22
w1socsupport_fam       21
w1socsupport_so        20
w1lifesat              19
w1internalized         19
w1hcthreat             16
w1ace_inc              16
w1ace_sub              15
w1ace_men              14
w1idcentral            14
w1kessler6             13
w1ace_sep              10
w1feltstigma            8
w1auditc                6
w1ace_men_i             0
w1socsupport_fam_i      0
w1ace_sub_i             0
w1ace_sex_i             0
w1ace_sep_i             0
w1socsupport_fr_i       0
w1ace_phy_i             0
w1socsupport_i          0
w1lifesat_i             0
w1idcentral_i           0
w1ace_emo_i             0
w1ace_ipv_i             0
w1ace_inc_i             0
w1meim_i    

As mentioned above, I was impressed enough by the original author's imputation methods to not drop any of these features over their missing values alone.  However, quite a few columns had troublingly high numbers of missing values.  Those with approximately 1% missing data or more are displayed below.

|Variable Name| Missing Values|Variable Name| Missing Values|Variable Name| Missing Values|
|--|--|--|--|--|--|
|w1ace        |267 |w1connectedness  | 39 |w1lifesat       | 19 |
|w1ace_ipv    |130 |w1socsupport     | 37 |w1internalized  | 19 |
|w1ace_emo    | 83 |w1everyday       | 31 |w1hcthreat      | 16 |
|w1ace_sex    | 69 |w1meim           | 22 |w1ace_inc       | 16 |
|w1dudit      | 56 |w1socsupport_fr  | 22 |w1ace_sub       | 15 |
|w1ace_phy    | 54 |w1socsupport_fam | 21 |w1ace_men       | 14 |
|w1socialwb   | 51 |w1socsupport_so  | 20 |w1idcentral     | 14 |

As can be seen above, many of my most theory-backed potential predictors have large numbers of missing values.  The implications of this will be discussed in the Limitations section, but in the interest of not sacrificing too much data, I went forward with my plan to drop these original columns and use the imputed versions.

In [519]:
# Divide that combination list between imputed and not
combo_imputed = [c for c in combo_features if c[-2:]=='_i']
combo_not_imputed = [c for c in combo_features if c[-2:]!='_i']

# Drop the ones that aren't
meyer.drop(columns = combo_not_imputed, inplace = True)
meyer.shape

(1494, 290)

In [520]:
# See how many missing values remain
meyer.isna().sum().sort_values(ascending = False)
meyer.isna().sum().sort_index()

cohort                    0
gcendiv                   0
gcenreg                   0
geduc1                    0
geduc2                    0
geducation                0
gmilesaway2              18
gp2                    1289
grace                   794
grespondent_date_w2     614
grespondent_date_w3     794
gsurvey                 794
gurban_i                  0
nopolicecontact        1156
screen_race               0
studyid                   0
w1ace_emo_i               0
w1ace_i                   0
w1ace_inc_i               0
w1ace_ipv_i               0
w1ace_men_i               0
w1ace_phy_i               0
w1ace_sep_i               0
w1ace_sex_i               0
w1ace_sub_i               0
w1age                     0
w1auditc_i                0
w1childgnc_i              0
w1connectedness_i         0
w1conversion              0
w1conversionhc            0
w1conversionrel           0
w1dudit_i                 0
w1everyday_i              0
w1feltstigma_i            0
w1gender            

As seen in the cell above, the columns with the most missing values tend to be those with a `w1qx_x` naming convention.  These are questions with a matrix-style response prompt, from which participants selected only a few - usually one - answer(s).  For example:

In [521]:
meyer[[
    'w1q136_1', 'w1q136_2', 'w1q136_3', 'w1q136_4', 'w1q136_5',
    'w1q136_6', 'w1q136_7', 'w1q136_8', 'w1q136_9', 'w1q136_10']].head()

Unnamed: 0,w1q136_1,w1q136_2,w1q136_3,w1q136_4,w1q136_5,w1q136_6,w1q136_7,w1q136_8,w1q136_9,w1q136_10
0,,,,4.0,,,7.0,,,
1,,,,,,,7.0,8.0,,
2,,,,,,,,,,
3,,,,,,,7.0,,,
4,,2.0,,,,,,,,


Each of these columns had up to 3 values - NaN, 7 or 97, and the number corresponding to the position of the response in the matrix (i.e., row 4 above has a 2 in the `w1q136_2` column).  NaN, 97 (float or int), and sometimes 7, all referred to missing data; NaNs appeared where participants declined to answer a question that was asked of them, and 7 or 97 appeared when participants were not asked the question at all, usually due to their response to a previous question (e.g., answering "no" to "Do you currently have health insurance?", and then not being asked "What kind of health insurance do you have?").  For all such columns, I converted the NaNs and 97s to 0s, and the target values to 1s.  The 7s required more careful attention, because they were legitimate values in some columns and essentially NAs in others.

In [522]:
# Make a list of all such columns
diy_ohe = ["w1q145_3", "w1q163_3", "w1q136_3", "w1q145_9", "w1q143_3", "w1q136_10", "w1q145_10", 
  "w1q136_9", "w1q163_10", "w1q163_9", "w1q143_9", "w1q143_4", "w1q136_6", "w1q143_10", "w1q143_5", 
  "w1q145_4", "w1q136_5", "w1q143_8", "w1q163_5", "w1q136_4", "w1q163_6", "w1q145_6", "w1q136_1", 
  "w1q143_7", "w1q163_1", "w1q143_2", "w1q143_1", "w1q145_5", "w1q143_6", "w1q163_2", "w1q163_4", 
  "w1q136_8", "w1q145_8", "w1q145_1", "w1q145_7", "w1q163_7", "w1q136_2", "w1q145_2", "w1q139_3", 
  "w1q136_7", "w1q139_9", "w1q139_10", "w1q139_5", "w1q139_4", "w1q139_8", "w1q139_6", "w1q139_2", 
  "w1q139_1", "w1q139_7", "w1q163_8", "w1q141_3", "w1q141_9", "w1q141_10", "w1q141_8", "w1q141_4", 
  "w1q141_1", "w1q141_5", "w1q141_2", "w1q141_7", "w1q141_6", "w1q64_12", "w1q64_5", "w1q74_5", 
  "w1q74_6", "w1q74_20", "w1q64_11", "w1q64_10", "w1q74_18", "w1q74_14", "w1q74_17", "w1q171_8", 
  "w1q30_3", "w1q64_13", "w1q64_7", "w1q30_4", "w1q171_6", "w1q171_4", "w1q64_8", "w1q74_10", 
  "w1q74_21", "w1q64_6", "w1q74_11", "w1q171_5", "w1q64_3", "w1q64_1", "w1q171_9", "w1q30_5", 
  "w1q171_3", "w1q74_22", "w1q64_9", "w1q171_2", "w1q171_7", "w1q74_23", "w1q64_4", "w1q64_2", 
  "w1q30_1", "w1q171_1", "w1q30_2"]
print(len(diy_ohe))

98


In [523]:
# Make distinct column names to put them into - don't overwrite them until I'm done
diy_ohe_new = [''.join([x, '_ei']) for x in diy_ohe]
print(len(diy_ohe_new))

98


In [524]:
# Convert the int 97s to 0s
meyer.loc[:, diy_ohe_new] = np.where(meyer.loc[:, diy_ohe]==97, 0, meyer.loc[:, diy_ohe])

In [525]:
# Identify columns with legitimate 7s
nat_7s = [
  'w1q143_7', 'w1q145_7', 'w1q163_7', 'w1q136_7', 'w1q139_7', 'w1q141_7', 'w1q64_7', 'w1q171_7']

In [526]:
# Get the new names for these columns (same as in diy_ohe_new)
nat_7s_ei = [''.join([x, '_ei']) for x in nat_7s]

# Turn those 7s into 1s
meyer.loc[:, nat_7s_ei] = np.where(meyer.loc[:, nat_7s]==7, 1, meyer.loc[:, nat_7s_ei])

In [527]:
# let's see if this replaces the 7s right
meyer.loc[:, diy_ohe_new] = np.where(meyer.loc[:, diy_ohe_new]==7, 0, meyer.loc[:, diy_ohe_new])

In [528]:
# Fill the NAs with 0s
meyer[diy_ohe_new] = meyer[diy_ohe_new].fillna(0)

In [529]:
# Check that all the values are correct
value_list_x = []
check_7s = []
for i in diy_ohe_new:
  a = list(meyer[i].unique())
  if 7 in a:
    check_7s.append(i)
  value_list_x += a
value_list_x = list(set(value_list_x))

print(sorted(value_list_x)) # should be 0-23, no 7s, no NaNs
print(check_7s) # should be blank

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 9.0, 10.0, 11.0, 13.0, 14.0, 17.0, 18.0, 20.0, 21.0, 22.0, 23.0]
[]


In [530]:
# reduce to 0s and 1s
meyer.loc[:, diy_ohe_new] = np.where(meyer.loc[:, diy_ohe_new]!=0, 1, meyer.loc[:, diy_ohe_new])

In [531]:
# Check that all the values are correct
value_list_x = []
check_7s = []
for i in diy_ohe_new:
  a = list(meyer[i].unique())
  if 7 in a:
    check_7s.append(i)
  value_list_x += a
value_list_x = list(set(value_list_x))

print(sorted(value_list_x)) # should be 0-1
print(check_7s) # should be blank

[0.0, 1.0]
[]


In [532]:
# Are all the NAs in those columns gone?
meyer[diy_ohe_new].isna().sum().sum()

0

In [533]:
# Drop the old versions
meyer.drop(columns = diy_ohe, inplace = True)

In [534]:
# Are we back the pre-add columns?
meyer.shape

(1494, 290)

### Feature Engineering

- creating new columns
- polynomial terms
- standardization

### Feature Selection, Part 2

## Notebook Summary

In this notebook, I have __________.  

In the following notebook, I will ___________.

Any readers who are following along or attempting to reproduce my work should uncomment the cell below to save a copy of the dataframe as it exists now.  A cell is provided at the top of the next notebook in which to import that copy.

In [535]:
# Save a copy of the dataframe to use in the next notebook
meyer.to_csv('../02_data/df_after_data_preparation.csv', index = False)