Final Project - GSS

1. What is in your data?

Our data is composed of GSS survey data and includes several variables relating to demographic, economic, social, and health-related information of the participants, as well as data on job satisfaction, political views, family dynamics, and more. The data includes both categorical (e.g., marital status, race, political views) and numerical (e.g., income, stress level) variables. For a more specific breakdown of our exact variables, see below:

1. Demographic Variables:
	•	marital: Marital status of the respondent (e.g., married, divorced, never married).
	•	divorce: Whether the respondent has experienced a divorce.
	•	childsinhh: Whether the respondent has children living in the household.
	•	childs: Number of children the respondent has.
	•	hompop_exp: Expected household population.
	•	rspgndr: Respondent’s gender.
	•	educ: Highest level of education attained by the respondent.
	•	degree: Type of degree held (e.g., high school, college).
	•	paeduc: Father's education level.
	•	maeduc: Mother's education level.
	•	wrkstat: Respondent's current work status (e.g., employed, unemployed, retired).
	•	race, raceacs1, raceacs2, raceacs3, raceacs16: Respondent's race or ethnicity and detailed race categories.
	•	sex: Gender of the respondent.
	•	sexornt: Sexual orientation of the respondent.
2. Economic Variables:
	•	sei: Socioeconomic index (measure of occupational status).
	•	realrinc, realinc: Real income or household income adjusted for inflation.
	•	rincome: Any income from occupation?
	•	class: Social class of the respondent.
3. Family and Household:
	•	babies, preteen, teens, adults: Number of individuals in various age groups within the household.
	•	partners: Number of sexual partners the respondent had in the past year.
4. Health-Related Variables:
	•	health: Respondent's self-reported health status.
	•	stress: Level of stress reported by the respondent.
	•	neisafe: Safety perceived in the neighborhood.
	•	physact: Physical activity levels.
	•	smokeday: Whether the respondent smokes daily.
5. Social and Political Views:
	•	polviews: Political views of the respondent (e.g., liberal, conservative).
	•	relig: Religious affiliation.
	•	pray: Frequency of prayer.
	•	attend: Frequency of religious service attendance.
	•	postlife: Beliefs regarding life after death.
	•	fear: Fears related to safety or societal issues.
	•	gunlaw: Opinion on gun laws.
	•	trust: Level of trust in government, people, etc.
6. Job and Work-Related Variables:
	•	joblose: Whether the respondent has lost a job.
	•	jobfind: Whether the respondent found a job after losing one.
	•	spwrksta: Work status of spouse/partner.
	•	cowrksta: Work status of coworkers.
	•	satjob: Job satisfaction (e.g., very satisfied, somewhat satisfied).
7. Well-being and Happiness:
	•	happy: Respondent's overall happiness.
	•	hapmar: Happiness in marriage.
	•	hapcohab: Happiness in cohabitation.
	•	satfin: Satisfaction with financial situation.
8. Safety and Security:
	•	vaxsafe: Perception of vaccine safety.
	•	covid12: Opinion or impact of COVID-19.
	•	evidu: Perception of educational or social issues.
	•	helpful: Perception of helpfulness in society or community.
	•	arrest: Whether the respondent has been arrested.
9. Other Variables:
	•	instype01: Type of health insurance.
	•	condom: Use of contraception, specifically condoms.
	•	jobfind: Whether the respondent was able to find a job after losing one.
10. Subjective Well-Being:
	•	happy: Overall happiness.
	•	hapmar: Happiness in marriage.
	•	hapcohab: Happiness in cohabitation.


2. How will these data be useful for studying the phenomenon you're interested in?

These variables from the GSS data allow our group to understand how social structures, personal well-being, cultural beliefs, and economic conditions interact with one another to shape the state of the world, particularly in the context of current global political and social dynamics, which is the phenomenon we are most interested in. By examining the relationships between various aspects of demographics, economic factors, and emotional well-being, our group is aiming to uncover how societal issues like job insecurity, income inequality, and family life impact overall happiness and stress levels. Furthermore, by examining variables related to political and religious views, our group wants to explore how these factors shape perceptions of trust, safety, and societal issues. We seek to understand how political polarization and the social divisions currently evident in our country may be reflected in this data. Overall, we want to examine the ways in which this data reflects broader social science concerns about inequality, societal well-being, and the evolving nature of family and work life. By analyzing these interactions, we can identify trends that inform better decision-making for policymakers and social scientists, as understanding how different factors influence well-being and societal outcomes allows for more targeted interventions and strategies that can promote a more equitable and less polarized society.

3. What are the challenges you've resolved or expect to face in using them?
Although the dataset appears relatively clean, there is always the issue of data completeness due to some variables being missing, inconsistent, or incomplete. Handling these missing values (NaNs), especially when they are spread across different columns, can require decisions about whether to impute values or just completely drop rows. Data standardization might also be necessary, as variables like political views/religious beliefs/income may be recorded in different formats or categories across time. Another challenge we are expecting to face is multicollinearity, where certain variables may be highly correlated with each other, which could lead to a lot of unnecessary overlap in our analysis and repetition of extracting trends from certain patterns. We could avoid this by dropping particular variables that represent the same thing. The last challenge we expect to face has to do with our interpretation of the interactions between certain variables as some of these relationships might be non-linear and complicated to visually understand what they are saying about the broader picture. As such, we should be very careful in how we read graphs and interpret the trends of the interactions between variables in order to most effectively understand how different factors interact.

In [262]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [263]:
var_list = ['marital', 'divorce', 'childsinhh', 'childs', 'hompop_exp','rspgndr', 'rprnt18',
'educ', 'degree', 'paeduc', 'maeduc', 'wrkstat','hrs1', 'sei', 'realrinc', 'realinc', 'rincome',
'babies','preteen','teens','adults','race', 'raceacs1', 'raceacs2', 'raceacs3',
'raceacs16','racecen1', 'sex', 'sexornt','vaxsafe','covid12','happy','hapmar','hapcohab', 'satfin',
'health', 'stress', 'neisafe','physact', 'instype01', 'partners', 'condom', 'evidu',
 'smokeday', 'arrest', 'trust', 'helpful','joblose', 'jobfind', 'spwrksta', 'cowrksta', 'class',
'polviews', 'relig', 'pray', 'attend', 'postlife', 'fear','gunlaw', 'satjob'] # List of variables you want to save

output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to
#
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    print(df.head()) # Visually inspect the first few rows
    if phase == 0:
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='w', # control write versus append
                                header=var_list, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='a', # control write versus append
                                header=None, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode


https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
   year  id            wrkstat  hrs1  hrs2 evwork    occ  prestige  \
0  1972   1  working full time   NaN   NaN    NaN  205.0      50.0   
1  1972   2            retired   NaN   NaN    yes  441.0      45.0   
2  1972   3  working part time   NaN   NaN    NaN  270.0      44.0   
3  1972   4  working full time   NaN   NaN    NaN    1.0      57.0   
4  1972   5      keeping house   NaN   NaN    yes  385.0      40.0   

         wrkslf wrkgovt  ...  agehef12 agehef13 agehef14  hompoph wtssps_nea  \
0  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
1  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
2  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
3  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
4  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   

   wtssnrps_nea  wtssps_next wt

In [264]:
gss = pd.read_csv('selected_gss_data.csv')

  gss = pd.read_csv('selected_gss_data.csv')


In [265]:
gss.head()

Unnamed: 0,marital,divorce,childsinhh,childs,hompop_exp,rspgndr,rprnt18,educ,degree,paeduc,...,cowrksta,class,polviews,relig,pray,attend,postlife,fear,gunlaw,satjob
0,never married,,,0.0,,,,16.0,bachelor's,10.0,...,,middle class,,jewish,,about once or twice a year,,,favor,a little dissatisfied
1,married,no,,5.0,,,,10.0,less than high school,8.0,...,,middle class,,catholic,,every week,,,favor,
2,married,no,,4.0,,,,12.0,high school,8.0,...,,working class,,protestant,,about once a month,,,favor,moderately satisfied
3,married,no,,0.0,,,,17.0,bachelor's,16.0,...,,middle class,,other,,never,,,favor,very satisfied
4,married,no,,2.0,,,,12.0,high school,8.0,...,,working class,,protestant,,never,,,favor,


In [266]:
gss['degree'].unique()

array(["bachelor's", 'less than high school', 'high school', 'graduate',
       'associate/junior college', nan], dtype=object)

In [267]:
gss[['educ','paeduc','maeduc']].describe()

Unnamed: 0,educ,paeduc,maeduc
count,72127.0,51529.0,60605.0
mean,13.034633,10.905296,11.034024
std,3.182372,4.33044,3.763997
min,0.0,0.0,0.0
25%,12.0,8.0,8.0
50%,12.0,12.0,12.0
75%,16.0,13.0,12.0
max,20.0,20.0,20.0


In [268]:
gss[['realrinc','realinc','sei']].describe()

Unnamed: 0,realrinc,realinc,sei
count,42333.0,64912.0,31277.0
mean,23064.143938,32537.399981,48.42357
std,29175.569814,30883.226094,19.183154
min,218.0,218.0,17.1
25%,8308.0,12080.625,32.4
50%,16604.5,24139.0,39.0
75%,28156.5,40756.5,63.5
max,480144.472857,162607.0,97.2


In [269]:
print(gss['rincome'].unique())
print(gss['class'].unique())
gss = gss.drop(columns='rincome')
# we do not need another measure of occupation income - total income will be fine

[nan '$1,000 to $2,999' '$15,000 to $19,999' '$7,000 to $7,999'
 '$8,000 to $9,999' '$20,000 to $24,999' '$4,000 to $4,999'
 '$10,000 to $14,999' '$25,000 or more' '$3,000 to $3,999' 'under $1,000'
 '$5,000 to $5,999' '$6,000 to $6,999']
['middle class' 'working class' 'upper class' 'lower class' nan 'no class']


In [270]:
gss[['babies','preteen','teens','adults','childs','childsinhh']].describe()

Unnamed: 0,babies,preteen,teens,adults,childs,childsinhh
count,66210.0,66181.0,66269.0,68289.0,72129.0,1768.0
mean,0.223954,0.275094,0.211698,1.91369,1.916538,0.464932
std,0.564219,0.652489,0.553078,0.812289,1.759511,0.982537
min,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.0,0.0,0.0,2.0,2.0,0.0
75%,0.0,0.0,0.0,2.0,3.0,0.0
max,6.0,8.0,8.0,9.0,8.0,8.0


In [271]:
print(gss['hompop_exp'].describe())
# total # of people in household
gss = gss.drop(columns='hompop_exp')

print(gss['rspgndr'].unique())
# how the home/childcare labor is divided by gender

print(gss['rprnt18'].unique())
# are u the parent of another child 18 or older in your household?
gss = gss.drop(columns='rprnt18')

count    3538.000000
mean        1.678067
std         1.269275
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        11.000000
Name: hompop_exp, dtype: float64
[nan 'women and men take equal responsibility'
 'men take much more responsibility'
 'men take somewhat more responsibility'
 'women take somewhat more responsibility'
 'women take much more responsibility']
[nan 'yes' 'no']


In [272]:
print(gss['racecen1'].unique())
print(gss['race'].unique())

[nan 'white' 'black or african american' 'hispanic' 'other asian' 'samoan'
 'filipino' 'american indian or alaska native' 'other pacific islander'
 'chinese' 'some other race' 'vietnamese' 'asian indian' 'japanese'
 'korean' 'native hawaiian' 'guamanian or chamorro']
['white' 'black' 'other' nan]


In [273]:
gss = gss.drop(columns=['raceacs1', 'raceacs2', 'raceacs3','raceacs16'])
# these race dummy variables are not needed and can be created by us if we want them

In [274]:
print(gss['sex'].unique())
print(gss['sexornt'].unique())

['female' 'male' nan]
[nan 'heterosexual or straight' 'gay, lesbian, or homosexual' 'bisexual']


In [275]:
print(gss['divorce'].unique())
print(gss['marital'].unique())
print(gss['hapmar'].unique())
print(gss['hapcohab'].unique())
print(gss['partners'].describe())

[nan 'no' 'yes']
['never married' 'married' 'divorced' 'widowed' 'separated' nan]
[nan 'very happy' 'pretty happy' 'not too happy']
[nan 'very happy' 'pretty happy' 'not too happy']
count         37672
unique           10
top       1 partner
freq          24059
Name: partners, dtype: object


In [276]:
print(gss['happy'].unique())
print(gss['satfin'].unique())
print(gss['health'].unique())
print(gss['stress'].unique())
print(gss['physact'].unique())

['not too happy' 'pretty happy' 'very happy' nan]
['not satisfied at all' 'more or less satisfied' 'pretty well satisfied'
 nan]
['good' 'fair' 'excellent' 'poor' nan]
[nan 'sometimes' 'often' 'hardly ever' 'never' 'always']
[nan 'once a month or less often' 'several times a month'
 'several times a week' 'daily' 'never']


In [277]:
print(gss['wrkstat'].unique())
print(gss['joblose'].unique())
print(gss['jobfind'].unique())
print(gss['spwrksta'].unique())
print(gss['cowrksta'].unique())
print(gss['satjob'].unique())

['working full time' 'retired' 'working part time' 'keeping house'
 'in school' 'unemployed, laid off, looking for work'
 'with a job, but not at work because of temporary illness, vacation, strike'
 'other' nan]
[nan 'not likely' 'not too likely' 'very likely' 'fairly likely'
 'leaving labor force']
[nan 'not easy' 'very easy' 'somewhat easy']
[nan 'keeping house' 'working full time'
 'with a job, but not at work because of temporary illness, vacation, strike'
 'working part time' 'retired' 'unemployed, laid off, looking for work'
 'in school' 'other']
[nan 'working full time' 'keeping house' 'retired'
 'unemployed, laid off, looking for work' 'in school' 'working part time'
 'other'
 'with a job, but not at work because of temporary illness, vacation, strike']
['a little dissatisfied' nan 'moderately satisfied' 'very satisfied'
 'very dissatisfied']


In [278]:
gss['hrs1'].describe()

Unnamed: 0,hrs1
count,41560.0
mean,41.183951
std,14.125299
min,0.0
25%,37.0
50%,40.0
75%,48.0
max,89.0


In [279]:
print(gss['evidu'].describe())
print(gss['smokeday'].describe())
print(gss['arrest'].describe())

count     23790
unique        2
top          no
freq      23077
Name: evidu, dtype: object
count                           1128
unique                             7
top       do not smoke and never did
freq                             665
Name: smokeday, dtype: object
count     12342
unique        3
top          no
freq      10781
Name: arrest, dtype: object


In [280]:
print(gss['vaxsafe'].describe())
print(gss['covid12'].describe())
print(gss['instype01'].describe())
print(gss['condom'].describe())

count      1232
unique        5
top       agree
freq        430
Name: vaxsafe, dtype: object
count     1226
unique       2
top        yes
freq       999
Name: covid12, dtype: object
count                         981
unique                          5
top       public health insurance
freq                          363
Name: instype01, dtype: object
count        27537
unique           2
top       not used
freq         21606
Name: condom, dtype: object


In [281]:
gss = gss.drop(columns=['helpful','trust'])
# not necessarily interesting

In [282]:
print(gss['pray'].describe())
print(gss['attend'].describe())
print(gss['relig'].describe())
print(gss['postlife'].describe())

count          43269
unique             6
top       once a day
freq           12083
Name: pray, dtype: object
count     71690
unique        9
top       never
freq      13855
Name: attend, dtype: object
count          71953
unique            13
top       protestant
freq           40125
Name: relig, dtype: object
count     44148
unique        2
top         yes
freq      35337
Name: postlife, dtype: object


In [283]:
print(gss['neisafe'].describe())
print(gss['fear'].describe())
print(gss['gunlaw'].describe())
print(gss['polviews'].describe())

count          6646
unique            4
top       very safe
freq           3464
Name: neisafe, dtype: object
count     45781
unique        2
top          no
freq      27793
Name: fear, dtype: object
count     48307
unique        2
top       favor
freq      36367
Name: gunlaw, dtype: object
count                            62718
unique                               7
top       moderate, middle of the road
freq                             23992
Name: polviews, dtype: object


In [284]:
# Rename columns in the GSS for clarity
gss = gss.rename(columns={
    'childsinhh': 'children_in_household',
    'childs': 'number_of_children',
    'rspgndr': 'gender',
    'paeduc': 'father_educ',
    'maeduc': 'mother_educ',
    'wrkstat': 'work_status',
    'hrs1': 'hours_worked',
    'realrinc': 'real_income',
    'realinc': 'household_income',
    'race': 'race_ethnicity',
    'racecen1': 'race_detailed',
    'vaxsafe': 'vaccine_safety',
    'covid12': 'covid19_impact',
    'hapmar': 'marriage_happiness',
    'hapcohab': 'cohab_happiness',
    'satfin': 'financial_satis',
    'neisafe': 'neighborhood_safety',
    'instype01': 'health_insurance',
    'partners': 'number_partners',
    'evidu': 'inj_drugs',
    'spwrksta': 'spouse_work_status',
    'cowrksta': 'cohab_work_status',
    'postlife': 'beliefs_afterlife',
    'fear': 'area_fear',
    'satjob': 'job_satisfaction'
})


After exploring the variables, here is a final list of the variables in the dataset. We dropped some variables and renamed many for clarity.

'marital', 'divorce', 'children_in_household', 'number_of_children', 'gender', 'educ',
 'degree', 'father_educ', 'mother_educ', 'work_status', 'hours_worked', 'sei', 'real_income',
 'household_income', 'babies', 'preteen', 'teens', 'adults', 'race_ethnicity', 'race_detailed',
 'sex', 'sexornt', 'vaccine_safety', 'covid19_impact', 'happy', 'marriage_happiness',
 'cohab_happiness', 'financial_satis', 'health', 'stress', 'neighborhood_safety', 'physact',
 'health_insurance', 'number_partners', 'condom', 'inj_drugs', 'smokeday', 'arrest', 'joblose',
 'jobfind', 'spouse_work_status', 'cohab_work_status', 'class', 'polviews', 'relig', 'pray', 'attend',
 'beliefs_afterlife', 'area_fear', 'gunlaw', 'job_satisfaction'