## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [4]:
! git clone https://github.com/ds4e/EDA


Cloning into 'EDA'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 76 (delta 17), reused 12 (delta 12), pack-reused 45 (from 2)[K
Receiving objects: 100% (76/76), 26.04 MiB | 21.77 MiB/s, done.
Resolving deltas: 100% (22/22), done.


In [9]:
import pandas as pd
#
var_list = ['year','educ','age','sex','agekdbrn','wrkstat','hrs1',
            'prestg10','marital','cowrksta','sppres10',
            'earnrs','income'] # List of variables you want to save
output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to
#
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    print(df.head()) # Visually inspect the first few rows
    if phase == 0 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='w', # control write versus append
                                header=var_list, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='a', # control write versus append
                                header=None, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode

https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
   year  id            wrkstat  hrs1  hrs2 evwork    occ  prestige  \
0  1972   1  working full time   NaN   NaN    NaN  205.0      50.0   
1  1972   2            retired   NaN   NaN    yes  441.0      45.0   
2  1972   3  working part time   NaN   NaN    NaN  270.0      44.0   
3  1972   4  working full time   NaN   NaN    NaN    1.0      57.0   
4  1972   5      keeping house   NaN   NaN    yes  385.0      40.0   

         wrkslf wrkgovt  ...  agehef12 agehef13 agehef14  hompoph wtssps_nea  \
0  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
1  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
2  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
3  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
4  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   

   wtssnrps_nea  wtssps_next wt

In [10]:
new_df = pd.read_csv('selected_gss_data.csv')
new_df.head(30)

Unnamed: 0,year,educ,age,sex,agekdbrn,wrkstat,hrs1,prestg10,marital,cowrksta,sppres10,earnrs,income
0,1972,16.0,23.0,female,,working full time,,45.0,never married,,,1.0,
1,1972,10.0,70.0,male,,retired,,50.0,married,,,0.0,
2,1972,12.0,48.0,female,,working part time,,49.0,married,,41.0,2.0,
3,1972,17.0,27.0,female,,working full time,,60.0,married,,39.0,2.0,
4,1972,12.0,61.0,female,,keeping house,,31.0,married,,38.0,1.0,
5,1972,14.0,26.0,male,,working full time,,45.0,never married,,,1.0,
6,1972,13.0,28.0,male,,working full time,,43.0,divorced,,,1.0,
7,1972,16.0,27.0,male,,working full time,,33.0,never married,,,1.0,
8,1972,12.0,21.0,female,,working part time,,33.0,never married,,,1.0,
9,1972,12.0,30.0,female,,working full time,,25.0,married,,25.0,2.0,


In [12]:
cleaned_df = new_df.dropna()
cleaned_df.head(30)
#cleaned the data by dropping all the NaN values

Unnamed: 0,year,educ,age,sex,agekdbrn,wrkstat,hrs1,prestg10,marital,cowrksta,sppres10,earnrs,income
62476,2018,19.0,59.0,male,30.0,working full time,40.0,72.0,divorced,working full time,24.0,1.0,"$25,000 or more"
62541,2018,14.0,33.0,female,25.0,working full time,32.0,32.0,never married,working full time,40.0,4.0,"$25,000 or more"
62548,2018,12.0,37.0,female,22.0,working full time,40.0,51.0,divorced,working full time,38.0,3.0,"$25,000 or more"
62550,2018,12.0,70.0,male,25.0,working part time,20.0,35.0,separated,keeping house,35.0,1.0,"$25,000 or more"
62595,2018,18.0,39.0,female,32.0,working full time,40.0,67.0,never married,"unemployed, laid off, looking for work",58.0,2.0,"$25,000 or more"
62691,2018,15.0,29.0,female,25.0,working part time,30.0,33.0,never married,working full time,38.0,2.0,"$25,000 or more"
62695,2018,12.0,24.0,female,19.0,working part time,25.0,21.0,never married,working full time,35.0,2.0,"$25,000 or more"
62702,2018,10.0,62.0,female,17.0,working full time,40.0,25.0,widowed,retired,80.0,1.0,"$25,000 or more"
62712,2018,12.0,29.0,male,24.0,working full time,40.0,24.0,never married,working full time,47.0,3.0,"$25,000 or more"
62722,2018,12.0,36.0,male,28.0,working full time,48.0,50.0,never married,working full time,52.0,4.0,"$25,000 or more"


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['age'] = pd.to_numeric(cleaned_df['age'], errors='coerce')  # Ensured age is numeric, coerce invalid values to NaN
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['income'] = pd.to_numeric(cleaned_df['income'], errors='coerce')  # Ensured income is numeric
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni

The variables I chose from the General Social Survey (GSS) offer a broad look at how different aspects of life—education, work, income, and family—are connected. By including year, I can track changes over time and see how people’s experiences have evolved. Education (educ) is important because it often shapes job opportunities, income, and overall social status. Age helps show how different stages of life impact work, family, and financial stability.

Sex is useful for understanding gender differences in education, work, and income. Age at first birth (agekdbrn) gives insight into how starting a family early or later in life might affect career and financial situations. Work status (wrkstat) and hours worked per week (hrs1) help analyze employment trends and work-life balance, especially when compared with income and prestige of occupation (prestg10), which can reveal patterns in social class and job satisfaction.

Family life also plays a role in shaping these outcomes. Marital status (marital) can affect work and income, while co-worker status (cowrksta) might show how social interactions at work impact job satisfaction. Social presence (sppres10) could indicate how connected people feel in their communities, and earnings (earnrs) and income provide key measures of financial well-being.

By making graphics of these variables and the ways they interact with each other, I think it will be revealing of broader trends in American society. That is the hope at least!