## Getting the GSS Data

Since the data files are about 40GB zipped, we can't store a compressed or uncompressed version on GitHub, and the entire dataset can't really be loaded into memory with Colab.

One option is to use Rivana: Download the data, unzip it, and work on it in a persistent environment.

The other option is to avoid opening the entire file at once, and instead work with chunks of the data. That's what this code does for you.

On GitHub, the data are broken into three smaller files, saved in .parquet format. The code below will load these chunks into memory, one at a time, you can specify the variables you want in `var_list`, and the results will be saved in `selected_gss_data.csv`.

You can add more cleaning instructions in between the lines where the data are loaded ( `df = pd.read_parquet(url)`) and the data are saved (`df.loc...`). It's probably easiest to use this code to get only the variables you want, and then clean that subset of the data.

In [27]:
! git clone https://github.com/DS3001/project_gss
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

Cloning into 'project_gss'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 23 (delta 6), reused 1 (delta 1), pack-reused 15[K
Receiving objects: 100% (23/23), 23.94 MiB | 20.72 MiB/s, done.
Resolving deltas: 100% (6/6), done.


In [180]:
var_list = ['wrkstat', 'prestige'] # List of variables you want to save
output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to

modes = ['w','a'] # Has write mode and append mode
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode

for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    #print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    #print(df.head()) # Visually inspect the first few rows
    df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                              mode=modes[phase], # control write versus append
                              header=var_list, # variable names
                              index=False) # no row index saved
    phase = 1 # Switch from write mode to append mode

print(df.shape)
df.columns
# 6694 columns/variables

(24130, 6694)


Index(['year', 'id', 'wrkstat', 'hrs1', 'hrs2', 'evwork', 'occ', 'prestige',
       'wrkslf', 'wrkgovt',
       ...
       'agehef12', 'agehef13', 'agehef14', 'hompoph', 'wtssps_nea',
       'wtssnrps_nea', 'wtssps_next', 'wtssnrps_next', 'wtsscomp',
       'wtsscompnr'],
      dtype='object', length=6694)

In [183]:
# Creating new data frame with certain, potentially relevant variables
df2 = df[['age', 'sex', 'race', 'dipged', 'degree', 'educ', 'padeg', 'madeg',
          'major1', 'major2', 'health', 'happy',
          'marital', 'martype', 'agewed', 'divorce', 'widowed',
          'occ10', 'wrkstat', 'hrs1', 'hrs2', 'wrkslf', 'wrkgovt1',
          'wrkgovt2', 'whatslf2', 'indus10',
          'spwrksta', 'sphrs1', 'sphrs2', 'spwrkslf', 'spwrkslffam', 'spocc10',
          'sppres10', 'spind10', 'whatsp2',
          'agekdbrn', 'childs', 'class', 'income16', 'pres16', 'pres20']]
#print(df2.dtypes, '\n')
print(df2.tail())
print(list(df2.columns))
# age (N)
# sex (C) : male or female
# race (C)
# digped (N??) : highschool education/degree
# degree (C)
# educ (N) : 10-20 ????
# paedeg/maedeg: (C) father/mother degree/education level
# major1/major2 (C) : 2 majors inputtable per person
# health (C) : poor, fair, good, excellent
# happy (C) : level
# marital (C) : status
# martype (C) : type of marriage
# agewed (N) : age married
# divorce (C) : yes/no
# widowed (C) : yes/no
# occ10 (C) : job/occ
# wrkstat : part-time, full-time, school, keeping house
# hrs1
# hrs2 (N) : typical weekly hrs worked
# wrkslf (C) : self employed?
# wrkgovt1 (C) : government employeed
# wrkgovt2 (C) : private employeed
# whatslf2 (C) : work place classification
# ind10 (N) : work industry????
# ----------- ALL ABOUT SPOUSE--------------
# spwrksta : spouse part-time, full-time, school, keeping house
# sphrs1 (N) : IF working full/part-time, hrs worked last week
# sphrs2 (N) : spouse's typically weekly hrs worked
# spwrkslffam : spouse work for family farm?
# spocc10 (C) : spouse's job/occ
# sppres10 (N) : prestige of spouse's job??????
# spind10 (C) : spouse's work industry
# whatsp2 (C) : spouse's work place classification
# -------------------------------------------
# agekdbrn (N) : age when 1st kd born
# childs (N) : # of kids
# class (C) : economic (self-evaluated)
# income16 (N) : range total family income
# pres16 (C) : voted for elections, if voting at all (Hillary/Trump)
# pres20 (C) : voted for elections, if voting at all (Trump/Biden)

        age     sex   race  dipged                    degree  educ  \
24125  22.0  female  white     1.0               high school  12.0   
24126  29.0  female  white     1.0                  graduate  19.0   
24127  32.0    male  white     1.0  associate/junior college  15.0   
24128  49.0  female  white     1.0                  graduate  17.0   
24129  50.0    male  white     1.0                  graduate  20.0   

             padeg                     madeg             major1     major2  \
24125  high school               high school                NaN        NaN   
24126   bachelor's                  graduate  special education  education   
24127  high school  associate/junior college             health        NaN   
24128     graduate                  graduate     home economics        NaN   
24129     graduate                  graduate            biology  chemistry   

       ...                               spocc10 sppres10  \
24125  ...                                   NaN 

In [194]:
# Cleaning data to make sense in context

# Checking sex data
for value, count in df2['sex'].value_counts().items():
    print(f"{value}: {count} times")
nan_count = df2['sex'].isna().sum()
print(f"nans:{nan_count}")

condition = df2['sex'].isin(['male', 'female'])
df2.loc[condition, 'sex'] = df2.loc[condition, 'sex']
# df2.sex should only show male & female

female: 13234 times
male: 10784 times
don't know: 0 times
iap: 0 times
I don't have a job: 0 times
dk, na, iap: 0 times
no answer: 0 times
not imputable_(2147483637): 0 times
not imputable_(2147483638): 0 times
refused: 0 times
skipped on web: 0 times
uncodeable: 0 times
not available in this release: 0 times
not available in this year: 0 times
see codebook: 0 times
nans:112
