## Getting the GSS Data

Since the data files are about 40GB zipped, we can't store a compressed or uncompressed version on GitHub, and the entire dataset can't really be loaded into memory with Colab.

One option is to use Rivana: Download the data, unzip it, and work on it in a persistent environment.

The other option is to avoid opening the entire file at once, and instead work with chunks of the data. That's what this code does for you.

On GitHub, the data are broken into three smaller files, saved in .parquet format. The code below will load these chunks into memory, one at a time, you can specify the variables you want in `var_list`, and the results will be saved in `selected_gss_data.csv`.

You can add more cleaning instructions in between the lines where the data are loaded ( `df = pd.read_parquet(url)`) and the data are saved (`df.loc...`). It's probably easiest to use this code to get only the variables you want, and then clean that subset of the data.

Here is your task:

Download a small (5-15) set of variables of interest.
Write a short description of the data you chose, and why. (1 page)
Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
Produce some numeric summaries and visualizations. (1-3 pages)
Describe your findings in 1-2 pages.
If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [30]:

#
var_list = ['year','educ','age','sex','agekdbrn','wrkstat','hrs1','indus10',
            'prestg10','prestg105plus','marital','martype','cowrksta','sppres10',
            'earnrs','income'] # List of variables you want to save

output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to
#
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    print(df.head()) # Visually inspect the first few rows
    if phase == 0 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='w', # control write versus append
                                header=var_list, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='a', # control write versus append
                                header=None, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode

https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
   year  id            wrkstat  hrs1  hrs2 evwork    occ  prestige  \
0  1972   1  working full time   NaN   NaN    NaN  205.0      50.0   
1  1972   2            retired   NaN   NaN    yes  441.0      45.0   
2  1972   3  working part time   NaN   NaN    NaN  270.0      44.0   
3  1972   4  working full time   NaN   NaN    NaN    1.0      57.0   
4  1972   5      keeping house   NaN   NaN    yes  385.0      40.0   

         wrkslf wrkgovt  ...  agehef12 agehef13 agehef14  hompoph wtssps_nea  \
0  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
1  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
2  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
3  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
4  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   

   wtssnrps_nea  wtssps_next wt

In [31]:
new_df = pd.read_csv('selected_gss_data.csv')

  new_df = pd.read_csv('selected_gss_data.csv')


In [32]:
new_df.head(5)

Unnamed: 0,year,educ,age,sex,agekdbrn,wrkstat,hrs1,indus10,prestg10,prestg105plus,marital,martype,cowrksta,sppres10,earnrs,income
0,1972,16.0,23.0,female,,working full time,,5170.0,45.0,49.0,never married,,,,1.0,
1,1972,10.0,70.0,male,,retired,,6470.0,50.0,62.0,married,,,,0.0,
2,1972,12.0,48.0,female,,working part time,,7070.0,49.0,69.0,married,,,41.0,2.0,
3,1972,17.0,27.0,female,,working full time,,5170.0,60.0,85.0,married,,,39.0,2.0,
4,1972,12.0,61.0,female,,keeping house,,6680.0,31.0,21.0,married,,,38.0,1.0,


In [33]:
# indus10 does not give the information I wanted
# do not need two measures of prestige - dropping prestg105plus
# marital is enough information - dropping martype
new_df.drop(columns=['indus10','prestg105plus','martype'], inplace=True)

In [34]:
new_df.head()

Unnamed: 0,year,educ,age,sex,agekdbrn,wrkstat,hrs1,prestg10,marital,cowrksta,sppres10,earnrs,income
0,1972,16.0,23.0,female,,working full time,,45.0,never married,,,1.0,
1,1972,10.0,70.0,male,,retired,,50.0,married,,,0.0,
2,1972,12.0,48.0,female,,working part time,,49.0,married,,41.0,2.0,
3,1972,17.0,27.0,female,,working full time,,60.0,married,,39.0,2.0,
4,1972,12.0,61.0,female,,keeping house,,31.0,married,,38.0,1.0,
