# General Social Survey

The General Social Survey is an extensive survey conducted on the public of the United States of America every 2 years, to attempt to record the state of society over time.

There are over 6000 columns in this dataset, many of which apply only to a single year (these are known as 'Modules'). The metadata for all columns can be found in the accompanying codebook files, and all variables should be present in the master codebook (titled 'GSS_Codebook.pdf'). All of the column names in the dataset have been abbreviated into code, so it might be beneficial to spend some time looking through the Codebook for variables that would be interesting to look at, rather than attempting to filter out useful variables from the 6000+ in the dataset.

Alternatively, the metadata for each variable can be found here: 'https://sda.berkeley.edu/D3/GSS18/Doc/hcbk.htm'

The set-up of the dataframe for this dataset is slightly more complex, due to the nature in which Pandas imports categorical data (data which can only be of a number of discrete, set values). The data has been codified to numerical format for compression purposes, and this codified data is held in the `data_X.csv` files. The decoding information is held in the `ddi_X.xml` files, and will be needed to translate the numerical data in the main datafiles.

In [10]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import os
from timeit import default_timer as dtimer

In [7]:
# Load the data for each datafile into a dataframe
df = pd.DataFrame()
for filename in os.listdir("data"):
    if "data_" in filename:
        t1 = dtimer()
        temp_df = pd.read_csv(os.path.join("data",filename))
        df = pd.concat([df, temp_df], axis=1)
        t2 = dtimer()
        print("Added {}, time={}s".format(filename, round(t2-t1, 2)))

print("Complete.")
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Added data_1.csv, time=4.43s
Added data_2.csv, time=2.61s
Added data_2000-2002_modules.csv, time=5.06s
Added data_2004_modules.csv, time=5.29s
Added data_2006-2010_modules.csv, time=6.12s
Added data_2012-2014_modules.csv, time=3.47s
Added data_3.csv, time=7.27s


  interactivity=interactivity, compiler=compiler, result=result)


Added data_4.csv, time=8.24s
Added data_80s-90s_issp_modules.csv, time=8.19s
Added data_80s-90s_modules.csv, time=7.6s
Complete.


Unnamed: 0,CASEID,YEAR,ID,AGE,SEX,RACE,RACECEN1,RACECEN2,RACECEN3,HISPANIC,...,WORKUNDC,OBRESPCT,ECONPAST,PASTUP,PASTDOWN,ECONFUTR,FUTRUP,FUTRDOWN,RDISCAFF,RIMMDISC
0,1972 1,1972,1,23,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1972 2,1972,2,70,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1972 3,1972,3,48,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1972 4,1972,4,27,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1972 5,1972,5,61,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
def compile_variable_dict(ddi_filepath):
    root = ET.parse(ddi_filepath)
    data = root.find("dataDscr")
    
    var_dict = {}
    for var in data:
        # Get the ID Code of the variable
        id_code = var.get("ID")
        # Get the Full Name of the variable
        full_title = var.find("./labl").text.strip()
        # Get the Question Text (if applicable)
        qstn = var.find("qstn")
        if qstn != None:
            question_text = qstn.find("qstnLit").text.strip()
        else:
            question_text = None
        # Get the categories
        category_dict = {}
        category_list = var.findall("catgry")
        for category in category_list:
            value = float(category.find("catValu").text.strip())
            label = category.find("*[@level='category']").text.strip()
            missing = category.get("missing")
            category_dict[value] = label
            # category_dict[value] = {"label":label, "missing":missing}
            
        var_dict[id_code] = {
            "title": full_title,
            "question_text": question_text,
            "categories": category_dict
        }
    
    return var_dict

In [12]:
for filename in os.listdir("data"):
    if "ddi_" in filename:
        t1 = dtimer()
        filepath = os.path.join("data", filename)
        var_dict = compile_variable_dict(filepath)
#         print(var_dict)
        for col in df.columns:
            if col in var_dict:
                try:
                    df[col] = df[col].astype(float).replace(to_replace=var_dict[col]['categories'], inplace=True)
                except Exception as e:
                    print(e)
                    print("Variable '{}' not found in dataframe".format(col))
        t2 = dtimer()
        print("Translated {}, time={}s".format(filename, round(t2-t1, 2)))
print("Complete.")
df.head()

could not convert string to float: '1972 1  '
Variable 'CASEID' not found in dataframe


KeyboardInterrupt: 