# General Social Survey

"The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes.  Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 80 years.
The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States. It allows researchers to examine the structure and functioning of society in general as well as the role played by relevant subgroups and to compare the United States to other nations.
The General Social Survey is an extensive survey conducted on the public of the United States of America every 2 years, to attempt to record the state of society over time." - https://gss.norc.org/about-the-gss


#### Metadata and Variable Translations

The categorical variables in this dataset are contracted into abbreviated labels within the datafile, and the values themselves are numerically encoded. The corresponding labels are contained within the accompanying `gss_data_codes.json` file, and a couple of pre-built functions for decoding the data are included below.

The General Social Survey contains over 6000 variables, some of which are only available for a single year (these are known as Modules). The metadata for these variables (such as a more detailed explanation and the exact wording of the question they are in response to) can be found in the accompanying files `GSS_Codebook_index.pdf` and `GSS_Codebook_mainbody.pdf`. Alternatively their metadata can be found at the following websites:
* https://sda.berkeley.edu/D3/GSS18/Doc/hcbk.htm

Due to the large number of variables in the dataset (over 6000), it may be beneficial to look at the grouping of variables in either of these resources, and select some variables to pull into a smaller sub-DataFrame instead of attempting to analyse the entire dataset at once.


#### Weighting Variables

In some sections of the GSS, certain subgroups are oversampled. For example, although the survey was laid out so that each household in the US had an equal statistical chance of being represented, only one member of each household was selected for the questionnaire, meaning that people from larger households are statistically underrepresented with concern to Personal level variables. In the 1982 and 1987 surveys, Black correspondents were sampled at a higher rate, and thus are overrepresented in the data for these years. Further information on included weighting values and how to compensate for these can be found at the following websites:
* https://sda.berkeley.edu/D3/GSS18/Doc/hcbk.htm
* http://gss.norc.org/Lists/gssFAQs/DispForm.aspx?ID=11


## Starting Questions:
(These are just suggestions, feel free to pursue your own questions)

* What variables are correlated with people's political views?
    - Do people's pets correlate with their political views?
    - Do the amount of siblings people have correlate with their political views?
    - How do these effects compare to those of Age and State?


* How have working hours and conditions changed over time in the US?


* Are more religious people more likely to occupy certain types of jobs?


* Is a person's Occupational Prestige score more correlated to their Father's Occupational Prestige score or their mother's? Is this different in 2010 to 1980?


* Almost 1500 correspondents have been surveyed at least 30 times (Try `df['ID'].value_counts().value_counts().sort_index()`). What variables have changed the most within the same correspondent over time?

In [None]:
# Import required libraries
import pandas as pd
import xml.etree.ElementTree as ET
import os
from timeit import default_timer as dtimer
from numpy import NAN
import json

In [None]:
# Read in the datafile. Takes just over 2 minutes on my HP Elitebook.
start_time = dtimer()
df = pd.read_csv("data/gss_data_encoded.csv", index_col="CASEID", dtype={"CASEID": str})
end_time = dtimer()
print("Time: {}s".format(round(end_time - start_time, 2)))
df.head()

In [None]:
# Load the variable codes
variable_codes = json.load(open("data/gss_data_codes.json", "r"))

# Convert dict keys back into floats, as they will match both float and integer values
for var in variable_codes:
    cvd = variable_codes[var]['categories']
    new_cvd = {}
    for code in cvd:
        new_cvd[float(code)] = cvd[code]
        variable_codes[var]['categories'] = new_cvd

        
def translate_categories(df, show_missing_values=False, alert_missing_columns=True):
    global variable_codes
    
    # NOTE: Using this function on larger datasets (with len(df.columns) > ~100) may take a long time.
    # Use smaller subsets of the data for more efficient processing.
        
    for col in df.columns:
        if col not in variable_codes:
            if alert_missing_columns:
                print("The variable '{}' is not included in variable codes. It will not be decoded.")
            continue
        
        categories = variable_codes[col]['categories']
        
        if not show_missing_values:
            for val in variable_codes[col]['missing_values']:
                categories[val] = NAN
        
        df.loc[:, col] = df[col].replace(categories)
    
    return df


def var_name(codename):
    global variable_codes
    return variable_codes[codename]['title']


def var_codes(codename):
    global variable_codes
    return variable_codes[codename]['categories']


def var_question(codename):
    global variable_codes
    return variable_codes[codename]['question_text']


def var_missing_values(codename):
    global variable_codes
    return variable_codes[codename]['missing_values']
    

In [None]:
test_df = df.loc[:, df.columns[:50]]
tdf = translate_categories(test_df)
tdf.head()