## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

1. download a small (5-15) set of variables of interest.

In [None]:
#list of variables to save
var_list = ['year', 'id', 'age', 'sex', 'race', 'degree', 'happy', 'health', 'satfin', 'partyid']
output_filename = 'selected_gss_data.csv'

#using the code from get_gss.ipynb to download data chunks
phase = 0
for k in range(3):
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet'
    #print(url)
    try:
        df = pd.read_parquet(url)
        if phase == 0:
            df.loc[:,var_list].to_csv(output_filename, mode='w', header=var_list, index=False)
            phase = 1
        elif phase == 1:
            df.loc[:,var_list].to_csv(output_filename, mode='a', header=None, index=False)
    except:
        print('error downloading chunk', k+1)

print('data download complete')

2. write a short description of the data you chose, and why.

i chose these variables to see how demographics relate to happiness and health.

variables:
- year: survey year
- id: respondent id
- age: age of respondent
- sex: sex of respondent
- race: race of respondent
- degree: highest degree
- happy: general happiness
- health: condition of health
- satfin: financial satisfaction
- partyid: political party affiliation

3. load the data using pandas. clean them up for eda.

In [None]:
#load data
df = pd.read_csv('selected_gss_data.csv')

#check rows and columns
print(df.shape)
print(df.head())

#check for missing values
#print(df.isnull().sum())

#drop rows with missing values for simplicity in this lab
df = df.dropna()
print(df.shape)

4. produce some numeric summaries and visualizations.

In [None]:
#numeric summary
print(df.describe())

#visualize age distribution
sns.histplot(df['age'])
plt.title('age distribution')
plt.show()

#visualize happiness
sns.countplot(x='happy', data=df)
plt.title('happiness counts')
plt.show()

#visualize health vs happiness
sns.countplot(x='health', hue='happy', data=df)
plt.title('health vs happiness')
plt.xticks(rotation=45)
plt.show()

5. describe your findings.

findings:
1. the age distribution is fairly spread out, with fewer very young and very old people.
2. most people report being 'pretty happy'.
3. there seems to be a relationship between health and happiness. people with 'excellent' or 'good' health tend to report being happier than those with 'fair' or 'poor' health.

6. appendix

no extra content needed.