## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest. (You can also check out `get_gss.ipynb` for some processed data.)
2. Write a short description of the data you chose, and why. (~500 words)
3. Load the data using Pandas. Clean them up for EDA. Do this in this notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations.
5. Describe your findings. (500 - 1000 words, or more)

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.


2. For this lab, we decided to use a minimal subset of variables from the General Social Survey (GSS) that are all related to work and employment. We felt this would be a good approach because work touches almost everyone's life in some form or another, and considering it from several angles can tell a lot about people as well as general social trends. The variables we've employed are:  

* year – the year in which the respondent completed the GSS survey.
* wrkstat – the respondent’s current labor force status (ex: working full time, part time, unemployed, retired, in school, etc.).
* evwork – indicates whether the respondent has ever held a job for at least one year.
* hrs1 – the number of hours the respondent actually worked during the week prior to the survey.
* hrs2 – the number of hours the respondent usually works in a typical week.

We like this set of variables because it gives us several different viewpoints on the topic of work. The wrkstat variable records a respondent's current employment status. If we add this on with hrs1 (hours worked in the previous week) and hrs2 (hours worked in a typical week), this means that we can check whether or not those categories correlate. It will be informative to see if this correlation holds across the entire dataset, or if there are cases in which reported status and hours do not align.

The evwork variable provides us with even more information. It tells us if the respondent has ever worked for one year or more. This allows us to split individuals who have never worked and individuals who are not working but did work previously, like retirees or those between jobs. That makes a difference because having a respondent be "not working" doesn't tell the whole story of their experience.

Comparing hrs1 and hrs2 to each other is also useful. The majority of the respondents will report only hours for a primary job, but hrs2 shows who have a regular pattern of juggling multiple jobs. That can be informative of whether people who have multiple jobs cluster in part-time groups or if their aggregate hours do indeed add up to full-time work.

The year variable allows us to put all of this into broader historical perspective since it identifies the year of the GSS survey for each respondent. Economic fluctuations, cultural shifts, and policy create patterns of work, so they do not remain constant over time. With this variable, we can see if average hours worked are changing or if some categories, such as "retired" or "in school," went up or down over decades.

We believe that these variables provide us with a solid basis for our analysis. They permit us to compare the way in which individuals classify their work situation with the hours they report, as well as provide us with the opportunity to monitor changes over time. What makes this particularly fascinating is being able to observe not just where the data is as predicted, but where it is not. Those differences will give us a better sense of how work life in America has been described in the survey.

In [1]:
!pip install pandas numpy matplotlib seaborn openpyxl



In [2]:
#3
import pandas as pd
import numpy as np

# loading dataset
df = pd.read_excel("./data/GSS.xlsx")

# keeping only variables of interest
df = df[['year', 'wrkstat', 'evwork', 'hrs1', 'hrs2']]

# cleaning categorical variables
df['wrkstat'] = df['wrkstat'].astype('category')
df['evwork'] = df['evwork'].replace({ # replacing non-yes/no with NaN
    '.i:  Inapplicable': None,
    '.n:  No answer': None,
    '.s:  Skipped on Web': None,
    '.d:  Do not Know/Cannot Choose': None
})
df['evwork'] = df['evwork'].astype('category')

# cleaning numeric variables
def clean_hours(x): # helper for stripping labels and converting to numbers
    if pd.isna(x):
        return None
    if isinstance(x, str):
        if x.startswith('.') or not any(c.isdigit() for c in x):
            return None
        try:
            return float(x)
        except:
            return None
    return x

df['hrs1'] = df['hrs1'].apply(clean_hours)
df['hrs2'] = df['hrs2'].apply(clean_hours)

# calculate total_hrs only for workers
df['total_hrs'] = df[['hrs1','hrs2']].sum(axis=1, min_count=1)

# NaN for non-workers
df.loc[~df['wrkstat'].isin(['Working full time','Working part time']), 'total_hrs'] = np.nan

# checks
print(df.head())
print(df.info())

print(df['wrkstat'].value_counts().head())
print(df['evwork'].value_counts())

print(df[['hrs1','hrs2','total_hrs']].describe())

FileNotFoundError: [Errno 2] No such file or directory: './data/GSS.xlsx'

In [None]:
# 4

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# 1. Distribution of Work Status
plt.figure(figsize=(8,5))
sns.countplot(y="wrkstat", data=df, order=df['wrkstat'].value_counts().index, palette="viridis")
plt.title("Distribution of Work Status")
plt.xlabel("Count")
plt.ylabel("Work Status")
plt.show()

# 2. Ever Worked Distribution
plt.figure(figsize=(6,4))
sns.countplot(x="evwork", data=df, palette="magma")
plt.title("Have you ever worked?")
plt.xlabel("Response")
plt.ylabel("Count")
plt.show()

# 3. Histogram of Weekly Hours
plt.figure(figsize=(8,5))
sns.histplot(df['total_hrs'], bins=30, kde=True, color="steelblue")
plt.title("Distribution of Weekly Working Hours")
plt.xlabel("Weekly Hours Worked")
plt.ylabel("Frequency")
plt.show()

# 4. Boxplot: Hours by Work Status
plt.figure(figsize=(10,6))
sns.boxplot(x="wrkstat", y="total_hrs", data=df, palette="Set2")
plt.title("Weekly Working Hours by Work Status")
plt.xlabel("Work Status")
plt.ylabel("Weekly Hours")
plt.xticks(rotation=30)
plt.show()

# 5. Trend Over Time (Year vs Avg Hours)
avg_hours_by_year = df.groupby("year")['total_hrs'].mean()

plt.figure(figsize=(10,5))
avg_hours_by_year.plot(marker="o", linestyle="-", color="darkgreen")
plt.title("Average Weekly Hours Worked Over Time")
plt.xlabel("Year")
plt.ylabel("Average Weekly Hours")
plt.grid(True)
plt.show()

# 6. Numeric Summaries
print("\n--- Summary Statistics ---")
print(df[['hrs1','hrs2','total_hrs']].describe())

print("\n--- Work Status Counts ---")
print(df['wrkstat'].value_counts())

print("\n--- Ever Worked Counts ---")
print(df['evwork'].value_counts())
