## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.

respondant ID
ballot
year
realinc
hrs2
martial
sphrs2
happy

2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("https://raw.githubusercontent.com/ashleynguyen04/EDA/refs/heads/main/lab/GSS.csv")

df.head()

Unnamed: 0,year,id_,hrs2,marital,sphrs2,educ,polviews,happy,trustsci,realinc
0,1972,1,.i: Inapplicable,Never married,.i: Inapplicable,4 years of college,.i: Inapplicable,Not too happy,.i: Inapplicable,18951.0
1,1972,2,.i: Inapplicable,Married,.i: Inapplicable,10th grade,.i: Inapplicable,Not too happy,.i: Inapplicable,24366.0
2,1972,3,.i: Inapplicable,Married,.i: Inapplicable,12th grade,.i: Inapplicable,Pretty happy,.i: Inapplicable,24366.0
3,1972,4,.i: Inapplicable,Married,.i: Inapplicable,5 years of college,.i: Inapplicable,Not too happy,.i: Inapplicable,30458.0
4,1972,5,.i: Inapplicable,Married,.i: Inapplicable,12th grade,.i: Inapplicable,Pretty happy,.i: Inapplicable,50763.0


In [4]:
#Cleaning numeric variables
df['hrs2'] = pd.to_numeric(df['hrs2'], errors='coerce') #change values to floats, and replace non-numbers w/ nan
print(df['hrs2'].unique())
df['sphrs2'] = pd.to_numeric(df['sphrs2'], errors='coerce')
print(df['sphrs2'].unique())

#print(df['realinc'].unique())

[nan 40. 55. 15. 56. 50. 10. 35. 70. 16. 37. 30. 48. 20.  5. 25. 60. 32.
 52. 45. 75. 24.  4. 44. 34. 13.  8. 12.  6. 80. 39. 21. 47. 36. 38. 33.
 46. 42. 43. 11. 66. 23.  7. 58. 18. 65. 84. 17. 68. 41.  1.  2.  0. 72.
 28. 57.  3. 22. 27. 26.  9.]
[nan 73. 20. 40. 35. 48. 14. 50. 16. 84. 44. 56. 13. 32.  8. 37. 72. 24.
 12. 60. 54. 22. 43. 25. 38. 49. 30. 45. 36. 65. 57. 55. 52. 70. 51. 21.
 26. 10.  5. 46. 80. 47.  1. 42. 39. 33. 27. 18. 15. 34.  9.]


In [22]:
#Cleaning categorical variables

#marital
valid_marital = ['Never married', 'Married', 'Divorced', 'Widowed', 'Separated']
df['marital'] = df['marital'].apply(lambda x: x if x in valid_marital else 'Unknown') #replace invalid marital values (.n, .s, .d) with 'Unknown'
print(df['marital'].value_counts())

#happy
valid_happy = ['Pretty happy', 'Very happy', 'Not too happy']
df['happy'] = df['happy'].apply(lambda x: x if x in valid_happy else 'Unknown') #replace invalid happpy values (.n, .d, .i, .s) with 'Unknown'
print(df['happy'].value_counts())

#educ
#print(df['educ'].value_counts())
df['educ'] = df['educ'].replace({
    # Replace high school categories
    '12th grade': 'High school graduate',
    '11th grade': 'High school graduate',
    '10th grade': 'High school graduate',
    '9th grade': 'High school graduate',
    '8th grade': 'No high school diploma',
    '7th grade': 'No high school diploma',
    '6th grade': 'No high school diploma',
    '5th grade': 'No high school diploma',
    '4th grade': 'No high school diploma',
    '3rd grade': 'No high school diploma',
    '2nd grade': 'No high school diploma',
    '1st grade': 'No high school diploma',
    'No formal schooling': 'No high school diploma',

    # Replace college categories
    '4 years of college': 'Bachelor\'s degree',
    '3 years of college': 'Some college',
    '2 years of college': 'Some college',
    '1 year of college': 'Some college',
    '6 years of college': 'Master\'s degree',
    '5 years of college': 'Master\'s degree',
    '7 years of college': 'Doctorate degree',
    '8 or more years of college': 'Doctorate degree',
})

valid_educ = ['High school graduate','No high school diploma','Bachelor\'s degree','Some college','Master\'s degree','Doctorate degree']
df['educ'] = df['educ'].apply(lambda x: x if x in valid_educ else 'Unknown') #replace invalid educ values (.n,.d) with 'Unknown'
print(df['educ'].value_counts())

marital
Married          37596
Never married    15904
Divorced          9642
Widowed           6756
Separated         2441
Unknown             51
Name: count, dtype: int64
happy
Pretty happy     37813
Very happy       20385
Not too happy     9390
Unknown           4802
Name: count, dtype: int64
educ
High school graduate      30525
Some college              17420
Bachelor's degree          9994
No high school diploma     5936
Master's degree            5337
Doctorate degree           2915
Unknown                     263
Name: count, dtype: int64
