## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
from google.colab import files
uploaded = files.upload()

In [14]:
file_path = r"GSS.xlsx"
df = pd.read_excel(file_path)

The data I chose from the GSS measures music taste across different genres, as well as respondants' experience with attending music events and playing musical instruments. I love listening to all kinds of music and know how to play some instruments, so I'm curious if there are certain preferences that are more likely to come in batches, or if people who play music tend to like certain types of music. The data is from 1993, 1998, and 2002 (those were the only years these questions were asked). Sentiment towards a genre is measured on a scale from "dislike very much" to "like very much".

Variables:

year: survey year

gomusic: if the respondant has attended a classical music or opera event (non-school) in the past year

plymusic: if the respondant has played a musical instrument in the past year

country: respondant's feelings towards country music

blues: respondant's feelings towards blues/R&B music

classicl: respondant's feelings towards classical music

folk: respondant's feelings towards folk music

gospel: respondant's feelings towards gospel music

jazz: respondant's feelings towards gospel music

rap: respondant's feelings towards rap music

oldies: respondant's feelings towards older/classic rock music

hvymetal: respondant's feelings towards heavy metal music

In [15]:
#First I will drop ballot because it is useless
df.drop(columns='ballot', inplace=True)

In [21]:
#Next I will replace the preference values with ordinal values 1-5, from dislike very much to like very much
#Do not know/cannot choose, no answer, and inapplicable will be replaced with null values
df.replace({'DISLIKE VERY MUCH': 1, 'DISLIKE IT': 2, 'MIXED FEELINGS': 3, 'LIKE IT': 4, 'LIKE VERY MUCH': 5, '.d:  Do not Know/Cannot Choose':None, '.n:  No answer' : None, '.i:  Inapplicable': None
       }, inplace=True)

  df.replace({'DISLIKE VERY MUCH': 1, 'DISLIKE IT': 2, 'MIXED FEELINGS': 3, 'LIKE IT': 4, 'LIKE VERY MUCH': 5, '.d:  Do not Know/Cannot Choose':None, '.n:  No answer' : None, '.i:  Inapplicable': None


In [24]:
df.head()

Unnamed: 0,year,id_,country,blues,classicl,folk,gospel,jazz,rap,oldies,hvymetal,gomusic,plymusic
0,1993,1,3.0,4.0,5.0,3.0,2.0,4.0,1.0,2.0,1.0,NO,NO
1,1993,2,3.0,5.0,5.0,4.0,5.0,5.0,2.0,5.0,1.0,NO,NO
2,1993,3,3.0,3.0,5.0,4.0,3.0,3.0,2.0,5.0,2.0,YES,NO
3,1993,4,3.0,3.0,5.0,1.0,1.0,4.0,1.0,4.0,1.0,NO,NO
4,1993,5,2.0,5.0,5.0,5.0,3.0,5.0,,1.0,,YES,NO
