# **Summary**

For Project 1, our group worked with data from the General Social Survey conducted repeatedly since 1972, provided by NORC at the University of Chicago. There were an abundance of variables we found interesting, but we particularly wanted to explore if and how mental health and different types of employment conditions are correlated. Our methods of investigating this question consisted of choosing the dataset, gss_chunk_3.parquet, observing and cleaning the data for consistency, proceeding with data wrangling to prepare the data for analysis, visualizing the data with a variety of graphs, and finally analyzing the data with our results.

There is still more to test and verify beyond our project, and our results showed that there were weak correlations or other factors in play between our variables. Just by looking at our visuals, regarding the wrkslf variable (whether the respondent was self employed or working for someone else), our results showed that taking mental health days off was more common among workers who were employed by someone else than those who were self employed. However, there seemed to be almost no correlation between hours worked and mental health days taken off in a month when grouped by wrkslf. Regarding income group distributions, our results showed that taking mental health days off was more common among workers who were in the lower income group, than those in the higher income group. However, there was a weak correlation between hours worked and mental health days taken off in a month when grouped by income groups.

# **Data**

For our dataset, we chose to use the provided file, gss_chunk_3.parquet. As we approached this dataset, we wanted to focus on the latest data available, so we decided to only use data from 2018, 2021, and 2022. There was no available data for the years 2019 and 2020, which is why there is a gap between 2018 and 2021.

Because we were exploring the correlation between mental health and employment conditions, we used the following variables:
- mntlhlth (numerical): days of poor mental health past 30 days, for how many days during the past 30 days was your mental health not good? (defined by stress, depression, problems with emotions)
- hrs2 (numerical): number of hours usually work in a week
- income16 (categorical): total family income
- wrkslf (categorical): self employed or working for someone else

As we began to read, clean and prepare the data for analysis, we came across a few challenges. The first challenge we came across was dealing with a large amount of NA values. Because our dataset was from a social survey, respondents often left many questions unanswered or did not leave meaningful answers. Our group considered imputing the values with the median, however, because there were a significant amount of values missing and this would greatly influence our results, we decided to remove all NA values. The second challenge we came across was as a result of removing all NA values which significantly reduced the amount of data we could work with. This influenced our decision to use multiple years of our data instead of just the latest year, 2022, which allowed for us to utilize more of the dataset while still preserving our purpose.

The following cells include our data cleaning and visualization process.


In [None]:
! git clone "https://github.com/gdbwoo/DS-3001-Projects"

Cloning into 'DS-3001-Projects'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 24 (delta 3), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (24/24), 7.89 MiB | 5.33 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [None]:
# Import all packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Observe all unfiltered data
df_all = pd.read_parquet('DS-3001-Projects/Project 1/gss_chunk_3.parquet')
print(df_all.shape, '\n')
df_all.head()

(24130, 6694) 



Unnamed: 0,year,id,wrkstat,hrs1,hrs2,evwork,occ,prestige,wrkslf,wrkgovt,...,agehef12,agehef13,agehef14,hompoph,wtssps_nea,wtssnrps_nea,wtssps_next,wtssnrps_next,wtsscomp,wtsscompnr
0,2006,1751,working full time,40.0,,,,,someone else,government,...,,,,,,,,,1.079141,0.96115
1,2006,1752,in school,,,yes,,,someone else,private,...,,,,,,,,,7.673834,6.641571
2,2006,1753,working full time,35.0,,,,,someone else,government,...,,,,,,,,,0.584663,0.512145
3,2006,1754,working full time,50.0,,,,,someone else,government,...,,,,,,,,,0.715731,0.637592
4,2006,1755,working full time,40.0,,,,,someone else,private,...,,,,,,,,,1.094831,0.956094


In [None]:
year = df_all['year']
print(year.unique(), '\n')

[2006 2008 2010 2012 2014 2016 2018 2021 2022] 



In [None]:
# Make new dataframe with years 2022, 2021, 2018

#df = df_all.loc[(df_all['year'] == 2022)]
df = df_all.loc[(df_all['year'] == 2022) | (df_all['year'] == 2021) | (df_all['year'] == 2018)]
print(df.shape, '\n')
df.head()

(9924, 6694) 



Unnamed: 0,year,id,wrkstat,hrs1,hrs2,evwork,occ,prestige,wrkslf,wrkgovt,...,agehef12,agehef13,agehef14,hompoph,wtssps_nea,wtssnrps_nea,wtssps_next,wtssnrps_next,wtsscomp,wtsscompnr
14206,2018,1,"with a job, but not at work because of tempora...",,41.0,,,,someone else,private,...,,,,,,,,,1.908104,2.244275
14207,2018,2,retired,,,yes,,,someone else,private,...,,,,,,,,,0.91455,1.095217
14208,2018,3,working full time,40.0,,,,,someone else,private,...,,,,,,,,,0.609109,0.740432
14209,2018,4,working full time,40.0,,,,,someone else,private,...,,,,,,,,,0.642403,0.769342
14210,2018,5,retired,,,yes,,,someone else,private,...,,,,,,,,,0.396347,0.462239


In [None]:
# Confirm all data is from correct years: 2022, 2021, 2018
year = df['year']
print(year.unique(), '\n')

[2018 2021 2022] 



In [None]:
# Select the columns we want to use: mntlhtlh, hrs2, income16, wrkslf
df = df.loc[:,['mntlhlth','hrs2','income16', 'wrkslf']]
print(df.shape, '\n')
print(df.head())


(9924, 4) 

       mntlhlth  hrs2              income16        wrkslf
14206      20.0  41.0                   NaN  someone else
14207       NaN   NaN    $30,000 to $34,999  someone else
14208       3.0   NaN  $150,000 to $169,999  someone else
14209       1.0   NaN      $170,000 or over  someone else
14210       NaN   NaN      $170,000 or over  someone else


In [None]:
# Remove all nan/missing values
df = df.dropna()
print(df.shape, '\n')
print(df.head())

(120, 4) 

       mntlhlth  hrs2              income16        wrkslf
14293       0.0  40.0      $170,000 or over  someone else
14296       1.0  45.0   $90,000 to $109,999  someone else
14303       0.0  50.0  $130,000 to $149,999  someone else
14310       0.0  24.0    $60,000 to $74,999  someone else
14444       0.0  15.0  $150,000 to $169,999  someone else


In [None]:
# Make a copy of the mntlhlth column
# mntlhlth: days of poor mental health past 30 days, for how many days during the past 30 days was your mental health not good?
# mental health: stress, depression, problems with emotions
mh = df['mntlhlth']
print(mh.unique(), '\n')
mh.value_counts()

[ 0.  1.  2. 10. 14.  5. 30.  3. 15. 25.  4. 20.  6.  7.  8. 12.] 



0.0     51
30.0    13
5.0     11
2.0      9
10.0     7
15.0     6
1.0      4
3.0      4
14.0     3
4.0      3
20.0     3
25.0     2
6.0      1
7.0      1
8.0      1
12.0     1
Name: mntlhlth, dtype: int64

In [None]:
# Confirm there are no missing values
print('Total missing: ', sum(mh.isnull()))

Total missing:  0


In [None]:
# number of hours worked per week
hoursworked = df["hrs2"]
hoursworked.value_counts()

40.0    49
50.0    16
45.0     5
35.0     5
30.0     4
60.0     4
25.0     4
48.0     3
10.0     3
55.0     3
38.0     2
80.0     2
36.0     2
70.0     2
6.0      2
15.0     2
24.0     2
84.0     1
12.0     1
46.0     1
9.0      1
42.0     1
32.0     1
1.0      1
66.0     1
21.0     1
52.0     1
Name: hrs2, dtype: int64

In [None]:
print('Total missing: ', sum(hoursworked.isnull()))

Total missing:  0


In [None]:
# income brackets
income = df["income16"]
income.value_counts()
#print(income.unique(), '\n')
#income.dtype

$60,000 to $74,999               19
$170,000 or over                 14
$90,000 to $109,999              14
$50,000 to $59,999               12
$75,000 to $89,999                7
$40,000 to $49,999                7
$30,000 to $34,999                7
$35,000 to $39,999                6
$110,000 to $129,999              5
$25,000 to $29,999                4
$150,000 to $169,999              4
$130,000 to $149,999              4
$20,000 to $22,499                4
$17,500 to $19,999                3
under $1,000                      2
$7,000 to $7,999                  2
$22,500 to $24,999                1
$12,500 to $14,999                1
$10,000 to $12,499                1
$1,000 to $2,999                  1
$5,000 to $5,999                  1
$15,000 to $17,499                1
no answer                         0
not available in this year        0
not available in this release     0
uncodeable                        0
skipped on web                    0
refused                     

In [None]:
# Remove all unused categories
income = income.cat.remove_unused_categories()  # source: https://stackoverflow.com/questions/62090972/why-does-pandas-value-counts-show-a-count-of-zero-for-some-values
income.value_counts()

$60,000 to $74,999      19
$170,000 or over        14
$90,000 to $109,999     14
$50,000 to $59,999      12
$40,000 to $49,999       7
$75,000 to $89,999       7
$30,000 to $34,999       7
$35,000 to $39,999       6
$110,000 to $129,999     5
$20,000 to $22,499       4
$25,000 to $29,999       4
$130,000 to $149,999     4
$150,000 to $169,999     4
$17,500 to $19,999       3
$7,000 to $7,999         2
under $1,000             2
$22,500 to $24,999       1
$1,000 to $2,999         1
$15,000 to $17,499       1
$12,500 to $14,999       1
$10,000 to $12,499       1
$5,000 to $5,999         1
Name: income16, dtype: int64

In [None]:
print('Total missing: ', sum(income.isnull()))

Total missing:  0


In [None]:
income = income.replace(['under $1,000', '$1,000 to $2,999', '$5,000 to $5,999', '$7,000 to $7,999', '$10,000 to $12,499',
                         '$12,500 to $14,999', '$15,000 to $17,499', '$17,500 to $19,999', '$20,000 to $22,499', '$22,500 to $24,999',
                         '$25,000 to $29,999', '$30,000 to $34,999', '$35,000 to $39,999', '$40,000 to $49,999', '$50,000 to $59,999'],'Lower')
income = income.replace(['$60,000 to $74,999', '$75,000 to $89,999', '$90,000 to $109,999', '$110,000 to $129,999', '$130,000 to $149,999'],'Middle')
income = income.replace(['$150,000 to $169,999', '$170,000 or over'],'Upper')
income.value_counts()

Lower     53
Middle    49
Upper     18
Name: income16, dtype: int64

In [None]:
# Replace original income column with cleaned income column for updated dataframe
df['income16'] = income
df['income16'].value_counts()

Lower     53
Middle    49
Upper     18
Name: income16, dtype: int64

In [None]:
# self employed or not
employment = df["wrkslf"]
employment.value_counts()

someone else                     106
self-employed                     14
don't know                         0
iap                                0
I don't have a job                 0
dk, na, iap                        0
no answer                          0
not imputable_(2147483637)         0
not imputable_(2147483638)         0
refused                            0
skipped on web                     0
uncodeable                         0
not available in this release      0
not available in this year         0
see codebook                       0
Name: wrkslf, dtype: int64

In [None]:
# Remove all unused categories
employment = employment.cat.remove_unused_categories()
employment.value_counts()

someone else     106
self-employed     14
Name: wrkslf, dtype: int64

In [None]:
print('Total missing: ', sum(employment.isnull()))

Total missing:  0


In [None]:
# Replace original self employed column with cleaned self-employed column for updated dataframe
df['wrkslf'] = employment
df['wrkslf'].value_counts()

someone else     106
self-employed     14
Name: wrkslf, dtype: int64

In [None]:
# Changed categorical columns into object types for easier visualization and interaction manipulation
df["wrkslf"] = df["wrkslf"].astype('object')
df["income16"] = df["income16"].astype('object')

In [None]:
pd.crosstab(df['wrkslf'],df['income16'], normalize = 'all')

Most respondents are lower (0.358) or middle (0.392) class employees. The self-employed middle and upper class were the least represented in the data (0.016 each).

In [None]:
# Data Visualization

In [None]:
sns.kdeplot(df,x="mntlhlth")
df['mntlhlth'].describe()

Looking at the mental health variable, there is a slight skew towards the right in the density plot. On average, the respondents took around 7 (6.97) mental health days off in a month.

In [None]:
sns.kdeplot(df,x="mntlhlth",hue="wrkslf")
df.loc[:,["wrkslf","mntlhlth"]].groupby("wrkslf").describe()

On average, workers who were self-employed took around 5 days (4.93) mental health days off in a month. Workers who were employed by someone else took around 7 (7.24) mental health days off in a month.

In [None]:
sns.kdeplot(df,x="mntlhlth",hue="income16")
df.loc[:,["income16","mntlhlth"]].groupby("income16").describe()

On average, workers who were in the low income group took around 9 days (9.24) mental health days off in a month. Workers who were in the middle income took around 6 (5.84) mental health days off in a month and high income earners took around 3 (3.33) mental health days off in a month on average.




In [None]:
df["WorkStatusXIncomeLevel"] = df["wrkslf"]+df["income16"]
sns.kdeplot(df,x="mntlhlth",hue="WorkStatusXIncomeLevel")
df.loc[:,['wrkslf','income16','mntlhlth']].groupby(['wrkslf','income16']).describe()

Looking at the interaction between income levels and work status, the group that took the most mental health days on average were the lower income employed workers (9.93). The group that took the least amount of mental health days off, on average, were the self-employed middle income group (0.00).

In [None]:
# kernel density plot
sns.kdeplot(data=df,x='hrs2',hue='wrkslf')
plt.legend(title='Work Status',labels=["Employee","Self-Employed"])
plt.show()
df.groupby('wrkslf')['hrs2'].describe()

The kernel density plot shows an approximately normal distribution for the hours worked variable. The normal shape is beneficial to future analyses and the validity of predictions. Self employed workers work on average 31.3 hours while employed workers work 41.7 hours on average.

In [None]:
sns.kdeplot(data=df,x='hrs2',hue='income16')
plt.legend(title='Income',labels=["Upper","Middle","Lower"])
plt.show()
df.groupby('income16')['hrs2'].describe()


The kernel density plot shows an approximately normal distribution for the income variable. Again, the normal shape is beneficial to future analyses and the validity of predictions. The lower income group works on average 38 hours a week (38.01), the middle income group works 42 hours a week (41.65), and the upper income group works around 45 hours a week on average (44.78).

In [None]:
sns.scatterplot(data=df,x='hrs2',y='mntlhlth', hue='wrkslf')
correlation_employment = df.groupby('wrkslf').apply(lambda x: x['hrs2'].corr(x['mntlhlth']))

In the scatterplot comparing mental health days taken off to hours worked (grouped by employment), self-employed workers (0.03) had higher correlation with mental health days taken off than employed workers (0.01). There seems to be almost no correlation between hours worked and mental health days taken off in a month when grouped by employment.

In [None]:
sns.scatterplot(data=df,x='hrs2',y='mntlhlth', hue='income16')
correlation_income = df.groupby('income16').apply(lambda x: x['hrs2'].corr(x['mntlhlth']))

In the scatterplot comparing mental health days taken off to hours worked (grouped by income), the lowest income group had the highest correlation with mental health days off (0.16) and the middle income group had the lowest correlation (0.05). There is a weak correlation between hours worked and mental health days taken off in a month when grouped by income groups.

# **Results**

In order to get a better grasp of our data we looked at the distributions of the predictor variables (hrs2, income16, wrkslf) as well as the response variable (mntlhlth). We mainly used grouped kernel density plots to show the distribution of the predictor variables and used a crosstab to show the categorical variable distributions. In addition, we employed grouped scatter plots to see if there was a relationship between the number of mental health days taken off in a month and the number of hours worked in a week.

Based on our crosstab of the two categorical predictors (wrkslf, income16) we found most respondents of our filtered dataset were lower (0.358) or middle (0.392) class employees. The self-employed middle and upper class were the least represented in the data (0.016 each).  

Looking at the mental health variable, there was a slight skew towards the right in the density plot. On average, the respondents took around 7 (6.97) mental health days off in a month. When taking employment into consideration, on average, workers who were self-employed took around 5 days (4.93) mental health days off in a month. Workers who were employed by someone else took around 7 (7.24) mental health days off in a month. For the income group distributions, on average, workers who were in the low income group took around 9 days (9.24) mental health days off in a month. Workers who were in the middle income took around 6 (5.84) mental health days off in a month and high income earners took around 3 (3.33) mental health days off in a month on average.

The kernel density plot for the hours worked variable showed an approximately normal distribution . Self employed workers work on average 31.3 hours while employed workers work 41.7 hours on average. The kernel density plot for the income variable showed an approximately normal distribution. The lower income group works on average 38 hours a week (38.01), the middle income group works 42 hours a week (41.65), and the upper income group works around 45 hours a week on average (44.78).

In the scatterplot comparing mental health days taken off to hours worked (grouped by employment), self-employed workers (0.03) had higher correlation with mental health days taken off than employed workers (0.01). There seems to be almost no correlation between hours worked and mental health days taken off in a month when grouped by employment. In the scatterplot comparing mental health days taken off to hours worked (grouped by income), the lowest income group had the highest correlation with mental health days off (0.16) and the middle income group had the lowest correlation (0.05). There is a weak correlation between hours worked and mental health days taken off in a month when grouped by income groups.

# **Conclusion**

To summarize, our project was a way for our team to practically use the skills we’ve learned in class — data wrangling, exploratory data analysis, and visualization. Our dataset was the General Social Survey (GSS), an ongoing survey for over fifty years collecting data from the same 3,000 participants. After examining the data, we chose the research question: How do different kinds of employment conditions affect mental health? We ran into a few problems with the dataset, which required us to clean the data by removing missing values, indexing by the last three years that the survey was sent out, and selecting certain variables from the vast dataset (mntlhealth, hrs2, income16, and wkslf). We then used the cleaned dataset to make a few visualizations including kernel density plots and scatter plots. The kernel density plots were used to show the distribution of predictor variables with a crosstab to display the categorical variable distributions. On the other hand, the grouped scatterplots were used to look at the mntlhealth variable and hrs2 to see if there was a relationship. After examining the density plot, we can see that on average, those who are self-employed and with higher income correlated with better mental health as measured by the lower number of mental health days off in a month. In the scatterplot, we once again saw a strong correlation between self-employed and a lesser number of mental health days off compared to those who worked for others when observing mental health days off and hours worked. In the scatterplot comparing hours worked and mental health days off, groups by income, the lowest income group had the highest correlation with mental health days off compared to others.

Upon reading our paper, some people may have various points of criticism. For example, they could ask about our variable selection process, cleaning process, and grouping choices. We acknowledge that the variables we chose are limited in showing the full story of employment conditions and their correlation with mental health. However, we chose the best and most relevant variables available from the dataset. If we could expand the scope of our project, there are certainly more data points and even datasets out on the Internet that could better example that relationship. Furthermore, there may be concerns regarding how we chose to group certain variables like income16 into different income levels. These income levels were chosen based on Pew Research Center’s income brackets listed on their website. It did result in an uneven number of data points per income level; however, given the chaotic ranges of income we were given after cleaning the data, it made the most intuitive sense to group them by income bracket. Ideally, if the scope of the project was bigger and the dataset contained fewer missing values, we would have data points for each income range and more total data points to create better analyses. To conclude, our analysis was limited by the sparse responses from the 3,000 participants and the thoroughness of the questions asked in the GSS. Given more time and resources, we could find more datasets outside the scope of the project to enhance our analyses.