# Analyzing Mental Health in America


Andy Alba, Andy Garcia, David Wang, Nada Dalloul, Allyson Ling

## Introduction


Over the course of a few years, mental health topics in the United States have become less taboo, and many are beginning to engage in an open and inviting conversation on their own psychological state. However, there has been a well documented increase of mental illness, specifically depression, across the United States. In order to shed light on the potential causes of this trend, we aim to explore the effect of different factors on the mental health status of the American population using survey data. Given the overwhelming impact of technology in today’s society and the increasing use of social media  as a source of community, we hope to investigate the scope of information available and topics covered in mental health online forums as compared to survey data. 

## Data

We utilized data from the National Health Interview Survey, an annual cross-sectional survey by the National Center for Health Statistics that aims to illustrate the the health status of those living in the United States. This survey includes a wide range of factors, covering information from health insurance to annual income. There is data for every year between 2004 to 2017, and they are split into multiple files. We began by only looking at the “Person” file from the most recent data in 2017 because it contained a clear indicator of whether a subject had depression. The original CSV contained 700 features, but after going through the features individually in the data documentation, we selected 46 possible features of interest. These features are listed below: 


We initially believed that the “Person” file also included the physical weights of each individual, however, upon further inspection, we discovered that key physical characteristics such as weight and height were in the “Sample Adult” and “Sample Child” files. According to the data documentation, these values have been modified to protect the privacy of the individuals surveyed. Height was limited to 59-70 inches for women and 63-76 inches for men. Similarly, weight was limited to 100-274 pounds for women and 126-299 pounds for men. BMI was calculated even for persons whose data was altered, so we made note of that as well. Although we could have applied to obtain some of the confidential data, due to monetary concerns, we did not pursue this further.

## Current Progress

One of the first things that we had to do was change the values of the data. The data was saved with numerical keys representing the possible answers for a given survey question. The key-answer pairings were stored in a separate pdf, so we converted these pairings to a Python dictionary and replaced the numeric keys in the data with its corresponding answer. 

Next, for our exploratory analysis, we wanted to see the overall spread and distribution of those who responded to mental health questions. These distributions were visualized as interactive bar plots in plot.ly so we could see the distribution of multiple groups at once and individual groups in one plot. 


After subsetting our data to only those who answered mental health questions, the size of our dataset went from around 17,000 observations to 1,500.


In [7]:
from IPython.display import IFrame

In [3]:
IFrame('depressionXincome.html', width=950, height=600)

The above graph, Depression Length X Income shows the spread of the income of the people surveyed who answered the survey question on depression. We can see that the majority of people who are depressed have been depressed for more than 1 year with very few people identifying as being depressed for less than a year. This may be explained by depression being something that occurs over a long period of time rather than a few days. In addition, we can see that a majority of people who are depressed have a salary of less than $5000. This tells us that money may possibly play a large role in a persons depression or vice versa. 

In [1]:
IFrame('depressionXage.html', width=950, height=600)

The above graph compares the spread of the length of a persons depression and their age. From this we can see that a majority of the people with depression tend to be above 40. We can also see that there are more people in this plot compared to the plot of depression length and income. This tells us that a majority of people who have depressed for more than year did not report their income or did not have their income surveyed.

In [8]:
IFrame('depressionXageXincome<5k.html', width=950, height=600)

## Future Plan


Our next steps are to combine survey data from 2004-2017 and also multiple components of the survey. We are interested in looking at physical characteristics of the people who are identified as having depression, which are in the “Sample Adult” and “Sample Child” files. Thus, we want to merge the “Person”, “Sample Adult”, and “Sample Child” data for every year. We will also merge in the “Family” data in case we want to use it in the future. All the files from 2006-2014 as currently in ASCII format, and we will use the SAS files provided to convert and merge them into CSVs, which can be read easily into Python for analysis.


 The files for 2004-2005 are currently unavailable, and we have contacted the Centers for Disease Control and Prevention, the department of which the National Center for Health Statistics is under, to see if we can obtain this data.
