Make sure to go to the **Edit** menu and click **Clear all outputs** before and after running this notebook.

# Load CSV files from S3 bucket into Spark dataframes

## Start Spark Session

In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [0]:
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MentalHealthETL").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

## Mount Google Drive into this runtime

To access the csv data files from the S3 bucket, you need to mount your google drive into this runtime. To do that, run the following cells.

This will prompt a URL with an authentication code. After you go to the URL and insert that authentication code in the provided space, your google drive will be mounted.

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
%cd /content/gdrive/My Drive/data_final_project/mental_health_ML

## Create config.py file

In the **mental_health_ML** directory, create a file called **config.py** and add the following contents:

```bash
ACCESS_ID='AWS_ACCESS_KEY_ID'
ACCESS_KEY='AWS_SECRET_ACCESS_KEY'
BUCKET_NAME='S3_BUCKET_NAME'
```

Replace AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET_NAME with their actual values. This file is in the .gitignore so that it won't be committed to GitHub.

## Access S3 bucket where csv files are stored

In [0]:
# Install aws sdk for python
! pip install boto3

In [0]:
import boto3
from config import ACCESS_ID, ACCESS_KEY, BUCKET_NAME

# Use Amazon S3
s3 = boto3.resource('s3', aws_access_key_id=ACCESS_ID, aws_secret_access_key= ACCESS_KEY)
bucket_name = BUCKET_NAME

# Bucket where csv files are stored.
bucket = s3.Bucket(bucket_name)

## Read in data from S3 bucket and load into Spark dataframes

In [0]:
# Read in data from S3 bucket
from pyspark import SparkFiles

bucket_name = BUCKET_NAME
original_dataframes = {}

for file in bucket.objects.all():
  key = file.key
  key_without_extension = key[:-4]
  year = key_without_extension[-4:]
  url=f"https://{bucket_name}.s3.amazonaws.com/{key}"
  spark.sparkContext.addFile(url)
  original_dataframes[year] = spark.read.csv(SparkFiles.get(key), sep=",", header=True, inferSchema=True)

# Show DataFrame
original_dataframes["2019"].show(n=5)
# original_dataframes["2018"].show(n=5)
# original_dataframes["2017"].show(n=5)
# original_dataframes["2016"].show(n=5)
# original_dataframes["2014"].show(n=5)

**Transform**


2019 Survey Data

In [0]:
#import dependencies
import pandas as pd
import numpy as np

#convert 2019 dataframe to pandas dataframe for transformation
nineteen_pandas_df = original_dataframes["2019"].toPandas() 
nineteen_pandas_df.head()

In [0]:
#Drop unwanted columns
nineteen_df = nineteen_pandas_df.drop(columns=['*Are you self-employed?*',
                 'Is your primary role within your company related to tech/IT?',
                 'Do you have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders?',
                 'Do you know local or online resources to seek help for a mental health issue?',
                 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?',
                 'If you have revealed a mental health disorder to a client or business contact, how has this affected you or the relationship?',
                 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?',
                 'If you have revealed a mental health disorder to a coworker or employee, how has this impacted you or the relationship?',
                 'Do you believe your productivity is ever affected by a mental health issue?',
                 'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
                 '*What disorder(s) have you been diagnosed with?*',
                 '*If possibly, what disorder(s) do you believe you have?*',
                 '*If so, what disorder(s) were you diagnosed with?*',
                 'Has being identified as a person with a mental health issue affected your career?',
                 'How has it affected your career?',
                 'Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.)',
                 'What US state or territory do you *live* in?',
                 'What US state or territory do you *work* in?'])

#nineteen_df.head()

In [0]:
#split dataframe into survey and demographics
dfs = np.split(nineteen_df, [59], axis=1)
nineteen_survey_df = dfs[0]
nineteen_demographics_df = dfs[1]
#nineteen_survey_df.head()

In [0]:
#Rename survey columns
nineteen_survey_df.columns = ["number_employees",
               "is_tech_company",
               "employer_provides_mental_health",
               "knows_options_available",
               "employer_formally_discussed_mental_health",
               "employer_offers_resources",
               "is_anonymity_protected_by_employer",
               "level_difficulty_asking_for_leave",
               "comfortable_talking_physical_mental_coworkers",
               "comfortable_discussing_with_supervisor",
               "has_discussed_with_employer",
               "conversation_with_employer",
               "comfortable_discussing_with_coworkers",
               "has_discussed_with_coworkers",
               "conversation_with_coworkers",
               "had_coworker_discuss_mental_health",
               "conversation_coworker_had",
               "employer_physical_health_importance",
               "employer_mental_health_importance",
               "has_previous_employers",
               "is_previous_employer_tech_company",
               "previous_employer_provided_mental_health_benefits",
               "aware_options_from_previous_employer",
               "previous_employer_formally_discussed_mental_health",
               "previous_employer_offered_resources",
               "is_anonymity_protected_by_previous_employer",
               "comfortable_talking_physical_mental_previous_employer",
               "comfortable_discussing_with_previous_supervisor",
               "has_discussed_with_previous_employer",
               "conversation_with_previous_employer",
               "willing_discuss_with_previous_coworkers",
               "has_discussed_with_previous_coworkers",
               "conversation_with_previous_coworkers",
               "had_previous_coworker_discuss_mental_health",
               "conversation_previous_coworker_had",
               "previous_employer_physical_health_importance",
               "previous_employer_mental_health_importance",
               "currently_has_mental_health_disorder",
               "has_been_diagnosed",
               "had_disorder_in_past",
               "sought_treatment_for_mental_health",
               "has_family_history",
               "interferes_with_work_treated",
               "interferes_with_work_not_treated",
               "observations_of_others",
               "willingness_to_share",
               "physical_health_in_interview",
               "comments_physical_health_in_interview",
               "mental_health_in_interview",
               "comments_mental_health_in_interview",
               "is_openly_identified",
               "how_think_coworkers_would_react",
               "experienced_unsupportive_response",
               "comments_unsupportive_response",
               "experienced_supportive_response",
               "comments_supportive_response",
               "tech_industry_level_support",
               "comments_improve_mental_health_support",
               "additional_comments"]

nineteen_survey_df.head()

In [0]:
#add blank columns and year
nineteen_survey_df["interferes_with_work"] = ""
nineteen_survey_df["discussing_mental_health_has_consequences"] = ""
nineteen_survey_df["discussing_physical_health_has_consequences"] = ""
nineteen_survey_df["employer_takes_mental_seriously_as_physical"] = ""
nineteen_survey_df["observed_consequences_for_coworkers"] = ""
nineteen_survey_df["discussing_mental_has_consequences_previous_employer"] = ""
nineteen_survey_df["discussiong_physical_has_consequences_previous_employer"] = ""
nineteen_survey_df["previous_employer_took_mental_seriously_as_physical"] = ""
nineteen_survey_df["observed_consequences_for_previous_coworkers"] = ""
nineteen_survey_df["feels_mental_health_hurts_career"] = ""
nineteen_survey_df["thinks_coworkers_view_negatively"] = ""
nineteen_survey_df["year"] = "2019"
nineteen_survey_df.head()

In [0]:
#rename 2019 demographics columns
nineteen_demographics_df.columns = [
               "age",
               "gender",
               "country_living_in",
               "race",
               "country_working_in"]

#nineteen_demographics_df.head()

In [0]:
#Add blank column for working remotely 
nineteen_demographics_df["works_remotely"] = ""
nineteen_demographics_df.head()

2018 Survey Data

In [0]:
#convert 2018 dataframe to pandas dataframe for transformation
eighteen_pandas_df = original_dataframes["2018"].toPandas() 
eighteen_pandas_df.head(2)

In [0]:
 eighteen_df = eighteen_pandas_df.drop(columns=['#',
                                                '<strong>Are you self-employed?</strong>',
                                                'Is your primary role within your company related to tech/IT?',
                                                'Do you have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders?',
                                                'Do you know local or online resources to seek help for a mental health issue?',
                                                '<strong>If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?</strong>',
                                                'If you have revealed a mental health disorder to a client or business contact, how has this affected you or the relationship?',
                                                '<strong>If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?</strong>',
                                                'If you have revealed a mental health disorder to a coworker or employee, how has this impacted you or the relationship?',
                                                'Do you believe your productivity is ever affected by a mental health issue?',
                                                'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
                                                'Anxiety Disorder (Generalized, Social, Phobia, etc)50',
                                                'Mood Disorder (Depression, Bipolar Disorder, etc)51',
                                                'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)52', 
                                                'Eating Disorder (Anorexia, Bulimia, etc)53', 
                                                'Attention Deficit Hyperactivity Disorder54', 
                                                'Personality Disorder (Borderline, Antisocial, Paranoid, etc)55', 
                                                'Obsessive-Compulsive Disorder56',
                                                'Post-Traumatic Stress Disorder57',
                                                'Stress Response Syndromes58',
                                                'Dissociative Disorder59',
                                                'Substance Use Disorder60',
                                                'Addictive Disorder61',
                                                'Other62',
                                                'Anxiety Disorder (Generalized, Social, Phobia, etc)63',
                                                'Mood Disorder (Depression, Bipolar Disorder, etc)64',
                                                'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)65',
                                                'Eating Disorder (Anorexia, Bulimia, etc)66',
                                                'Attention Deficit Hyperactivity Disorder67',
                                                'Personality Disorder (Borderline, Antisocial, Paranoid, etc)68',
                                                'Obsessive-Compulsive Disorder69',
                                                'Post-traumatic Stress Disorder70',
                                                'Stress Response Syndromes71',
                                                'Dissociative Disorder72',
                                                'Substance Use Disorder73',
                                                'Addictive Disorder74',
                                                'Other75',
                                                'Anxiety Disorder (Generalized, Social, Phobia, etc)76',
                                                'Mood Disorder (Depression, Bipolar Disorder, etc)77',
                                                'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)78',
                                                'Eating Disorder (Anorexia, Bulimia, etc)79',
                                                'Attention Deficit Hyperactivity Disorder80',
                                                'Personality Disorder (Borderline, Antisocial, Paranoid, etc)81',
                                                'Obsessive-Compulsive Disorder82',
                                                'Post-traumatic Stress Disorder83',
                                                'Stress Response Syndromes84',
                                                'Dissociative Disorder85',
                                                'Substance Use Disorder86',
                                                'Addictive Disorder87',
                                                'Other88',
                                                'Has being identified as a person with a mental health issue affected your career?',
                                                'How has it affected your career?',
                                                'Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used <em>anonymously</em> and only with your permission.)',
                                                'What US state or territory do you <strong>live</strong> in?',
                                                'Other117',
                                                'What US state or territory do you <strong>work</strong> in?',
                                                'Start Date (UTC)',
                                                'Submit Date (UTC)',
                                                'Network ID'
                                                ])
 eighteen_df.head()

In [0]:
#split dataframe into survey and demographics
dfs = np.split(eighteen_df, [59], axis=1)
eighteen_survey_df = dfs[0]
eighteen_demographics_df = dfs[1]
eighteen_survey_df.head()

In [0]:
#Rename survey columns
eighteen_survey_df.columns = ["number_employees",
               "is_tech_company",
               "employer_provides_mental_health",
               "knows_options_available",
               "employer_formally_discussed_mental_health",
               "employer_offers_resources",
               "is_anonymity_protected_by_employer",
               "level_difficulty_asking_for_leave",
               "comfortable_talking_physical_mental_coworkers",
               "comfortable_discussing_with_supervisor",
               "has_discussed_with_employer",
               "conversation_with_employer",
               "comfortable_discussing_with_coworkers",
               "has_discussed_with_coworkers",
               "conversation_with_coworkers",
               "had_coworker_discuss_mental_health",
               "conversation_coworker_had",
               "employer_physical_health_importance",
               "employer_mental_health_importance",
               "has_previous_employers",
               "is_previous_employer_tech_company",
               "previous_employer_provided_mental_health_benefits",
               "aware_options_from_previous_employer",
               "previous_employer_formally_discussed_mental_health",
               "previous_employer_offered_resources",
               "is_anonymity_protected_by_previous_employer",
               "comfortable_talking_physical_mental_previous_employer",
               "comfortable_discussing_with_previous_supervisor",
               "has_discussed_with_previous_employer",
               "conversation_with_previous_employer",
               "willing_discuss_with_previous_coworkers",
               "has_discussed_with_previous_coworkers",
               "conversation_with_previous_coworkers",
               "had_previous_coworker_discuss_mental_health",
               "conversation_previous_coworker_had",
               "previous_employer_physical_health_importance",
               "previous_employer_mental_health_importance",
               "currently_has_mental_health_disorder",
               "has_been_diagnosed",
               "had_disorder_in_past",
               "sought_treatment_for_mental_health",
               "has_family_history",
               "interferes_with_work_treated",
               "interferes_with_work_not_treated",
               "observations_of_others",
               "willingness_to_share",
               "physical_health_in_interview",
               "comments_physical_health_in_interview",
               "mental_health_in_interview",
               "comments_mental_health_in_interview",
               "is_openly_identified",
               "how_think_coworkers_would_react",
               "experienced_unsupportive_response",
               "comments_unsupportive_response",
               "experienced_supportive_response",
               "comments_supportive_response",
               "tech_industry_level_support",
               "comments_improve_mental_health_support",
               "additional_comments"]

eighteen_survey_df.head()

In [0]:
#add blank columns and year
eighteen_survey_df["interferes_with_work"] = ""
eighteen_survey_df["discussing_mental_health_has_consequences"] = ""
eighteen_survey_df["discussing_physical_health_has_consequences"] = ""
eighteen_survey_df["employer_takes_mental_seriously_as_physical"] = ""
eighteen_survey_df["observed_consequences_for_coworkers"] = ""
eighteen_survey_df["discussing_mental_has_consequences_previous_employer"] = ""
eighteen_survey_df["discussiong_physical_has_consequences_previous_employer"] = ""
eighteen_survey_df["previous_employer_took_mental_seriously_as_physical"] = ""
eighteen_survey_df["observed_consequences_for_previous_coworkers"] = ""
eighteen_survey_df["feels_mental_health_hurts_career"] = ""
eighteen_survey_df["thinks_coworkers_view_negatively"] = ""
eighteen_survey_df["year"] = "2018"
eighteen_survey_df.head()

In [0]:
#rename 2018 demographics columns
eighteen_demographics_df.columns = [
               "age",
               "gender",
               "country_living_in",
               "race",
               "country_working_in"]

#eighteen_demographics_df.head()

In [0]:
#Add blank column for working remotely 
eighteen_demographics_df["works_remotely"] = ""
eighteen_demographics_df.head()

2017 Survey Data

In [0]:
#convert 2017 dataframe to pandas dataframe for transformation
seventeen_pandas_df = original_dataframes["2017"].toPandas() 
seventeen_pandas_df.head(2)

In [0]:
seventeen_df = seventeen_pandas_df.drop(columns=['#',
                                                '<strong>Are you self-employed?</strong>',
                                                'Is your primary role within your company related to tech/IT?',
                                                'Do you have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders?',
                                                'Do you know local or online resources to seek help for a mental health issue?',
                                                '<strong>If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?</strong>',
                                                'If you have revealed a mental health disorder to a client or business contact, how has this affected you or the relationship?',
                                                '<strong>If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?</strong>',
                                                'If you have revealed a mental health disorder to a coworker or employee, how has this impacted you or the relationship?',
                                                'Do you believe your productivity is ever affected by a mental health issue?',
                                                'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
                                                'Anxiety Disorder (Generalized, Social, Phobia, etc)50',
                                                'Mood Disorder (Depression, Bipolar Disorder, etc)51',
                                                'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)52', 
                                                'Eating Disorder (Anorexia, Bulimia, etc)53', 
                                                'Attention Deficit Hyperactivity Disorder54', 
                                                'Personality Disorder (Borderline, Antisocial, Paranoid, etc)55', 
                                                'Obsessive-Compulsive Disorder56',
                                                'Post-Traumatic Stress Disorder57',
                                                'Stress Response Syndromes58',
                                                'Dissociative Disorder59',
                                                'Substance Use Disorder60',
                                                'Addictive Disorder61',
                                                'Other62',
                                                'Anxiety Disorder (Generalized, Social, Phobia, etc)63',
                                                'Mood Disorder (Depression, Bipolar Disorder, etc)64',
                                                'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)65',
                                                'Eating Disorder (Anorexia, Bulimia, etc)66',
                                                'Attention Deficit Hyperactivity Disorder67',
                                                'Personality Disorder (Borderline, Antisocial, Paranoid, etc)68',
                                                'Obsessive-Compulsive Disorder69',
                                                'Post-traumatic Stress Disorder70',
                                                'Stress Response Syndromes71',
                                                'Dissociative Disorder72',
                                                'Substance Use Disorder73',
                                                'Addictive Disorder74',
                                                'Other75',
                                                'Anxiety Disorder (Generalized, Social, Phobia, etc)76',
                                                'Mood Disorder (Depression, Bipolar Disorder, etc)77',
                                                'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)78',
                                                'Eating Disorder (Anorexia, Bulimia, etc)79',
                                                'Attention Deficit Hyperactivity Disorder80',
                                                'Personality Disorder (Borderline, Antisocial, Paranoid, etc)81',
                                                'Obsessive-Compulsive Disorder82',
                                                'Post-traumatic Stress Disorder83',
                                                'Stress Response Syndromes84',
                                                'Dissociative Disorder85',
                                                'Substance Use Disorder86',
                                                'Addictive Disorder87',
                                                'Other88',
                                                'Has being identified as a person with a mental health issue affected your career?',
                                                'How has it affected your career?',
                                                'What US state or territory do you <strong>live</strong> in?',
                                                'Other117',
                                                'What US state or territory do you <strong>work</strong> in?',
                                                'Start Date (UTC)',
                                                'Submit Date (UTC)',
                                                'Network ID'
                                                ])
seventeen_df.head()

In [0]:
#split dataframe into survey and demographics
dfs = np.split(seventeen_df, [59], axis=1)
seventeen_survey_df = dfs[0]
seventeen_demographics_df = dfs[1]
seventeen_survey_df.head()

In [0]:
#Rename survey columns
seventeen_survey_df.columns = ["number_employees",
               "is_tech_company",
               "employer_provides_mental_health",
               "knows_options_available",
               "employer_formally_discussed_mental_health",
               "employer_offers_resources",
               "is_anonymity_protected_by_employer",
               "level_difficulty_asking_for_leave",
               "comfortable_talking_physical_mental_coworkers",
               "comfortable_discussing_with_supervisor",
               "has_discussed_with_employer",
               "conversation_with_employer",
               "comfortable_discussing_with_coworkers",
               "has_discussed_with_coworkers",
               "conversation_with_coworkers",
               "had_coworker_discuss_mental_health",
               "conversation_coworker_had",
               "employer_physical_health_importance",
               "employer_mental_health_importance",
               "has_previous_employers",
               "is_previous_employer_tech_company",
               "previous_employer_provided_mental_health_benefits",
               "aware_options_from_previous_employer",
               "previous_employer_formally_discussed_mental_health",
               "previous_employer_offered_resources",
               "is_anonymity_protected_by_previous_employer",
               "comfortable_talking_physical_mental_previous_employer",
               "comfortable_discussing_with_previous_supervisor",
               "has_discussed_with_previous_employer",
               "conversation_with_previous_employer",
               "willing_discuss_with_previous_coworkers",
               "has_discussed_with_previous_coworkers",
               "conversation_with_previous_coworkers",
               "had_previous_coworker_discuss_mental_health",
               "conversation_previous_coworker_had",
               "previous_employer_physical_health_importance",
               "previous_employer_mental_health_importance",
               "currently_has_mental_health_disorder",
               "has_been_diagnosed",
               "had_disorder_in_past",
               "sought_treatment_for_mental_health",
               "has_family_history",
               "interferes_with_work_treated",
               "interferes_with_work_not_treated",
               "observations_of_others",
               "willingness_to_share",
               "physical_health_in_interview",
               "comments_physical_health_in_interview",
               "mental_health_in_interview",
               "comments_mental_health_in_interview",
               "is_openly_identified",
               "how_think_coworkers_would_react",
               "experienced_unsupportive_response",
               "comments_unsupportive_response",
               "experienced_supportive_response",
               "comments_supportive_response",
               "tech_industry_level_support",
               "comments_improve_mental_health_support",
               "additional_comments"]

seventeen_survey_df.head()

In [0]:
#add blank columns and year
seventeen_survey_df["interferes_with_work"] = ""
seventeen_survey_df["discussing_mental_health_has_consequences"] = ""
seventeen_survey_df["discussing_physical_health_has_consequences"] = ""
seventeen_survey_df["employer_takes_mental_seriously_as_physical"] = ""
seventeen_survey_df["observed_consequences_for_coworkers"] = ""
seventeen_survey_df["discussing_mental_has_consequences_previous_employer"] = ""
seventeen_survey_df["discussiong_physical_has_consequences_previous_employer"] = ""
seventeen_survey_df["previous_employer_took_mental_seriously_as_physical"] = ""
seventeen_survey_df["observed_consequences_for_previous_coworkers"] = ""
seventeen_survey_df["feels_mental_health_hurts_career"] = ""
seventeen_survey_df["thinks_coworkers_view_negatively"] = ""
seventeen_survey_df["year"] = "2017"
seventeen_survey_df.head()

In [0]:
seventeen_demographics_df.head()
#split dataframe into survey and demographics
dfs2 = np.split(seventeen_demographics_df, [1], axis=1)
seventeen_garbage_df = dfs2[0]
seventeen_demographics2_df = dfs2[1]
seventeen_demographics2_df.head()


In [0]:
#rename 2017 demographics columns
seventeen_demographics2_df.columns = [
               "age",
               "gender",
               "country_living_in",
               "race",
               "country_working_in"]

seventeen_demographics2_df.head()

In [0]:
#Add blank column for working remotely 
seventeen_demographics2_df["works_remotely"] = ""
seventeen_demographics2_df.head()

2016 Survey Data

In [0]:
#convert 2016 dataframe to pandas dataframe for transformation
sixteen_pandas_df = original_dataframes["2016"].toPandas() 
sixteen_pandas_df.head(2)

In [0]:
#remove columns in 2016 df
sixteen_df = sixteen_pandas_df.drop(columns=['Are you self-employed?',
                                             'Is your primary role within your company related to tech/IT?',
                                             'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?',
                                             'If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?',
                                             'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?',
                                             'If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?',
                                             'Do you believe your productivity is ever affected by a mental health issue?',
                                             'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
                                             'If yes, what condition(s) have you been diagnosed with?',
                                             'If maybe, what condition(s) do you believe you have?',
                                             'If so, what condition(s) were you diagnosed with?',
                                             'What US state or territory do you live in?',
                                             'What US state or territory do you work in?',
                                             'Which of the following best describes your work position?'
                                             ])
sixteen_df.head(2)
                

In [0]:
#rename columns in 2016 dataframe
sixteen_df.columns = ['number_employees',
                      'is_tech_company',
                      'employer_provides_mental_health',
                      'knows_options_available',
                      'employer_formally_discussed_mental_health',
                      'employer_offers_resources',
                      'is_anonymity_protected_by_employer',
                      'level_difficulty_asking_for_leave',
                      'discussing_mental_health_has_consequences',
                      'discussing_physical_health_has_consequences',
                      'comfortable_talking_physical_mental_coworkers',
                      'comfortable_discussing_with_supervisor',
                      'employer_takes_mental_seriously_as_physical',
                      'observed_consequences_for_coworkers',
                      'employer_provides_mental_health2',
                      'employer_offers_resources2',
                      'has_previous_employers',
                      'previous_employer_provided_mental_health_benefits',
                      'aware_options_from_previous_employer',
                      'previous_employer_formally_discussed_mental_health',
                      'previous_employer_offered_resources',
                      'is_anonymity_protected_by_previous_employer',
                      'discussing_mental_health_has_consequences',
                      'discussing_physical_health_has_consequences',
                      'willing_discuss_with_previous_coworkers',
                      'comfortable_discussing_with_previous_supervisor',
                      'employer_takes_mental_seriously_as_physical',
                      'observed_consequences_for_coworkers',
                      'physical_health_in_interview',
                      'comments_physical_health_in_interview',
                      'mental_health_in_interview',
                      'comments_mental_health_in_interview',
                      'feels_mental_health_hurts_career',
                      'thinks_coworkers_view_negatively',
                      'willingness_to_share',
                      'experienced_unsupportive_response',
                      'observations_of_others',
                      'has_family_history',
                      'had_disorder_in_past',
                      'currently_has_mental_health_disorder',
                      'has_been_diagnosed',
                      'sought_treatment_for_mental_health',
                      'interferes_with_work_treated',
                      'interferes_with_work_not_treated',
                      'age',
                      'gender',
                      'country_living_in',
                      'country_working_in',
                      'works_remotely'
                      ]

sixteen_df.head()

In [0]:
#adding in blank columns & year
sixteen_df["has_discussed_with_employer"] = ""
sixteen_df["conversation_with_employer"] = ""
sixteen_df["comfortable_discussing_with_coworkers"] = ""
sixteen_df["has_discussed_with_coworkers"] = ""
sixteen_df["conversation_with_coworkers"] = ""
sixteen_df["had_coworker_discuss_mental_health"] = ""
sixteen_df["conversation_coworker_had"] = ""
sixteen_df["employer_physical_health_importance"] = ""
sixteen_df["employer_mental_health_importance"] = ""
sixteen_df["is_previous_employer_tech_company"] = ""
sixteen_df["comfortable_talking_physical_mental_previous_employer"] = ""
sixteen_df["has_discussed_with_previous_employer"] = ""
sixteen_df["conversation_with_previous_employer"] = ""
sixteen_df["has_discussed_with_previous_coworkers"] = ""
sixteen_df["conversation_with_previous_coworkers"] = ""
sixteen_df["had_previous_coworker_discuss_mental_health"] = ""
sixteen_df["conversation_previous_coworker_had"] = ""
sixteen_df["previous_employer_physical_health_importance"] = ""
sixteen_df["previous_employer_mental_health_importance"] = ""
sixteen_df["is_openly_identified"] = ""
sixteen_df["how_think_coworkers_would_react"] = ""
sixteen_df["comments_unsupportive_response"] = ""
sixteen_df["experienced_supportive_response"] = ""
sixteen_df["comments_supportive_response"] = ""
sixteen_df["tech_industry_level_support"] = ""
sixteen_df["comments_improve_mental_health_support"] = ""
sixteen_df["additional_comments"] = ""
sixteen_df["interferes_with_work"] = ""
sixteen_df["discussing_mental_has_consequences_previous_employer"] = ""
sixteen_df["discussiong_physical_has_consequences_previous_employer"] = ""
sixteen_df["previous_employer_took_mental_seriously_as_physical"] = ""
sixteen_df["observed_consequences_for_previous_coworkers"] = ""
sixteen_df["race"] = ""
sixteen_df["year"] = "2016"

sixteen_df.head()

In [0]:
sixteen_reordered_df = sixteen_df[['number_employees',
                                     'is_tech_company',
                                     'employer_provides_mental_health',
                                     'knows_options_available',
                                     'employer_formally_discussed_mental_health',
                                     'employer_offers_resources',
                                     'is_anonymity_protected_by_employer',
                                     'level_difficulty_asking_for_leave',
                                     'comfortable_talking_physical_mental_coworkers',
                                     'comfortable_discussing_with_supervisor',
                                     'has_discussed_with_employer',
                                     'conversation_with_employer',
                                     'comfortable_discussing_with_coworkers',
                                     'has_discussed_with_coworkers',
                                     'conversation_with_coworkers',
                                     'had_coworker_discuss_mental_health',
                                     'conversation_coworker_had',
                                     'employer_physical_health_importance',
                                     'employer_mental_health_importance',
                                     'has_previous_employers',
                                     'is_previous_employer_tech_company',
                                     'previous_employer_provided_mental_health_benefits',
                                     'aware_options_from_previous_employer',
                                     'previous_employer_formally_discussed_mental_health',
                                     'previous_employer_offered_resources',
                                     'is_anonymity_protected_by_previous_employer',
                                     'comfortable_talking_physical_mental_previous_employer',
                                     'comfortable_discussing_with_previous_supervisor',
                                     'has_discussed_with_previous_employer',
                                     'conversation_with_previous_employer',
                                     'willing_discuss_with_previous_coworkers',
                                     'has_discussed_with_previous_coworkers',
                                     'conversation_with_previous_coworkers',
                                     'had_previous_coworker_discuss_mental_health',
                                     'conversation_previous_coworker_had',
                                     'previous_employer_physical_health_importance',
                                     'previous_employer_mental_health_importance',
                                     'currently_has_mental_health_disorder',
                                     'has_been_diagnosed',
                                     'had_disorder_in_past',
                                     'sought_treatment_for_mental_health',
                                     'has_family_history',
                                     'interferes_with_work_treated',
                                     'interferes_with_work_not_treated',
                                     'observations_of_others',
                                     'willingness_to_share',
                                     'physical_health_in_interview',
                                     'comments_physical_health_in_interview',
                                     'mental_health_in_interview',
                                     'comments_mental_health_in_interview',
                                     'is_openly_identified',
                                     'how_think_coworkers_would_react',
                                     'experienced_unsupportive_response',
                                     'comments_unsupportive_response',
                                     'experienced_supportive_response',
                                     'comments_supportive_response',
                                     'tech_industry_level_support',
                                     'comments_improve_mental_health_support',
                                     'additional_comments',
                                     'interferes_with_work',
                                     'discussing_mental_health_has_consequences',
                                     'discussing_physical_health_has_consequences',
                                     'employer_takes_mental_seriously_as_physical',
                                     'observed_consequences_for_coworkers',
                                     'discussing_mental_has_consequences_previous_employer',
                                     'discussiong_physical_has_consequences_previous_employer',
                                     'previous_employer_took_mental_seriously_as_physical',
                                     'observed_consequences_for_previous_coworkers',
                                     'feels_mental_health_hurts_career',
                                     'thinks_coworkers_view_negatively',
                                     'year',
                                     'employer_provides_mental_health2',
                                     'employer_offers_resources2',
                                     'age',
                                     'gender',
                                     'country_living_in',
                                     'race',
                                     'country_working_in',
                                     'works_remotely'
                                     ]]


sixteen_reordered_df.head()

In [0]:
#split 2016 data into 2 dataframes
#split dataframe into survey and demographics
dfs = np.split(sixteen_reordered_df, [75], axis=1)
sixteen_survey_df = dfs[0]
sixteen_demographics_df = dfs[1]
sixteen_survey_df.head()

In [0]:
sixteen_demographics_df.head()
dfs2 = np.split(sixteen_demographics_df, [2], axis=1)
sixteen_garbage_df = dfs2[0]
sixteen_demographics2_df = dfs2[1]
sixteen_demographics2_df.head()

2014 Survey Data

In [0]:
#convert 2014 dataframe to pandas dataframe for transformation
fourteen_pandas_df = original_dataframes["2014"].toPandas() 
fourteen_pandas_df.head()

In [0]:
#remove columns in 2014 df
fourteen_df = fourteen_pandas_df.drop(columns=['Are you self-employed?',
                                               'If you live in the United States, which state or territory do you live in?',
                                               'Timestamp'])
fourteen_df.head()

In [0]:
#rename columns in 2014 dataframe
fourteen_df.columns = ["age",
                              "gender",
                              "country_living_in",
                              "has_family_history",
                              "sought_treatment_for_mental_health",
                              "interferes_with_work",
                              "number_employees",
                              "works_remotely",
                              "is_tech_company",
                              "employer_provides_mental_health",
                              "knows_options_available",
                              "employer_formally_discussed_mental_health",
                              "employer_offers_resources",
                              "is_anonymity_protected_by_employer",
                              "level_difficulty_asking_for_leave",
                              "discussing_mental_health_has_consequences",
                              "discussing_physical_health_has_consequences",
                              "comfortable_discussing_with_coworkers",
                              "comfortable_discussing_with_supervisor",
                              "mental_health_in_interview",
                              "physical_health_in_interview",
                              "employer_takes_mental_seriously_as_physical",
                              "observed_consequences_for_coworkers",
                              "additional_comments"]

fourteen_df.head()

In [0]:
#adding in blank columns & year
fourteen_df["comfortable_talking_physical_mental_coworkers"] = ""
fourteen_df["has_discussed_with_employer"] = ""
fourteen_df["conversation_with_employer"] = ""
fourteen_df["has_discussed_with_coworkers"] = ""
fourteen_df["conversation_with_coworkers"] = ""
fourteen_df["had_coworker_discuss_mental_health"] = ""
fourteen_df["conversation_coworker_had"] = ""
fourteen_df["employer_physical_health_importance"] = ""
fourteen_df["employer_mental_health_importance"] = ""
fourteen_df["has_previous_employers"] = ""
fourteen_df["is_previous_employer_tech_company"] = ""
fourteen_df["previous_employer_provided_mental_health_benefits"] = ""
fourteen_df["aware_options_from_previous_employer"] = ""
fourteen_df["previous_employer_formally_discussed_mental_health"] = ""
fourteen_df["previous_employer_offered_resources"] = ""
fourteen_df["is_anonymity_protected_by_previous_employer"] = ""
fourteen_df["comfortable_talking_physical_mental_previous_employer"] = ""
fourteen_df["comfortable_discussing_with_previous_supervisor"] = ""
fourteen_df["has_discussed_with_previous_employer"] = ""
fourteen_df["conversation_with_previous_employer"] = ""
fourteen_df["willing_discuss_with_previous_coworkers"] = ""
fourteen_df["has_discussed_with_previous_coworkers"] = ""
fourteen_df["conversation_with_previous_coworkers"] = ""
fourteen_df["had_previous_coworker_discuss_mental_health"] = ""
fourteen_df["conversation_previous_coworker_had"] = ""
fourteen_df["previous_employer_physical_health_importance"] = ""
fourteen_df["previous_employer_mental_health_importance"] = ""
fourteen_df["currently_has_mental_health_disorder"] = ""
fourteen_df["has_been_diagnosed"] = ""
fourteen_df["had_disorder_in_past"] =""
fourteen_df["interferes_with_work_treated"] = ""
fourteen_df["interferes_with_work_not_treated"] = ""
fourteen_df["observations_of_others"] = ""
fourteen_df["willingness_to_share"] = ""
fourteen_df["comments_physical_health_in_interview"] = ""
fourteen_df["comments_mental_health_in_interview"] = ""
fourteen_df["is_openly_identified"] = ""
fourteen_df["how_think_coworkers_would_react"] = ""
fourteen_df["experienced_unsupportive_response"] = ""
fourteen_df["comments_unsupportive_response"] = ""
fourteen_df["experienced_supportive_response"] = ""
fourteen_df["comments_supportive_response"] = ""
fourteen_df["tech_industry_level_support"] = ""
fourteen_df["comments_improve_mental_health_support"] = ""
fourteen_df["discussing_mental_has_consequences_previous_employer"] = ""
fourteen_df["discussiong_physical_has_consequences_previous_employer"] = ""
fourteen_df["previous_employer_took_mental_seriously_as_physical"] = ""
fourteen_df["observed_consequences_for_previous_coworkers"] = ""
fourteen_df["feels_mental_health_hurts_career"] = ""
fourteen_df["thinks_coworkers_view_negatively"] = ""
fourteen_df["race"] = ""
fourteen_df["country_working_in"] = ""
fourteen_df["year"] = "2014"

fourteen_df

In [0]:
fourteen_reordered_df = fourteen_df[['number_employees',
                                     'is_tech_company',
                                     'employer_provides_mental_health',
                                     'knows_options_available',
                                     'employer_formally_discussed_mental_health',
                                     'employer_offers_resources',
                                     'is_anonymity_protected_by_employer',
                                     'level_difficulty_asking_for_leave',
                                     'comfortable_talking_physical_mental_coworkers',
                                     'comfortable_discussing_with_supervisor',
                                     'has_discussed_with_employer',
                                     'conversation_with_employer',
                                     'comfortable_discussing_with_coworkers',
                                     'has_discussed_with_coworkers',
                                     'conversation_with_coworkers',
                                     'had_coworker_discuss_mental_health',
                                     'conversation_coworker_had',
                                     'employer_physical_health_importance',
                                     'employer_mental_health_importance',
                                     'has_previous_employers',
                                     'is_previous_employer_tech_company',
                                     'previous_employer_provided_mental_health_benefits',
                                     'aware_options_from_previous_employer',
                                     'previous_employer_formally_discussed_mental_health',
                                     'previous_employer_offered_resources',
                                     'is_anonymity_protected_by_previous_employer',
                                     'comfortable_talking_physical_mental_previous_employer',
                                     'comfortable_discussing_with_previous_supervisor',
                                     'has_discussed_with_previous_employer',
                                     'conversation_with_previous_employer',
                                     'willing_discuss_with_previous_coworkers',
                                     'has_discussed_with_previous_coworkers',
                                     'conversation_with_previous_coworkers',
                                     'had_previous_coworker_discuss_mental_health',
                                     'conversation_previous_coworker_had',
                                     'previous_employer_physical_health_importance',
                                     'previous_employer_mental_health_importance',
                                     'currently_has_mental_health_disorder',
                                     'has_been_diagnosed',
                                     'had_disorder_in_past',
                                     'sought_treatment_for_mental_health',
                                     'has_family_history',
                                     'interferes_with_work_treated',
                                     'interferes_with_work_not_treated',
                                     'observations_of_others',
                                     'willingness_to_share',
                                     'physical_health_in_interview',
                                     'comments_physical_health_in_interview',
                                     'mental_health_in_interview',
                                     'comments_mental_health_in_interview',
                                     'is_openly_identified',
                                     'how_think_coworkers_would_react',
                                     'experienced_unsupportive_response',
                                     'comments_unsupportive_response',
                                     'experienced_supportive_response',
                                     'comments_supportive_response',
                                     'tech_industry_level_support',
                                     'comments_improve_mental_health_support',
                                     'additional_comments',
                                     'interferes_with_work',
                                     'discussing_mental_health_has_consequences',
                                     'discussing_physical_health_has_consequences',
                                     'employer_takes_mental_seriously_as_physical',
                                     'observed_consequences_for_coworkers',
                                     'discussing_mental_has_consequences_previous_employer',
                                     'discussiong_physical_has_consequences_previous_employer',
                                     'previous_employer_took_mental_seriously_as_physical',
                                     'observed_consequences_for_previous_coworkers',
                                     'feels_mental_health_hurts_career',
                                     'thinks_coworkers_view_negatively',
                                     'year',
                                     'age',
                                     'gender',
                                     'country_living_in',
                                     'race',
                                     'country_working_in',
                                     'works_remotely'
                                     ]]


fourteen_reordered_df.head()

In [0]:
#split 2014 data into 2 dataframes
#split dataframe into survey and demographics
dfs = np.split(fourteen_reordered_df, [71], axis=1)
fourteen_survey_df = dfs[0]
fourteen_demographics_df = dfs[1]
fourteen_survey_df.head()

In [0]:
fourteen_demographics_df.head()

In [0]:
#appending demographics dataframes
demographics_df1 = nineteen_demographics_df.append(eighteen_demographics_df)
#demographics_df1.head()
demographics_df2 = demographics_df1.append(seventeen_demographics2_df)
demographics_df3 = demographics_df2.append(fourteen_demographics_df)
demographics_df4 = demographics_df3.append(sixteen_demographics2_df)
demographics_df4.tail()

In [0]:
#creating unique id
demographics_df4['id'] = demographics_df4.index
demographics_df4

In [0]:
#appending survey dataframes
survey_df1 = nineteen_survey_df.append(eighteen_survey_df)
survey_df2 = survey_df1.append(seventeen_survey_df)
survey_df3 = survey_df2.append(fourteen_survey_df)
survey_df3.tail()
#survey_df4 = survey_df3.append(sixteen_survey_df)
#survey_df4.tail()

In [0]:
#creating unique id
survey_df3['id'] = survey_df3.index
survey_df3