<a href="https://colab.research.google.com/github/arvynathaniel/Python/blob/main/Mental_Health_Data_Cleaning_(Shortened).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Mental Health Survey Data Cleaning Project**

In this project, we will be looking at the 'Mental Health Survey' dataset. The main goal of this project is to take a look at the data presented by this dataset and then clean it, so that the cleaned data can further be used on the next exploratory data analysis (EDA) project.

The main work sequence that will be performed in this project:
1.   Calling in the libraries and dataset
2.   Performing a light data exploration on the dataset
3.   Cleaning the data and performing some feature engineering if needed
4.   Exporting the dataset into a new csv file

Our thanks to the provider of this dataset:

https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey


##**1. Preparation**

###1a. Libraries

In [20]:
# pandas to help us performing actions on the dataset in a tabular form
import pandas as pd

# numpy to help us performing mathematical operations
import numpy as np

# termcolor to help us coloring printed string
from termcolor import colored

###1b. Dataset

In [4]:
data = pd.read_csv('Mental Health Survey.csv')

###1c. Functions

In [25]:
# A function to get the unique values in a feature and each of its counts
def uniquevaluescount(data, feature):
    # Storing each unique value of feature in a list
    features = data[feature]
    listUV = features.unique()
    listUV = [i for i in listUV if type(i) is not float]

    # Storing the count of each unique value of feature in a list
    listUVcount = []
    for i in listUV:
        UVcount = (data[feature] == i).sum()
        listUVcount.append(UVcount)
    
    # Displaying the report in a table
    table = pd.DataFrame(list(zip(listUV, listUVcount)))
    table.columns = [feature, 'Count']
    print(colored(('Unique value count of ' + feature), 'blue'))
    print(table.
          sort_values(['Count', feature], 
                      ascending = [0,1]).
          to_string(index = False))
    print('Total : ' + str(sum(listUVcount)) + '\n')

# A function to get the percentage of missing values in a feature
def missingvaluepercentage(data, feature):
    # Count of missing value
    missingvaluecount = data[feature].isnull().sum()
    
    # Count of total data
    totalcount = len(data[feature])

    # Missing value percentage calculation
    missingvalue = missingvaluecount / totalcount * 100
    missingvalue = "{:.2f}".format(missingvalue)

    # Displaying the misisng value report
    print('Missing value : ' + str(missingvalue) + '%')

# A function to calculate the probability of each unique value in a feature
def valueprobability(data, feature):
    # Storing each unique value of feature in a list
    features = data[feature]
    global listUV
    listUV = features.unique()
    listUV = [i for i in listUV if type(i) is not float]

    # Storing the count of each unique value of feature in a list
    global listUVcount
    listUVcount = []
    for i in listUV:
        UVcount = (data[feature] == i).sum()
        listUVcount.append(UVcount)
    
    # Calculating the total count of the feature
    totalcount = sum(listUVcount)

    # Calculating the probability of each unique value in the feature
    global listUVprob
    listUVprob = []
    for i in listUVcount:
        UVprob = i / totalcount
        listUVprob.append(UVprob)
    
    # Displaying the report in a table
    table = pd.DataFrame(list(zip(listUV, listUVcount, listUVprob)))
    table.columns = [feature, 'Count', 'Probabiity']
    print(table.
          sort_values(['Count', feature], 
                      ascending = [0,1]).
          to_string(index = False))

##**2. Light Data Exploration**

###2a. General overview

In [5]:
data.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

Regarding the feature names of the dataset, the provider of this dataset previously gave us a clue about what data each feature represent. We will be displaying them in a Pandas dataframe format so that it is easier for us to read.

In [7]:
# Storing all the feature description in a variable
RawFeatures = 'Timestamp;Age;Gender;Country;state: If you live in the United States,which state or territory do you live in?;self_employed: Are you self-employed?;family_history: Do you have a family history of mental illness?;treatment: Have you sought treatment for a mental health condition?;work_interfere: If you have a mental health condition,do you feel that it interferes with your work?;no_employees: How many employees does your company or organization have?;remote_work: Do you work remotely (outside of an office) at least 50% of the time?;tech_company: Is your employer primarily a tech company/organization?;benefits: Does your employer provide mental health benefits?;care_options: Do you know the options for mental health care your employer provides?;wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?;seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?;anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?;leave: How easy is it for you to take medical leave for a mental health condition?;mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?;physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?;coworkers: Would you be willing to discuss a mental health issue with your coworkers?;supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?;mentalhealthinterview: Would you bring up a mental health issue with a potential employer in an interview?;physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?;mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?;obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?;comments: Any additional notes or comments'

# Splitting each feature into a separate item in a list
Features = RawFeatures.split(';')

# Separating each feature and its description into two separate lists
ListFeature = []
ListDesc = []
for i in Features:
    part1 = i.partition(':')[0]
    part2 = i.partition(':')[2]
    ListFeature.append(part1)
    ListDesc.append(part2)

# Creating the description dataframe
FeaturesTable = pd.DataFrame(list(zip(ListFeature, ListDesc))) # Creating the dataframe
pd.set_option("display.max_colwidth", 0)                       # Displaying the description's column in a full length
FeaturesTable.columns = ['Feature', 'Description']             # Giving each dataframe's column a name

# Displaying the description dataframe
print('Feature Description')
FeaturesTable

Feature Description


Unnamed: 0,Feature,Description
0,Timestamp,
1,Age,
2,Gender,
3,Country,
4,state,"If you live in the United States,which state or territory do you live in?"
5,self_employed,Are you self-employed?
6,family_history,Do you have a family history of mental illness?
7,treatment,Have you sought treatment for a mental health condition?
8,work_interfere,"If you have a mental health condition,do you feel that it interferes with your work?"
9,no_employees,How many employees does your company or organization have?


###2b. Missing values

In [8]:
# Calculating the missing values for each feature
MissingValues = data.isnull().sum()

# Turning the previous output into a dataframe format
MissingValues = pd.DataFrame(MissingValues)
MissingValues = MissingValues.reset_index()
MissingValues.columns = ['Feature', 'Count']

# Calculating each feature's missing values percentage
ListMissingVPerc = []
for i in MissingValues['Count']:
    missingperc = i/1259 * 100 # Total entries = 1259 (from previous step)
    ListMissingVPerc.append(missingperc)
ListMissingVPerc = [round(num, 2) for num in ListMissingVPerc]

# Adding the missing values percentage column to the report
MissingValues['Percentage (%)'] = ListMissingVPerc

# Printing the missing values report
print('Missing values on each feature: ')
print(MissingValues.to_string(index = False))

Missing values on each feature: 
                  Feature  Count  Percentage (%)
                Timestamp      0            0.00
                      Age      0            0.00
                   Gender      0            0.00
                  Country      0            0.00
                    state    515           40.91
            self_employed     18            1.43
           family_history      0            0.00
                treatment      0            0.00
           work_interfere    264           20.97
             no_employees      0            0.00
              remote_work      0            0.00
             tech_company      0            0.00
                 benefits      0            0.00
             care_options      0            0.00
         wellness_program      0            0.00
                seek_help      0            0.00
                anonymity      0            0.00
                    leave      0            0.00
mental_health_consequence      0    

###2c. Duplicated values

In [9]:
DuplicatedValues = data.duplicated().sum()
print('Duplicated data : ' + str(DuplicatedValues))

Duplicated data : 0


###2d. Unique values

In [10]:
# Getting the feature names in a list
DataCols = list(data.columns.values)

# Calculating the number of each feature's unique values
ListUniqueC = []
for i in data.columns:
    UCount = len(data[i].unique())
    ListUniqueC.append(UCount)

# Creating the unique values table
UniqueCTable = pd.DataFrame(list(zip(DataCols, ListUniqueC)))
UniqueCTable.columns = ['Feature', 'Unique Values']

# Printing the report
print('Unique values in each feature:')
print(UniqueCTable.to_string(index = False))

Unique values in each feature:
                  Feature  Unique Values
                Timestamp           1246
                      Age             53
                   Gender             49
                  Country             48
                    state             46
            self_employed              3
           family_history              2
                treatment              2
           work_interfere              5
             no_employees              6
              remote_work              2
             tech_company              2
                 benefits              3
             care_options              3
         wellness_program              3
                seek_help              3
                anonymity              3
                    leave              5
mental_health_consequence              3
  phys_health_consequence              3
                coworkers              3
               supervisor              3
  mental_health_interview 

##**3. Data Cleaning and Feature Engineering**

###Checking unique values of each feature

As the first step of the data cleaning process, we will be looking at the unique values contained in each feature of the dataset. The 'Timestamp' and 'comments' features will be checked and dealt with later as the 'Timestamp' feature possess too many unique values that are irrelevant for further analysis and the 'comments' feature possess too many null values that it demands a different approach. The 'Age' feature will also be dealt in a different approach as it contains a continuous data and logically lies within a specific range.

In [27]:
# Checking unique values of each feature, except ['Timestamp', 'Age', 'comments']
for column in data.columns[1:-1]:
    uniquevaluescount(data, column)

[34mUnique value count of Age[0m
        Age  Count
         29     85
         32     82
         26     75
         27     71
         33     70
         28     68
         31     67
         34     65
         30     63
         25     61
         35     55
         23     51
         24     46
         37     43
         38     39
         36     37
         39     33
         40     33
         43     28
         22     21
         41     21
         42     20
         21     16
         45     12
         46     12
         44     11
         19      9
         18      7
         20      6
         48      6
         50      6
         51      5
         49      4
         56      4
         54      3
         55      3
         57      3
         47      2
         60      2
      -1726      1
        -29      1
         -1      1
          5      1
          8      1
         11      1
         53      1
         58      1
         61      1
         62      1
         65    

There are some features that are seemingly need to be engineered, namely:


1.   Timestamp: irrelevant for further analysis
2.   Age: contains abnormal age values
3.   Gender: no data validation and contains too many categories
4.   State: contains missing values
5.   self_employed: contains missing values
6.   work_interfere: contains missing values
7.   no_employees: needs a little change to avoid data auto-formatting
8.   comments: no data validation and too many missing values



###Timestamp

The 'Timestamp' feature does not really have any meaning or importance in this dataset - it does not indicate a time when a person is doing the survey. We will drop this feature.

In [28]:
# Dropping the 'Timestamp' feature
data.drop('Timestamp', axis = 1, inplace = True)

# Checking the features of the dataset
print(data.columns)

Index(['Age', 'Gender', 'Country', 'state', 'self_employed', 'family_history',
       'treatment', 'work_interfere', 'no_employees', 'remote_work',
       'tech_company', 'benefits', 'care_options', 'wellness_program',
       'seek_help', 'anonymity', 'leave', 'mental_health_consequence',
       'phys_health_consequence', 'coworkers', 'supervisor',
       'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')


###Age

We will check if there are any entries with abnormal 'Age' value. By knowing that the target of the dataset are people who work in tech industry, it can be assumed that the youngest of the population should be around 18 (US legal age) and the oldest around 80 (still possible).

In [29]:
# Storing the age feature in a variable
Age = data['Age']

# Age feature range
AgeMin = min(Age)
AgeMax = max(Age)

# Entries with abnormal age value count
AgeLO = (Age < 18).sum()
AgeUO = (Age > 80).sum()
AgeOCount = AgeLO + AgeUO
AgeOCountP = AgeOCount / 1259

# Displaying the information
print('Highest age in the entries : ' + str(AgeMax))
print('Lowest age in the entries  : ' + str(AgeMin))
print('People younger than 18     : ' + str(AgeLO) + ' people')
print('People older than 80       : ' + str(AgeUO) + ' people')
print('Abnormal entries count     : ' + str(AgeOCount))
print('Abnormal entries percentage: ' + str(AgeOCountP))

Highest age in the entries : 99999999999
Lowest age in the entries  : -1726
People younger than 18     : 6 people
People older than 80       : 2 people
Abnormal entries count     : 8
Abnormal entries percentage: 0.006354249404289118


Luckily, there are only 8 entries with an abnormal 'Age' value, accounting for around 0.64% of the total entries. We need to replace these values with a value that is more likely to occur. A simple method to do this is to re-input them with the mean of the 'Age' feature. However, since the mean value is affected by the highly abnormal outlier (99999999999), we will instead use the median value for this process. Let us first check the what the median value of the 'Age' feature is.



In [30]:
AgeMedian = data['Age'].median()
print('Median of age feature : ' + str(AgeMedian))

Median of age feature : 31.0


The median looks quite normal and can be used to impute for the abnormal values.

In [31]:
# Replacing the age value of the abnormal age entries
data['Age'].mask((data['Age'] < 18) | (data['Age'] > 90),
                 31,
                 inplace = True)

# Checking if the values are replaced correctly
Age2 = data['Age']
AgeMin2 = min(Age2)
AgeMax2 = max(Age2)
AgeCount2 = len(Age2)
print('Lowest age in the entries       : ' + str(AgeMin2))
print('Highest age in the entries      : ' + str(AgeMax2))
print('Total counts of the age entries : ' + str(AgeCount2))

Lowest age in the entries       : 18
Highest age in the entries      : 72
Total counts of the age entries : 1259


###Gender

We have checked that the 'Gender' feature has too many unique values with no data validation. For the sake of simplicity, we will do some feature engineering and categorize the 'Gender' feature into only three values: 'Male', 'Female', and 'Other'.

In [34]:
# Defining the Male and Female categories
male = ['M', 'Male', 'male', 'm', 'maile', 'Mal', 'Make', 'Male ', 'Man', 'msle', 'Mail', 'Malr']
female = ['Female', 'female', 'F', 'Woman', 'f', 'Femake', 'woman', 'Female ', 'femail']

# Storing the unique values in a list
GenderUV = data['Gender'].unique()

# Defining the Other category
other1 = [x for x in GenderUV if x not in male]
other2 = [x for x in other1 if x not in female]

# Replacing the values in Gender feature into three categories only
data['Gender'].replace(to_replace = male, value = 'Male', inplace=True)
data['Gender'].replace(to_replace = female, value = 'Female', inplace=True)
data['Gender'].replace(to_replace = other2, value = 'Other', inplace=True)

# Displaying the report
print('Number of unique values in Gender feature : ' + 
      str(len(data['Gender'].unique())))
print('Unique values in Gender feature           : ' + 
      str(data['Gender'].unique()))

Number of unique values in Gender feature : 3
Unique values in Gender feature           : ['Female' 'Male' 'Other']


###State

In the 'state' feature, the number of missing value is quite large, accounting for 40.91% of the total number of entries. Before deciding what to do with the missing values, we first need to check a possible explanation as to why the missing value is that large.

'state' is applicable for only certain countries that have states in them, for example: United States that has 50 states. For countries that have no states, then it is most likely that their 'state' feature is empty. Let us check if this is the case.

In [38]:
# Count: Entries outside US
NotUS = (data['Country'] != 'United States').sum()

# Count: Entries outside US that have empty state feature
NotUSNullState = data[data['Country'] != 'United States']['state'].isnull().sum()

# Count: Entries in US
US = (data['Country'] == 'United States').sum()

# Count: Entries in US that have empty state feature
USNullState = data[data['Country'] == 'United States']['state'].isnull().sum()

# Total entries with empty state
NullState = data['state'].isnull().sum()

# Displaying the report
print('Entries Outside US                  : ' + str(NotUS))
print('Entries Outside US with empty state : ' + str(NotUSNullState))
print('Entries In US                       : ' + str(US))
print('Entries In US with empty state      : ' + str(USNullState))
print('Total entries with empty state      : ' + str(NullState))

Entries Outside US                  : 508
Entries Outside US with empty state : 504
Entries In US                       : 751
Entries In US with empty state      : 0
Total entries with empty state      : 504


From the report above, it can be seen that most of the non-empty 'state' feature only occur in the entries with 'Country' value 'United States'. There are only 4 entries that are outside of 'United States' that have non-empty 'state' feature. When we are specifically analyzing the data inside 'United States', this 'state' feature can be pretty useful. Because of this, we will not discard the entire 'state' feature despite the large missing values, but will instead do a little cleaning on this feature. 

####Inside of United States

Let us check the unique values in the 'state' feature of the US 'Country' entries.

In [None]:
uniquevaluescount(data[data['Country'] == 'United States'], 'state')

state  Count
   CA    138
   WA     70
   NY     56
   TN     45
   TX     44
   OH     30
   OR     29
   PA     29
   IL     28
   IN     27
   MI     22
   MN     21
   MA     20
   FL     15
   NC     14
   VA     14
   GA     12
   MO     12
   WI     12
   UT     10
   CO      9
   AL      8
   AZ      7
   MD      7
   NJ      6
   OK      6
   KY      5
   SC      5
   CT      4
   DC      4
   IA      4
   KS      3
   NH      3
   NV      3
   SD      3
   VT      3
   NE      2
   NM      2
   WY      2
   ID      1
   LA      1
   ME      1
   MS      1
   RI      1
   WV      1
Total : 740


There seems no problem with the unique values: no typo or any abnormal format. Now, let us count the count percentage of the missing values.

In [36]:
# Calculating the 'state' missing value percentage of the 'United States' entries
print("'United States' entries with missing 'State' value")
missingvaluepercentage(data[data['Country'] == 'United States'], 'state')

'United States' entries with missing 'State' value
Missing value : 1.46%


The percentage of missing values in the 'state' feature of the entries in 'United States' is relatively small, around 1.46% and still can be used for further analysis. We will just rename the missing values with 'No Data' and see if further deeper analysis can extract some information out of it.

In [39]:
# Replacing the 'state' missing values of the 'United States' entries with 'No Data'
data.loc[data.Country == 'United States', 'state'] = data.loc[data.Country == 'United States', 'state'].fillna('No Data')

# Checking if the missing values have been replaced correctly
USNoDataState = len(data[data['Country'] == 'United States'][data['state'] == 'No Data'])
USNullState = data[data['Country'] == 'United States']['state'].isnull().sum()
NullState = data['state'].isnull().sum()

# Displaying the report
print("Entries In US With 'No Data' State : " + str(USNoDataState))
print('Entries in US with empty state     : ' + str(USNullState))
print('Total entries with empty state     : ' + str(NullState))
print('Entries Outside US                 : ' + str(NotUS))

Entries In US With 'No Data' State : 11
Entries in US with empty state     : 0
Total entries with empty state     : 504
Entries Outside US                 : 508


  """


The 'state' missing values of the 'United States' entries have been successfully replaced with 'No Data'. We will no proceed to the entries outside of the 'United States'.

####Outside of United States

Let us check the unique values in the 'state' feature of entries outside of the US 'Country'.

In [40]:
uniquevaluescount(data[data['Country'] != 'United States'], 'state')

[34mUnique value count of state[0m
state  Count
   IL      1
   MD      1
   NY      1
   UT      1
Total : 4



In [43]:
# Getting the unique 'state' values of entries outside 'United States'
liststateOutsideUS = list(data[data['Country'] != 'United States']['state'].unique())

# Omitting the nan numerical value
liststateOutsideUS = [i for i in liststateOutsideUS if type(i) is not float]

# Getting the 'Country' for each of the 'state'
listcountryOutsideUS = []
for i in liststateOutsideUS:
    countryOutsideUS = data[data['Country'] != 'United States'].loc[data['state'] == i, 'Country'].iloc[0]
    listcountryOutsideUS.append(str(countryOutsideUS))

# Displaying the report
Table = pd.DataFrame(list(zip(liststateOutsideUS, listcountryOutsideUS)))
Table.columns = ['State', 'Country']
print("Unique 'state' value of entries outside of 'United States' and their corresponding 'Country :'")
print(Table.to_string(index = False))

Unique 'state' value of entries outside of 'United States' and their corresponding 'Country :'
State      Country
   NY       Latvia
   MD       Israel
   IL Bahamas, The
   UT     Bulgaria


Each 'state' resides on different countries. In further analysis, these 'state' values might not be used. We will just leave them like that for now.

###self_employed

The percentage of missing values in the 'self_employed' feature is relatively small, around 1.43% and still can be used for further analysis. We will try filling in the missing values randomly based on the probability of each 'yes' or 'no' value.

In [44]:
# Calculating the probability of each value
valueprobability(data, 'self_employed')

self_employed  Count  Probabiity
           No   1095    0.882353
          Yes    146    0.117647


Filling in the missing values of the 'self_employed' feature based on the probability of each unique value in the feature.

In [45]:
# Filling in the missing values
data['self_employed'] = data['self_employed'].fillna(pd.Series(np.random.choice(listUV,
                                                                               p = listUVprob,
                                                                               size = len(data))))

Checking if the missing values have been filled in, then the count of unique values of the updated feature.

In [46]:
# Checking if the missing values have been filled in
print(missingvaluepercentage(data, 'self_employed'))

Missing value : 0.00%
None


In [47]:
# Checking the unique values in the 'self_employed' feature
print(uniquevaluescount(data, 'self_employed'))

[34mUnique value count of self_employed[0m
self_employed  Count
           No   1111
          Yes    148
Total : 1259

None


###work_interfere

The missing values in the 'work_interfere' feature is quite large. However, since this feature may present a rather important insight when we analyze the dataset further, we will try to fill in the missing values instead of dropping the entire feature. 

In [48]:
# Calculating the probability of each value
valueprobability(data, 'work_interfere')

work_interfere  Count  Probabiity
     Sometimes    465    0.467337
         Never    213    0.214070
        Rarely    173    0.173869
         Often    144    0.144724


In [49]:
# Filling in the missing values based on the probability of each unique value
data['work_interfere'] = data['work_interfere'].fillna(pd.Series(np.random.choice(listUV,
                                                                                  p = listUVprob,
                                                                                  size = len(data))))

Checking if the missing values have been filled in, then the count of unique values of the updated feature.

In [50]:
# Checking if the missing values have been filled in
print(missingvaluepercentage(data, 'work_interfere'))

Missing value : 0.00%
None


In [51]:
# Checking the unique values in the 'self_employed' feature
print(uniquevaluescount(data, 'work_interfere'))

[34mUnique value count of work_interfere[0m
work_interfere  Count
     Sometimes    577
         Never    282
        Rarely    218
         Often    182
Total : 1259

None


###no_employees

We will rename the unique value 'More than 1000' to '>1000' instead to make it shorter and changing the '-' character to ' to ' instead to avoid data auto-formatting in the future when we export the data into a csv file.

In [52]:
# Renaming the 'More than 1000' data
data['no_employees'] = data['no_employees'].replace(['More than 1000'], '>1000')

# Finding and replacing the character
data['no_employees'] = [w.replace('-', ' to ') for w in data['no_employees']]

# Checking if the data has been properly changed
uniquevaluescount(data, 'no_employees')

[34mUnique value count of no_employees[0m
no_employees  Count
     6 to 25    290
   26 to 100    289
       >1000    282
  100 to 500    176
      1 to 5    162
 500 to 1000     60
Total : 1259



###comments

The 'comments' feature has 161 unique values large missing values. We will check the percentage of the missing values.

In [53]:
# Checking the missing value percentage of 'comments' feature
missingvaluepercentage(data, 'comments')

Missing value : 86.97%


Since the feature has a large missing values, we will omit this feature.

In [54]:
# Dropping the 'comments' feature
data.drop('comments', axis = 1, inplace = True)

# Checking the features of the dataset
print(data.columns)

Index(['Age', 'Gender', 'Country', 'state', 'self_employed', 'family_history',
       'treatment', 'work_interfere', 'no_employees', 'remote_work',
       'tech_company', 'benefits', 'care_options', 'wellness_program',
       'seek_help', 'anonymity', 'leave', 'mental_health_consequence',
       'phys_health_consequence', 'coworkers', 'supervisor',
       'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence'],
      dtype='object')


The 'comments' feature has been successfully omitted while retaining the other features.

##**4. Data Exporting and Next Actions**

Now we have done cleaning and doing some feature engineering to the initial dataset. What is left is to export the result in a new csv. file.

In [55]:
data.to_csv('Mental Health Survey (Cleaned).csv')