---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Pandas (self-study)

### 🔗 **Link**: https://bit.ly/WA_LEC9_SEABORN

🚫 **Note**: This notebook was developed by Jie Lu. You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 1. Data Vizualization with Seaborn

The purpose of this notebook is to teach you some basic elements of visualization in Python. 

### Data Source: Survey on Mental Health in the Tech Workplace in 2014

- Our data comes from a 2014 survey that measures attitudes towards mental health, and frequency of mental health disorders in tech workplaces.
- Data since year 2014 can be found on [OSMI](https://osmihelp.org/research)
- Here are the features of the data set:

| Feature | Description | 
| --- | --- | 
| Timestamp |The time when the survey is conducted  | 
|Age|Age|
|Gender|Male or Female|
|Country| Country|
|state|If you live in the United States, which state or territory do you live in?|
|self_employed|Are you self-employed?|
|family_history| Do you have a family history of mental illness?
|treatment|Have you sought treatment for a mental health condition?
|work_interfere|If you have a mental health condition, do you feel that it interferes with your work?
|no_employees | How many employees does your company or organization have?
|remote_work | Do you work remotely (outside of an office) at least 50% of the time?
|tech_company | Is your employer primarily a tech company/organization?
| benefits | Does your employer provide mental health benefits?
| care_options | Do you know the options for mental health care your employer provides?
| wellness_program | Has your employer ever discussed mental health as part of an employee wellness program?
| seek_help | Does your employer provide resources to learn more about mental health issues and how to seek help?
| anonymity | Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
| leave | How easy is it for you to take medical leave for a mental health condition?
| mentalhealthconsequence | Do you think that discussing a mental health issue with your employer would have negative consequences?
|physhealthconsequence | Do you think that discussing a physical health issue with your employer would have negative consequences?
| coworkers | Would you be willing to discuss a mental health issue with your coworkers?
| supervisor | Would you be willing to discuss a mental health issue with your direct supervisor(s)?
| mentalhealthinterview | Would you bring up a mental health issue with a potential employer in an interview?
|physhealthinterview | Would you bring up a physical health issue with a potential employer in an interview?
|mentalvsphysical|Do you feel that your employer takes mental health as seriously as physical health?
|obs_consequence|Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
|comments|Any additional notes or comments

In [None]:
# load data, data source: mental health in tech survey
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

# ignore future warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None

# set up global figure and font size
matplotlib.rcParams.update({'font.size': 18})
matplotlib.rcParams['figure.figsize'] = (16.0, 10.0)

In [None]:
# you can also get the data here: https://drive.google.com/file/d/1JFvtWmu9qV68dZpJEPEethYiNGh_U-uE/view?usp=sharing
df = pd.read_csv("files/mental_survey.csv")
df.head()

In [None]:
# a quick view of num_missing and num_unique

pd.concat([df.apply(lambda x: sum(x.isnull())).rename("num_missing"),
          df.apply(lambda x: len(x.unique())).rename("num_unique")], axis=1)

## Example: Barchart for Missing Value

Plot rectangular data as a color-encoded matrix. In this example, it's useful to examine the missing value pattern.

In [None]:
# let's take a look of the missing value
missing_values = df.isnull().sum() / len(df)   # normalize as a percentage of columns
missing_values = missing_values[missing_values > 0]
missing_values.sort_values(inplace=True, ascending=False)
missing_values

In [None]:
missing_values = missing_values.reset_index()
missing_values.columns = ['name', 'count']
sns.barplot(y = 'name', x='count', data=missing_values)
plt.title('Missing Values')
plt.ylabel('Featurs')
plt.xlabel('Missing Percentage')

## Example: barplot

This example avoids countplot on purpose, and shows how seaborn can be matplotlib alike. It can work well with fig and ax if you want to customize the details.

### Some illustration:
- **Counter**: A Counter is a container that keeps track of how many times equivalent values are added.
- **List comprehensions** It provide a concise way to create lists. It consists of brackets containing an expression followed by a for clause, then zero or more for or if clauses. The expressions can be anything, meaning you can put in all kinds of objects in lists.

In [None]:
# quick example: list comprehension
[y for y in range(10) if y % 2 == 0]

In [None]:
from collections import Counter

country_count = Counter(df['Country'].dropna().tolist()).most_common(10)
country_idx = [country[0] for country in country_count]
country_val = [country[1] for country in country_count]

In [None]:
sns.barplot(y = country_idx, x=country_val)
plt.title('Top Ten Countries in the Survey')
plt.ylabel('Country')
plt.xlabel('Count')

## Example: Catplot 

This function provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.

* self_employed - Are you self-employed?
* remote_work - Do you work remotely (outside of an office) at least 50% of the time?
* tech_company - Is your employer primarily a tech company/organization?

In [None]:
sns.catplot(x='self_employed', hue='remote_work', col='tech_company', kind='count', data=df)

## Example: distplot
This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions.

In [None]:
# use discribe to display basic stats 
df.Age.describe()

This indicates there must be some outliers for the Age feature.

In [None]:
# filter age from 0 to 100
df = df[(df['Age'] <= 100) & (df['Age']>=0)]
sns.displot(pd.Series(df['Age']), bins=24)

## Example: Box Plot

A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset 

In [None]:
sns.boxplot(df['Age'])

How to read a box plot:
- The minimum (the smallest number in the data set). The minimum is shown at the far left of the chart, at the end of the left “whisker.”
- First quartile, Q1, is the far left of the box (or the far right of the left whisker).
- The median is shown as a line in the center of the box.
- Third quartile, Q3, shown at the far right of the box (at the far left of the right whisker).
- The maximum (the largest number in the data set), shown at the far right of the box.

## Example: FacetGrid

This class maps a dataset onto multiple axes arrayed in a grid of rows and columns that correspond to levels of variables in the dataset. The plots it produces are often called “lattice”, “trellis”, or “small-multiple” graphics.

* treatment: Have you sought treatment for a mental health condition?

In [None]:
g = sns.FacetGrid(df, col='treatment', height=5)
g = g.map(sns.distplot, "Age")

## Example: Catplot with FacetGrid

In [None]:
# map Age to age_range
# map treatment to category

df['age_range'] = pd.cut(df['Age'], [0,20,30,65,100], labels=["0-20", "21-30", "31-65", "66-100"], include_lowest=True)
df['treatment'] = df['treatment'].map(dict(Yes=1, No=0)).values

The code block below can be skipped. It just shows sometimes a feature can be very noisy. And need manually clean.

In [None]:
# TL;DR --- clean 'Gender', map it into ['female' 'male' 'trans']

gender = df['Gender'].str.lower()
gender = df['Gender'].unique()

# manually code
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)",
            "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
trans_str = ["trans-female", "something kinda male?", "queer/she/they", 
             "non-binary","nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "trans woman", "neuter", "female (trans)", "queer", "ostensibly male, unsure what that really means"]           
female_str = ["cis female", "f", "female", "woman",  "femake", "female ",
              "cis-female/femme", "female (cis)", "femail"]

for (row, col) in df.iterrows():
    if str.lower(col.Gender) in male_str:
        df['Gender'].replace(to_replace=col.Gender, value='male', inplace=True)
    if str.lower(col.Gender) in female_str:
        df['Gender'].replace(to_replace=col.Gender, value='female', inplace=True)
    if str.lower(col.Gender) in trans_str:
        df['Gender'].replace(to_replace=col.Gender, value='trans', inplace=True)

rest = ['A little about you', 'p']
df = df[~df['Gender'].isin(rest)]

print(df['Gender'].unique())

In [None]:
plt.figure(figsize=(40,5))
sns.catplot(x="age_range", y="treatment", hue="Gender", data=df,
            kind="bar", ci=None, height=5, aspect=2, legend_out = True)

plt.title('Probability of mental health condition')

Similarly, we can examine the mental health condition with family history and age. And play with some other parameters.

The default value of ci is 95, which shows us a 95% confidence interval.

In [None]:
sns.catplot(x="family_history", y="treatment", hue="Gender", data=df, kind="bar", height=5, aspect=3, legend_out = True)

## Example: Time series plot

Seaborn can also plot multiple time series easily with a line chart

In [None]:
# example using US Treasury interest rate
import requests # to access the web page
import pandas as pd

# get the tables from the US Treasury interest rate table
dfs = pd.read_html('https://home.treasury.gov/resource-center/data-chart-center/interest-rates/TextView?type=daily_treasury_yield_curve&field_tdr_date_value_month=202210')
df_line = dfs[0][['Date','1 Mo', '2 Mo', '3 Mo', '6 Mo']]

df_line['Date'] = pd.to_datetime(df_line.Date)
df_line.set_index("Date", drop=True, inplace=True)
df_line.head()

In [None]:
sns.lineplot(data=df_line, dashes=False)

In [None]:
# regplot shows the correlation between 1 month and 2 month yields
sns.regplot(x='1 Mo', y='2 Mo', data=df_line)

## Example: Correlation Matrix

When exploring the data, it's natural to check what features are correlated with each other. 

In [None]:
# some preprocessing work

df = df.drop(['Country'], axis= 1)
df = df.drop(['comments'], axis= 1)
df = df.drop(['state'], axis= 1)
df = df.drop(['Timestamp'], axis= 1)
# df = df.drop(['date'], axis= 1)
df = df.dropna()       # dropna to make life easier

In [None]:
# use labelencoder to transform every column from str to numeric
from sklearn.preprocessing import LabelEncoder

for feature in df:
    le = LabelEncoder()
    df[feature] = le.fit_transform(df[feature])
    
df.head()

**LabelEncoder**: This method encodes target labels with value between 0 and n_classes-1 for categorical variables. For example, genders would be mapped to (0, 1), while numeric variables such as ages would remain unchanged.

In [None]:
# correlation matrix
corrmat = df.corr()
fig, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
plt.show()

We can see benefits, care_options, wellness_program, seek_help and anonymity are closely correlated with each other.

* benefits: Does your employer provide mental health benefits?

* care_options: Do you know the options for mental health care your employer provides?

* wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?

* seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?

* anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

## We are also interested in what are closely correlated with treatment?

In [None]:
# treatment correlation matrix
# pickup top 5 variables for heatmap

import numpy as np
k = 5
cols = corrmat.nlargest(k, 'treatment')['treatment'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

* work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
* family_history: Do you have a family history of mental illness?
* benefits: Does your employer provide mental health benefits?
* care_options: Do you know the options for mental health care your employer provides?

## There is a lot of fun stuff to explore in seaborn. Enjoy!

In [None]:
# dogplot - post-credit scene

sns.dogplot()