## Data Analysis & Visualization with Python:Analysis of US Citizens by Income Levels

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [Introduction](#0)
* [Dataset Info](#1)
* [Importing Related Libraries](#2)
* [Recognizing & Understanding Data](#3)
* [Univariate & Multivariate Analysis](#4)    
* [Other Specific Analysis Questions](#5)
* [Dropping Similar & Unneccessary Features](#6)
* [Handling with Missing Values](#7)
* [Handling with Outliers](#8)    
* [Final Step to make ready dataset for ML Models](#9)
* [The End of the Project](#10)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Introduction</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

One of the most important components to any data science experiment that doesn’t get as much importance as it should is **``Exploratory Data Analysis (EDA)``**. In short, EDA is **``"A first look at the data"``**. It is a critical step in analyzing the data from an experiment. It is used to understand and summarize the content of the dataset to ensure that the features which we feed to our machine learning algorithms are refined and we get valid, correctly interpreted results.
In general, looking at a column of numbers or a whole spreadsheet and determining the important characteristics of the data can be very tedious and boring. Moreover, it is good practice to understand the problem statement and the data before you get your hands dirty, which in view, helps to gain a lot of insights. 

The adult dataset is from the 1996 Census database. It is also known as “Census Income” dataset. Details of this dataset can be found at the **[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult)**. The problem statement here is how to explore and prepare the dataset for Machine Learning modelling where we may predict whether the income exceeds 50k a year or not based on the census data.

# Aim of the Project

Applying Exploratory Data Analysis (EDA) and preparing the data to implement the Machine Learning Algorithms;
1. Analyzing the characteristics of individuals according to income groups
2. Preparing data to create a model that will predict the income levels of people according to their characteristics (So the "salary" feature is the target feature)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Dataset Info</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

The Census Income dataset has 32561 entries. Each entry contains the following information about an individual:

- **salary (target feature/label):** whether or not an individual makes more than $50,000 annually. (<= 50K, >50K)
- **age:** the age of an individual. (Integer greater than 0)
- **workclass:** a general term to represent the employment status of an individual. (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
- **fnlwgt:** this is the number of people the census believes the entry represents. People with similar demographic characteristics should have similar weights.  There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.(Integer greater than 0)
- **education:** the highest level of education achieved by an individual. (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)
- **education-num:** the highest level of education achieved in numerical form. (Integer greater than 0)
- **marital-status:** marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces. Married-spouse-absent includes married people living apart because either the husband or wife was employed and living at a considerable distance from home (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- **occupation:** the general type of occupation of an individual. (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- **relationship:** represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute. (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- **race:** Descriptions of an individual’s race. (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- **sex:** the biological sex of the individual. (Male, female)
- **capital-gain:** capital gains for an individual. (Integer greater than or equal to 0)
- **capital-loss:** capital loss for an individual. (Integer greater than or equal to 0)
- **hours-per-week:** the hours an individual has reported to work per week. (continuous)
- **native-country:** country of origin for an individual (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)

Having examined the features above, it can be concluded that some of the variables, such as **``fnlwgt``** are not related directly to the target variable **``income``** and are not self-explanatory. Therefore, they can be removed from the Dataset if Machine Learning model will be built. The continuous variable **``fnlwgt``** represents final weight, which is the number of units in the target population that the responding unit represents. The variable **``education_num``** stands for the number of years of education in total, which is a continuous representation of the discrete variable education. The variable **``relationship``** represents the responding unit’s role in the family. **``capital_gain``** and **``capital_loss``** are income from investment sources other than wage/salary.

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">How to Install/Enable Intellisense or Autocomplete in Jupyter Notebook</p>

### Installing [jupyter_contrib_nbextensions](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html)

**To install the current version from The Python Package Index (PyPI), which is a repository of software for the Python programming language, simply type:**

!pip install jupyter_contrib_nbextensions

**Alternatively, you can install directly from the current master branch of the repository:**

!pip install https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tarball/master

### Enabling [Intellisense or Autocomplete in Jupyter Notebook](https://botbark.com/2019/12/18/how-to-enable-intellisense-or-autocomplete-in-jupyter-notebook/)


### Installing hinterland for jupyter without anaconda

**``STEP 1:``** ``Open cmd prompt and run the following commands``
 1) pip install jupyter_contrib_nbextensions<br>
 2) pip install jupyter_nbextensions_configurator<br>
 3) jupyter contrib nbextension install --user<br> 
 4) jupyter nbextensions_configurator enable --user<br>

**``STEP 2:``** ``Open jupyter notebook``
 - click on nbextensions tab<br>
 - unckeck disable configuration for nbextensions without explicit compatibility<br>
 - put a check on Hinterland<br>

**``Step 3:``** ``Open new python file and check autocomplete feature``

[VIDEO SOURCE](https://www.youtube.com/watch?v=DKE8hED0fow)

![Image_Assignment](https://i.ibb.co/RbmDmD6/E8-EED4-F3-B3-F4-4571-B6-A0-1-B3224-AAB060-4-5005-c.jpg)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Related Libraries</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy, Pandas, Matplotlib, Seaborn & other related libraries you need, you can import them as a library:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import seaborn as sns
from termcolor import colored

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

plt.rcParams["figure.figsize"] = (10, 6)

sns.set_style("whitegrid")
# to disable scientific notation 
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# to reset to the original settings we can use
# pd.reset_option('display.float_format', silent=True)

# Set it None to display all rows in the dataframe
# pd.set_option('display.max_rows', None)

# Set it to None to display all columns in the dataframe
pd.set_option('display.max_columns', None)

# Plotly Express can be used as a Pandas .plot() backend.
# pd.options.plotting.backend = "plotly"

ModuleNotFoundError: No module named 'termcolor'

In [None]:
def show_nulls(data):
    
    '''
    
    This function plots missing values for each column by observation in the dataset.
    
    ''' 
    plt.figure(figsize=(10, 5))

    sns.displot(data=data.isnull().melt(value_name="missing"),
                y="variable",
                hue="missing",
                multiple="fill",
                height=9.25)

    plt.axvline(0.2, color="r")
    plt.show()

In [None]:
def show_density(col):
    
    '''
    
    This function plots a 'density plot' which is a representation of the distribution of a numeric variableand display it. 
    It uses a kernel density estimate to show the probability density function of the variable.
    
    '''    
    from matplotlib import pyplot as plt

    fig = plt.figure(figsize=(15, 5))

    # Plot density
    col.plot.density()

    # Add titles and labels
    plt.title('Data Density', fontsize=15)

    # Show the mean, median, and mode
    plt.axvline(x=col.mean(),    color='cyan',   linestyle='dashed', linewidth=2, label='Mean')
    plt.axvline(x=col.median(),  color='red',    linestyle='dashed', linewidth=2, label='Median')
    plt.axvline(x=col.mode()[0], color='yellow', linestyle='dashed', linewidth=2, label='Mode')
    plt.legend()

    # Show the figure
    plt.show()

In [None]:
def show_distribution(col):
    
    '''
    
    This function will prints a Histogram and box plot which are graphical representations 
    for the frequency of numeric data values. It aims to describe the data and explore 
    the central tendency and variability before using advanced statistical analysis techniques. 
    
    '''
    # Get statistics
    from termcolor import colored

    print(colored('Statistical Calculations :', 'red', attrs=['bold']))
    print(colored('-'*26, 'red', attrs=['bold']))    
    min_val = col.min()
    max_val = col.max()
    mean_val = col.mean()
    med_val = col.median()
    mod_val = col.mode()[0]

    print(colored('Minimum:{:>7.2f}\nMean:{:>10.2f}\nMedian:{:>8.2f}\nMode:{:>10.2f}\nMaximum:{:>7.2f}\n'.format(min_val,
                                                                                             mean_val,
                                                                                             med_val,
                                                                                             mod_val,
                                                                                             max_val), 'blue', attrs=['bold']))
    
    
#     # Get statistics
#     from termcolor import colored

#     print(colored('Statistical Calculations :', 'red', attrs=['bold']))
#     print(colored('-'*26, 'red', attrs=['bold']))
#     min_val = df[col1].min()
#     max_val = df[col1].max()
#     mean_val = df[col1].mean()
#     med_val = df[col1].median()
#     mod_val = df[col1].mode()[0]

#     print(colored('Minimum:{:>7.2f}\nMean:{:>10.2f}\nMedian:{:>8.2f}\nMode:{:>10.2f}\nMaximum:{:>7.2f}\n'.format(min_val,
#                                                                                              mean_val,
#                                                                                              med_val,
#                                                                                              mod_val,
#                                                                                              max_val), 'blue', attrs=['bold']))
    
    
    
    # Create a figure for 2 subplots (2 rows, 1 column)
    fig, ax = plt.subplots(2, 1, figsize=(15, 15))

    # Plot the histogram   
    ax[0].hist(col, bins=30)
    ax[0].set_ylabel('Frequency', fontsize=10)

    # Add lines for the mean, median, and mode
    ax[0].axvline(x=min_val,  color='yellow',     linestyle='dashed', linewidth=2, label='Minimum')
    ax[0].axvline(x=mean_val, color='lightgreen', linestyle='dashed', linewidth=2, label='Mean')
    ax[0].axvline(x=med_val,  color='cyan',       linestyle='dashed', linewidth=2, label='Median')
    ax[0].axvline(x=mod_val,  color='purple',     linestyle='dashed', linewidth=2, label='Mode')
    ax[0].axvline(x=max_val,  color='red',        linestyle='dashed', linewidth=2, label='Maximum')
    ax[0].legend(loc='upper right')

    # Plot the boxplot   
    medianprops = dict(linestyle='-', linewidth=3, color='m')
    boxprops=dict(linestyle='-', linewidth=1.5)
    meanprops={"marker":"d", "markerfacecolor":"red", "markeredgecolor":"black", "markersize":"10"}
    flierprops={'marker': 'o', 'markersize': 8, 'markerfacecolor': 'fuchsia'}
    
    ax[1].boxplot(col, 
                  vert=False,
                  notch=True, 
                  patch_artist=False,
                  medianprops=medianprops,
                  flierprops=flierprops,
                  showmeans=True,
                  meanprops=meanprops)
    
    ax[1].set_xlabel('value', fontsize=10)
    

    # Add a title to the Figure
    fig.suptitle('Data Distribution', fontsize=15)

In [None]:
def show_compare(df, col1, col2):
    
    '''
    This function makes comparison among subcategories of target variable according to another variable.
    ''' 
    from matplotlib.patches import Patch
    from matplotlib.lines import Line2D
    
    # Get statistics
    from termcolor import colored

    print(colored('Statistical Calculations :', 'red', attrs=['bold']))
    print(colored('-'*26, 'red', attrs=['bold']))
    min_val = df[col1].min()
    max_val = df[col1].max()
    mean_val = df[col1].mean()
    med_val = df[col1].median()
    mod_val = df[col1].mode()[0]

    print(colored('Minimum:{:>7.2f}\nMean:{:>10.2f}\nMedian:{:>8.2f}\nMode:{:>10.2f}\nMaximum:{:>7.2f}\n'.format(min_val,
                                                                                             mean_val,
                                                                                             med_val,
                                                                                             mod_val,
                                                                                             max_val), 'blue', attrs=['bold']))

    fig, ax = plt.subplots(figsize=(12, 6))

    ax = sns.kdeplot(data=df, x=col1, hue=col2, fill=True)
    
    plt.title("Data Density", fontsize=20, color="darkblue")
    ax.ticklabel_format(style='plain')

    h, l = ax.get_legend_handles_labels()

    legend_elements1 = [Line2D([0], [0], marker='s', color='lightblue', label=df[col2].unique()[0], markersize=15),
                       Line2D([0], [0], marker='s', color='orange', label=df[col2].unique()[1], markersize=15)]
    l1 = plt.legend(handles=legend_elements1, title='Salary Type', bbox_to_anchor=(0.84, 1))

    legend_elements2 = [Line2D([0], [0], color='green',  label='Overall Mean',   markersize=15, linestyle='dashed'),
                        Line2D([0], [0], color='blue',   label='Group Mean', markersize=15, linestyle='-'),
                        Line2D([0], [0],  color='orange', label='Group Mean', markersize=15, linestyle='-'),
                        Line2D([0], [0],  color='red',    label='Median', markersize=15, linestyle='dashed'),
                        Line2D([0], [0],  color='yellow', label="Mode",   markersize=15, linestyle='dashed')]
    l2 = plt.legend(handles=legend_elements2,
                    title=f"Overall Mean {round(df[col1].mean(), 2)}\
                    \nGroup Mean {round(df.groupby([col2])[col1].mean()[0], 2)}\
                    \nGroup Mean {round(df.groupby([col2])[col1].mean()[1], 2)}\
                    \nOverall Median {round(np.median(df[col1]), 2)}\
                    \nOverall Mode {round(df[col1].mode()[0], 2)}", 
                    bbox_to_anchor=(0.9, 0.81))

    plt.axvline(x=df[col1].mean(),    color='green',   linestyle='dashed', linewidth=2, label='Overall Mean')
    plt.axvline(x=df[col1].median(),  color='red',     linestyle='dashed', linewidth=2, label='Overall Median')
    plt.axvline(x=df[col1].mode()[0], color='yellow',  linestyle='dashed', linewidth=2, label='Overall Mode')
    
    
    group_mean1 = df.groupby([col2])[col1].mean()[0]
    group_mean2 = df.groupby([col2])[col1].mean()[1]
    
    plt.axvline(x=group_mean1, color='blue',  linestyle='-', linewidth=2, label='Group Mean')
    plt.axvline(x=group_mean2, color='orange',  linestyle='-', linewidth=2, label='Group Mean')    


    ax.add_artist(l1); # we need this because the 2nd call to legend() erases the first one

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:left; border-radius:10px 10px;">Reading the data from file</p>

In [None]:
df0 = pd.read_csv("adult_eda.csv", sep=",")
df = df0.copy()

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Recognizing and Understanding Data</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

### 1.Try to understand what the data looks like
- Check the head, shape, data-types of the features.
- Check if there are some dublicate rows or not. If there are, then drop them. 
- Check the statistical values of features.
- If needed, rename the columns' names for easy use. 
- Basically check the missing values.

📝 **[head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)** function is used to access the first n rows of a dataframe or series.

In [None]:
df.head()

📝 The **[shape](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.shape.html)** attribute of pandas demonstrates the number of rows and columns as a tuple stored in the given DataFrame. 

In [None]:
df.shape

**There are 32561 samples in the dataset. There are 15 columns (features) in the dataset including both categorical and numerical ones.**

📝 The **[info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)** method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

In [None]:
# Dataset information

print(colored('US Citizens Income Levels Dataset Information:\n', 'blue', attrs=['bold']))
df.info()

**The columns of education-num and relationship seems to have missing values. However, we are not sure if the others have some inappropriate values.**

📝 **[duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)** method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

In [None]:
# Checking and detecting the duplicated rows

df.duplicated().value_counts()  

📝 Pandas **[drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)** method helps in removing duplicates from the data frame.

In [None]:
# Drop Duplicated rows with df.drop_duplicates(inplace=True) permanently

df.drop_duplicates(keep='first', inplace=True)  

In [None]:
# checking the shape of the dataset after droping the duplicated rows 

df.shape

In [None]:
# Checking and detecting the duplicated rows

df.duplicated().value_counts()  

What happened❓ Even if ``drop_duplicates()`` method worked here, you should be careful when there have been similarities among some features, such as "education" vs "education-num" & "relationship vs "marital-status" like in our case. We highly recommend you tackle with duplicates after getting some insights for deciding which one is picked up for the analysis and handling NaN values. Why❓ 🤔

In [None]:
def check_obj_columns(df):
    '''
    Returns nothing or column/s which has/have mixed object types.
    '''    
    
    tdf = df.select_dtypes(include=['object']).applymap(type)
    for col in tdf:
        if len(set(tdf[col].values)) > 1:
            print("Column {} has mixed object types.".format(col))

check_obj_columns(df)

In [None]:
df.select_dtypes(include=['object']).applymap(type)

Even if the dtype of "relationship" feature in our DataFrame is "Object", it seems that "relationship" feature has more than one types of data. For example, the "relationship" feature has different types for some elements including``<class 'str'>`` at the index number 0 and ``<class 'float'>`` at the index number 32559.

In [None]:
df.info()

In [None]:
display(df["workclass"].apply(type).value_counts())
display(df["relationship"].apply(type).value_counts())

# As seen here, while "workclass" feature has 1 type (<class 'str'>) of data,  
# "relationship" feature has 2 types (<class 'str'> and <class 'float'>) of data even if its dtype is "Object"

In this given case, the problem is the mixed types "relationship" column have. As you remember from our sessions, in Pandas the object type can hold different types which can create a tricky situation.

In [None]:
df["relationship"]

In [None]:
df["relationship"].value_counts(dropna=False)

In [None]:
print("The value at the index of '32559' is ", end='')
print(colored(df.loc[32559]["relationship"], 'blue', attrs=['bold']))
print("The type of element: ", end='')
print(colored(type(df.loc[32559]["relationship"]), 'blue', attrs=['bold']))
print(colored("*"*50, 'green', attrs=['bold']))

print("The value at the index of '0' is ", end='')
print(colored(df.loc[0]["relationship"], 'red', attrs=['bold']))
print("The type of element: ", end='')
print(colored(type(df.loc[0]["relationship"]), 'red', attrs=['bold']))
print(colored("*"*50, 'green', attrs=['bold']))

print("The dtype of 'relationship' feature is ", end='')
print(colored(df["relationship"].dtypes, 'magenta', attrs=['bold']))

**Some Remarks on Duplicated Values :** Duplicated values will NOT be handled for now since there have been similarities among some features, such as "relationship and "marital-status", when their unique values are compared. After deciding which one is picked up for the analysis and handling NaN values, the consideration will be taken again.

📝 The **[describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)** method computes and displays summary statistics for a Python dataframe. (It also operates on dataframe columns and Pandas series objects.) So if you have a Pandas dataframe or a Series object, you can use the describe method and it will calculate some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.

In [None]:
# Descriptive Statistics of the dataset

df.describe().T

**Rename the features of;**

**``"education-num"``**, **``"marital-status"``**, **``"capital-gain"``**, **``"capital-loss"``**, **``"hours-per-week"``**, **``"native-country``**", and **``"sex"``** as
**``"education_num"``**, **``"marital_status"``**, **``"capital_gain"``**, **``"capital_loss"``**, **``"hours_per_week"``**, **``"native_country"``**, and **``"gender"``**, respectively and permanently.

📝 Pandas **[rename()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)** method is used to rename any index, column or row. Renaming of column can also be done by dataframe.

In [None]:
# Changing the names of some columns for convenience

df.rename(columns={"education-num" : "education_num",
                   "marital-status" : "marital_status",
                   "capital-gain" : "capital_gain",
                   "capital-loss": "capital_loss",
                   "hours-per-week" : "hours_per_week",
                   "native-country" : "native_country",
                   "sex" : "gender"},
          inplace=True)

📝 Pandas **[isnull()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html)** & **[sum()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)** together return the number of missing values in the columns of any given dataframe.

In [None]:
# Checking the sum of Missing Values per column

df.isnull().sum() 

In [None]:
# Check the Percentage of Missing Values for each column. len(df): 32537

df.isnull().sum() / df.shape[0] * 100  

In [None]:
print(colored('Missing Value Information Per Column:\n', 'blue', attrs=['bold']))
missing_count = df.isnull().sum()
missing_per = df.isnull().sum()/df.shape[0]*100

missing_df = pd.concat({"missing_count": missing_count, "missing_percentage": missing_per}, axis=1)
missing_df

In [None]:
show_nulls(df)

**Some Remarks on Missing Values:** There is a high number of missing values in the Relationship column. People might prefer not to disclose their relationships in the family or hesitate to answer as "not-in-family", which might be a sensitive/personal issue. Even if its percentage is small, Education column is also missing with the value of 2.5% of the inputs. There might be some individuals who never attended a school as there is no option given in the survey or they might hesitate to disclose their educational background.

### 2.Look at the value counts of columns that have object datatype and detect strange values apart from the NaN Values

📝 Pandas **[columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html)** attribute return the column labels of the given Dataframe.

In [None]:
df.columns

In [None]:
# Descriptive Statistics of Categorical Features in the Dataset
# So we selected the categorical columns using the select_dtypes() function.

print(colored('Descriptive Statistics of Categorical Features:\n', 'blue', attrs=['bold']))
df.describe(include="object").T

📝 The **[select_dtypes()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)** method returns a new DataFrame that includes/excludes columns of the specified dtype(s). Use the include parameter to specify the included columns, or use the exclude parameter to specify which columns to exclude.

In [None]:
object_col = df.select_dtypes(include='object').columns
object_col

In [None]:
# Checking uniques and their numbers at Categorical Features in the Dataset

for col in object_col:
    print(col)
    print("--"*13)
    print(df[col].value_counts(dropna=False))
    print("--"*20)

📝 The **[isin()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html)** method checks if the Dataframe contains the specified value(s).

It returns a DataFrame similar to the original DataFrame, but the original values have been replaced with True if the value was one of the specified values, otherwise False.

In [None]:
# Checking if any column includes "?" as an input

df[df.isin(['?'])].any()

# df.isin(["?"]).sum(axis=0)

📝 **Some Remarks on "?" Mark:** If the questions for the workclass, occupation, and native country columns are drop-down or multiple-choice style with the pre-determined answer choices, some people might not be able to fit in and/or might submit a question mark as their answers. There is another posibility that workclass and occupation items are not very clear to understand for some people.

📝 **The important point** is that the incomplete/misleading data can only be discarded/excluded if the test suggest that they are **[Missing Completely At Random (MCAR)](https://towardsdatascience.com/handling-missing-data-in-machine-learning-d7acac44bef9)**; otherwise, as a Data Analyst, we have to treat the data and cannot discard them to determine the underlying patterns in the dataset. This decision should be a part of our data screening process. 

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Univariate & Multivariate Analysis</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Examine all features (first target feature("salary"), then numeric ones, lastly categoric ones) separetly from different aspects according to target feature.

**to-do list for numeric features:**
1. Check the boxplot to see extreme values 
2. Check the histplot/kdeplot to see distribution of feature
3. Check the statistical values
4. Check the boxplot and histplot/kdeplot by "salary" levels
5. Check the statistical values by "salary" levels
6. Write down the conclusions you draw from your analysis

**to-do list for categoric features:**
1. Find the features which contains similar values, examine the similarities and analyze them together 
2. Check the count/percentage of person in each categories and visualize it with a suitable plot
3. If need, decrease the number of categories by combining similar categories
4. Check the count of person in each "salary" levels by categories and visualize it with a suitable plot
5. Check the percentage distribution of person in each "salary" levels by categories and visualize it with suitable plot
6. Check the count of person in each categories by "salary" levels and visualize it with a suitable plot
7. Check the percentage distribution of person in each categories by "salary" levels and visualize it with suitable plot
8. Write down the conclusions you draw from your analysis

📝 **Note :** Instruction/direction for each feature is available under the corresponding feature in detail, as well.

## Salary (Target Feature)

**Check the count of person in each "salary" levels and visualize it with a countplot**

📝 The **[value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html)** function is used to get a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [None]:
# Checking the counts of unique values in "salary" feature

df.salary.value_counts(dropna=False)

In [None]:
# Visualizing the number of people in each category of "salary"

fig, ax = plt.subplots()

sns.countplot(data=df, x="salary")

ax.set_title("Total Number of People by Income Level", fontsize=18)

for p in ax.patches:
    ax.annotate((p.get_height()), (p.get_x()+0.35, p.get_height()+100))

📝 The plot object has a method called ``containers`` that would list the properties of each bar. Iterate through the list items of the container object and pass each item to the **[bar_label](https://matplotlib.org/3.5.0/gallery/lines_bars_and_markers/bar_label_demo.html)** function. ``This will extract and display the bar value in the bar plot.``

In [None]:
fig, ax = plt.subplots()

ax = sns.countplot(data=df, x="salary")

ax.set_title("Total Number of People by Income Level", fontsize=18)

for container in ax.containers:
    ax.bar_label(container);

**Check the percentage of person in each "salary" levels and visualize it with a pieplot**

📝 With **[normalize](https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.Series.value_counts.html#:~:text=With%20normalize%20set%20to%20True,by%20the%20sum%20of%20values.&text=Bins%20can%20be%20useful%20for,number%20of%20half%2Dopen%20bins.)** parameter set to ``True``, returns the ``relative frequency`` by dividing all values by the sum of values. 

In [None]:
# Checking the proportion of people who make above 50K and under 50K

df.salary.value_counts(normalize=True)

In [None]:
# Visualing the percentage of distribution in the salary feature

fig, ax = plt.subplots(figsize=(6, 6))

ax.pie(x=df.salary.value_counts().values, 
       labels=['<=50K', '>50K'], 
       autopct='%.1f%%',
       explode=(0, 0.1),
       colors=['lightskyblue', 'gold'],
       textprops={'fontsize': 12},
       shadow=True
       )
plt.title("Percentage of Income-Levels", fontdict={'fontsize': 14})
plt.show()

**Write down the conclusions you draw from your analysis**

**Some Remarks on "salary" Feature:** As seen above, the number of people who earn 50K or less than 50K per year is way higher that the number of people who earn more than 50K per year in our sample. There are a lot more to investigate such as the educational background of our sample, number of hours worked per week, or age span to understand the results better. However, it is clearly concluded that 25% of the individuals in the dataset are at the high-income level while others (75%) are at the low-income level.

## Numeric Features

## age

**Check the boxplot to see extreme values**

📝 A **``boxplot``** is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are.

In [None]:
# Displaying the distribution of "age" feature with a box plot

sns.boxplot(data=df, 
            x="age",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

plt.title("Age Distribution", fontsize=18, color="b");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Displaying the distribution of age feature with a histogram

sns.histplot(data=df, x="age", bins=20, kde=True, color="g")

plt.title("Age Distribution", fontsize=18, color="fuchsia");

**Check the statistical values**

In [None]:
# Descriptive Statistics of "age" Feature

print(colored('Descriptive Statistics of the Age Feature:\n', 'blue', attrs=['bold']))
df.age.describe()

In [None]:
show_distribution(df["age"])

In [None]:
show_density(df["age"])

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Checking the extreme values in "age" feature by Salary with box plot

sns.boxplot(data=df, 
            y="salary", 
            x="age",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

# Add in points to show each observation
sns.stripplot(data=df,
              y="salary", 
              x="age",
              size=4, 
              color="purple", 
              linewidth=0)

# Tweaking the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

plt.title("Age Distribution by Salary", fontsize=20, color="darkblue");

In [None]:
# Checking Density Distribution of "age" feature by Salary 

sns.histplot(data=df, x="age", bins=20, kde=True, hue="salary")

plt.title("Age Distribution by Salary", fontsize=20, color="darkblue");

In [None]:
# # Display age distribution across salary levels with a Box Plot, Histogram and KDE Plot

# fig, ax = plt.subplots(3,1, figsize=(12, 24))
# fig.suptitle("Age Distribution Across Salary-Levels", fontsize=20, color="darkblue")

# sns.boxplot(data=df, x="salary", y="age", ax=ax[0])
# sns.histplot(data=df, x="age", hue="salary", kde=True, bins=20, ax=ax[1])
# sns.kdeplot(data=df, x="age", hue="salary", shade=True, ax=ax[2])

# plt.subplots_adjust(hspace=0.3)
# plt.show()

**Check the statistical values by "salary" levels**

In [None]:
# Descriptive Statistics of age with respect to salary levels

print(colored('Descriptive Statistics of the Age by Salary:\n', 'blue', attrs=['bold']))
df.groupby("salary").age.describe()

In [None]:
# # Get statistics
# from termcolor import colored

# print(colored('Statistical Calculations', 'red', attrs=['bold']))
# min_val = df["age"].min()
# max_val = df["age"].max()
# mean_val = df["age"].mean()
# med_val = df["age"].median()
# mod_val = df["age"].mode()[0]

# print('Minimum:{:>7.2f}\nMean:{:>10.2f}\nMedian:{:>8.2f}\nMode:{:>10.2f}\nMaximum:{:>7.2f}\n'.format(min_val,
#                                                                                          mean_val,
#                                                                                          med_val,
#                                                                                          mod_val,
#                                                                                          max_val))

# fig, ax = plt.subplots(figsize=(10, 5))

# ax = sns.kdeplot(data=df, x="age", hue="salary", fill=True)


# h, l = ax.get_legend_handles_labels()

# legend_elements1 = [Line2D([0], [0], marker='s', color='lightblue', label=df.salary.unique()[0], markersize=15),
#                    Line2D([0], [0], marker='s', color='orange', label=df.salary.unique()[1], markersize=15)]
# l1 = plt.legend(handles=legend_elements1, title='Salary Type', bbox_to_anchor=(0.69, 1))

# legend_elements2 = [Line2D([0], [0], color='cyan',   label='Mean',   markersize=15, linestyle='dashed'),
#                    Line2D([0], [0],  color='red',    label='Median', markersize=15, linestyle='dashed'),
#                    Line2D([0], [0],  color='yellow', label="Mode",   markersize=15, linestyle='dashed')]
# l2 = plt.legend(handles=legend_elements2, title="Mean, Median, Mode", bbox_to_anchor=(0.72, 0.81))
    

# plt.axvline(x=df["age"].mean(),    color='cyan',    linestyle='dashed', linewidth=2, label='Mean')
# plt.axvline(x=df["age"].median(),  color='red',     linestyle='dashed', linewidth=2, label='Median')
# plt.axvline(x=df["age"].mode()[0], color='yellow',  linestyle='dashed', linewidth=2, label='Mode')

# ax.add_artist(l1); # we need this because the 2nd call to legend() erases the first one

In [None]:
show_compare(df, "age", "salary")

**Write down the conclusions you draw from your analysis**

**Some Remarks on "age" Feature:** 

**``First``**, ``the characteristics of the age column was investigated. Here are the findings:``

- There are some extreme values around/between the ages of 78-90.
- Mean is 38, and it is observed that majority of the people are between 28 and 48 years old.
- The age data is right skewed, which means that our sample is a young population as expected. Most people after the age of 65 will be retired and will no longer work.

**``Second``**, ``the age component was explored through the salary levels (<=50K and >50K) was explored:``

- Visuals revealed that the people who make less than or equal to 50K are a younger population than the people who make above 50K per year.
- The mean and median age of the high-income group is higher than the low-income group. It can be stated that the older generation possesses more wealth than the young. In other words, The average age of Americans who earn less than 50K is lower than those who earn over 50K. However, care should be taken when making such a comment of the older always earn more compared to the younger people or visa versa since there have been some extreme values showing that there are some observations between 80-90 years old and make less than 50K, and there are people who are below 30 years old and make more than 50K.
- When we consider histogram and kde plot, it is realized that both the age feature of both populations (who make <=50K and >50K) are right skewed while the population who make >50K is closer to a normal distribution.
- It is expected that the population who make less than 50K shows a stronger right-skewed graph. Younger people are less experienced or new graduates. As a result, they will earn less as seen on the graphs.
- Besides, it looks like people in almost every data point who earn less than 50K are younger than those in the high-income group. The difference in standard deviations of these 2 groups is not very large so it seems there is not much more difference between the mean and median for both groups, it can be stated that the average income increases by age. There is some gap between the 75% values and max values for each income group, which might be the indicator of extreme values (candidates of outliers). Therefore, there is a need to examine meticulously the boxplots for these features, salary & age, separately and together by means of domain knowledge obtained by the field/expertise.

## fnlwgt

**Check the boxplot to see extreme values**

In [None]:
# Checking the extreme values in "fnlwgt" feature by means of box plot

sns.boxplot(data=df, 
            x="fnlwgt",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

plt.title("Fnlwgt Distribution by Boxplot", fontsize=20, color="darkblue");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Checking Density Distribution of "fnlwgt" feature 

sns.kdeplot(data=df, x="fnlwgt", fill=True)

plt.ticklabel_format(style='plain', axis='y')
plt.ticklabel_format(style='plain', axis='x')
plt.title("Fnlwgt Density", fontsize=20, color="darkblue");

**[Prevent/Suppress Scientific Notation in Matplotlib](https://stackoverflow.com/questions/28371674/prevent-scientific-notation-in-matplotlib-pyplot)**

In [None]:
# The First Approach to Prevent/Suppress Scientific Notation in Matplotlib

fig, ax = plt.subplots()

sns.kdeplot(data=df, x="fnlwgt", fill=True)

plt.title("Fnlwgt Density", fontsize=20, color="darkblue")

ax.ticklabel_format(style='plain');

In [None]:
# The Second Approach to Prevent/Suppress Scientific Notation in Matplotlib

sns.kdeplot(data=df, x="fnlwgt", fill=True)

plt.title("Fnlwgt Density", fontsize=20, color="darkblue")

plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y');

**Check the statistical values**

In [None]:
# Descriptive Statistics of "fnlwgt" Feature

print(colored('Descriptive Statistics of the "fnlwgt" Feature:\n', 'blue', attrs=['bold']))
df.fnlwgt.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Checking the extreme values in "fnlwgt" feature by Salary with box plot

plt.figure(figsize=(14, 8))

sns.boxplot(data=df, 
            y="salary", 
            x="fnlwgt",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

# Adding in points to show each observation
sns.stripplot(data=df,
              y="salary", 
              x="fnlwgt",
              size=4, 
              color="purple", 
              linewidth=0)

# Tweaking the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

plt.title("Fnlwgt Distribution by Salary", fontsize=20, color="darkblue")
plt.ticklabel_format(style='plain', axis='x');

If one of the sections is longer than another, it indicates a wider range in the values of data in that section (meaning the data are more spread out). A smaller section of the boxplot indicates the data are more condensed (closer together).

In [None]:
# Checking Density Distribution of "fnlwgt" feature by Salary

ax = sns.kdeplot(data=df, x="fnlwgt", hue="salary", fill=True)

plt.title("Fnlwgt Density by Salary", fontsize=20, color="darkblue")

ax.ticklabel_format(style='plain');

**Check the statistical values by "salary" levels**

In [None]:
# Descriptive Statistics of "fnlwgt" with respect to salary levels

print(colored('Descriptive Statistics of the "fnlwgt" by Salary:\n', 'blue', attrs=['bold']))
df.groupby("salary").fnlwgt.describe()

In [None]:
show_compare(df, "fnlwgt", "salary")

**Write down the conclusions you draw from your analysis**

**Some Remarks on "fnlwgt" Feature:** It looks like there is no significant difference between high and low-income groups according to the "fnlwgt" feature.

However, the difference in standard deviations of these 2 groups is not very large so it seems there is not much more difference between the mean and median for both groups, it can be stated that the average income increases by age. There is some gap between the 75% values and max values for each income group, which might be the indicator of extreme values (candidates of outliers). Therefore, there is a need to examine meticulously the boxplots for these features, salary & age, separately and together by means of domain knowledge obtained by the field/expertise.

## capital_gain

### 📝[Domain Knowledge About Capital Gain](https://en.wikipedia.org/wiki/Capital_gain)

**What is 'Capital Gain?'**

Capital gain is an economic concept defined as the profit earned on the sale of an asset which has increased in value over the holding period. An asset may include tangible property, a car, a business, or intangible property such as shares.

A capital gain is only possible when the selling price of the asset is greater than the original purchase price. In the event that the purchase price exceeds the sale price, a capital loss occurs. Capital gains are often subject to taxation, of which rates and exemptions may differ between countries. 


In [None]:
# We can build a function that highlights the maximum value across rows, cols, and the DataFrame all at once.

def highlight_max(s, props=''):
    return np.where(s == np.nanmax(s.values), props, '')

In [None]:
# Descriptive Statistics of the "capital_loss" by "workclass", "occupation"

print(colored('Descriptive Statistics "capital_gain" by "workclass" & "occupation":\n', 'blue', attrs=['bold']))

df.groupby(["workclass", "occupation"])["capital_gain"].describe()\
                             .style.apply(highlight_max, props='color: white; background-color: #33FFF9;', axis=0)\
                             .apply(highlight_max, props='color: white; background-color: pink;', axis=1)\
                             .apply(highlight_max, props='color: white; background-color: purple', axis=None)

**Check the boxplot to see extreme values**

Use the IQR to assess the variability where most of your values lie. Larger values indicate that the central portion of your data spread out further. Conversely, smaller values show that the middle values cluster more tightly.

In [None]:
# Checking the extreme values in the "capital_gain" feature with box plot

sns.boxplot(data=df, x="capital_gain")

plt.suptitle("Capital Gain Distribution", fontsize=20, color="darkblue");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Checking Density Distribution of the "capital_gain" feature 

sns.kdeplot(data=df, x="capital_gain", fill=True)

plt.suptitle("Capital Gain Density", fontsize=20, color="darkblue");

**Check the statistical values**

In [None]:
# Descriptive Statistics of "capital_gain" Feature

print(colored('Descriptive Statistics of the "capital_gain" Feature:\n', 'blue', attrs=['bold']))
df.capital_gain.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Checking the extreme values in the "capital_gain" feature by Salary with box plot

plt.figure(figsize=(14, 8))

sns.boxplot(data=df, 
            y="salary", 
            x="capital_gain")

# # Adding in points to show each observation
# sns.stripplot(data=df,
#               y="salary", 
#               x="capital_gain",
#               size=4, 
#               color="purple", 
#               linewidth=0)

# # Tweaking the visual presentation
# ax.xaxis.grid(True)
# ax.set(ylabel="")
# sns.despine(trim=True, left=True)

plt.title("Capital Gain Distribution By Salary", fontsize=20, color='darkblue');

In [None]:
# Checking Density Distribution of the "capital_gain" feature by Salary 

sns.kdeplot(data=df, x="capital_gain", hue="salary", fill=True)

plt.title("Capital Gain Density By Salary", fontsize=20, color='darkblue');

**Check the statistical values by "salary" levels**

In [None]:
# Descriptive Statistics of "capital_gain" with respect to Salary levels

print(colored('Descriptive Statistics of the "capital_gain" by Salary:\n', 'blue', attrs=['bold']))
df.groupby("salary").capital_gain.describe()

In [None]:
show_compare(df, "capital_gain", "salary")

**Check the statistical values by "salary" levels for capital_gain not equal to zero**

In [None]:
# Descriptive Statistics of "capital_gain != 0" by Salary

print(colored('Descriptive Statistics of the "capital_gain != 0" by Salary :\n', 'blue', attrs=['bold']))
df[df.capital_gain != 0].groupby("salary")["capital_gain"].describe()

In [None]:
# Checking Density Distribution of "capital_gain" feature by Salary 

df[df.capital_gain != 0].groupby('salary')['capital_gain']\
                        .apply(lambda x: sns.kdeplot(x, label=x.name))

plt.title("A Close Look at Capital Gain Density By Salary", fontsize=18, color='darkblue');

**Write down the conclusions you draw from your analysis**

**Some Remarks on "capital_gain" Feature :** Capital gain is usually known as the profits made from the sale of real estate, investments, and personal property.

``The dataset indicates the following findings when all rows are included:``

- The average capital gain earned by the people who make above 50K is almost 27 times (18731.165/148.885 = 26.914497766732715) larger than the capital gain earned by the people who make 50K or below 50K.
- There are extreme values such as 41K made by those who are in "<=50K income-level" or 99999K made by those who are ">50K income-level" and they affect the average values.
- The capital gain values made by the people who are in the ">50K income-level" vary way more than the values made by those in the "<=50K income-level". The latter clustered around "148.88" while the former varies mostly between 0 and 20K.
- When the min, median, and interquartile ranges are checked, it is realized that a lot of values of zero exist in the capital gain column. That's why it might be better to check the statistical description of the capital gain column excluding the rows with zero values.

``The dataset indicates the following findings when the rows with zero values are excluded:``

- The average capital gain earned by the people who make above 50K is almost 5 times (18731.165/3552.813 = 5.272206840044776) larger than the capital gain earned by those who make money 50K or below. It has a sharp decrease when compared to the output above.
- The standard deviation increases in both salary-levels, which shows that the varieties of capital gain values increase since most of the values were 0's in the previous scenerio.
- The median and IQRs vary a lot between two salary levels. While the capital gain of "<=50K population" is mostly clustered below 5K, the capital gain of ">50K population" varies mostly between 7K-15K.

**Final Thoughts:**

People who earn more are tend to make investments more than the people who earn less. In other words, the more "capital-gain", the more "high-income". However, almost 79% of the people who earn >50K reported "0" as their capital gain. This can be the starter of another story about why people don't/can't invest.

## capital_loss

### 📝 [Domain Knowledge About Capital Loss](https://www.investopedia.com/terms/c/capitalloss.asp)

**What Is a Capital Loss?**

A capital loss is the loss incurred when a capital asset, such as an investment or real estate, decreases in value. This loss is not realized until the asset is sold for a price that is lower than the original purchase price. In simple terms, the difference between the selling price and cost/purchase price of an investment can be described as capital gain/loss.

KEY TAKEAWAYS
- A capital loss is a loss incurred when a capital asset is sold for less than the price it was purchased for.
- In regards to taxes, capital gains can be offset by capital losses, reducing taxable income by the amount of the capital loss.
Capital gains and capital losses are reported on Form 8949.
- The Internal Revenue Service (IRS) puts measures around wash sales to prevent investors from taking advantage of the tax benefits of capital losses.

In [None]:
df[df.capital_loss < 0]

# As seen, there has been no values lower than zero since it is contrary to the meaning of "Capital Loss"

In [None]:
# Descriptive Statistics of the "capital_loss" by "workclass" & "occupation"

print(colored('Descriptive Statistics "capital_loss" by "workclass" & "occupation":\n', 'blue', attrs=['bold']))

df.groupby(["workclass", "occupation"])["capital_loss"].describe()\
                             .style.apply(highlight_max, props='color: white; background-color: #33FFF9;', axis=0)\
                             .apply(highlight_max, props='color: white; background-color: pink;', axis=1)\
                             .apply(highlight_max, props='color: white; background-color: purple', axis=None)

Indeed, capital loss cannot be understood unless the asset is sold for a price since it results in a loss when the investment is sold. So there have no values under the zero (0) as seen above. Even if there have been some values greater then zero, this cannot be evaluated as capital gain. To consider if there has been capital gain we have to look at "captal_gain" feature.  

**Check the boxplot to see extreme values**

In [None]:
# Checking the extreme values in the "capital_loss" feature with box plot

sns.boxplot(data=df, x="capital_loss")

plt.title("Capital Loss Distribution", fontsize=20, color='darkblue');

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Checking Density Distribution of the "capital_loss" feature 

sns.kdeplot(data=df, x="capital_loss", fill=True)

plt.title("Capital Loss Density", fontsize=20, color='darkblue');

**Check the statistical values**

In [None]:
# Descriptive Statistics of "capital_loss" Feature

print(colored('Descriptive Statistics of the "capital_loss" Feature:\n', 'blue', attrs=['bold']))
df.capital_loss.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Checking the extreme values in the "capital_loss" feature by Salary with box plot

plt.figure(figsize=(14, 8))

sns.boxplot(data=df, 
            y="salary", 
            x="capital_loss")

# # Adding in points to show each observation
# sns.stripplot(data=df,
#               y="salary", 
#               x="capital_loss",
#               size=4, 
#               color="purple", 
#               linewidth=0)

# # Tweaking the visual presentation
# ax.xaxis.grid(True)
# ax.set(ylabel="")
# sns.despine(trim=True, left=True)

plt.title("Capital Loss Distribution By Salary", fontsize=20, color='darkblue');

In [None]:
# Checking Density Distribution of the "capital_loss" feature by Salary 

sns.kdeplot(data=df, x="capital_loss", hue="salary", fill=True)

plt.title("Capital Loss Density By Salary", fontsize=20, color='darkblue');

**Check the statistical values by "salary" levels**

In [None]:
# Descriptive Statistics of "capital_loss" by Salary

print(colored('Descriptive Statistics of the "capital_loss" by Salary :\n', 'blue', attrs=['bold']))
df.groupby("salary").capital_loss.describe()

In [None]:
show_compare(df, "capital_loss", "salary")

**Check the statistical values by "salary" levels for capital_loss not equel the zero**

In [None]:
# Descriptive Statistics of "capital_loss != 0" by Salary

print(colored('Descriptive Statistics of the "capital_loss != 0" by Salary :\n', 'blue', attrs=['bold']))
df[df.capital_loss != 0].groupby("salary").capital_loss.describe()

In [None]:
# Checking Density Distribution of the "capital_loss != 0" by Salary 

df[df.capital_loss != 0].groupby('salary')['capital_loss']\
                        .apply(lambda x: sns.kdeplot(x, label=x.name, fill=True))

plt.title("A Close Look at Capital Loss Density By Salary", fontsize=18, color='darkblue');

**Write down the conclusions you draw from your analysis**

**Some Remarks on "capital_loss" Feature:** 

📝 ``A capital loss is considered to sell an investment for less than the original purchase price.``

Here are the findings from the dataset:

- Unlike the capital gain, we recognize a similar patterns for both salary-levels populations in the capital loss data.
- The graphs -boxplots and histograms- show similar distributions among the people who make money <=50K and those who make money >50K. Both are close to a normal distribution and most of the values are clustered around "0". It leaded us again to consider the "capital_loss" column with nonzero values to better understand the data.
- When "0" values are eliminated, it is realized that the pattern of the data in both salary levels get closer to each other. The statistical description of the capital loss data support our claim with the close mean, median, and IQRs.

## hours_per_week

**Check the boxplot to see extreme values**

In [None]:
# Checking the extreme values in the "hours_per_week" feature with box plot

sns.boxplot(data=df, 
            x="hours_per_week",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

plt.title("Working-Hours Distribution", fontsize=18, color='darkblue');

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Checking Density Distribution of the "hours_per_week" Feature 

sns.kdeplot(data=df, x="hours_per_week", fill=True)

plt.title("Working-Hours Density", fontsize=18, color='darkblue');

**Check the statistical values**

In [None]:
# Descriptive Statistics of "hours_per_week" by Salary

print(colored('Descriptive Statistics of the "hours_per_week" :\n', 'blue', attrs=['bold']))
df.hours_per_week.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Checking the extreme values in the "hours_per_week" by Salary with box plot

plt.figure(figsize=(14, 8))

sns.boxplot(data=df, 
            y="salary", 
            x="hours_per_week",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

# Adding in points to show each observation
sns.stripplot(data=df,
              y="salary", 
              x="hours_per_week",
              size=4, 
              color="purple", 
              linewidth=0)

# Tweaking the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

plt.title("Working-Hours by Salary", fontsize=18, color='darkblue');            

In [None]:
# Checking Density Distribution of the "hours_per_week" by Salary 

sns.kdeplot(data=df, x="hours_per_week", hue="salary", fill=True)

plt.title("Working-Hours by Salary", fontsize=18, color='darkblue');            

**Check the statistical values by "salary" levels**

In [None]:
# Descriptive Statistics of "hours_per_week" by Salary

print(colored('Descriptive Statistics of the "hours_per_week" by Salary :\n', 'blue', attrs=['bold']))
df.groupby("salary").hours_per_week.describe()

In [None]:
show_compare(df, "hours_per_week", "salary")

**Write down the conclusions you draw from your analysis**

**Remarks on "hours_per_week" Feature:** 

- We don't recognize a distinguishable difference between two salary levels in terms of the number of working hours. Both salary-level graphs indicate that the number of hour values clustered at 40 hours in general as expected. Most of the full time employees work 8 hours per day. Since the number of working hours is standardized, it doesn't change much across the salary levels.
- However, it is observed that the "<=50K population" data shows a peaky normal distribution ([leptokurtic distribution](https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/#:~:text=Summary,pushed%20towards%20the%20left%20side).)) while ">50K population" data is slightly right-skewed and it's distribution is binominal which should be taken into consideration for determining if there have been any underlying patterns. The statistical measurements support the visuals with closer mean & median values in the former while a higher mean compared to median in the latter.
- Nevertheless, it can be stated that to be in a high-income group, almost at least an average of 45 working-hours per week is required. Most of those who work less than 39 hours per week are in the low-income group.

🧐 **[What is skewness and kurtosis for normal distribution?](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics)**

🧐 **[Kurtosis() & Skew() Function In Pandas](https://medium.com/@atanudan/kurtosis-skew-function-in-pandas-aa63d72e20de)**


In [None]:
# Checking the skewness of all features in the dataset

df.skew(axis=0)

In [None]:
# Checking the kurtosis of all features in the dataset

df.kurtosis(axis=0)

📝 **``Mesokurtic:``** This distribution has kurtosis statistic similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three.

📝 **``Leptokurtic (Kurtosis > 3):``** Distribution is longer, tails are fatter. Peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.

📝 **``Platykurtic: (Kurtosis < 3):``** Distribution is shorter, tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers.
The reason for this is because the extreme values are less than that of the normal distribution.

🧐 **[Why is high positive kurtosis problematic for hypothesis tests?](https://stats.stackexchange.com/questions/193117/why-is-high-positive-kurtosis-problematic-for-hypothesis-tests)**

🧐 **[Skew and Kurtosis: 2 Important Statistics terms you need to know in Data Science](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa)**

### See the relationship between each numeric features by target feature (salary) in one plot basically

A pairplot plot a pairwise relationships in a dataset.

In [None]:
g = sns.pairplot(df, hue="salary", palette="viridis", corner=True)

g.fig.suptitle("Pairwise Relationships Among Features", fontsize=20, color='darkblue');            

## Categorical Features

## education & education_num

**Detect the similarities between these features by comparing unique values**

In [None]:
# Checking the uniques of "education" feature and determining their numbers

df.education.value_counts(dropna=False)

In [None]:
# Checking the uniques of "education_num" feature and determining their numbers 

df.education_num.value_counts(dropna=False)

In [None]:
# Comparing the uniques of "education" with those of "education_num"

df.groupby('education').education_num.value_counts(dropna=False)

**Visualize the count of person in each categories for these features (education, education_num) separately**

In [None]:
# Visualization of "education" feature

ax = sns.countplot(data=df, x="education")

plt.xticks(rotation=60)

plt.title("The Number of People by Educational Levels", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

In [None]:
# Visualization of "education_num" feature

ax = sns.countplot(data=df, x="education_num")

plt.title("The Number of People by Educational Levels", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

In [None]:
# Visualization of "education" and "education_num" features by order

education_count = df["education"].value_counts().index
education_num_count = df["education_num"].value_counts().index

fig, ax = plt.subplots(2, 1, figsize=(12, 20))

sns.countplot(df["education"], order=education_count, ax=ax[0])
for container in ax[0].containers:
    ax[0].bar_label(container);

sns.countplot(df["education_num"], order=education_num_count, ax=ax[1])
for container in ax[1].containers:
    ax[1].bar_label(container);

ax[0].tick_params(axis='x', rotation=60)

plt.subplots_adjust(hspace=0.6);

The pattern of the counts in both features are almost equivalent. It is realized that "9.0" represent "HS-Grad", "10.0" represents "Some-college", "1.0" represents "Preschool" and so on. There are slight differences between the counts. It might be caused by the human errors during the survey.

**Check the count of person in each "salary" levels by these features (education and education_num) separately and visualize them with countplot**

In [None]:
# Checking "education" feature by Salary in detail 

df.groupby("education").salary.value_counts()

In [None]:
# Visualizing the number of people in each "education" level by Salary

ax = sns.countplot(data=df, x="education", hue="salary")

plt.title("The Number of by Educational Levels by Salary", fontsize=16, color="darkblue")

plt.xticks(rotation = 60)

for container in ax.containers:
    ax.bar_label(container);

In [None]:
# Checking "education_num" feature by Salary in detail 

df.groupby("education_num").salary.value_counts()

In [None]:
# Visualizing the number of people in each "education_num" level by Salary

ax = sns.countplot(data=df, 
                   x="education_num", 
                   hue="salary")

plt.title("The Number of by Educational Levels by Salary", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Visualize the boxplot of "education_num" feature by "salary" levels**

In [None]:
# Checking the extreme values in the "education_num" by Salary with box plot

plt.figure(figsize=(14, 8))

sns.boxplot(data=df, 
            y="salary", 
            x="education_num")

# Adding in points to show each observation
sns.stripplot(data=df,
              y="salary", 
              x="education_num",
              size=4, 
              color="purple", 
              linewidth=0)

# Tweaking the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

plt.title("The Distribution of Education", fontsize=16, color="darkblue");

**By means of "Feature Engineering", decreasing the number of categories in "education" feature as "elementary_primary_school_degree", "middle_secondary school_degree", and "high_level_degree" and creating a new feature with this new categorical data would be helpful to interpret the data. So, let us do that in accordance with the domain knowledge below:**


### 📝 [Domain Knowledge About Education in the US](https://www.investopedia.com/terms/c/capitalloss.asp)

**[Structure of U.S. Education at U.S. Department of Education](https://www2.ed.gov/about/offices/list/ous/international/usnei/us/edlite-structure-us.html#:~:text=Education%20in%20the%20United%20States%20follows%20a%20pattern%20similar%20to,then%20postsecondary%20(tertiary)%20education.)**

Education in the United States follows a pattern similar to that in many systems. Early childhood education is followed by **primary school** (called elementary school in the United States), **middle school, secondary school** (called high school in the United States), and then **postsecondary (tertiary) education**. Postsecondary education includes non-degree programs that lead to certificates and diplomas plus six degree levels: associate, bachelor, first professional, master, advanced intermediate, and research doctorate.

**Primary or elementary education** ranges from grade 1 to grades 4-7, depending on state and school district policy [Elementary schools in the United States](https://en.wikipedia.org/wiki/Elementary_schools_in_the_United_States). **Middle schools** serve pre-adolescent and young adolescent students between grades 5 and 9, with most in the grade 6-8 range. Secondary education in the United States is the last seven years of statutory formal education grade 6 (age 11–12) through grade 12 (age 17–18). It occurs in two phases. The first is the ISCED lower secondary phase, a middle school or junior high school for students grade 6 (age 11–12) through grade 8 (age 13–14). The second is the ISCED upper secondary phase, a high school or senior high school for students grade 9 (age 14–15) through grade 12 (age 17–18) [Secondary education in the United States](https://en.wikipedia.org/wiki/Secondary_education_in_the_United_States).

**High school graduate** means an individual who has received a high school diploma from a high school or passed the general educational development (GED) diploma test or any other high school graduate equivalency examination approved by the state board of education.

**Community colleges** and **vocational schools** are two pathways to success your career. Traditionally, **community colleges** offer two-year degrees on a wide variety of topics. Graduates earn associates degrees, many of which offer increased earning potential—and the possibility of transferring to a four-year institution to finish their degree. In other words, they serve as pipelines for four-year degrees and more—and cost less than traditional college programs. **Vocational schools**, historically, focus on on-the-job training and apprenticeships for industries like manufacturing. Some describe vocational programs as “career and technical” education.

**A bachelor’s degree** is an undergraduate degree in which you study a subject of your choice at an academic institution, and is commonly known as a college degree.

**Higher-postsecondary education** in the United States is an optional stage of formal learning following secondary education and third level education after you leave school. Higher education is also referred as post-secondary education, third-stage, third-level, or tertiary education. It is tertiary education leading to award of an academic degree. Higher education is an optional final stage of formal learning that occurs after completion of secondary education [Higher education](https://en.wikipedia.org/wiki/Higher_education). It takes places at universities and Further Education colleges and normally includes undergraduate and postgraduate study. Higher education gives you the chance to study a subject you are interested in and can boost your career prospects and earning potential [Higher education in the United States](https://en.wikipedia.org/wiki/Higher_education_in_the_United_States). 

**[Academic degrees in the World](https://en.wikipedia.org/wiki/Academic_degree)**

**[5 Types of Academic Degrees in the US:](https://www.indeed.com/career-advice/career-development/types-of-academic-degrees)**

<img src=https://i.ibb.co/4Th6MRC/academic-degree.png width="400" height="100" align="left">



In [None]:
def mapping_education(x):
    if x in ["Preschool", "1st-4th", "5th-6th"]:
        return "elementary_primary_school_degree"
    elif x in ["7th-8th", "9th", "10th", "11th", "12th", "HS-grad"]:
        return "middle_secondary school_degree"
    elif x in ["Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters", "Prof-school", "Doctorate"]:
        return "high_level_degree"

In [None]:
df.education.apply(mapping_education).value_counts(dropna=False)

In [None]:
df["education_summary"] = df.education.apply(mapping_education)

**Visualize the count of person in each categories for these new education levels (high, medium, low)**

In [None]:
# Visualizing the number of persons in each sub-categories of "education_summary" (high, medium, low)

ax = sns.countplot(data=df, x="education_summary")

plt.title("The Distribution of Education", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Check the count of person in each "salary" levels by these new education levels(high, medium, low) and visualize it with countplot**

In [None]:
# The number of persons in each "education_summary" levels by Salary (high, medium, low)

df.groupby("education_summary").salary.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "education_summary" (high, medium, low) by Salary

ax = sns.countplot(data=df, x="education_summary", hue="salary")

plt.title("The Distribution of Education by Salary", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Check the percentage distribution of person in each "salary" levels by each new education levels (high, medium, low) and visualize it with pie plot separately**

In [None]:
# The Proportional Distribution of persons in each sub-categories of "education_summary" (high, medium, low) by Salary 

edu = df.groupby(["education_summary"]).salary.value_counts(normalize=True)
edu

In [None]:
# df2 = df.groupby("education_summary").salary.value_counts(normalize=True).unstack(level=0)
# df2.plot(kind="pie", 
#          subplots=True,
#          layout=(1, 3),
#          labels=["<=50K", ">50K"],
#          title=["high_level_grade", "low_level_grade", "medium_level_grade"],
#          ylabel="salary",
#          autopct="%.2f%%", legend=False, figsize=(15, 5),
#          colors=["#c3eba9","#34d8eb"]);

In [None]:
# Visualizing the percentages of "education_summary" levels by Salary

plt.figure(figsize=(18, 6))
index=1
for i in [0, 2, 4]:
    plt.subplot(1, 3, index)
    edu[i:i+2].plot.pie(subplots=True,
                        labels=["<=50K", ">50K"],
                        autopct="%.2f%%",
                        textprops={'fontsize': 12},
                        colors=['pink', 'lightskyblue'],
                        )
    plt.title(edu.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Check the count of person in each these new education levels(high, medium, low) by "salary" levels and visualize it with countplot**

In [None]:
# The number of persons in each "salary" by "education_summary" levels (high, medium, low)

df.groupby("salary").education_summary.value_counts()

In [None]:
# Visualizing the number of persons in "salary" feature by "education_summary" (high, medium, low) 

ax = sns.countplot(data=df, x="salary", hue="education_summary")

plt.title("The Distribution of Education by Salary", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Check the the percentage distribution of person in each these new education levels(high, medium, low) by "salary" levels and visualize it with pie plot separately**

In [None]:
# The proportion of persons in each "salary" by "education_summary" levels (high, medium, low)

edu = df.groupby(["salary"]).education_summary.value_counts(normalize=True)
edu

In [None]:
# Visualizing the percentages of persons in each "salary" group by Education levels (high, medium, low) 

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 3]:
    plt.subplot(1, 2, index)
    edu[i:i+3].plot.pie(subplots=True,
#                        labels=["secondary education", "higher education", "primary education"],
                        autopct="%.2f%%",
                        textprops={'fontsize': 12},
                        colors=['pink', 'lightskyblue', 'lightgreen'],
                        )
    plt.title(edu.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

In [None]:
# Creating a dataframe demonstrating the percentages of education level by Salary 

edu_df = pd.DataFrame(edu)
edu_df.rename(columns={"education_summary": "percentage"}, inplace=True)
edu_df.reset_index(inplace=True)
edu_df.sort_values(by=["salary", "education_summary"], inplace=True)
edu_df

In [None]:
# Visualizing the percentages of persons in each "salary" group by Education levels (high, medium, low)

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 3]:
    plt.subplot(1, 2, index)
    edu_df["percentage"][i:i+3].plot.pie(subplots=True,
                                         labels=["primary education", "higher education", "secondary education"],
                                         autopct="%.2f%%",
                                         textprops={'fontsize': 12},
                                         colors=['pink', 'lightskyblue', 'lightgreen'],
                                         )
    plt.title(edu_df.salary[i], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Write down the conclusions you draw from your analysis**

**Some Remarks on "education" Feature:** 

- Apparently, the individuals with higher education levels have higher income. In other words, the more education the more high income.
- It is a very low probability (2.13%) to earn more money when an individual has a lower (primary) educational background.

Sum up, the majority of participants that earn more than 50K (75.53%) have high level of education. For those who make money 50K or less, the percentages of higher and secondary level of educational attainments are very close to each other, 48.08% and 49.79% respectively. Yet, educational attainment level for high income group is higher. It can be assumed that education and salary seem to be significantly related.

## marital_status & relationship

**Detect the similarities between these features by comparing unique values**

In [None]:
# Checking the uniques of "marital_status" feature and determining their numbers

df.marital_status.value_counts(dropna=False)

In [None]:
# Checking the uniques of "relationship" feature and determining their numbers

df.relationship.value_counts(dropna=False)

In [None]:
# Filling the missing values with "Unknown" in the column of "relationship"

df.relationship.fillna("Unknown", inplace=True)

In [None]:
# Checking the uniques of "relationship" and "marital_status" features and determining their numbers

df.groupby("relationship").marital_status.value_counts(dropna=False)

**Assessment :** These features have almost same info, but "relationship" feature has %15 of missing values. So I have decided to go my way with "marital_status" feature

**Visualize the count of person in each categories**

In [None]:
# Visualizing the number of persons in "marital_status" feature

ax = sns.countplot(data=df, x="marital_status")

plt.title("The Distribution of Marital Status", fontsize=16, color="darkblue")
plt.xticks(rotation=45)

for container in ax.containers:
    ax.bar_label(container);

**Check the count of person in each "salary" levels by categories and visualize it with countplot**

In [None]:
# Checking the uniques of "marital_status" and "relationship" features and determining their numbers

df.groupby("marital_status").salary.value_counts()

In [None]:
# Visualizing the number of "marital_status" by "relationship" 

ax = sns.countplot(data=df, x="marital_status", hue="salary")

plt.title("The Number of People in Each Salary Level by Marital Status", fontsize=16, color="darkblue")
plt.xticks(rotation=45)

for container in ax.containers:
    ax.bar_label(container);

The number of categories in "marital_status" feature will be re-organized as married and unmarried for a more detailed investigation. The new feature will be added to the dataset as a new categorical data.

**Decrease the number of categories in "marital_status" feature as married, and unmarried and create a new feature with this new categorical data**

In [None]:
def mapping_marital_status(x):
    if x in ["Never-married", "Divorced", "Separated", "Widowed"]:
        return "single"
    elif x in ["Married-civ-spouse", "Married-AF-spouse", "Married-spouse-absent"]:
        return "married"

In [None]:
df.marital_status.apply(mapping_marital_status).value_counts(dropna=False)

In [None]:
df["marital_status_summary"] = df.marital_status.apply(mapping_marital_status)

**Visualize the count of person in each categories for these new marital status (married, unmarried)**

In [None]:
# Visualizing the number of persons in each sub-categories of "marital_status_summary" (single, married) 

ax = sns.countplot(data=df, x="marital_status_summary")

plt.title("The Distribution of Marital Status", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Check the count of person in each "salary" levels by these new marital status (married, single) and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "marital_status_summary" (single, married) by Salary

df.groupby("marital_status_summary").salary.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "marital_status_summary" (single, married) by Salary

ax = sns.countplot(data=df, x="marital_status_summary", hue="salary")

plt.title("The Distribution of Marital Status by Salary", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Check the percentage distribution of person in each "salary" levels by each new marital status (married, unmarried) and visualize it with pie plot separately**

In [None]:
# Checking the proportion of persons in each "marital_status_summary" by "salary" levels (<=50K, >50K)

marital = df.groupby(["marital_status_summary"]).salary.value_counts(normalize=True)
marital

In [None]:
# Visualizing the percentages of persons in each "marital_status_summary" group by "salary" levels (<=50K, >50K)

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 2]:
    plt.subplot(1, 2,index)
    marital[i:i+2].plot.pie(subplots=True,
                            labels=["<=50K", ">50K"],
                            autopct="%.2f%%",
                            textprops={'fontsize': 12},
                            colors=['pink', 'lightskyblue'],
                            )
    plt.title(marital.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Check the count of person in each these new marital status (married, single) by "salary" levels and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of Salary by "marital_status_summary" levels (single, married) 

df.groupby("salary").marital_status_summary.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of Salary by "marital_status_summary" (single, married)

ax = sns.countplot(data=df, x="salary", hue="marital_status_summary")

plt.title("The Distribution of Marital Status by Salary", fontsize=16, color="darkblue")

for container in ax.containers:
    ax.bar_label(container);

**Check the the percentage distribution of person in each these new marital status (married, single) by "salary" levels and visualize it with pie plot separately**

In [None]:
# Checking the proportion of persons in each "salary" levels (<=50K, >50K) by "marital_status_summary" (single, married)

marital = df.groupby("salary").marital_status_summary.value_counts(normalize=True)
marital

In [None]:
# Creating a dataframe demonstrating the proportions at Salary by "marital_status_summary" levels (single, married) 

marital_df = pd.DataFrame(marital)
marital_df.rename(columns={"marital_status_summary" : "percentage"}, inplace=True)
marital_df.reset_index(inplace=True)
marital_df.sort_values(by=["salary", "marital_status_summary"], inplace=True)
marital_df

In [None]:
# Visualizing the percentages of persons in each "salary" levels (<=50K, >50K) by "marital_status_summary" levels (single, married) 

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 2]:
    plt.subplot(1,2,index)
    marital_df["percentage"][i:i+2].plot.pie(subplots=True,
                                             labels=["married", "unmarried"],
                                             autopct="%.2f%%",
                                             textprops={'fontsize': 12},
                                             colors=['pink', 'lightskyblue'],
                                             )
    plt.title(marital_df.salary[i], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Write down the conclusions you draw from your analysis**

**Some Remarks on "marital_status" Feature :** It can easily be stated that married persons have earned more income than single persons. Majority of the group (85.9%) that make more money than 50K are married. Marital status seem to be associated with the income of individuals.

To sum up, this may imply that the individuals who have high salary are more tend to be getting married or being married imposed to getting high salary.

## workclass

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Checking the counts of unique values in "workclass" feature

df.workclass.value_counts(dropna=False)

In [None]:
# Visualizing the number of people in each category of "workclass"

ax = sns.countplot(data=df, x="workclass")

plt.title("The Distribution of Workclass", fontsize=18, color="darkblue")
plt.xticks(rotation=60)

for container in ax.containers:
    ax.bar_label(container);

**Replace the value "?" to the value "Unknown"** 

In [None]:
# Replacing "?" values with "Unknown"

df.workclass.replace("?", "Unknown", inplace=True)

**Check the count of person in each "salary" levels by workclass groups and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "workclass" by "salary" levels  

df.groupby("workclass").salary.value_counts()

In [None]:
# Visualizing the number of "workclass" by Salary 

ax = sns.countplot(data = df, x="workclass", hue="salary")

plt.title("The Distribution of Workclass by Salary", fontsize=18, color="darkblue")
plt.xticks(rotation=60)

for container in ax.containers:
    ax.bar_label(container);

**Domain Knowledge:** ``Self-employment income`` is income that arises from the performance of personal services, but which cannot be classified as wages because an employer-employee relationship does not exist between the payer and the payee.

**Check the percentage distribution of person in each "salary" levels by each workclass groups and visualize it with bar plot**

In [None]:
# The proportion of persons in each "workclass" levels by "salary" (<=50K, >50K)

workclass = df.groupby("workclass").salary.value_counts(normalize=True)
workclass

In [None]:
# Creating a dataframe demonstrating the proportions at "workclass" by "salary" levels 

workclass_df = pd.DataFrame(workclass)
workclass_df.rename(columns={"salary": "percentage"}, inplace=True)
workclass_df.reset_index(inplace=True)
workclass_df.sort_values(by=["workclass", "salary"], inplace=True)
workclass_df

In [None]:
# Visualizing the number of persons in each sub-categories of "workclass" by "salary" 

fig, ax = plt.subplots()

ax = sns.barplot(data=workclass_df, x="workclass", y="percentage", hue="salary")

plt.title("The Distribution of Workclass", fontsize=18, color="darkblue")
plt.xticks(rotation=60)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f");

**Check the count of person in each workclass groups by "salary" levels and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "salary" by "workclass" levels 

df.groupby("salary").workclass.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "salary" by "workclass" levels 

ax = sns.countplot(data=df, x="salary", hue="workclass")

plt.title("The Number of People in Each Workclass by Salary", fontsize=16)
for container in ax.containers:
    ax.bar_label(container);

**Check the the percentage distribution of person in each workclass groups by "salary" levels and visualize it with countplot**

In [None]:
# The proportion of persons in each sub-categories of "salary" by "workclass" levels 

workclass = df.groupby("salary").workclass.value_counts(normalize=True)
workclass

In [None]:
# Creating a dataframe demonstrating the proportions at "salary" by "workclass" levels 

workclass_df = pd.DataFrame(workclass)
workclass_df.rename(columns={"workclass": "percentage"}, inplace=True)
workclass_df.reset_index(inplace=True)
workclass_df.sort_values(by=["salary", "percentage"], ascending=False, inplace=True)
workclass_df

In [None]:
# Visualizing the proportion of persons in each sub-categories of "salary" by "workclass" levels  

fig, ax = plt.subplots(figsize=(14, 6))

ax = sns.barplot(data=workclass_df, x="salary", y="percentage", hue="workclass")

plt.title("The Percentage of People in Each Workclass by Salary", fontsize=16)
for container in ax.containers:
    ax.bar_label(container, fmt='%.2f', label_type='center', color="w");

**Write down the conclusions you draw from your analysis**

**Result :** The share of Federal, state and local government workers and self-employed individuals are relatively higher for the category of income above 50K compared to their respective values for below 50K salary group.

"Self-emp-inc" work-class has a high ratio in the self group about high-level income. "Private" work-class has a high ratio in both income groups. However, the rate of those in the high level income group in their own group is slightly lower than those in the lower income group.

## occupation

**Check the count of person in each sub-categorie of occupation and visualize it with countplot**

In [None]:
# Checking the counts of unique values in "occupation" feature

df.occupation.value_counts(dropna=False)

In [None]:
# Visualizing the number of people in each category of "occupation"

ax = sns.countplot(data=df, x="occupation")

plt.title("The Number of People in Each Occupation", fontsize=16)
plt.xticks(rotation=60)

for container in ax.containers:
    ax.bar_label(container);

**Replace the value "?" to the value "Unknown"**

In [None]:
# Replacing "?" values with "Unknown"

df.occupation.replace("?", "Unknown", inplace=True)

**Check the count of person in each "salary" levels by occupation groups and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "occupation" by "salary" levels  

df.groupby("occupation").salary.value_counts()

In [None]:
# Visualizing the number of "occupation" by Salary 

ax = sns.countplot(data=df, x="occupation", hue="salary")

plt.title("The Number of People in Each Salary Level by Occupation", fontsize=16)
plt.xticks(rotation=90)

for container in ax.containers:
    ax.bar_label(container);

**Check the percentage distribution of person in each "salary" levels by each occupation groups and visualize it with bar plot**

In [None]:
# Checking the proportion of persons in each "salary" levels (<=50K, >50K) by "occupation" 

occupation = df.groupby("occupation").salary.value_counts(normalize=True)
occupation

In [None]:
# Creating a dataframe demonstrating the proportions at "occupation" by "salary" levels 

occupation_df = pd.DataFrame(occupation)
occupation_df.rename(columns={"salary": "percentage"}, inplace=True)
occupation_df.reset_index(inplace=True)
occupation_df.sort_values(by=["occupation", "salary"], inplace=True)
occupation_df

In [None]:
# Visualizing the number of persons in each sub-categories of "occupation" by "salary" 

fig, ax = plt.subplots(figsize=(14, 6))

ax = sns.barplot(data=occupation_df, x="occupation", y="percentage", hue="salary")

plt.title("The Percentage of People in Each Salary Level by Occupation", fontsize=16)
plt.xticks(rotation=60)
plt.legend(loc=(1.01, 0.9))

for container in ax.containers:
    ax.bar_label(container, fmt='%.2f', color="r");

**Check the count of person in each occupation groups by "salary" levels and visualize it with countplot**

In [None]:
# Checking the number of persons in each "salary" levels by "occupation" 

df.groupby("salary").occupation.value_counts()

In [None]:
# Visualizing the number of persons in each "occupation" group by "salary" levels 

fig, ax = plt.subplots(figsize=(14, 6))

plt.title("The Number of People in Each Occupation by Salary Levels", fontsize=16);
ax = sns.countplot(data=df, x="salary", hue="occupation")

for container in ax.containers:
    ax.bar_label(container, color="#1927BD");

**Check the percentage  of person in each occupation groups by "salary" levels and visualize it with bar plot**

In [None]:
# Checking the percentage of persons in each "occupation" group by "salary" levels 

occupation = df.groupby("salary").occupation.value_counts(normalize=True)*100
occupation

In [None]:
# Creating a dataframe demonstrating the percentages of people at "occupation" by "salary" levels 

occupation_df = pd.DataFrame(occupation)
occupation_df.rename(columns = {"occupation": "percentage"}, inplace=True)
occupation_df.reset_index(inplace=True)
occupation_df.sort_values(by=["salary", "occupation"], inplace=True)
occupation_df

In [None]:
# Visualizing the percentage of persons at "occupation" by "salary" levels 

fig, ax = plt.subplots(figsize=(20, 6))

ax = sns.barplot(data=occupation_df, x="salary", y="percentage", hue="occupation")

plt.title("The Percentage of People in Each Occupation by Salary Levels", fontsize=16)
for container in ax.containers:
    ax.bar_label(container, fmt="%.1f%%", color="#00010A", fontsize=12);

**Write down the conclusions you draw from your analysis**

**Some Remarks on "occupation" Feature:** "Exec-managerial" and "Prof-specialty" occupations have a high ratio of high-level income both in the self group and in the high-income group.

## race

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Checking the counts of unique values in "race" feature

df.race.value_counts(dropna=False)

In [None]:
# Visualizing the number of people in each category of "race"

ax = sns.countplot(data=df, x="race")

plt.title("The Number of People by Race", fontsize=16)
plt.xticks(rotation=45)

for container in ax.containers:
    ax.bar_label(container, fontsize=13);

We see that the White are in the majority in our dataset.

**Check the count of person in each "salary" levels by races and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "race" by "salary"   

df.groupby("race").salary.value_counts()

We see that the majority of the White and African-American makes money equal to and less than 50K in our dataset. However the  White making more money than 50K have an overwhelming advantage compared to other races. We are curios about this is because of their huge number over other races or the race matters?

In [None]:
# Visualizing the number of people in each category of "salary" by "race"  

ax = sns.countplot(data=df, x="race", hue="salary")

plt.title("The Number of People in Each Race Category by Salary Levels", fontsize=16);
plt.xticks(rotation=45)

for container in ax.containers:
    ax.bar_label(container, fontsize=12);

**Check the percentage distribution of person in each "salary" levels by each races and visualize it with pie plot**

In [None]:
# Checking the proportion of people in each "race" group by "salary" levels

race = df.groupby("race").salary.value_counts(normalize=True)
race

In [None]:
# Visualizing the percentage of people in each "race" group by "salary" levels

plt.figure(figsize = (18, 12))
index=1
for i in [0, 2, 4, 6, 8]:
    plt.subplot(2, 3, index)
    race[i:i+2].plot.pie(subplots=True,
                         labels=["<=50K", ">50K"],
                         autopct="%.2f%%",
                         textprops={'fontsize': 12},
                         colors=['pink', 'lightskyblue'],
                         )
    plt.title(race.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Check the count of person in each races by "salary" levels and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "race" by "salary" levels  

df.groupby("salary").race.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "race" by "salary" levels  

ax = sns.countplot(data=df, x="salary", hue="race")

plt.title("The Number of People in Salary Levels by Race Category", fontsize=16);
for container in ax.containers:
    ax.bar_label(container, fontsize=12);

**Check the the percentage distribution of person in each races by "salary" levels and visualize it with bar plot**

In [None]:
# Checking the percentage of persons at "race" by "salary" levels 

race = df.groupby("salary").race.value_counts(normalize=True)*100
race

🤔 ❓ **[What race in America has the most money?](https://www.federalreserve.gov/econres/notes/feds-notes/disparities-in-wealth-by-race-and-ethnicity-in-the-2019-survey-of-consumer-finances-20200928.html)**

White families have more wealth than Black, Hispanic, and other or multiple race families in the 2019 SCF. Source: Federal Reserve Board, 2019 Survey of Consumer Finances. Notes: Figures displays median (top panel) and mean (bottom panel) wealth by race and ethnicity, expressed in thousands of 2019 dollars... [Source](https://www.federalreserve.gov/econres/notes/feds-notes/disparities-in-wealth-by-race-and-ethnicity-in-the-2019-survey-of-consumer-finances-20200928.html)

🤔 ❓ **[Racial, gender wage gaps persist in U.S. despite some progress](https://www.pewresearch.org/fact-tank/2016/07/01/racial-gender-wage-gaps-persist-in-u-s-despite-some-progress/)**

Large racial and gender wage gaps in the U.S. remain, even as they have narrowed in some cases over the years. Among full- and part-time workers in the U.S., blacks in 2015 earned just 75% as much as whites in median hourly earnings and women earned 83% as much as men... [Source](https://www.pewresearch.org/fact-tank/2016/07/01/racial-gender-wage-gaps-persist-in-u-s-despite-some-progress/)

In [None]:
# Creating a dataframe demonstrating the percentage of persons at "race" by "salary" levels 

race_df = pd.DataFrame(race)
race_df.rename(columns={"race": "percentage"}, inplace=True)
race_df.reset_index(inplace=True)
race_df.sort_values(by=["salary", "race"], inplace=True)
race_df

In [None]:
# Visualizing the percentage of persons at "race" by "salary" levels 

fig, ax = plt.subplots(figsize=(14, 6))

ax = sns.barplot(data=race_df, x="salary", y="percentage", hue="race")

plt.title("The Percentage of People in Salary Levels by Race", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f%%", fontsize=12);

**Write down the conclusions you draw from your analysis**

**Result :** "Asian-Pac-Islander" and "White" races has a high ratio in the self group about high-level income. "White" race has a high ratio in the high-income group about high-level income.

25.6% of whites and 26.6% of Asian-Pac-Islanders have income over 50K, compared to 12.4% of Blacks, 11.6% of Amer-Indian-Eskimo and 9.2% of Others. Race seems to be a factor in salary.

## gender

**Check the count of person in each gender and visualize it with countplot**

In [None]:
# Checking the counts of unique values in "gender" feature

df.gender.value_counts(dropna=False)

In [None]:
# Visualizing the number of people in each category of "gender"

ax = sns.countplot(data=df, x="gender")

plt.title("The Number of People by Gender", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fontsize=14);

**Check the count of person in each "salary" levels by gender and visualize it with countplot**

In [None]:
# Checking the number of people in each sub-categories of "gender" by "salary" levels   

df.groupby("gender").salary.value_counts()

In [None]:
# Visualizing the number of people in each sub-categories of "gender" by "salary" levels  

ax = sns.countplot(data=df, x="gender", hue="salary")

plt.title("Gender Classification by Salary", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fontsize=12);

### 📝 [Domain Knowledge About Gender Pay Gap in the US](https://www.etax.com.au/gender-pay-gap/)

Over half a century after pay discrimination became illegal in the United States, a persistent pay gap between men and women continues to hurt our nation’s workers and our national economy.

The gender pay gap is the result of many factors, including race and ethnicity, disability, access to education and age. As a result, different groups of women experience very different gaps in pay. The gender pay gap is a complex issue that will require robust and inclusive solutions.

🧐 **[The Simple Truth About the Gender Pay Gap](https://www.aauw.org/resources/research/simple-truth/)**

The main reason that women hold part-time jobs: they cannot find full-time jobs. Child care and work in the home are the other main factors. "Women still tend to be the last to be hired and the first to be fired," says Ms. Chinery-Hesse, ILO Deputy Director-General.

🧐 **["Women Work More, But are Still Paid Less" the Article at International Labor Organization](https://www.ilo.org/global/about-the-ilo/newsroom/news/WCMS_008091/lang--en/index.htm)**

🧐 **[Do women make less money than men in America?](https://www.cnbc.com/2022/05/19/women-are-still-paid-83-cents-for-every-dollar-men-earn-heres-why.html)**

In 2020, women made 83 cents for every dollar earned by men, according to the U.S. Census Bureau. Women of color are at an even greater disadvantage. The gender wage gap was much larger in 1960, when women's pay was 61% of men's. But progress has stalled over the last 15 or more years, according to researchers.

**Check the percentage distribution of person in each "salary" levels by each gender and visualize it with pie plot**

In [None]:
# Checking the percentage of people in each "salary" level by "gender"  

gender = df.groupby("gender").salary.value_counts(normalize=True)*100
gender

In [None]:
# Visualizing the percentage of people in each "salary" level by "gender"

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 2]:
    plt.subplot(1, 2, index)
    gender[i:i+2].plot.pie(subplots=True,
                           labels=["<=50K", ">50K"],
                           autopct="%.2f%%",
                           textprops={'fontsize': 12},
                           colors=['pink', 'lightskyblue'],
                           )
    plt.title(gender.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Some Remarks on "gender" Feature:** 11% of females earn above 50K compared to 30,6% of males, about 3 times lower ratio on females side. From salary perspective, of those that get 50K or more, males dominate the category by 85% compared to only 15% females.

**Check the count of person in each gender by "salary" levels and visualize it with countplot**

In [None]:
# Checking the number of people in each sub-categories of "gender" by "salary" levels  

df.groupby("salary").gender.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "gender" by "salary" levels  

ax = sns.countplot(data=df, x="salary", hue="gender")

plt.title("The Number of People in Salary Levels by Gender", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fontsize=12);

In [None]:
df.gender.value_counts(normalize=True)

In [None]:
df.groupby("salary").gender.value_counts(normalize=True)*100

**Check the the percentage distribution of person in each gender by "salary" levels and visualize it with pie plot**

In [None]:
# Checking the percentage of persons at "gender" by "salary" levels 

gender = df.groupby("salary").gender.value_counts(normalize=True)*100
gender

In [None]:
# Visualizing the percentage of persons at "gender" by "salary" levels 

plt.figure(figsize=(18, 6))
index = 1
for i in [0, 2]:
    plt.subplot(1, 2, index)
    gender[i:i+2].plot.pie(subplots=True,
                         labels=["Male", "Female"],
                         autopct="%.2f%%",
                         textprops={'fontsize': 12},
                         colors=['pink', 'lightskyblue'],
                         )
    plt.title(gender.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Write down the conclusions you draw from your analysis**

**Some Remarks on "gender" Feature :** It can easily be stated that males have been making more money than females have.



## native_country

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Checking the counts of unique values in "native_country" feature

df.native_country.value_counts(dropna=False)

In [None]:
# Visualizing the number of people in each category of "native_country"

plt.figure(figsize=(14, 6))

ax = sns.countplot(data=df, x="native_country")

plt.title("The Number of People by Country", fontsize=16)
plt.xticks(rotation=90)

for container in ax.containers:
    ax.bar_label(container);

**Replace the value "?" to the value "Unknown"** 

In [None]:
# Replacing "?" values with "Unknown"

df.native_country.replace("?", "Unknown", inplace=True)

**Decrease the number of categories in "native_country" feature as US, and Others and create a new feature with this new categorical data**

In [None]:
def mapping_native_country(x):
    if x == "United-States":
        return "US"
    else:
        return "Others"

In [None]:
# Decreasing the number of categories in "native_country" feature as "US", and "Others"   

df.native_country.apply(mapping_native_country).value_counts(dropna=False)

In [None]:
# Creating a new feature named "native_country_summary"

df["native_country_summary"] = df.native_country.apply(mapping_native_country)
df["native_country_summary"]

**Visualize the count of person in each new categories (US, Others)**

In [None]:
# Visualizing the number of people in each category of "native_country_summary"

ax = sns.countplot(data=df, x="native_country_summary")

plt.title("The Number of People From US & Other Countries", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fontsize=12);

**Check the count of person in each "salary" levels by these new native countries (US, Others) and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "salary" by "native_country_summary"   

df.groupby("native_country_summary").salary.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "salary" by "native_country_summary"   

ax = sns.countplot(data=df, x="native_country_summary", hue="salary")

plt.title("The Number of People From US and Other Countries by Salary Levels", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fontsize=12);

**Check the percentage distribution of person in each "salary" levels by each new native countries (US, Others) and visualize it with pie plot separately**

In [None]:
# Checking the percentage of persons in each "salary" levels by "native_country_summary" 

country = df.groupby(["native_country_summary"]).salary.value_counts(normalize=True)*100
country

In [None]:
# Visualizing the percentage of persons in each "salary" levels by "native_country_summary" 

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 2]:
    plt.subplot(1, 2, index)
    country[i:i+2].plot.pie(subplots=True,
                            labels=["<=50K", ">50K"],
                            autopct="%.2f%%",
                            textprops={'fontsize': 12},
                            colors=['pink', 'lightskyblue'],
                            )
    plt.title(country.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Some Remarks on "Salary" and "Nationality" Features :** While 19.8% of foreign nationals are paid over 50K, this percentage is 24,6 for US nationals.

**Check the count of person in each these new native countries (US, Others) by "salary" levels and visualize it with countplot**

In [None]:
# Checking the number of persons in each sub-categories of "native_country_summary" by "salary" levels  

df.groupby("salary").native_country_summary.value_counts()

In [None]:
# Visualizing the number of persons in each sub-categories of "native_country_summary" by "salary" levels  

ax = sns.countplot(data=df, x="salary", hue="native_country_summary")

plt.title("The Number of People From Each Country by Salary", fontsize=16);

for container in ax.containers:
    ax.bar_label(container, fontsize=12);

**Check the the percentage distribution of person in each these new native countries (US, Others) by "salary" levels and visualize it with pie plot separately**

In [None]:
# Checking the percentage of persons in each sub-categories of "native_country_summary" by "salary" levels  

country = df.groupby(["salary"]).native_country_summary.value_counts(normalize=True)*100
country

In [None]:
# Visualizing the percentage of persons in each sub-categories of "native_country_summary" by "salary" levels  

plt.figure(figsize = (18, 6))
index = 1
for i in [0, 2]:
    plt.subplot(1, 2, index)
    country[i:i+2].plot.pie(subplots=True,
                            labels=["US", "Others"],
                            autopct="%.2f%%",
                            textprops={'fontsize': 12},
                            colors=['pink', 'lightskyblue'],
                            )
    plt.title(country.index[i][0], fontdict = {'fontsize': 14})
#    plt.legend()
    index += 1

**Write down the conclusions you draw from your analysis**

**Some Remarks on "native_country" Feature :** The individuals from the "United States" has a higher ratio of high-level income both in the self group and in the high-income group.

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Other Specific Analysis Questions</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

### 1. What is the average age of males and females by income level?

In [None]:
# Determining the average of people in each sub-categories of "gender" by "salary"  

df.groupby(["salary", "gender"]).age.mean()

In [None]:
# With Pandas

# Visualizing the average of people in each sub-categories of "gender" by "salary" with Pandas 

fig, ax = plt.subplots()

ax = df.groupby(["salary", "gender"]).age.mean().plot.bar()
plt.xticks(rotation=0)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f", fontsize=12);

In [None]:
age = df.groupby(["salary", "gender"])[["age"]].mean().reset_index()
age

In [None]:
# With Seaborn

# Visualizing the average of people in each sub-categories of "gender" by "salary" with Seaborn 

fig, ax = plt.subplots()

ax = sns.barplot(data=age, x="salary", y="age", hue="gender")

plt.title("The Average Age by Income Level", fontsize=16)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f", fontsize=12);

### 2. What is the workclass percentages of Americans in high-level income group?

In [None]:
# Determining the percentage of US citizens in high-level income group 

workclass_US = df[(df.salary == ">50K") & (df.native_country_summary == "US")]\
                                                    .workclass.value_counts(dropna=False, normalize=True)*100
workclass_US

In [None]:
# Visualizing the percentage of US citizens in high-level income group 

fig, ax = plt.subplots()

ax = sns.barplot(x=workclass_US.index, y=workclass_US.values)

plt.title("Workclass Percentages in High-Level Income", fontsize=16)
plt.xticks(rotation=0)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f%%", fontsize=12);

The majority of  US nationals( 63%) that make money more than 50K work in private sector.

### 3. What is the occupational percentages of Americans who work as "Private" workclass in high-level income group?

In [None]:
# Determining the percentage of US citizens who work as "Private" workclass in high-level income 

occupation_US = df[(df.salary == ">50K") & (df.native_country_summary == "US") & (df.workclass == "Private")]\
                  .occupation.value_counts(dropna=False, normalize=True)*100 
occupation_US

In [None]:
# Visualizng the percentage of US citizens who work as "Private" workclass in high-level income 

fig, ax = plt.subplots()

ax = sns.barplot(x=occupation_US.index, y=occupation_US.values)

plt.title("Occupation Percentages in Private Workclass with High-Level Income", fontsize=16)
plt.xticks(rotation=60)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f%%", fontsize=10);

More than half of the US national (about 60%) working in private sector are clustered in three occupation groups: "Executive-managerial", "prof-specialty", "craft-repair". The last three are followed by "farming-fishing", "Protective-service", and "Priv-house-service", which may require relatively less education but more hours-per-week work.

### 4. What is the percentage of educational level of Asian-Pac-Islander race group in high-level income group?

In [None]:
# Determining the percentage of educational level of US citizens whose origin is Asian-Pac-Islander and income level is high 

Asian_Pac_Islander_Edu = df[(df.salary == ">50K") & (df.race == "Asian-Pac-Islander")]\
                     .education.value_counts(dropna=False, normalize=True)*100 
Asian_Pac_Islander_Edu

In [None]:
# Visualizing the percentage of educational level of US citizens whose origin is Asian-Pac-Islander and income level is high 

fig, ax = plt.subplots()

ax = sns.barplot(x=Asian_Pac_Islander_Edu.index, y=Asian_Pac_Islander_Edu.values)

plt.title("Education Level Percentages of Asian-Pac-Islanders with High-Level Income", fontsize=16)
plt.xticks(rotation=90)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f%%", fontsize=10);

Over 56% of Asian-Pac-Islanders in high-level income group have Bachelors or higher degree.

### 5. What is the occupation percentages of Asian-Pac-Islander race group who has a Bachelors degree in high-level income group?

In [None]:
# Determining the percentage of occupation of US citizens 
# whose origin is Asian-Pac-Islander and has Bachelors degree in high-level income group 

Asian_Pac_Islander_Occupation = df[(df.salary == ">50K") & (df.race == "Asian-Pac-Islander") & (df.education == "Bachelors")]\
                                 .occupation.value_counts(dropna=False, normalize=True)*100 
Asian_Pac_Islander_Occupation

In [None]:
# Visualizing the percentage of occupation of US citizens 
# whose origin is Asian-Pac-Islander and has Bachelors degree in high-level income group 

fig, ax = plt.subplots()

ax = sns.barplot(x=Asian_Pac_Islander_Occupation.index, y=Asian_Pac_Islander_Occupation.values)

plt.title("Occupation Percentages of Asian-Pac-Islanders with Bachelors Degree and High-Level Income", fontsize=14)
plt.xticks(rotation=60)

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f%%", fontsize=10);

About 66 percent of Asian-Pac-Islanders who has a Bachelors degree in high-level income group work in three occupatinal groups: "Exec-managerial" with 27.84%, "Prof-specialty" with 25.77%, and "Adm-clerical" with 2.37%.

### 6. What is the mean of working hours per week by gender for education level, workclass and marital status? Try to plot all required in one figure.

In [None]:
# Visualizing the average of working hours per week by gender for education level, workclass and marital status 

g = sns.catplot(x="education_summary",
                y="hours_per_week",
                data=df,
                kind="bar",
                estimator= np.mean,
                hue="gender",
                col="marital_status_summary",
                row="native_country_summary",
                ci=None,
                palette=sns.color_palette(['skyblue', 'pink']));

g.fig.set_size_inches(15, 8)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Working Hours Per Week by Gender for Education, Workclass, Marital Status', fontsize=16)

# iterate through axes
for ax in g.axes.ravel():

    # add annotations
    for container in ax.containers:
        ax.bar_label(container, fmt="%.2f", fontsize=12);
    
    ax.margins(y=0.2)

plt.show()

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Dropping Similar & Unneccessary Features</p>

<a id="6"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [None]:
df.info()

In [None]:
df.drop(columns=["education", "education_num", "relationship", "marital_status", "native_country"], inplace=True)

In [None]:
df.shape

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Missing Value</p>

<a id="7"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**Check the missing values for all features basically**

In [None]:
df.isnull().sum()

**Besides, it's time to detect the duplicated values now, if there have been, drop them permanently. Remember that we toke care of them before at the beginning of our analysis. Let's see what happens:**

In [None]:
df.shape

In [None]:
df.duplicated().value_counts()  

In [None]:
df.drop_duplicates(keep='first', inplace=True)  

In [None]:
df.shape

In [None]:
# As you remember we checked it at the begining of the analysis and "relationship" feature had more than one types of data

def check_obj_columns(df):
    '''
    Returns nothing or column/s which has/have mixed object types.
    '''    
    
    tdf = df.select_dtypes(include=['object']).applymap(type)
    for col in tdf:
        if len(set(tdf[col].values)) > 1:
            print("Column {} has mixed object types.".format(col))

check_obj_columns(df)

**1. It seems that there is no missing values. But we know that "workclass", and "occupation" features have missing (inappropriate) values named  "Unknown" as string type. Let's Examine these features in more detail.**

**2. Decide if it's necessary to drop the "Unknown" string values or not?**

In [None]:
df.workclass.value_counts()

In [None]:
df.occupation.value_counts()

In [None]:
# Let's examine what kind sub-categories of "workclass" there have been regarding "occupation" 

df[df.occupation == "Unknown"].workclass.value_counts()

In [None]:
# As seen, it is almost imposibble to extract some insights from the output

df[df.occupation == "Unknown"]

In [None]:
df[df.occupation == "Unknown"].groupby(["occupation", "workclass"])["salary"].agg(pd.Series.mode)

In [None]:
df[df.occupation == "Unknown"].groupby(["occupation", "workclass"])\
.agg({"marital_status_summary": pd.Series.mode, "salary": pd.Series.mode, "native_country_summary": pd.Series.mode, "education_summary": pd.Series.mode, "race": pd.Series.mode, "gender": pd.Series.mode})

# df[df.occupation == "Unknown"].groupby(["occupation", "workclass"])["marital_status_summary", "salary", "native_country_summary", "education_summary", "race", "gender"]\
#                               .apply(lambda x : x.agg(pd.Series.mode))

**Even if you may prefer to fill them with the findings above; however, we will NOT pick up this approach since we assume that we have enough observation to continue for analysis.** 

In [None]:
# So, let's assign these "Unknown" values to "NaN"

df.replace("Unknown", np.nan, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
# Then, we can exclude them from our DataFrame 

df.dropna(inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.info()

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Outliers</p>

<a id="8"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

### Boxplot and Histplot for all numeric features

**Plot boxplots for each numeric features at the same figure as subplots**

In [None]:
plt.boxplot((df[df.select_dtypes('number').columns]), 
            labels=df.select_dtypes('number').columns,
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})
plt.ticklabel_format(style='plain', axis='y')
plt.show()

In [None]:
index = 0
plt.figure(figsize=(20, 10))

for feature in df.select_dtypes('number').columns:
    index += 1
    plt.subplot(2, 3, index)
    sns.boxplot(x=feature, 
                data=df, 
                whis=1.5,
                showmeans=True,
                meanprops={"marker":"o",
                           "markerfacecolor":"white", 
                           "markeredgecolor":"black",
                           "markersize":"10"})
    plt.ticklabel_format(style='plain', axis='x')

In [None]:
numerical_features_df = df.select_dtypes('number')
numerical_features_df

inputs = numerical_features_df.columns
num_inputs = len(inputs)

fig, ax = plt.subplots(1, num_inputs, figsize=(20, 10))

for i, (ax, curve) in enumerate(zip(ax.flat, inputs)):
    sns.boxplot(y=numerical_features_df[curve], 
                ax=ax, 
                color='cornflowerblue', 
                showmeans=True,
                meanprops={"marker":"o",
                           "markerfacecolor":"white",
                           "markeredgecolor":"black",
                           "markersize":"10"},
                flierprops={'marker':'o',
                            'markerfacecolor':'darkgreen',
                            'markeredgecolor':'darkgreen'})
    
    ax.set_title(inputs[i])
    ax.set_ylabel('')
    
plt.subplots_adjust(hspace=0.15, wspace=1.25)
plt.show()

**Plot both boxplots and histograms for each numeric features at the same figure as subplots**

In [None]:
list(zip(df.select_dtypes('number').columns, [(1, 2),(3, 4),(5, 6),(7, 8),(9, 10)]))

In [None]:
plt.figure(figsize=(20, 40))

sns.set(style="whitegrid", font_scale=1)

for i, j in list(zip(df.select_dtypes('number').columns, [(1, 2),(3, 4),(5, 6),(7, 8),(9, 10)])):
    plt.subplot(5, 2, j[0])
    sns.boxplot(x=df.select_dtypes('number')[i])
    plt.subplot(5, 2, j[1])
    sns.histplot(x=df.select_dtypes('number')[i])

In [None]:
# Alternative Code Block Plotting Both "boxplots" and "histograms" for Each Numeric Features at the Same Figure

# index = 0
# plt.figure(figsize=(20, 40))

# for feature in df.select_dtypes('number').columns:
#     index += 1
#     plt.subplot(6, 2, index)
#     sns.boxplot(x=feature, data=df, whis=1.5)
#     index += 1
#     plt.subplot(6,2,index)
#     sns.histplot(x=feature, data=df)
#     plt.ticklabel_format(style='plain', axis='x')

**Check the statistical values for all numeric features**

In [None]:
df.describe().T

In [None]:
df[df["hours_per_week"] > 70]

**1. After analyzing all numerical features, we have decided that we can't evaluate extreme values in "fnlwgt, capital_gain, capital_loss" features in the scope of outliers.**

**2. So let's examine "age and hours_per_week" features and detect extreme values which could be outliers by using IQR Rule.**

### age

In [None]:
plt.figure(figsize=(20, 7))

plt.subplot(1, 2, 1)
sns.boxplot(data=df.age,
            orient="h",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

plt.subplot(1, 2, 2)
sns.histplot(data=df.age);

In [None]:
low = df.age.quantile(0.25)
high = df.age.quantile(0.75)
IQR = high - low
low, high, IQR

In [None]:
lower_lim = low - (1.5 * IQR)
upper_lim = high + (1.5 * IQR)
lower_lim, upper_lim

In [None]:
df[df.age > upper_lim].age.value_counts()

In [None]:
df[df.age > upper_lim].sort_values(by="age", ascending=False)

### hours_per_week

In [None]:
plt.figure(figsize=(20, 7))

plt.subplot(1, 2, 1)
sns.boxplot(data=df.hours_per_week,
            orient="h",
            showmeans=True,
            meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                       "markersize":"10"})

plt.subplot(1, 2, 2)
sns.histplot(data=df.hours_per_week, bins=10);

In [None]:
low = df.hours_per_week.quantile(0.25)
high = df.hours_per_week.quantile(0.75)
IQR = high - low
low, high, IQR

In [None]:
lower_lim = low - (1.5 * IQR)
upper_lim = high + (1.5 * IQR)
lower_lim, upper_lim

In [None]:
df[df.hours_per_week > upper_lim].hours_per_week.value_counts().sort_index(ascending=False)

In [None]:
df[df.hours_per_week > upper_lim].sort_values(by="hours_per_week", ascending=False)

In [None]:
df[df.hours_per_week < lower_lim].hours_per_week.value_counts().sort_index()

In [None]:
df[df.hours_per_week < lower_lim].groupby("salary").hours_per_week.describe()

In [None]:
df[df.hours_per_week < lower_lim].groupby("salary").age.describe()

**Result :** As seen above, there are some number of extreme values in both "age and hours_per_week" features. But how can we know if these extreme values are outliers or not? At this point, **domain knowledge** comes to the fore.

📝 **Domain Knowledge for this dataset:**
1. In this dataset, all values are created according to the statements of individuals. So there can be some "data entry errors".
2. In addition, we have aimed to create an ML model with some restrictions as getting better performance from the ML model.
3. In this respect, our sample space ranges for some features are as follows.
    - age : 17 to 80
    - hours_per_week : 7 to 70
    - if somebody's age is more than 60, he/she can't work more than 60 hours in a week

### Dropping rows according to the domain knownledge 

In [None]:
df[(df.age < 17) |  (df.age > 80)].sort_values(by="age", ascending=False)

In [None]:
df[(df.age < 17) | (df.age > 80)].shape

In [None]:
drop_index = df[(df.age < 17) | (df.age > 80)].sort_values(by="age", ascending=False).index
drop_index

In [None]:
df.drop(drop_index, inplace=True)

In [None]:
df[(df.hours_per_week < 7) | (df.hours_per_week > 70)].sort_values(by="hours_per_week", ascending=False)

In [None]:
df[(df.hours_per_week < 7) | (df.hours_per_week > 70)].shape

In [None]:
drop_index = df[(df.hours_per_week < 7) | (df.hours_per_week > 70)].sort_values(by="hours_per_week", ascending=False).index
drop_index

In [None]:
df.drop(drop_index, inplace=True)

In [None]:
df[(df.age > 60) & (df.hours_per_week > 60)]

In [None]:
df[(df.age > 60) & (df.hours_per_week > 60)].shape

In [None]:
drop_index = df[(df.age > 60) & (df.hours_per_week > 60)].index
drop_index

In [None]:
df.drop(drop_index, inplace=True)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.reset_index(drop=True, inplace=True)
df

In [None]:
df.info()

**Having dropped some of the extreme values using "Tukey's Fence", let us check the whisker plot again:**

In [None]:
index = 0
plt.figure(figsize=(20, 10))

for feature in df.select_dtypes('number').columns:
    index += 1
    plt.subplot(2, 3, index)
    sns.boxplot(x=feature, data=df)
    plt.ticklabel_format(style='plain', axis='x')

**``Boxplots``** are a great way to summarize the distribution of a dataset. But they become increasingly inaccurate when the size of a dataset grows. Therefore, **``Letter-Value Plots``** (or boxenplots) have been developed to overcome the problem of an inaccurate representation of outliers in boxplots.

📝 **[Letter Value Plot — The Easy to Understand Boxplot for Large Datasets](https://towardsdatascience.com/letter-value-plot-the-easy-to-understand-boxplot-for-large-datasets-12d6c1279c97)**

In [None]:
index = 0
plt.figure(figsize=(20, 10))

for feature in df.select_dtypes('number').columns:
    index += 1
    plt.subplot(2, 3, index)
    sns.boxenplot(x=feature, data=df)
    plt.ticklabel_format(style='plain', axis='x')

In [None]:
df.describe()

We can see that the Boxenplot gives us much more information about the tails of our dataset’s distribution. In the boxplot above, we can’t tell what the data looks like beyond some points of numerical features:

For example, are people older than around 75 year-old extreme values/outliers❓ 🤔 

According to boxplot and within the whiskers, it’s quite hard to grasp what’s going on as well. There’s a pretty big gap between the 75th percentile at the age of 47 and the maximum value at the age of 50.

For example, are people working more than around 52 hours per week extreme values/outliers❓ 🤔 

According to boxplot and within the whiskers, it seems that they are extreme values and someof them are candidates for being outliers. However, it’s quite hard to grasp and decide what they are exactly. There’s a pretty big gap between the 75th percentile and the maximum value and similary between the 25th percentile and the minimum value.

With respect to working hours, the boxenplot, on the other hand, provides more insights in how the data is distributed beyond the quantiles. Contrary to the output of box plot, it can be assumed that there have been no extreme values. Moreover, we can see that the next box ends shortly after the 50 mark and hence is relatively short as compared to the following one. At the same time, the following box beginning approximately at the 50 and ending 60 marks is quite stretched, indicating that there’s a higher variance.

To wrap up, interpreting boxenplots can be more straightforward. The concept of thicker boxes representing a bigger part of the total population is easier to comprehend and facilitates discussions. Hence, letter-value plots can often be a better choice when presenting data to your audiances.

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Final Step to Make the Dataset Ready for ML Models</p>

<a id="9"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

### 1. Convert all features to numeric

**Convert target feature (salary) to numeric (0 and 1) by using map function**

In [None]:
df["salary"] = df.salary.map({"<=50K": 0, ">50K": 1})
df["salary"]

In [None]:
df.salary.value_counts(dropna=False)

**Convert all features to numeric by using get_dummies function**

In [None]:
df_dummy = pd.get_dummies(df, drop_first=True)
df_dummy

In [None]:
df.shape

In [None]:
df_dummy.shape

### 2. Take a look at correlation between features by utilizing power of visualizing

In [None]:
# function for set text color of correlation values in Dataframes

def color_correlation(val):
    """
    Takes a scalar and returns a string with the css property in a variety of color scales 
    for different correlations.
    """
   
    if val >= 0.6 and val < 0.99999 or val <= -0.6 and val > -0.99999:
        color = 'red'
    elif val < 0.6 and val >= 0.3 or val > -0.6 and val <= -0.3:
        color = 'blue'
    elif val < 0.3 and val > -0.3:
        color = 'pink'        
    elif val == 1:
        color = 'cyan'    
    else:
        color = 'black'
    return 'color: %s' % color
        
df_dummy.corr().style.applymap(color_correlation).set_precision(2)

In [None]:
plt.figure(figsize=(20, 20))

sns.heatmap(df_dummy.corr().round(2), annot=True, cmap="YlGnBu");

In [None]:
# 'salary' column whose location will be changed to the last position
target_column = df_dummy.pop('salary')
  
# insert column using insert(position,column_name, target_column) method
df_dummy.insert(33, 'salary', target_column)

In [None]:
df_dummy.columns

In [None]:
plt.figure(figsize=(20, 20))

sns.heatmap(df_dummy.corr().round(2), annot=True, cmap="YlGnBu");

In [None]:
df_dummy_corr_salary = df_dummy.corr()[["salary"]].drop("salary").sort_values(by="salary", ascending=False)
df_dummy_corr_salary

In [None]:
plt.figure(figsize = (3, 9))
sns.heatmap(df_dummy_corr_salary, 
            annot=True, 
            cmap="YlGnBu", 
            vmin=-1, 
            vmax=1, 
            annot_kws={"size": 11}, 
            cbar_kws={'shrink': 1})
plt.show()

In [None]:
plt.figure(figsize = (10, 14))

ax = sns.barplot(x=df_dummy_corr_salary.values.flatten(), y=df_dummy_corr_salary.index, palette='plasma')

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f", fontsize=12);

In [None]:
# Alternative Code Block for Plotting the Correlation of Dummied Variables with "salary" Feature

# plt.figure(figsize = (10, 14))

# ax = df_dummy.corr()["salary"].drop("salary").sort_values().plot(kind='barh', colormap='Paired')

# for container in ax.containers:
#     ax.bar_label(container, fmt="%.2f", fontsize=12);

In [None]:
df_dummy.to_csv("adult_dummy.csv", index=False)

In [None]:
pd.read_csv('adult_dummy.csv')