# Task 0: Get Started

<div class="markdownViewer select-text  markdown-default markdown-table markdown-viewer markdown-viewer-project markdown-viewer-heading" role="none"><p>In this project, you’ll use Python’s <code>sidetable</code> library to learn data analysis and then the <code>Bokeh</code> library to create meaningful visualizations; therefore, a Jupyter Notebook environment is provided on the right side. The dataset to work on is provided in the <code>/usercode/dataset.data</code> file.</p>
<p>You’ll work in the <code>/usercode/solution.ipynb</code> file throughout the project. Each task in the project has one or more associated cells in the notebook that can be identified by their headings.</p>
</div>

# Task 1: Import Libraries

<div class="markdownViewer select-text  markdown-default markdown-table markdown-viewer markdown-viewer-project markdown-viewer-heading" role="none"><p>In this task, import all libraries that will be used in the project. Since the project involves data analysis and visualization, import the following libraries:</p>
<ul>
<li>Use the <code>numpy</code> library to handle all the numerical values in our dataset.</li>
<li>Use the <code>pandas</code> library to store and manipulate data.</li>
<li>Use the <code>sidetable</code> package to perform data analysis.</li>
<li>Use <code>output_notebook</code> and <code>show</code> from the <code>bokeh.io</code> library to view inline visualizations in Jupyter Notebook.</li>
<li>Import <code>ColumnDataSource</code> from <code>bokeh.models</code> to pass data to the Bokeh graphs.</li>
<li>Import <code>figure</code> from <code>bokeh.plotting</code> for plot creation.</li>
<li>Configure the default output state for Bokeh plots using the <code>output_notebook()</code> command.</li>
</ul>
<p>If you’re unsure how to do this, click the “Show Hint” button.</p>
</div>

In [2]:
!pip install sidetable

Collecting sidetable
  Downloading sidetable-0.9.1-py3-none-any.whl (19 kB)
Installing collected packages: sidetable
Successfully installed sidetable-0.9.1


In [3]:
from numpy import nan
import pandas as pd
import sidetable
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
output_notebook()

# Task 2: Load the Dataset

<p>The dataset taken for this project is <a href="https://archive-beta.ics.uci.edu/dataset/2/adult" target="_blank" rel="noopener noreferrer">Census Income</a>, a multivariate dataset to predict the income class of adults and is modified to include the features of interest.</p>

In [4]:
#Load the dataset
data =  pd.read_csv('dataset.data', sep=", ", engine='python')
#Display the first few rows of the dataset
print(data.head(5))

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country  class  
0          2174             0              40  United-States  <=50K  
1             0             0             

# Task 3: Explore the Dataset

<p>The dataset you’re working on in this project is a classification dataset that has <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>14</mn></mrow><annotation encoding="application/x-tex">14</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">14</span></span></span></span> attributes and one target variable. In this task, generate insight into the data by printing the summary of the dataset.</p>

In [5]:
#Print a quick summary of the dataset
data.stb.counts()

Unnamed: 0,count,unique,most_freq,most_freq_count,least_freq,least_freq_count
sex,32561,2,Male,21790,Female,10771
class,32561,2,<=50K,24720,>50K,7841
race,32561,5,White,27816,Other,271
relationship,32561,6,Husband,13193,Other-relative,981
marital-status,32561,7,Married-civ-spouse,14976,Married-AF-spouse,23
workclass,32561,9,Private,22696,Never-worked,7
occupation,32561,15,Prof-specialty,4140,Armed-Forces,9
education,32561,16,HS-grad,10501,Preschool,51
education-num,32561,16,9,10501,1,51
native-country,32561,42,United-States,29170,Holand-Netherlands,1


# Task 4: Treat Missing Values

<div class="markdownViewer select-text  markdown-default markdown-table markdown-viewer markdown-viewer-project markdown-viewer-heading" role="none"><p>The missing values in this dataset are represented by the <code>?</code> symbol. However, all the missing value functions of <code>pandas</code> represent missing values by <code>NaN</code>.</p>
<p>In this task, look for missing values in the dataset, indicated by <code>?</code>, and clean the dataset as well by doing the following operations:</p>
<ol>
<li>Replace every instance of <code>?</code> with a <code>NaN</code> in the DataFrame.</li>
<li>Explore which features in the dataset have missing values along with their counts</li>
<li>Drop the rows with any missing values.</li>
<li>Confirm the cleaning of the dataset by printing the missing value count of each feature.</li>
</ol>
<p>If you’re unsure how to do this, click the “Show Hint” button.</p>
</div>

In [6]:
# Use the following code to replace ? with NaN in the dataset:
data = data.replace('?',nan)
# View the missing values for each column using the following code:
#clip_0 argument is to only display non_zero results
data.stb.missing(clip_0=True)

Unnamed: 0,missing,total,percent
occupation,1843,32561,5.660146
workclass,1836,32561,5.638647
native-country,583,32561,1.790486


In [7]:
# Drop the rows using the following command:
data.dropna(inplace=True)
# View the updated number of missing items for each column in the dataset using the following piece of code:
data.stb.missing()

Unnamed: 0,missing,total,percent
age,0,30162,0.0
workclass,0,30162,0.0
fnlwgt,0,30162,0.0
education,0,30162,0.0
education-num,0,30162,0.0
marital-status,0,30162,0.0
occupation,0,30162,0.0
relationship,0,30162,0.0
race,0,30162,0.0
sex,0,30162,0.0


# Task 5: Inspect Class Distribution

<div class="markdownViewer select-text  markdown-default markdown-table markdown-viewer markdown-viewer-project markdown-viewer-heading" role="none"><p>One of the important aspects for any classification problem is that if there is any class imbalance in the dataset, the dataset needs to be cleaned in case of a positive outcome.</p>
<p>In this task, analyze class distribution in the dataset for the variable <code>class</code> as follows:</p>
<ol>
<li>Display the count of all unique classes in the dataset using the <code>sidetable</code> package.</li>
<li>Use the <code>Bokeh</code> package to display the class distribution in a visual format.</li>
</ol>
<p>If you’re unsure how to do this, click the “Show Hint” button.</p>
</div>

In [8]:
# Use the following code to view class distribution in the dataset:

data.stb.freq(['class'] ,style=True)
distribution = data.stb.freq(['class'])
#Display specific columns of the output
distribution[["class", "count", "percent"]]

Unnamed: 0,class,count,percent
0,<=50K,22654,75.107751
1,>50K,7508,24.892249


# Task 6: Display Class Imbalance as Histogram

<div class="markdownViewer select-text  markdown-default markdown-table markdown-viewer markdown-viewer-project markdown-viewer-heading" role="none"><p>As a continuation of the previous task, view the class distribution of the dataset as a histogram.
To complete this task:</p>
<ol>
<li>Convert the <code>class</code> and <code>count</code> columns of the output created in the previous task to list format.</li>
<li>Initialize a <code>ColumnDataSource</code> object with a dictionary of <code>class</code> and <code>count</code>.
<ul>
<li>The <code>Bokeh</code> library uses the dictionary’s keys as column names, and values are used as the data values.</li>
</ul>
</li>
<li>Create a <code>figure</code> object for the <code>Bokeh</code> graph.</li>
<li>Create a vertical bar graph using the <code>vbar</code> render function</li>
<li>Provide the labels to the x-axis and y-axis.</li>
<li>Display the graph.</li>
</ol>
</div>

In [9]:
from bokeh.transform import factor_cmap
from bokeh.palettes import Spectral6

#Convert the columns to list to be able to use for ColumnDataSource
output = distribution['class'].to_list()
count = distribution['count'].to_list()

#Plot a ColumnDataSource object
source = ColumnDataSource(data=dict(output=output, count=count))
   
#Create a figure object
p = figure(x_range=output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Adult Income Graph")

# Use the vbar renderer function
p.vbar(x='output', top='count', width=0.9, source=source, line_color='white', fill_color=factor_cmap('output', palette=Spectral6, factors=output))
   
#Add the axes labels
p.xaxis.axis_label = 'Income Status'
p.yaxis.axis_label = 'No. of Adults'
   
#Display the graph
show(p)

# Task 7: Explore the Categorical Variables

<p>This dataset has a number of categorical variables like <code>race</code>, <code>relationship</code>, <code>marital-status</code>, <code>workclass</code>, <code>occupation</code>, <code>education</code>, and <code>education-num</code>. In this task, let’s look at the number of unique values for each of the variables listed above.</p>

In [10]:
race_dist = data.stb.freq(['race'])
rel_dist = data.stb.freq(['relationship'])
marital_dist = data.stb.freq(['marital-status'])
workclass_dist = data.stb.freq(['workclass'])
occupation_dist = data.stb.freq(['occupation'])
education_dist = data.stb.freq(['education'])
ed_num_dist = data.stb.freq(['education-num'])

#Print these variables to get a gist of data distribution
print(race_dist, '\n', rel_dist, '\n', marital_dist, '\n',workclass_dist, '\n',occupation_dist,'\n', education_dist, '\n', ed_num_dist)

                 race  count    percent  cumulative_count  cumulative_percent
0               White  25933  85.979046             25933           85.979046
1               Black   2817   9.339566             28750           95.318613
2  Asian-Pac-Islander    895   2.967310             29645           98.285923
3  Amer-Indian-Eskimo    286   0.948213             29931           99.234136
4               Other    231   0.765864             30162          100.000000 
      relationship  count    percent  cumulative_count  cumulative_percent
0         Husband  12463  41.320204             12463           41.320204
1   Not-in-family   7726  25.615012             20189           66.935216
2       Own-child   4466  14.806710             24655           81.741927
3       Unmarried   3212  10.649161             27867           92.391088
4            Wife   1406   4.661495             29273           97.052583
5  Other-relative    889   2.947417             30162          100.000000 
           

# Task 8: Visualize Data Distribution in Categorical Variables

<div class="markdownViewer select-text  markdown-default markdown-table markdown-viewer markdown-viewer-project markdown-viewer-heading" role="none"><p>In this task, create a grid bar chart of all the variables that you’ve explored in the previous task.
To complete this task:</p>
<ol>
<li>Import <code>gridplot</code> module.</li>
<li>Create a bar chart of every categorical variable one by one, as you did in <strong>Task 7</strong>.</li>
<li>Create a grid plot object with two plots per row.</li>
<li>Display the grid plot.</li>
</ol>
<p>If you’re unsure how to do this, click the “Show Hint” button.</p>
</div>

In [11]:
#Import gridplot module
from bokeh.layouts import gridplot

#For Race
race_output = race_dist['race'].to_list()
race_count = race_dist['count'].to_list()
source1 = ColumnDataSource(data=dict(output=race_output, count=race_count))
plot1 = figure(x_range=race_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Race's Categorical Analysis")
plot1.vbar(x='output', top='count', width=0.9, source=source1, line_color='white', color = 'paleturquoise'
       )
plot1.xaxis.major_label_orientation = 1.1


#For Relationship
rel_output = rel_dist['relationship'].to_list()
rel_count = rel_dist['count'].to_list()
source2 = ColumnDataSource(data=dict(output=rel_output, count=rel_count))
plot2 = figure(x_range=rel_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Relationship's Categorical Analysis")
plot2.vbar(x='output', top='count', width=0.9, source=source2, line_color='white', color = 'turquoise')
plot2.xaxis.major_label_orientation = 1.1


#Marital Status
marital_output = marital_dist['marital-status'].to_list()
marital_count = marital_dist['count'].to_list()
source3 = ColumnDataSource(data=dict(output=marital_output, count=marital_count))
plot3 = figure(x_range=marital_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Marital Status's Categorical Analysis")
plot3.vbar(x='output', top='count', width=0.9, source=source3, line_color='white', color='mediumturquoise')
plot3.xaxis.major_label_orientation = 1.1


#Work Class
workclass_output = workclass_dist['workclass'].to_list()
workclass_count = workclass_dist['count'].to_list()
source4 = ColumnDataSource(data=dict(output=workclass_output, count=workclass_count))
plot4 = figure(x_range=workclass_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Workclass's Categorical Analysis")
plot4.vbar(x='output', top='count', width=0.9, source=source4, line_color='white', color='darkturquoise')
plot4.xaxis.major_label_orientation = 1.1


#Occupations
occupation_output = occupation_dist['occupation'].to_list()
occupation_count = occupation_dist['count'].to_list()
source5 = ColumnDataSource(data=dict(output=occupation_output, count=occupation_count))
plot5 = figure(x_range=occupation_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Occupation's Categorical Analysis")
plot5.vbar(x='output', top='count', width=0.9, source=source5, line_color='white', color='lightseagreen')
plot5.xaxis.major_label_orientation = 1.1


#Education Levels
education_output = education_dist['education'].to_list()
education_count = education_dist['count'].to_list()
source6 = ColumnDataSource(data=dict(output=education_output, count=education_count))
plot6 = figure(x_range=education_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Education's Categorical Analysis")
plot6.vbar(x='output', top='count', width=0.9, source=source6, line_color='white', color='cadetblue')
plot6.xaxis.major_label_orientation = 1.1


#Educational Years in numbers
#Convert category names to str first
ed_num_dist['education-num'] = ed_num_dist['education-num'].apply(str)
ed_num_output = ed_num_dist['education-num'].to_list()
ed_num_count = ed_num_dist['count'].to_list()
source7 = ColumnDataSource(data=dict(output=ed_num_output, count=ed_num_count))
plot7 = figure(x_range=ed_num_output, toolbar_location=None, tools="hover", tooltips="@output: @count", title="Educational Years's Categorical Analysis")
plot7.vbar(x='output', top='count', width=0.9, source=source7, line_color='white', color = 'darkcyan')
plot7.xaxis.major_label_orientation = 1.1


#Create a grid plot of all
gridplot_output = gridplot([[plot1,plot2], [plot3,plot4], [plot5,plot6], [plot7]], toolbar_location=None)


#Display the plot
show(gridplot_output)

# Task 9: Explore Trends via Continuous Variables

<p>In order to explore the correlation trends of data via continuous variables, scatter plots are the best choice. In this task, plot a scatter plot between the variables <code>capital_gain</code> and <code>working_hours</code>.</p>

In [12]:
#Create a figure object
p = figure()

#Create scatter plot
p.circle(data['hours-per-week'],data['capital-gain'],size=10,color='green')

#Specify the number format
p.yaxis.formatter.use_scientific = False

#Show the plot.
show(p)