# PHASE 1 GRADED CHALLENGE 2

## <span style="color:red"> I. INTRODUCTION </span>

Welcome everyone to my notebook, here is my short introduction:

* Name : Alexander Prasetyo Christianto
* Age : 23
* Last Education Background : Electrical Engineering
* Occupation : Full Time Data Science Student Batch-001

### DATA DESCRIPTION

The data used in this task is taken from Machine Learning Datasets in BigQuery Google Public Dataset. The data is retrieved by querying the `census_adult_income` table.

The columns taken from `census_adult_income` are `age, workclass, education, education_num, marital_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week, native_country, and income_bracket`. The data taken from these columns also has a condition where the `workclass` does not contain the `?` character and the hours per week are less than 100 hours.

Here is the dataset link: [Dataset Link](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=census_adult_income&page=table)

Here is the data dictionary :

- age : the age of a person
- workclass: type of work agency
- education: the last education grade
- education number: estimated years of education completed based on the value of the education field.
- marital_status : the state of being married or not married
- occupation: current job
- relationship: status in the family
- sex : gender of a person
- capital_gain : profit earned on the sale of an asset which has increased in value over the holding period
- capital_loss : the loss incurred when a capital asset, such as an investment or real estate, decreases in value.
- hour_per_week : number of hours worked per week
- native_country : one's country of origin
- income_bracket : one's category of income falls within defined upper and lower levels

### OBJECTIVES

The objective of the task is to be able to create a classification model using Logistic Regression and Support Vector Machine to predict `income_bracket` using existing datasets.

## <span style="color:red"> II. IMPORT LIBRARIES </span>

The following are libraries that will be used to perform analysis, processing, modeling, and also evaluation of the model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sklearn

## <span style="color:red"> III. DATA LOADING </span>

Here is the SQL Query that I used to obtain the data:

~~~~sql
SELECT age, workclass, education, education_num, marital_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week, native_country, income_bracket FROM `bigquery-public-data.ml_datasets.census_adult_income` 
WHERE workclass NOT LIKE '%?%' AND hours_per_week < 100
LIMIT 1999
~~~~

Now, to begin the work, I load the data into my workspace and defining my data as `data`.

In [2]:
data = pd.read_csv('h8dsft_P1G2_AlexanderPrasetyoC.csv')

In [3]:
data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,Private,9th,5,Married-civ-spouse,Other-service,Wife,Black,Female,3411,0,34,United-States,<=50K
1,72,Private,9th,5,Married-civ-spouse,Exec-managerial,Wife,Asian-Pac-Islander,Female,0,0,48,United-States,>50K
2,45,Private,9th,5,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,United-States,>50K
3,31,Private,9th,5,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
4,55,Private,9th,5,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,23,United-States,<=50K


In [4]:
data.tail()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
1994,43,Private,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,60,United-States,<=50K
1995,40,Private,11th,7,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,40,United-States,<=50K
1996,47,Private,11th,7,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,1485,58,United-States,<=50K
1997,38,Self-emp-not-inc,11th,7,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
1998,32,Private,11th,7,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,45,United-States,<=50K


In [5]:
# Duplication of original dataset for safety purposes

data_duplicate = data.copy()

I made a duplicate of data from the original dataset for safety purposes and if I want to use the original dataset I can directly use `dataset_duplicate` because the data processing that I will do will start from the `data` variable.

## <span style="color:red"> IV. DATA CLEANING </span>

It has been always a necessary step to do data checking and cleaning before beginning the analysis. This way may prevent things that I don't want to happen when I do the data and processing. The steps that I am going to perform in this section are null-value checking and if null values are present I will do a null-value handling, and lastly checking data duplication.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             1999 non-null   int64 
 1   workclass       1999 non-null   object
 2   education       1999 non-null   object
 3   education_num   1999 non-null   int64 
 4   marital_status  1999 non-null   object
 5   occupation      1999 non-null   object
 6   relationship    1999 non-null   object
 7   race            1999 non-null   object
 8   sex             1999 non-null   object
 9   capital_gain    1999 non-null   int64 
 10  capital_loss    1999 non-null   int64 
 11  hours_per_week  1999 non-null   int64 
 12  native_country  1999 non-null   object
 13  income_bracket  1999 non-null   object
dtypes: int64(5), object(9)
memory usage: 218.8+ KB


From the executed command above, we obtain the general information about the dataset. The dataset that I have contains 14 columns and 1999 rows. Fortunately, it can be seen that the number of rows contained in the dataset all have the same value. This indicates that my dataset has no null values at all. But to make sure I will do another few steps.

In [7]:
data.isnull().sum()

age               0
workclass         0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income_bracket    0
dtype: int64

And it turns out there is no null-value in my dataset.

In [9]:
data.duplicated().sum()

62

Unfortunately, there are duplicated entries in my dataset. I will drop the duplicated entries and keeping only 1 entries that are unique to each other.

In [13]:
duplicateRows = data[data.duplicated()]
duplicateRows

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
112,60,Private,7th-8th,4,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,United-States,<=50K
245,39,Private,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,40,United-States,>50K
284,25,Private,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,40,United-States,<=50K
288,24,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,United-States,<=50K
302,43,Private,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,40,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1871,31,Private,10th,6,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
1906,34,Private,10th,6,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
1908,30,Private,10th,6,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
1917,36,Private,10th,6,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K


Usually, I will do a cleaning to those duplicated data entries. But, by my personal judgement about this dataset, this dataset talks about census. I assume that every entry in this data is unique to each other even though that there is no unique identifier in the dataset. Thus, this concludes that I will not drop those duplicated data entries.

## <span style="color:red"> V. EXPLORATORY DATA ANALYSIS (EDA) </span>

It is worth to remember again that the objective in this dataset is to make a Logistic Regression and SVM model and the target of the model is `income_bracket`. In this section, I will do some plotting from the features contained in the dataset.

In [18]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1999.0,41.493247,11.938314,17.0,32.0,40.0,50.0,90.0
education_num,1999.0,9.181591,3.000758,1.0,6.0,9.0,12.0,16.0
capital_gain,1999.0,1429.310155,8445.784759,0.0,0.0,0.0,0.0,99999.0
capital_loss,1999.0,101.638319,430.426136,0.0,0.0,0.0,0.0,2415.0
hours_per_week,1999.0,38.856428,11.290898,1.0,36.0,40.0,40.0,99.0


Here is the observation from the command above:

* The minimum and maximum age of people in the dataset is 17 years and 90 years.
* The minimum and maximum years spent on education is 1 and 16 years respectively, whereas the median value education level is 9 years.
* While the minimum and median capital gain is 0, the maximum is 99999. This seems a bit odd for a person having a capital gain that much, my assumption is there is a false input to the dataset.
* The number of hours spent per week varies between 1 to 99 and the average being 38 hours.

In [16]:
num_cols = list(data.select_dtypes(exclude='object').columns)
num_cols

['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

## VI. DATA PREPROCESSING

## VII. MODEL DEFINITION

## VIII. MODEL TRAINING

## IX. MODEL EVALUATION

## X. MODEL INFERENCE

## XI. CONCLUSION