<a href="https://colab.research.google.com/github/codesid7/Data_Science/blob/main/Income_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Income Classfication Model
In this project we are evaluating a census dataset and predicting whether the income of a person with given features is greater than 50k $ or not.
We are using **Logistic Regression** Algorithm. The given dataset is publicaly available and References are given.

## Importing all the required libraries
We will be importing various Python libraries which will be needed for the model building.

In [23]:
import numpy as np  # We need array to compute all the calculations
import pandas as pd # As the data is not pre processes hence we need pandas to process the data
from sklearn.model_selection import train_test_split # used to split the data
from sklearn.linear_model import LogisticRegression # to use logistics regression
from sklearn.metrics import accuracy_score # for checking accuracy of model
import matplotlib.pyplot as plt # use to see various plots for analysis
import seaborn as sns # Visualisation of data

In [24]:
# Loading the dataset
income_data = pd.read_csv('/content/adult.csv')

In [25]:
# Overview of Data using head and tail function
income_data.head()

Unnamed: 0,Age,Worclass,Unique ID,Education,Education Number,Marital-status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per wee,Native Country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Blac,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Blac,Female,0,0,40,Cuba,0


In [26]:
income_data.tail()

Unnamed: 0,Age,Worclass,Unique ID,Education,Education Number,Marital-status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per wee,Native Country,Income
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,1


The given dataset has 14 **features** and Income label which indicated:
- 0 - Less than 50k
- 1 - More than 50k

In [27]:
# Let us See the number of values and information of data using info function
income_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               32561 non-null  int64 
 1   Worclass          32561 non-null  object
 2   Unique ID         32561 non-null  int64 
 3   Education         32561 non-null  object
 4   Education Number  32561 non-null  int64 
 5   Marital-status    32561 non-null  object
 6   Occupation        32561 non-null  object
 7   Relationship      32561 non-null  object
 8   Race              32561 non-null  object
 9   Sex               32561 non-null  object
 10  Capital Gain      32561 non-null  int64 
 11  Capital Loss      32561 non-null  int64 
 12  Hours per wee     32561 non-null  int64 
 13  Native Country    32561 non-null  object
 14  Income            32561 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 3.7+ MB


## Data Pre Processing


In [28]:
# Checking the number of null values in each column
income_data.isnull().sum()

Age                 0
Worclass            0
Unique ID           0
Education           0
Education Number    0
Marital-status      0
Occupation          0
Relationship        0
Race                0
Sex                 0
Capital Gain        0
Capital Loss        0
Hours per wee       0
Native Country      0
Income              0
dtype: int64

In [29]:
# Just in case the dataset does have null values then we can drop those by using drop na
income_data.dropna(how='any')

Unnamed: 0,Age,Worclass,Unique ID,Education,Education Number,Marital-status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per wee,Native Country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Blac,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Blac,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


In [30]:
# Now let us see the count of labels
income_data['Income'].value_counts()

0    24720
1     7841
Name: Income, dtype: int64

As the given dataset has 70:30 distribution of class then we can prooceed with this dataset.
As the dataset has both categorical and numerical value the model will not be able perform well and hence we need to convert the given categorical data to numerical values.


In [31]:
# We can observe that theres is a 2 feature Education and Education Number and hence both signifies the same thing we will drop the categorical column
income_data = income_data.drop('Education',axis = 1)


In [32]:
# Dividing the dataset into Categorical and Numerical values
numeric_data = income_data.select_dtypes(include=[np.number])
categorical_data = income_data.select_dtypes(exclude=[np.number])

In [33]:
# Let us view each dataset one by one
numeric_data

Unnamed: 0,Age,Unique ID,Education Number,Capital Gain,Capital Loss,Hours per wee,Income
0,39,77516,13,2174,0,40,0
1,50,83311,13,0,0,13,0
2,38,215646,9,0,0,40,0
3,53,234721,7,0,0,40,0
4,28,338409,13,0,0,40,0
...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0
32557,40,154374,9,0,0,40,1
32558,58,151910,9,0,0,40,0
32559,22,201490,9,0,0,20,0


In [34]:
categorical_data

Unnamed: 0,Worclass,Marital-status,Occupation,Relationship,Race,Sex,Native Country
0,State-gov,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,Married-civ-spouse,Handlers-cleaners,Husband,Blac,Male,United-States
4,Private,Married-civ-spouse,Prof-specialty,Wife,Blac,Female,Cuba
...,...,...,...,...,...,...,...
32556,Private,Married-civ-spouse,Tech-support,Wife,White,Female,United-States
32557,Private,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
32558,Private,Widowed,Adm-clerical,Unmarried,White,Female,United-States
32559,Private,Never-married,Adm-clerical,Own-child,White,Male,United-States


In [35]:
# Importing warnings and converting the categorical to numerical using get dummies
import warnings
warnings.filterwarnings('ignore') # Ignore the warning
# Using one hot encoding to get data into numericsl values
categorical_data_updated = pd.get_dummies(categorical_data, columns = ['Worclass', 'Marital-status','Occupation','Relationship','Race','Sex','Native Country'])



In [36]:
categorical_data_updated

Unnamed: 0,Worclass_,Worclass_ Federal-gov,Worclass_ Local-gov,Worclass_ Never-wored,Worclass_ Private,Worclass_ Self-emp-inc,Worclass_ Self-emp-not-inc,Worclass_ State-gov,Worclass_ Without-pay,Marital-status_ Divorced,...,Native Country_ Portugal,Native Country_ Puerto-Rico,Native Country_ Scotland,Native Country_ South,Native Country_ Taiwan,Native Country_ Thailand,Native Country_ Trinadad&Tobago,Native Country_ United-States,Native Country_ Vietnam,Native Country_ Yugoslavia
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32557,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32558,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32559,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [37]:
income_data_updated = pd.concat([categorical_data_updated,numeric_data], axis = 1)
income_data_updated

Unnamed: 0,Worclass_,Worclass_ Federal-gov,Worclass_ Local-gov,Worclass_ Never-wored,Worclass_ Private,Worclass_ Self-emp-inc,Worclass_ Self-emp-not-inc,Worclass_ State-gov,Worclass_ Without-pay,Marital-status_ Divorced,...,Native Country_ United-States,Native Country_ Vietnam,Native Country_ Yugoslavia,Age,Unique ID,Education Number,Capital Gain,Capital Loss,Hours per wee,Income
0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,39,77516,13,2174,0,40,0
1,0,0,0,0,0,0,1,0,0,0,...,1,0,0,50,83311,13,0,0,13,0
2,0,0,0,0,1,0,0,0,0,1,...,1,0,0,38,215646,9,0,0,40,0
3,0,0,0,0,1,0,0,0,0,0,...,1,0,0,53,234721,7,0,0,40,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,28,338409,13,0,0,40,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,1,0,0,0,0,0,...,1,0,0,27,257302,12,0,0,38,0
32557,0,0,0,0,1,0,0,0,0,0,...,1,0,0,40,154374,9,0,0,40,1
32558,0,0,0,0,1,0,0,0,0,0,...,1,0,0,58,151910,9,0,0,40,0
32559,0,0,0,0,1,0,0,0,0,0,...,1,0,0,22,201490,9,0,0,20,0


In [38]:
# Splitting the features and labels datset into X & Y
X = income_data_updated.drop(columns = 'Income',axis = 1)
Y = income_data_updated['Income']

In [39]:
# Feature data
X

Unnamed: 0,Worclass_,Worclass_ Federal-gov,Worclass_ Local-gov,Worclass_ Never-wored,Worclass_ Private,Worclass_ Self-emp-inc,Worclass_ Self-emp-not-inc,Worclass_ State-gov,Worclass_ Without-pay,Marital-status_ Divorced,...,Native Country_ Trinadad&Tobago,Native Country_ United-States,Native Country_ Vietnam,Native Country_ Yugoslavia,Age,Unique ID,Education Number,Capital Gain,Capital Loss,Hours per wee
0,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,39,77516,13,2174,0,40
1,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,50,83311,13,0,0,13
2,0,0,0,0,1,0,0,0,0,1,...,0,1,0,0,38,215646,9,0,0,40
3,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,53,234721,7,0,0,40
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,28,338409,13,0,0,40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,27,257302,12,0,0,38
32557,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,40,154374,9,0,0,40
32558,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,58,151910,9,0,0,40
32559,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,22,201490,9,0,0,20


In [40]:
# Label
Y

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: Income, Length: 32561, dtype: int64

### Splitting the Dataset
Now we will split our whole data into training and testing set to train the model and evaluate based on testing data.

In [41]:
# using train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify = Y, random_state = 3)

In [42]:
# Cheking shape of data
print(X.shape,X_train.shape,X_test.shape)

(32561, 92) (26048, 92) (6513, 92)


## Training the model
We will train the model using Logistic Regression Algoirthm

In [43]:
# Loading thr model
model = LogisticRegression()

In [44]:
# Training the model
model.fit(X_train,Y_train)

As the model is trained now on the training data let us examine it's performance using accuracy score

In [45]:
# Predicting & Checking accuracy on training data
X_train_predic = model.predict(X_train)
training_accuracy = accuracy_score(X_train_predic, Y_train)
print("The Accuracy Score on the training data using model is :",training_accuracy)

The Accuracy Score on the training data using model is : 0.7979499385749386


In [46]:
# Predicting & Checking accuracy on testing data
X_test_predic = model.predict(X_test)
test_accuracy = accuracy_score(X_test_predic, Y_test)
print("The Accuracy Score on the training data using model is :",test_accuracy)

The Accuracy Score on the training data using model is : 0.797328420082911


## Prediction
We will input the new set and try to classify it's income

In [56]:
input = X.loc[2] # Taking input from the given data
print(input)
input_data = np.asarray(input) # COnverting to numpy array
input_data_reshape = input_data.reshape(1,-1) # Reshaping to only one instances
prediction = model.predict(input_data_reshape) # Predicting
print(prediction) # Output

Worclass_                     0
Worclass_ Federal-gov         0
Worclass_ Local-gov           0
Worclass_ Never-wored         0
Worclass_ Private             1
                          ...  
Unique ID                215646
Education Number              9
Capital Gain                  0
Capital Loss                  0
Hours per wee                40
Name: 2, Length: 92, dtype: int64
[0]


## References

*  Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.

