**1. The problem statement **

In this kernel, I try to make predictions where the prediction task is to determine whether a person makes over 50K a year. I implement Random Forest Classification, Logistic Regression, Decision Tree and SVM with Python and Scikit-Learn. So, to answer the question, I build a Random Forest classifier,  Logistic Regression, Decision Tree and SVM to predict whether a person makes over 50K a year.

**2. Import libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
sns.set(style="whitegrid")

In [None]:
import warnings

warnings.filterwarnings('ignore')

**3. Import dataset **

In [None]:
data = '/kaggle/input/income-classification/income_evaluation.csv'

df = pd.read_csv(data)

**4. Exploratory data analysis **

Now, I will explore the data to gain insights about the data.

In [None]:
# print the shape
print('The shape of the dataset : ', df.shape)

We can see that there are 32561 instances and 15 attributes in the data set.

In [None]:
df.head()

**Rename column names **

We can see that the dataset does not have proper column names. The column names contain underscore. We should give proper names to the columns. I will do it as follows:-

In [None]:
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

df.columns

In [None]:
df.dtypes

**Findings**
    
*     We can see that the dataset contains 9 character variables and 6 numerical variables.
*     income is the target variable.

In [None]:
df.describe()

* The above df.describe() command presents statistical properties in vertical form.

In [None]:
# check for missing values

df.isnull().sum()

**Types of variables**

* In this section, I segregate the dataset into categorical and numerical variables.

* There are a mixture of categorical and numerical variables in the dataset.
 
* Categorical variables have data type object. Numerical variables have data type int64.
 
* First of all, I will explore categorical variables.

In [None]:
categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

* There are 9 categorical variables in the dataset.

* The categorical variables are given by workclass, education, marital_status, occupation, relationship, race, sex, native_country and income.

In [None]:
df[categorical].head()

* Now, we will check the frequency distribution of categorical variables.

In [None]:
for var in categorical: 
    
    print(df[var].value_counts())

* Percentage of frequency distribution of values

**Explore income target variable**

In [None]:
# check for missing values

df['income'].isnull().sum()

**We can see that there are no missing values in the income target variable.**

In [None]:
# view number of unique values

df['income'].nunique()

**There are 2 unique values in the income variable.**

In [None]:
# view the unique values

df['income'].unique()

**The two unique values are <=50K and >50K.**

In [None]:
# view the frequency distribution of values

df['income'].value_counts()

In [None]:
# view percentage of frequency distribution of values

df['income'].value_counts()/len(df)

In [None]:
# visualize frequency distribution of income variable

f,ax=plt.subplots(1,2,figsize=(18,8))

ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Income Share')


#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="income", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of income variable")

plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt sex")
plt.show()

**We can see that males make more money than females in both the income categories.**

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="race", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt race")
plt.show()

**We can see that whites make more money than non-whites in both the income categories.**

In [None]:
# check number of unique labels 

df.workclass.nunique()

In [None]:
# view frequency distribution of values

df.workclass.value_counts()

In [None]:
# replace '?' values in workclass variable with `NaN`

df['workclass'].replace(' ?', np.NaN, inplace=True)

In [None]:
# again check the frequency distribution of values in workclass variable

df.workclass.value_counts()

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
ax = df.workclass.value_counts().plot(kind="bar", color="green")
ax.set_title("Frequency distribution of workclass variable")
ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)
plt.show()

We can see that there are lot more private workers than other category of workers.

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt income")
ax.legend(loc='upper right')
plt.show()

In [None]:
# check number of unique labels

df.occupation.nunique()

In [None]:
# view unique labels

df.occupation.unique()

In [None]:
# view frequency distribution of values

df.occupation.value_counts()

In [None]:
# replace '?' values in occupation variable with `NaN`

df['occupation'].replace(' ?', np.NaN, inplace=True)

In [None]:
# again check the frequency distribution of values

df.occupation.value_counts()

In [None]:
# visualize frequency distribution of `occupation` variable

f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="occupation", data=df, palette="Set1")
ax.set_title("Frequency distribution of occupation variable")
ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)
plt.show()

In [None]:
# check number of unique labels

df.native_country.nunique()

In [None]:
# view unique labels 

df.native_country.unique()

In [None]:
# check frequency distribution of values

df.native_country.value_counts()

In [None]:
# replace '?' values in native_country variable with `NaN`

df['native_country'].replace(' ?', np.NaN, inplace=True)

In [None]:
# again check the frequency distribution of values

df.native_country.value_counts()

In [None]:
# visualize frequency distribution of `native_country` variable

f, ax = plt.subplots(figsize=(16, 12))
ax = sns.countplot(x="native_country", data=df, palette="Set1")
ax.set_title("Frequency distribution of native_country variable")
ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)
plt.show()

In [None]:
df[categorical].isnull().sum()

In [None]:
# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

**We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split.**

**Find numerical variables**

In [None]:
numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :\n\n', numerical)

In [None]:
df[numerical].head()

In [None]:
df[numerical].isnull().sum()

* We can see that there are no missing values in the numerical variables.

In [None]:
df['age'].nunique()

In [None]:
f, ax = plt.subplots(figsize=(10,8))
x = df['age']
ax = sns.distplot(x, bins=10, color='blue')
ax.set_title("Distribution of age variable")
plt.show()

**Explore relationship between age and income variables**

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.boxplot(x="income", y="age", data=df)
ax.set_title("Visualize income wrt age variable")
plt.show()

* As expected, younger people make less money as compared to senior people.

In [None]:
# plot correlation heatmap to find out correlations

df.corr().style.format("{:.4}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)

**We can see that there is no strong correlation between variables.**

****Declare feature vector and target variable****

In [None]:
X = df.drop(['income'], axis=1)

y = df['income']

**Split data into separate training and test set **

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

**Scikit-Learn (sklearn) → Commonly used open source machine learning library**

In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape


** I will do feature engineering on different variables. **

** First, I will show the categorical and numerical variables separately in the training set. **

In [None]:
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

In [None]:
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

In [None]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

In [None]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

In [None]:
for df2 in [X_train, X_test]:
    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)

In [None]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

In [None]:
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

**As a final check, I will check for missing values in X_train and X_test.**

In [None]:
# check missing values in X_train

X_train.isnull().sum()

In [None]:
# check missing values in X_test

X_test.isnull().sum()

**We can see that there are no missing values in X_train and X_test.**

In [None]:
# preview categorical variables in X_train

X_train[categorical].head()

In [None]:
# import category encoders

import category_encoders as ce

**One Hot Encoding means that categorical variables are represented as binary.**

In [None]:
# encode categorical variables with one-hot encoding

encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 
                                 'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train.shape

**Similarly, I will take a look at the X_test set.**

In [None]:
X_test.head()

In [None]:
X_test.shape

In [None]:
cols = X_train.columns
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])


**We now have X_train dataset ready to be fed into the Random Forest classifier.**

In [None]:
# import Random Forest classifier

from sklearn.ensemble import RandomForestClassifier



# instantiate the classifier 

rfc = RandomForestClassifier(random_state=0)



# fit the model

rfc.fit(X_train, y_train)



# Predict the Test set results

y_pred = rfc.predict(X_test)



# Check accuracy score 

from sklearn.metrics import accuracy_score

print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
# instantiate the classifier with n_estimators = 100

rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

rfc_100.fit(X_train, y_train)



# Predict on the test set results

y_pred_100 = rfc_100.predict(X_test)



# Check accuracy score 

print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))

** The expected accuracy increases with number of decision-trees in the model. **

In [None]:
logr = LogisticRegression()
logr.fit(X_train,y_train)
y_predict_lr = logr.predict(X_test)
acc_log = metrics.accuracy_score(y_predict_lr,y_test)
print('The accuracy of the Logistic Regression is', acc_log)

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
y_predict_dt = dt.predict(X_test)
acc_dt = metrics.accuracy_score(y_predict_dt,y_test)
print('The accuracy of the Decision Tree is', acc_dt)

In [None]:
sv = SVC() #select the algorithm
sv.fit(X_train,y_train) # we train the algorithm with the training data and the training output
y_predict_svm = sv.predict(X_test) #now we pass the testing data to the trained algorithm
acc_svm = metrics.accuracy_score(y_predict_svm,y_test)
print('The accuracy of the SVM is:', acc_svm)

In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)

In [None]:
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))