# Dataset about adults

### About the dataset

This dataset contains information about adults. The goal is to predict whether a person earns over 50K a year or not.

### Attributes

- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. (we will not use this attribute because it is redundant with `education-num`)
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



In [182]:
# import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer


def load_data(file_name, name_of_columns):
    """
    Load data from a file into a pandas dataframe
    :param file_name: name of file to load
    :return: pandas dataframe
    """
    dataset = pd.read_csv(file_name, header=None, names=name_of_columns)
    
    
    # replace the cells with ' ?' with NaN
    dataset = dataset.replace(' ?', np.NaN)
    
    # replce the cells with single or extra spaces with NaN
    dataset.replace(r'^\s*$', np.NaN, regex=True, inplace=True)
    
    # drop the education column as it is redundant
    dataset.drop('education', axis=1, inplace=True)
    
    dataset = pd.DataFrame(dataset)
    return dataset

name_of_columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                   'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

dataset = load_data('adult.data', name_of_columns)

test_dataset = load_data('adult.test', name_of_columns)

# print the 15th row of the dataset
print(dataset.iloc[14])

print(dataset.shape)

print(dataset.info())

age                                40
workclass                     Private
fnlwgt                         121772
education-num                      11
marital-status     Married-civ-spouse
occupation               Craft-repair
relationship                  Husband
race               Asian-Pac-Islander
sex                              Male
capital-gain                        0
capital-loss                        0
hours-per-week                     40
native-country                    NaN
income                           >50K
Name: 14, dtype: object
(32561, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education-num   32561 non-null  int64 
 4   marital-status  32561 non-null  object
 5   occupation

## Checking the dataset for missing values

- `?` is used to represent missing values in this dataset

In [183]:
dataset.isnull().sum()
test_dataset.isnull().sum()

age                 0
workclass         963
fnlwgt              0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64

In [184]:
# function to find the missing columns
def find_missing_columns(dataset):
    dataset_columns_length = len(dataset.columns)
    # Store the column number of the columns with missing values in a list called missing_cols
    missing_cols = [i for i in range(dataset_columns_length) if dataset.iloc[:, i].isnull().any()]
    
    # Print columns index and names with missing values and the number of missing values
    for i in missing_cols:
        print(i, dataset.columns[i], dataset.iloc[:, i].isnull().sum())
            
    return missing_cols

def impute_missing_values(dataset, missing_columns):
    for column in missing_columns:
        column_name = dataset.columns[column]
        if dataset[column_name].dtype == 'object' or dataset[column_name].dtype == 'int64':
            imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
            dataset[column_name] = imputer.fit_transform(dataset[column_name].values.reshape(-1, 1)).ravel()
        elif dataset[column_name].dtype == 'float64':
            imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
            dataset[column_name] = imputer.fit_transform(dataset[column_name].values.reshape(-1, 1)).ravel()
    return dataset

# find the missing columns in the dataset
missing_cols = find_missing_columns(dataset)

print(missing_cols)

dataset = impute_missing_values(dataset, missing_cols)

missing_cols_test = find_missing_columns(test_dataset)

print(missing_cols_test)

test_dataset = impute_missing_values(test_dataset, missing_cols_test)

print(dataset.isnull().sum())       # verify
print(test_dataset.isnull().sum())  # verify

1 workclass 1836
5 occupation 1843
12 native-country 583
[1, 5, 12]
1 workclass 963
5 occupation 966
12 native-country 274
[1, 5, 12]
age               0
workclass         0
fnlwgt            0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64
age               0
workclass         0
fnlwgt            0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


### Converting the label to a binary value

- `<=50K` is converted to `0`
- `>50K` is converted to `1`

In [185]:
# convert the label column to binary values

def convert_label_to_binary(dataset, label_column_name, options):
    dataset[label_column_name] = dataset[label_column_name].map(options)
    return dataset

dataset = convert_label_to_binary(dataset, 'income', {' <=50K': 0, ' >50K': 1})
test_dataset = convert_label_to_binary(test_dataset, 'income', {' <=50K.': 0, ' >50K.': 1})

# print the 15th row of the dataset
# print(dataset.iloc[14])

print(dataset.dtypes)
print(dataset.isnull().sum())

print(test_dataset.dtypes)
print(test_dataset.isnull().sum())

age                int64
workclass         object
fnlwgt             int64
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income             int64
dtype: object
age               0
workclass         0
fnlwgt            0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64
age                int64
workclass         object
fnlwgt             int64
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            

### Converting the categorical attributes to numerical attributes

For converting the categorical values to numerical values, we use the `OneHotEncoder` class from `sklearn.preprocessing` module.

In [186]:
# function to convert the categorical values to numerical values using one hot encoding
def convert_categorical(dataset, column_names_to_convert):
    column_index = []
    
    # find the index of the categorical columns
    for column_name in column_names_to_convert:
        column_index.append(dataset.columns.get_loc(column_name))
        
    # convert the dataset to a numpy array
    dataset_array = dataset.values
    
    # one hot encoder object 
    one_hot_encoder = OneHotEncoder(dtype=np.int64, handle_unknown='ignore')

    # apply the one hot encoder object on the independent variable dataset
    encoded_x = one_hot_encoder.fit_transform(dataset_array[:, column_index]).toarray()
    
    # drop the original column from the dataset
    dataset_array = np.delete(dataset_array, column_index, axis = 1)
    
    # add the new columns to the dataset
    dataset_array = np.concatenate((dataset_array, encoded_x), axis = 1)
    
    
    # get the column names of the new columns
    encoded_x_column_names = one_hot_encoder.get_feature_names_out(input_features=column_names_to_convert)
    
    # drop the old column from the dataset
    dataset = dataset.drop(column_names_to_convert, axis = 1)
    
    # record the data types of each column
    original_data_types = dataset.dtypes.to_dict()
    # all the data types are int64 for encoded columns
    
    
    # record the last column number of the dataset
    last_column_number = len(dataset.columns)
    
    # reconstruct the new dataset column names
    new_column_names = list(dataset.columns[0:last_column_number-1]) + list(encoded_x_column_names)
    new_column_names.append(dataset.columns[last_column_number-1])
    
    # rearrange the columns of the dataset_array 
    # i.e. bring the dataset_array column with the last column number to the last column number of the new dataset_array
    dataset_array = np.concatenate((dataset_array[:, 0:last_column_number-1], dataset_array[:, last_column_number:], dataset_array[:, last_column_number-1:last_column_number]), axis = 1)
    
    # convert the dataset to a dataframe
    # Here the column names are the original column names and the one hot encoded column names
    # and the values are the values of the dataset array
    dataset = pd.DataFrame(data=dataset_array, columns = new_column_names)
    
    # restore the original data types of the columns
    for column_name in dataset.columns:
        if column_name in original_data_types:
            dataset[column_name] = dataset[column_name].astype(original_data_types[column_name])
        else:
            dataset[column_name] = dataset[column_name].astype('int64')
    
    return dataset

def align_columns(train, test):
    # Get missing columns in the training test
    missing_cols = set(train.columns) - set(test.columns)

    # Add a missing column in test set with default value equal to 0
    for c in missing_cols:
        test[c] = 0

    # Ensure the order of column in the test set is in the same order than in train set
    test = test[train.columns]

    return train, test


categorical_col_names = ['workclass', 'marital-status', 'relationship', 'occupation', 'race', 'sex', 'native-country']

dataset = convert_categorical(dataset, categorical_col_names)

test_dataset = convert_categorical(test_dataset, categorical_col_names)

dataset, test_dataset = align_columns(dataset, test_dataset)

print(dataset.columns)

print(dataset.shape)

print(test_dataset.shape)

print(test_dataset.dtypes)

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'workclass_ Federal-gov', 'workclass_ Local-gov',
       'workclass_ Never-worked', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay',
       'marital-status_ Divorced', 'marital-status_ Married-AF-spouse',
       'marital-status_ Married-civ-spouse',
       'marital-status_ Married-spouse-absent',
       'marital-status_ Never-married', 'marital-status_ Separated',
       'marital-status_ Widowed', 'relationship_ Husband',
       'relationship_ Not-in-family', 'relationship_ Other-relative',
       'relationship_ Own-child', 'relationship_ Unmarried',
       'relationship_ Wife', 'occupation_ Adm-clerical',
       'occupation_ Armed-Forces', 'occupation_ Craft-repair',
       'occupation_ Exec-managerial', 'occupation_ Farming-fishing',
       'occupation_ Handlers-cleaners', 'occupation_ Machine-

### Scale the dataset

For scaling the dataset, we use the `minmax_scale` method

In [187]:
# function to scale the dataset
def scale_numerical_values(dataset, column_names_to_scale):
    for column_name in column_names_to_scale:
        # max and min values
        max_value = dataset[column_name].max()
        min_value = dataset[column_name].min()
        
        # scale the values
        dataset[column_name] = (dataset[column_name] - min_value)/(max_value - min_value)
    return dataset

numerical_col_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

dataset = scale_numerical_values(dataset, numerical_col_names)

test_dataset = scale_numerical_values(test_dataset, numerical_col_names)

print(dataset.head(2))

print(test_dataset.head(2))

        age    fnlwgt  education-num  capital-gain  capital-loss  \
0  0.301370  0.044302            0.8       0.02174           0.0   
1  0.452055  0.048238            0.8       0.00000           0.0   

   hours-per-week  workclass_ Federal-gov  workclass_ Local-gov  \
0        0.397959                       0                     0   
1        0.122449                       0                     0   

   workclass_ Never-worked  workclass_ Private  ...  \
0                        0                   0  ...   
1                        0                   0  ...   

   native-country_ Puerto-Rico  native-country_ Scotland  \
0                            0                         0   
1                            0                         0   

   native-country_ South  native-country_ Taiwan  native-country_ Thailand  \
0                      0                       0                         0   
1                      0                       0                         0   

   native-c

### Divide the dataset into dependent and independent variables

- The dependent variable is `income`
- Rest of the variables are independent variables

In [188]:
# function to divide the dataset into dependent and independent variables

def divide_dataset(dataset):
    # divide the dataset into x and y
    dataset_columns_length = len(dataset.columns)

    x = dataset.iloc[:, 0:(dataset_columns_length-1)].values
    y = dataset.iloc[:, (dataset_columns_length-1)].values
    
    return x, y

x_train, y_train = divide_dataset(dataset)

x_test, y_test = divide_dataset(test_dataset)

print(x_train.shape)
print(y_train.shape)

print(x_train[0])
print(y_train[0])

print(x_test.shape)
print(y_test.shape)

print(x_test[0])
print(y_test[0])

(32561, 89)
(32561,)
[0.30136986 0.0443019  0.8        0.02174022 0.         0.39795918
 0.         0.         0.         0.         0.         0.
 1.         0.         0.         0.         0.         0.
 1.         0.         0.         0.         1.         0.
 0.         0.         0.         1.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         1.         0.         1.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         1.         0.         0.        ]
0
(16281, 89)
(16281,)
[0.10958904 0.14443012 0.4        0.         0.         0.3979591

### Model

- We use the `CustomLogisticRegression` class from `custom_logistic_regression` module

In [189]:
# import the logistic regression class

from custom_logistic_regression import CustomLogisticRegression


# create an object of the class
classifier = CustomLogisticRegression(early_stopping_threshold=0.01, learning_rate=0.01, num_iterations=20000, verbose=True, num_features=30)

# fit the model to the training data
classifier.fit(x_train, y_train)

# predict the test set results
y_pred = classifier.predict(x_test)

# print the accuracy
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print(accuracy_score(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))



loss: 0.6921680563465841 	
loss: 0.4026967250859418 	
0.8173330876481789
[[11524   911]
 [ 2063  1783]]
              precision    recall  f1-score   support

           0       0.85      0.93      0.89     12435
           1       0.66      0.46      0.55      3846

    accuracy                           0.82     16281
   macro avg       0.76      0.70      0.72     16281
weighted avg       0.80      0.82      0.81     16281

