# ML Answers

- **Answer Set**: No. 05
- **Full Name**: Mohammad Hosein Nemati
- **Student Code**: `610300185`

---

## Basics

In this section we will done some basic steps:

### Libraries

Before begin, we must import these required libraries:

In [39]:
import numpy as np
import pandas as pd
import sklearn as sk

import sklearn.utils as skutils
import sklearn.dummy as skdummy
import sklearn.impute as skimpute
import sklearn.compose as skcompose
import sklearn.pipeline as skpipeline
import sklearn.naive_bayes as skbayes
import sklearn.linear_model as sklinear
import sklearn.preprocessing as skprocessing
import sklearn.model_selection as skselection

sk.set_config(display="diagram")

### Dataset

Now we can load our dataset:

In [40]:
data_frame = pd.read_csv("../lib/HW5.csv")
data_frame = skutils.shuffle(data_frame)

data_label = data_frame["Income"]
data_frame = data_frame.drop(["Income", "Education"], axis=1)

data_features = data_frame.to_numpy()
data_labels = data_label.to_numpy()

data_labels[data_labels == " <=50K"] = 0
data_labels[data_labels == " >50K"] = 1

data_labels = np.array(data_labels, dtype=np.int32)

train_features, test_features, train_labels, test_labels = skselection.train_test_split(
    data_features, 
    data_labels, 
    test_size=0.3, 
    random_state=313
)

data_frame

Unnamed: 0,Age,WorkClass,FinancialWeight,Education-num,MaritalStatus,Occupation,Relationship,Race,Sex,CapitalGain,CapitalLoss,HourPerWeek,NativeCountry
9396,54,Private,234938,9,Married-civ-spouse,Sales,Husband,White,Male,4064,0,55,United-States
28150,46,Private,208067,9,Divorced,Craft-repair,Other-relative,White,Male,0,0,40,United-States
4245,27,Federal-gov,469705,9,Never-married,Craft-repair,Not-in-family,Black,Male,0,1980,40,United-States
4133,55,Private,147098,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,50,United-States
13174,29,Private,179498,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,Germany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17193,34,Private,238305,10,Married-civ-spouse,Other-service,Wife,White,Female,0,1628,12,
19993,21,Private,232591,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,40,United-States
31416,42,Private,195584,12,Separated,Adm-clerical,Unmarried,Black,Female,0,0,40,United-States
31529,23,Private,288771,10,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,30,United-States


---

## Problem (1)

### Part (a)

In [41]:
print("Column Types:")
print("")

heading_numbers = [label for label in data_frame if str(data_frame.loc[0, label]).isdigit()]
heading_strings = [label for label in data_frame if not str(data_frame.loc[0, label]).isdigit()]

heading_numbers_index = [data_frame.columns.get_loc(label) for label in heading_numbers]
heading_strings_index = [data_frame.columns.get_loc(label) for label in heading_strings]

print(f"Numbers: {', '.join(heading_numbers)}")
print(f"Strings: {', '.join(heading_strings)}")

Column Types:

Numbers: Age, FinancialWeight, Education-num, CapitalGain, CapitalLoss, HourPerWeek
Strings: WorkClass, MaritalStatus, Occupation, Relationship, Race, Sex, NativeCountry


---

### Part (b)

In [42]:
print("Column Nulls:")
print("")

null_columns = []

for label in data_frame:
    nulls = len(data_frame[label][pd.isna(data_frame[label])])
    
    if (nulls > 0):
        null_columns.append(label)
        print(f"{label}: {nulls}")

Column Nulls:

WorkClass: 1836
Occupation: 1843
NativeCountry: 583


---

### Part (c)

Because the columns that contains null records are string typed, we can use **Mode of Column** for the null cells

---

### Part (d)

Now, we will define a **Data Pipeline** for handling entire workflow, this pipeline contains three steps:

1. **Imputer**: Fills empty cells with `column mode`
2. **Transformer**: Converts columns
   1. **Numbers**: Converts number typed columns using `MinMaxScaler`
   2. **Strings**: Converts string typed columns to numbers using `OneHotEncoder`
3. **Classifier**: Classification using `dummy_baseline`

In [43]:
pipe = skpipeline.Pipeline(steps=[
    ("imputer", skimpute.SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ("transformer", skcompose.ColumnTransformer(transformers=[
        ("numbers", skprocessing.MinMaxScaler(), heading_numbers_index),
        ("strings", skprocessing.OneHotEncoder(handle_unknown="ignore"), heading_strings_index),
    ])),
    ("classifier", skdummy.DummyClassifier(strategy="constant", constant=0))
]).fit(train_features, train_labels)

pipe

In [44]:
train_accuracy = skselection.cross_val_score(pipe, train_features, train_labels, cv=10)
test_accuracy = skselection.cross_val_score(pipe, test_features, test_labels, cv=10)

print(f"Train Accuracy Mean: {train_accuracy.mean()}")
print(f"Test Accuracy Mean: {test_accuracy.mean()}")
print("")
print(f"Train Accuracy STD: {train_accuracy.std()}")
print(f"Test Accuracy STD: {train_accuracy.std()}")

Train Accuracy Mean: 0.7579413870349414
Test Accuracy Mean: 0.762104636139403

Train Accuracy STD: 0.00019142755976423408
Test Accuracy STD: 0.00019142755976423408


---

## Problem (2)

Now, we will define a **Data Pipeline** for handling entire workflow, this pipeline contains three steps:

1. **Imputer**: Fills empty cells with `column mode`
2. **Transformer**: Converts columns
   1. **Numbers**: Converts number typed columns using `MinMaxScaler`
   2. **Strings**: Converts string typed columns to numbers using `OneHotEncoder`
3. **Classifier**: Classification using `logistic_regression`

In [45]:
pipe = skpipeline.Pipeline(steps=[
    ("imputer", skimpute.SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ("transformer", skcompose.ColumnTransformer(transformers=[
        ("numbers", skprocessing.MinMaxScaler(), heading_numbers_index),
        ("strings", skprocessing.OneHotEncoder(handle_unknown="ignore"), heading_strings_index),
    ])),
    ("classifier", sklinear.LogisticRegression(max_iter=1000))
]).fit(train_features, train_labels)

pipe

In [46]:
train_accuracy = skselection.cross_val_score(pipe, train_features, train_labels, cv=10)
test_accuracy = skselection.cross_val_score(pipe, test_features, test_labels, cv=10)

print(f"Train Accuracy Mean: {train_accuracy.mean()}")
print(f"Test Accuracy Mean: {test_accuracy.mean()}")
print("")
print(f"Train Accuracy STD: {train_accuracy.std()}")
print(f"Test Accuracy STD: {train_accuracy.std()}")

Train Accuracy Mean: 0.8485871573404771
Test Accuracy Mean: 0.8472704162961223

Train Accuracy STD: 0.004899943692374601
Test Accuracy STD: 0.004899943692374601


---

## Problem (3)

Now, we will define a **Data Pipeline** for handling entire workflow, this pipeline contains three steps:

1. **Imputer**: Fills empty cells with `column mode`
2. **Transformer**: Converts columns
   1. **Numbers**: Converts number typed columns using `MinMaxScaler`
   2. **Strings**: Converts string typed columns to numbers using `OneHotEncoder`
3. **Classifier**: Classification using `naive_bayes`

In [47]:
pipe = skpipeline.Pipeline(steps=[
    ("imputer", skimpute.SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ("transformer", skcompose.ColumnTransformer(transformers=[
        ("numbers", skprocessing.MinMaxScaler(), heading_numbers_index),
        ("strings", skprocessing.OneHotEncoder(handle_unknown="ignore"), heading_strings_index),
    ])),
    ("classifier", skbayes.MultinomialNB())
]).fit(train_features, train_labels)

pipe

In [48]:
train_accuracy = skselection.cross_val_score(pipe, train_features, train_labels, cv=10)
test_accuracy = skselection.cross_val_score(pipe, test_features, test_labels, cv=10)

print(f"Train Accuracy Mean: {train_accuracy.mean()}")
print(f"Test Accuracy Mean: {test_accuracy.mean()}")
print("")
print(f"Train Accuracy STD: {train_accuracy.std()}")
print(f"Test Accuracy STD: {train_accuracy.std()}")

Train Accuracy Mean: 0.7631190773115325
Test Accuracy Mean: 0.7630255088343374

Train Accuracy STD: 0.006856513610407651
Test Accuracy STD: 0.006856513610407651


### Results

As we can see, the accuracy of **Naive Bayes** and **Baseline** are close to each other, because the relation between the label and the features is complicated.

---

## Problem (4)

- Most of the persons have income `<50K`, so the **Dummy Baseline** classifier accuracy, was high.
- The relation of the features is complicated, so the **Naive Bayes** classifier accuracy, is close to `baseline`.
- The **Logistic Regression**, can find a simple linear relation between features, so the accuracy is higher.

---