# ML Answers

- **Answer Set**: No. 05
- **Full Name**: Mohammad Hosein Nemati
- **Student Code**: `610300185`

---

## Basics

In this section we will done some basic steps:

### Libraries

Before begin, we must import these required libraries:

In [204]:
import numpy as np
import pandas as pd
import sklearn as sk

import sklearn.utils as skutils
import sklearn.dummy as skdummy
import sklearn.impute as skimpute
import sklearn.metrics as skmetrics
import sklearn.compose as skcompose
import sklearn.pipeline as skpipeline
import sklearn.naive_bayes as skbayes
import sklearn.linear_model as sklinear
import sklearn.preprocessing as skprocessing
import sklearn.model_selection as skselection

import matplotlib.pyplot as plt

sk.set_config(display="diagram")

### Dataset

Now we can load our dataset:

In [227]:
data_frame = pd.read_csv("../lib/HW5.csv")
data_frame = skutils.shuffle(data_frame)

data_label = data_frame["Income"]
data_frame = data_frame.drop(["Income", "Education"], axis=1)

data_features = data_frame.to_numpy()
data_labels = data_label.to_numpy()

data_labels[data_labels == " <=50K"] = 0
data_labels[data_labels == " >50K"] = 1

data_labels = np.array(data_labels, dtype=np.int32)

train_features, test_features, train_labels, test_labels = skselection.train_test_split(
    data_features, 
    data_labels, 
    test_size=0.3, 
    random_state=313
)

data_frame

Unnamed: 0,Age,WorkClass,FinancialWeight,Education-num,MaritalStatus,Occupation,Relationship,Race,Sex,CapitalGain,CapitalLoss,HourPerWeek,NativeCountry
24050,24,Private,456460,12,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,40,United-States
23009,31,,283531,9,Divorced,,Unmarried,Black,Female,0,0,20,United-States
1583,39,Private,128715,9,Divorced,Adm-clerical,Not-in-family,White,Male,10520,0,40,United-States
21878,37,Private,176073,10,Divorced,Craft-repair,Not-in-family,White,Male,0,0,50,United-States
17669,37,Private,83880,10,Never-married,Craft-repair,Not-in-family,White,Male,0,0,40,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22175,35,Self-emp-inc,187046,14,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,50,United-States
26552,54,Private,189607,13,Never-married,Other-service,Own-child,Black,Female,0,0,36,United-States
3369,61,Private,149653,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States
17866,21,Private,147655,9,Never-married,Other-service,Own-child,White,Female,0,0,35,United-States


---

## Problem (1)

### Part (a)

In [229]:
print("Column Types:")

heading_numbers = [label for label in data_frame if str(data_frame.loc[0, label]).isdigit()]
heading_strings = [label for label in data_frame if not str(data_frame.loc[0, label]).isdigit()]

heading_numbers_index = [data_frame.columns.get_loc(label) for label in heading_numbers]
heading_strings_index = [data_frame.columns.get_loc(label) for label in heading_strings]

print(f"Numbers: {', '.join(heading_numbers)}")
print(f"Strings: {', '.join(heading_strings)}")

Column Types:
Numbers: Age, FinancialWeight, Education-num, CapitalGain, CapitalLoss, HourPerWeek
Strings: WorkClass, MaritalStatus, Occupation, Relationship, Race, Sex, NativeCountry


---

### Part (b)

In [230]:
print("Column Nulls:")
print("")

null_columns = []

for label in data_frame:
    nulls = len(data_frame[label][pd.isna(data_frame[label])])
    
    if (nulls > 0):
        null_columns.append(label)
        print(f"{label}: {nulls}")

Column Nulls:

WorkClass: 1836
Occupation: 1843
NativeCountry: 583


---

### Part (c)

Because the columns that contains null records are string typed, we can use **Mode of Column** for the null cells

---

### Part (d)

Now, we will define a **Data Pipeline** for handling entire workflow, this pipeline contains three steps:

1. **Imputer**: Fills empty cells with `column mode`
2. **Transformer**: Converts columns
   1. **Numbers**: Converts number typed columns using `MinMaxScaler`
   2. **Strings**: Converts string typed columns to numbers using `OneHotEncoder`
3. **Classifier**: Classification using `dummy_baseline`

In [263]:
pipe = skpipeline.Pipeline(steps=[
    ("imputer", skimpute.SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ("transformer", skcompose.ColumnTransformer(remainder="passthrough", transformers=[
        ("numbers", skprocessing.MinMaxScaler(), heading_numbers_index),
        ("strings", skprocessing.OneHotEncoder(handle_unknown="ignore"), heading_strings_index),
    ])),
    ("classifier", skdummy.DummyClassifier(strategy="constant", constant=0))
]).fit(train_features, train_labels)

pipe

In [264]:
train_accuracy = skselection.cross_val_score(pipe, train_features, train_labels, cv=10)
test_accuracy = skselection.cross_val_score(pipe, test_features, test_labels, cv=10)

print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Train Accuracy: [0.76052632 0.76008772 0.76042124 0.76042124 0.76042124 0.76042124
 0.76042124 0.76042124 0.76042124 0.76042124]
Test Accuracy: [0.75639713 0.75639713 0.75639713 0.75639713 0.75639713 0.75639713
 0.75639713 0.75639713 0.75639713 0.75614754]


---

## Problem (2)

Now, we will define a **Data Pipeline** for handling entire workflow, this pipeline contains three steps:

1. **Imputer**: Fills empty cells with `column mode`
2. **Transformer**: Converts columns
   1. **Numbers**: Converts number typed columns using `MinMaxScaler`
   2. **Strings**: Converts string typed columns to numbers using `OneHotEncoder`
3. **Classifier**: Classification using `logistic_regression`

In [265]:
pipe = skpipeline.Pipeline(steps=[
    ("imputer", skimpute.SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ("transformer", skcompose.ColumnTransformer(remainder="passthrough", transformers=[
        ("numbers", skprocessing.MinMaxScaler(), heading_numbers_index),
        ("strings", skprocessing.OneHotEncoder(handle_unknown="ignore"), heading_strings_index),
    ])),
    ("classifier", sklinear.LogisticRegression(max_iter=1000))
]).fit(train_features, train_labels)

pipe

In [266]:
train_accuracy = skselection.cross_val_score(pipe, train_features, train_labels, cv=10)
test_accuracy = skselection.cross_val_score(pipe, test_features, test_labels, cv=10)

print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Train Accuracy: [0.85350877 0.85614035 0.85563844 0.84598508 0.84247477 0.84686266
 0.85870996 0.85432207 0.8411584  0.84159719]
Test Accuracy: [0.8444217  0.84032753 0.83623337 0.84339816 0.84646878 0.84749232
 0.84032753 0.83930399 0.85158649 0.85348361]


---

## Problem (3)

Now, we will define a **Data Pipeline** for handling entire workflow, this pipeline contains three steps:

1. **Imputer**: Fills empty cells with `column mode`
2. **Transformer**: Converts columns
   1. **Numbers**: Converts number typed columns using `MinMaxScaler`
   2. **Strings**: Converts string typed columns to numbers using `OneHotEncoder`
3. **Classifier**: Classification using `naive_bayes`

In [267]:
pipe = skpipeline.Pipeline(steps=[
    ("imputer", skimpute.SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ("transformer", skcompose.ColumnTransformer(remainder="passthrough", transformers=[
        ("numbers", skprocessing.MinMaxScaler(), heading_numbers_index),
        ("strings", skprocessing.OneHotEncoder(handle_unknown="ignore"), heading_strings_index),
    ])),
    ("classifier", skbayes.MultinomialNB())
]).fit(train_features, train_labels)

pipe

In [268]:
train_accuracy = skselection.cross_val_score(pipe, train_features, train_labels, cv=10)
test_accuracy = skselection.cross_val_score(pipe, test_features, test_labels, cv=10)

print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Train Accuracy: [0.77938596 0.75131579 0.75822729 0.74506362 0.74813515 0.74286968
 0.76129882 0.75866608 0.77007459 0.74769636]
Test Accuracy: [0.76663255 0.75537359 0.76560901 0.76356192 0.73797339 0.76151484
 0.75127943 0.77379734 0.75230297 0.76741803]


---

## Problem (4)

First we `Standardize` data records:

### Results

As we can see, the accuracy of **Vanilla Perceptron** and **LD1 Perceptron** is `100%`, because the records are **Linearly Separable**, but the accuracy of **PCA Perceptron** is about `50%`, because it considers the direction in which the variance of the data is greater, not the direction in which classes are separable.

---