Based on census data, the dataset contains information to check whether income exceeds $50K/yr. The datasets consist of 14 attributes and one binary class variable:

- income: >50K, <=50K

- age: continuous.

- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

- fnlwgt: continuous.

- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous.

- education-num: continuous.

- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

- sex: Female, Male.

- capital-gain: continuous.

- capital-loss: continuous.

- hours-per-week: continuous.

- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

and we have a binary class which can be `>50` or `<=50`.

**NOTE**
- Unlike the labs, each function you make here will be **graded**, so it is important to *strictly* follow the instruction.
- **Import** all necessary libraries yourself whenever needed. Failure to run any code can affect your grade.

#### Basic libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
RANDOM_STATE = 13579 #Do not change it!
np.random.seed(RANDOM_STATE) #Do not change it!
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from collections import Counter
import math
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score





#### Load the dataset

Load the **Adult** dataset here using Pandas.

In [3]:
adult = pd.read_csv("adult.data", sep=",", header=None, skipinitialspace=True)

You can run the line below to give the dataframe proper column names.

In [4]:
adult.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race',  'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

Here you can find out some basic information by calling *info(), head()*, and *describe()*.

In [5]:
adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [7]:
adult.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


There is no null data, as we checked. However, if you read the dataset description, it says there are missing parts represented as "?". You can count them using the same technique we used for checking nulls in the previous lab. We have missing values in specific columns only, and it is about 5% of data records.

In [8]:
missing_values = (adult == "?").sum(axis=0)

In [9]:
print(missing_values)

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64


In [10]:
def drop_missing_values(df, miss):
    """
    Input: 
      df: the dataframe (adult in our case)
      miss: a character to represent missing value ("?" in our case)
      
    Output: the dataframe without the missing values

    Step 1: Replace the value 'miss' with np.nan.
    Step 2: Drop the rows having nan values and store the result in data_dropped.
    Step 3: Return data_dropped
    
    """
    ## step 1
    
    df.replace(miss, np.nan, inplace=True)
    
    ## Step 2
    data_dropped = df.dropna()
    
    ## Step 3
    
    return data_dropped


- Apply `drop_missing_values` function to our dataset `adult` and save the result to `adult_dropped.` This part should be done correctly to get the point. You need to put our dataset and the indicator for missing values.

In [11]:
adult_dropped = drop_missing_values(adult,"?")

- The output of the function should have the same attributes but only a smaller number of rows. Check how many rows are removed. Your dataset should have 30,162 rows!

In [12]:
adult_dropped.shape

(30162, 15)

In [13]:
X = adult_dropped.drop("income", axis=1)

y = adult_dropped.iloc[:, -1]

- Check the type and size here. We expect (30162, 14) for attributes (`X`) and (30162, ) for labels (`y`).

In [14]:
(X.shape, y.shape, type(X), type(y))

((30162, 14), (30162,), pandas.core.frame.DataFrame, pandas.core.series.Series)

Unfortunately, scikit-learn does not support categorical attributes very well even for decision tree, and that means we need to convert them into reasonal form of numeric data to fit the algorithms. There is one way called one-hot encoding, which transforms the categorical data into multiple numeric columns for each possible value. There are various ways to apply this, especially using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) but here we will use the Pandas function to keep the dataframe structure.

- Finish one_hot_encoding function which applies one-hot encoding to a given dataframe.


In [15]:
def one_hot_encoding(df):
    """
    Input:
        df: the attributes (X in our case)
    Output: one-hot encoded dataframe
    
    Step 1: Use pd.get_dummies to convert df to a one-hot-encoded form. 
            Enable an option called 'drop_first' to remove duplication.
    Step 2: Return the one-hot-encoded dataframe.
    
    * Those steps and suggested method are just for your convenience. You can use your own choice of methods.
      However, the result should be the same as the one created with the steps above.
    """
    
    df_onehot = pd.get_dummies(df, drop_first=True)
    
    return df_onehot

- Create `X_onehot` by calling `one_hot_encoding` function with `X`. 

In [16]:
X_onehot = one_hot_encoding(X)

- Check your result by calling any methods you learned. If you successfully followed the instruction, the output (`X_onehot`) should have 96 columns.

In [17]:
X_onehot.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,39,77516,13,2174,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,50,83311,13,0,0,13,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
2,38,215646,9,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
3,53,234721,7,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
4,28,338409,13,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False


We also need to split our dataset further into four parts for evaluation.

- Use scikit-learn's `train_test_split` function to divide the dataset into four parts.
- Follow the instruction below carefully to get a point!.
    - Use `X_onehot` and `y`.
    - Assign 20% to a test set.
    - Use our random state (`RANDOM_STATE`)
    - Enable stratify option.

In [18]:
# Remove the assigned values and write the train_test_split function

X_train, X_test, y_train, y_test = train_test_split(X_onehot, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y)


In [19]:
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

((24129, 96), (6033, 96), (24129,), (6033,))

In [20]:
def standardize(X_train, X_test, numeric):
    """
    Input:
        - X_train: A split training set from Task 4
        - X_test: A split test set from Task 5
        - numeric: Numeric columns that should be standardized
    Output:
        - X_train_st: A standardized numeric attributes (ndarray)
        - X_test_st: A standardized numeric attributes (ndarray)

    Step 1: Initialize StandardScaler into the variable 'sc'.
    Step 2: Create X_train_numeric, X_test_numeric by selecting numeric columns from original X_train and X_test.
            Use the input 'numeric' to choose the columns.
    Step 3: Fit StandardScaler on X_train_numeric. You should only use the numeric columns only.
    Step 4: Use trained StandardScaler and run the transform function both on X_train_numeric (for the training set) 
            and X_test_numeric (for the test set). This job will standardize both training and test sets based on
            the statistics of training set. You should only use numeric attributes. Save the outputs to X_train_st and X_test_st.
    Step 5: Return X_train_st, X_test_st.
    
    """
    
    ## Step 1
    sc = StandardScaler()
    
    ## Step 2
    X_train_numeric = X_train[numeric]
    X_test_numeric = X_test[numeric]
    
    ## Step 3

    sc.fit(X_train_numeric)
    
    
    ## Step 4
    
    X_train_st = sc.transform(X_train_numeric)
    X_test_st = sc.transform(X_test_numeric)
    
    ## Step 5
    
    return X_train_st, X_test_st

In [21]:
def merge(X_train, X_test, X_train_numeric, X_test_numeric, numeric):
    # DO NOT CHANGE THIS FUNCTION
    # This function is to ensure that the datasets keep the Pandas DataFrame format.
    if X_train.shape == (0, 0): return pd.DataFrame([0]), pd.DataFrame([0])
    
    X_train_st_df = X_train.copy()
    X_train_st_df[numeric] = X_train_numeric
    X_test_st_df = X_test.copy()
    X_test_st_df[numeric] = X_test_numeric
    
    return X_train_st_df, X_test_st_df

- Find numeric columns first and assign the column names into the variable `numeric`. You can use `.info()` or `.describe()` function to find numeric columns.
- This `numeric` should contain the list of column names as strings, e.g., `['a', 'b', 'c']`.

- Call `standardize` function to standardize numeric attributes. In this case, the output should only contain numeric attributes. We will merge the categorical features later on.

In [22]:
adult.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [23]:
 

numeric = ["age", "fnlwgt", "education-num", "capital-gain","capital-loss","hours-per-week"]




In [24]:
X_train_numeric, X_test_numeric = standardize(X_train, X_test, numeric)

- Check the mean and standard deviation values of the standardized dataset by running the blocks below. The dataset now should have near zero mean and one standard deviation.

In [25]:
X_train_numeric.mean(axis=0)

array([ 1.82575530e-16, -3.29813861e-17,  8.42203251e-17, -2.82697595e-17,
       -4.47604525e-17,  1.41348797e-17])

In [26]:
X_train_numeric.std(axis=0)

array([1., 1., 1., 1., 1., 1.])

- Unfortunately, scikit-learn's StandardScaler does not return DataFrame. Run the block below to recover DataFrame and categorical features.

In [27]:
X_train_st, X_test_st = merge(X_train, X_test, X_train_numeric, X_test_numeric, numeric)

- The final outcome (`X_train_st`) should have 96 columns again, where the numeric attributes have zero mean and one standard deviation.

In [28]:
X_train_st.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,24129.0,24129.0,24129.0,24129.0,24129.0,24129.0
mean,1.825755e-16,-3.2981390000000004e-17,8.422033e-17,-2.826976e-17,-4.4760450000000004e-17,1.413488e-17
std,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021
min,-1.63786,-1.667683,-3.574146,-0.1471472,-0.2153033,-3.333406
25%,-0.7991874,-0.6804452,-0.4392511,-0.1471472,-0.2153033,-0.07805663
50%,-0.113001,-0.1067059,-0.04738916,-0.1471472,-0.2153033,-0.07805663
75%,0.6494285,0.4495231,1.128197,-0.1471472,-0.2153033,0.3392959
max,3.927875,12.29248,2.303782,13.73371,10.7032,4.846703


After finishing a simple data processing, let's proceed to our main task, classification.

# 1. Classification

In [29]:
rf = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)

In [30]:
rf_cross_val_score = cross_val_score(rf, X_onehot, y, cv=5)

rf_avg_cross_val_score = np.mean(rf_cross_val_score)

In [31]:
rf_avg_cross_val_score

0.85163467089276


 3. Run grid search `gs` with a single dictionary `grid_dict` with two keys.
    1) max_depth from 3 to 6 (included).
    2) min_samples_split = `[3, 5, 7]`.
4. Report the best classifier into the variable `rf_best_classifier`. 
   - Set **cv=5** for grid search cross-validation. Use our training set (`X_train_st` and `y_train`) to perform the grid search. 
   - This task can take a few minutes depending on computing power (0.3 pt).

In [32]:
grid_dict ={ 'max_depth': [3, 4, 5, 6],
            'min_samples_split': [3, 5, 7]}
   


In [33]:
gs = GridSearchCV(estimator=rf, param_grid=grid_dict, cv=5)

gs.fit(X_train_st, y_train)


- Report your best classifier instance to `rf_best_classifier` (not the best score or average score).

In [34]:
rf_best_classifier = gs.best_estimator_

In [35]:
rf_best_classifier

In [36]:
svc = SVC()

In [37]:
   
grid_param = [{"kernel": ["linear", "poly", "rbf"]},
 {"C": [1, 10, 100]}]
  
   
gs = GridSearchCV(estimator=svc, param_grid=grid_param, cv=5)
gs.fit(X_train_st, y_train) 
svm_best_classifier = gs.best_estimator_

In [38]:
print(svm_best_classifier)

SVC()


In [39]:
svm_gs_score = gs.best_estimator_.score(X_test_st, y_test)


In [40]:
print(svm_gs_score)

0.8486656721365822


In [41]:
playgolf = pd.read_csv("playgolf.csv")

In [42]:
playgolf.columns

Index(['Outlook', 'Temp', 'Humidity', 'Windy', 'Play Golf'], dtype='object')

In [43]:
playgolf.head()

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play Golf
0,Rainy,Hot,High,False,No
1,Rainy,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Sunny,Mild,High,False,Yes
4,Sunny,Cool,Normal,False,Yes


In [44]:
def gini(dataset):
    """
    A function that calculates the Gini index of a given list.
    
    Input
     - dataset: a list of labels.
    Output
     - impurity: The Gini index of the list.
    
    You do not need to keep the output name of this function. The grade only depends on the correct outputs.
    
    """
    label_count = Counter(dataset)
    total_samples = len(dataset)

    gini_score = 1.0
    for label in label_count:
        probability = label_count[label] / total_samples
        gini_score = gini_score - probability**2

    
    return gini_score

- Your Gini index is expected to have the following results:
  - `0.5` for `[0,0,1,1]`
  - `0.4082` for `[0,0,0,0,0,1,1]`

In [45]:
gini([0,0,0,0,0,1,1])

0.40816326530612246

- Report the Gini score of the `Windy` attribute of **playgolf** to `gini_score` (0.2 pt).

In [46]:



gini_score = gini(playgolf["Windy"])



- Print your score here!

In [47]:
gini_score

0.48979591836734704

In [48]:
def entropy(dataset):
    """
    A function that calculates the entropy of a given list.
    
    Input
     - dataset: a list of labels.
    Output
     - impurity: entropy value of the list.
    
    You do not need to keep the output name of this function. The grade only depends on the correct outputs.
    
    """
    label_count = Counter(dataset)

    total_samples = len(dataset)
    entropy_val=0.0
    for label in label_count:
        probability = label_count[label]/total_samples
        entropy_val -= probability * math.log2(probability)
    return entropy_val

- Your entropy is expected to have the following results:
  - `1.0` for `[0,0,1,1]`
  - `0.8631` for `[0,0,0,0,0,1,1]`

In [49]:
entropy([0,0,1,1])

1.0

In [50]:
entropy([0,0,0,0,0,1,1])

0.863120568566631

In [51]:
entropy_score = entropy(playgolf["Play Golf"])

- Print your score here!

In [52]:
entropy_score

0.9402859586706311

In [53]:
def information_gain(labels_start, labels_split):
    """
    Calculate information gain when we have information on label distribution before and after the split operation.
    This information gain function receives two values:
    
    Input:
      - labels_start: A single list of all current labels
        e.g.) [0,0,0,0,1,1,1,1]
      - labels_split: A list of lists representing the split 
        e.g.) [ [0,0,1,1], [1,1,0,0] ]
    
    Then, we can calculate information gain by calculating the Gini index before splitting,
    and subtract (Gini index of the subset * proportion of the subset) for each list after splitting from there.
    
    Output:
      - info_gain: Information gain
    
    You do not need to keep the output name of this function. The grade only depends on the correct outputs.
    
    """
    total_samples_start = len(labels_start)
    gini_start = gini(labels_start)  
    
    gini_after_split = 0.0
    for subset in labels_split:
        total_samples_subset = len(subset)
        gini_subset = gini(subset)  
        gini_after_split += (total_samples_subset / total_samples_start) * gini_subset

    info_gain = gini_start - gini_after_split

    
    return info_gain

In [54]:
information_gain([0,0,0,0,1,1,1,1], [[0,0,1,0],[1,0,1,1]])

0.125

- Your information gain is expected to have the following results:
  - `0.0` for `[0,0,0,0,1,1,1,1], [[0,0,1,1],[0,0,1,1]]`
  - `0.5` for `[0,0,0,0,1,1,1,1], [[0,0,0,0],[1,1,1,1]]`
  - `0.125` for `[0,0,0,0,1,1,1,1], [[0,0,1,0],[1,0,1,1]]`

In [55]:
labels_start = [1,2,1,2,2,1,2,1,3,3,3]
labels_split = [[3,3,3],[1,2,1,1],[2,2,1,2]]

In [56]:
info_gain_score = information_gain(labels_start, labels_split)

- Print your score here!

In [57]:
info_gain_score

0.38842975206611574

In [58]:
def split(X, y, attr):
    split_attrs = []
    split_labels = []
    
    for val in X[attr].unique():
        attr_subset = []
        label_subset = []
        
        for idx, row in X.iterrows():
            
            if row[attr] == val:
                attr_subset.append(row)
                label_subset.append(y[idx])
                
        split_attrs.append(pd.DataFrame(attr_subset))
        split_labels.append(label_subset)
        
    return split_attrs, split_labels

Check out the result by running the function below, and also check the `Windy` column to understand what the function does.

In [59]:
split(playgolf.drop('Play Golf', axis=1), playgolf['Play Golf'], 'Windy')

([     Outlook  Temp Humidity  Windy
  0      Rainy   Hot     High  False
  2   Overcast   Hot     High  False
  3      Sunny  Mild     High  False
  4      Sunny  Cool   Normal  False
  7      Rainy  Mild     High  False
  8      Rainy  Cool   Normal  False
  9      Sunny  Mild   Normal  False
  12  Overcast   Hot   Normal  False,
       Outlook  Temp Humidity  Windy
  1      Rainy   Hot     High   True
  5      Sunny  Cool   Normal   True
  6   Overcast  Cool   Normal   True
  10     Rainy  Mild   Normal   True
  11  Overcast  Mild     High   True
  13     Sunny  Mild     High   True],
 [['No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes'],
  ['No', 'No', 'Yes', 'Yes', 'Yes', 'No']])

In [60]:
playgolf['Windy']

0     False
1      True
2     False
3     False
4     False
5      True
6      True
7     False
8     False
9     False
10     True
11     True
12    False
13     True
Name: Windy, dtype: bool

In [61]:
def select_attributes(X, strategy):
    """
    Input
        - X: Attributes of the node.
        - strategy: a strategy for the number of attributes the algorithm chooses.
    Output
        - attributes: a list of selected attributes

    Step 1: Check the strategy. If the type of strategy is an integer, the number of attributes to choose will be that number.
            If it's "sqrt", then it will be the square root of the column size (rounded down if it's a floating point number). 
            If "max", it's the size of the dataset's columns. Put the appropriate value into 'num_attr'.
    Step 2: Choose 'num_attr' column names from X.columns. Ensure that 'num_attr' is not greater than the number of columns.
            You can use np.random.choice or equivalent. Assign the result to 'attributes'.
    Step 3: Return 'attributes'.

    """
    ## step 1
    if isinstance(strategy, int):
        num_attr = strategy
    elif strategy == "sqrt":
        num_attr = int(np.sqrt(X.shape[1]))
    elif strategy == "max":
        num_attr = X.shape[1]
    else:
        raise ValueError("Invalid strategy. Use an integer, 'sqrt', or 'max'.")

    ## Step 2
    if num_attr > X.shape[1]:
        num_attr = X.shape[1]  
    
    attributes = np.random.choice(X.columns, num_attr, replace=False)

    ## Step 3
    return attributes

- You can test your method here!

In [62]:
select_attributes(playgolf.drop('Play Golf', axis=1), 2)

array(['Outlook', 'Humidity'], dtype=object)

In [63]:
select_attributes(playgolf.drop('Play Golf', axis=1), "max")

array(['Temp', 'Windy', 'Humidity', 'Outlook'], dtype=object)

In [64]:

def check_info_gain_per_attribute(X, y, attributes):
    """
    Input
        - X: Attributes of the node.
        - y: dataset labels.
        - attributes: the selected attributes to test.
    Output
        - best_feature: The best feature in terms of information gain.
        - best_gain: The information gain value when the dataset is split by the best feature.
        
    Step 1: Initialize two variables: Set best_info_gain to zero and best_attr to None.
    Step 2: You should iterate the attributes we get as input.
            For each chosen attribute, 'split' the dataset using the split function we have offered.
            This will return sets of attributes and labels. Save the split attributes and labels.
    Step 3: Calculate the information gain of the current split in the iteration. 
            Use the information_gain function you created and the label information from Step 3.
    Step 4: Compare it to the current best gain. If the new gain is higher (not higher or equal to), reset best_gain and best_feature.
    Step 5: Return best_attr, best_info_gain.
    
    
    """
    ## Step 1
    best_info_gain = 0 
    best_attr = None
    
    ## Step 2 
        ## Step 3
        ## Step 4
    for attr in attributes:
        _, split_labels = split(X, y, attr)
        current_information_gain = information_gain(y, split_labels)
        if current_information_gain > best_info_gain:
            best_info_gain = current_information_gain
            best_attr = attr
    ## Step 5
    return best_attr,best_info_gain

In [65]:
attributes = ["Windy","Outlook"]
best_attr, better_attribute = check_info_gain_per_attribute(X=playgolf.drop("Play Golf", axis=1),y = playgolf["Play Golf"], attributes=attributes)

In [66]:
print("The attribute with the highest information gain is:", best_attr)
print("The information gain value is:", better_attribute)

The attribute with the highest information gain is: Outlook
The information gain value is: 0.11632653061224485


In [67]:
# 0.2 pt
def best_split(X, y, strategy):
    """
    Input
        - X: Attributes of the node.
        - y: dataset labels.
        - strategy: a strategy for the number of attributes the algorithm chooses.
    Output
        - best_feature: The best feature in terms of information gain.
        - best_gain: The information gain value when the dataset is split by the best feature.

    Complete the function following the instructions above
    """
    selected_attributes = select_attributes(X,strategy)
    best_attr, best_info_gain = check_info_gain_per_attribute(X, y,selected_attributes )
    
    return best_attr, best_info_gain

In [68]:
np.random.seed(RANDOM_STATE)
strategy = "sqrt"
best_attr_playgolf, best_gain_playgolf = best_split(X = playgolf.drop("Play Golf", axis = 1), y = playgolf["Play Golf"], strategy = strategy)

In [69]:
# TEST YOUR RESULT HERE
best_attr_playgolf, best_gain_playgolf

('Windy', 0.030612244897959162)

In [70]:
# 0.5 pt


def build(X, y, strategy, max_depth = 5, min_samples_leaf = 5, tol=0.00001, _depth = 0):
    """
    Input
        - X: Attributes of the data
        - y: dataset labels
        - strategy: a strategy for the number of attributes the algorithm chooses.
        - max_depth: maximum allowed depth of the tree
        - min_samples_leaf: minimum number of data instances required to continue
        - tol: information gain tolerance value.
        - _depth: current depth of tree starting from zero (root). Only controlled by the algorithm.
    Output
        - node: a leaf or middle node.
    
    Step 0: Consider some stopping criteria. We do not continue this function if ONE of the following conditions is met:
        1. if the current depth number is bigger than max_depth
        2. if the current sample size is smaller than min_samples_leaf
      Check these conditions and terminate the function if required. When terminating, return {"type": "leaf", "majority": the most common class label (y)}.
    Step 1: Run the best split function to get the best attributes and the best information gain for the node.
    Step 2: Examine the best information gain value. If it is lower than the tolerance value (tol), 
            return the node with the best information gain value. The node should be a dictionary form 
            {"type": "leaf", "gain": the best information gain, "majority": the most common class label (y)}.
    Step 3: If the best information gain is higher, split the dataset with the chosen best attribute.
    Step 4: Create an empty list called "branches" to save all the branches of the current node. 
            This branches list will contain the returned values of the recursive calls of the build function.
            Remember to increase the depth value so we can trace the max_depth.
            Note that depending on your implementation your indices for X and y may mismatch. You need to always make sure that your featurea and labels are correctly ordered.
                Step 5: Run this 'build' function recursively for each split attribute and label and store the result
            to the  "branches" list **only if the returned value is not False (from termination)**.
    Step 6: After all the recursion process is done, return the root node with its best attribute, branch information (i.e., "branches" list),
            and the best information gain (fill in the right variables into the dictionary!).
    """
    ## Step 0
    if _depth > max_depth or len(X) < min_samples_leaf:
        majority_class = y.mode().iloc[0]
        return {"type": "leaf", "majority": majority_class}

    ## Step 1
    best_attribute, best_gain = best_split(X, y, strategy)
    print(best_attribute,best_gain)

    ## Step 2
    ## Step 3
    if best_gain < tol:
        majority_class = y.mode().iloc[0]
        return {"type": "leaf", "gain": best_gain, "majority": majority_class}
    
    ## step4
    ## step 5
    branches = []

    for value in set(X[best_attribute]):
        X_subset = X[X[best_attribute] == value]
        y_subset = y[X[best_attribute] == value]
        print('2')

        subtree = build(X_subset, y_subset, strategy, max_depth, min_samples_leaf, tol, _depth + 1)
        

        if subtree:
            branches.append({"value": value, "subtree": subtree})

    # Step 6: Return the Root Node
    return {"type": "node", "attribute": best_attribute, "branches": branches, "gain":best_gain}
 

In [71]:
np.random.seed(RANDOM_STATE)
single_tree = build(playgolf.iloc[:,:-1], playgolf.iloc[:,-1],"sqrt")

Windy 0.030612244897959162
2
Humidity 0.125
2
2
2
Outlook 0.33333333333333337
2
2
2


- Print your result here

In [72]:
single_tree

{'type': 'node',
 'attribute': 'Windy',
 'branches': [{'value': False,
   'subtree': {'type': 'node',
    'attribute': 'Humidity',
    'branches': [{'value': 'High',
      'subtree': {'type': 'leaf', 'majority': 'No'}},
     {'value': 'Normal', 'subtree': {'type': 'leaf', 'majority': 'Yes'}}],
    'gain': 0.125}},
  {'value': True,
   'subtree': {'type': 'node',
    'attribute': 'Outlook',
    'branches': [{'value': 'Overcast',
      'subtree': {'type': 'leaf', 'majority': 'Yes'}},
     {'value': 'Sunny', 'subtree': {'type': 'leaf', 'majority': 'No'}},
     {'value': 'Rainy', 'subtree': {'type': 'leaf', 'majority': 'No'}}],
    'gain': 0.33333333333333337}}],
 'gain': 0.030612244897959162}

# 2. Evaluation 

In [73]:
# Subtask 1: 0.1 pt
y_train_numeric = y_train.map({"<=50K": 0, ">50K":1})
y_test_numeric = y_test.map({"<=50K":0,">50K":1})


Check if you successfully replaced the values here.

In [74]:
y_train_numeric.unique(), y_test_numeric.unique()

(array([1, 0], dtype=int64), array([0, 1], dtype=int64))

In [75]:
# Subtask 2: 0.2 pt

svc_classification = SVC(kernel="poly")
svc_classification.fit(X_train_st,y_train_numeric)
y_pred_svc = svc_classification.predict(X_test_st)
precision_score_svc = precision_score(y_test_numeric, y_pred_svc, average="macro")
recall_score_svc = recall_score(y_test_numeric, y_pred_svc, average="macro")
f1_score_svc = f1_score(y_test_numeric, y_pred_svc, average="macro")

Print three scores here.

In [76]:
precision_score_svc, recall_score_svc, f1_score_svc

(0.8110794247077231, 0.7462773096476087, 0.7695547365499339)

In [77]:
rf_10 = RandomForestClassifier(random_state=RANDOM_STATE)


rf_10.fit(X_train_st, y_train_numeric)

In [78]:

y_pred_rf = rf_10.predict(X_train_st)
accuracy_rf = accuracy_score(y_train_numeric, y_pred_rf)
accuracy_rf

0.999875668282979

In [79]:

y_prob_rf = rf_10.predict_proba(X_test_st)[:, 1] 

auc_rf = roc_auc_score(y_test_numeric, y_prob_rf, average="weighted")
auprc_rf = average_precision_score(y_test_numeric, y_prob_rf, average="weighted")

Print your scores here!

In [80]:
(accuracy_rf, auc_rf, auprc_rf)

(0.999875668282979, 0.9001796618706874, 0.7680117192095133)

In [81]:

knn_classifier = KNeighborsClassifier()


param_grid = {"n_neighbors": list(range(1, 11))}


scoring_measures = ["average_precision", "precision"]


auprc_best_classifier = {}
precision_best_classifier = {}
for scoring in scoring_measures:
    grid_search = GridSearchCV(knn_classifier, param_grid, scoring=scoring, cv=3)
    grid_search.fit(X_train_st, y_train_numeric)
    
    # Get the best classifier and its parameters
    best_classifier = grid_search.best_estimator_
    best_params = grid_search.best_params_
    
    if scoring == "average_precision":
        auprc_best_classifier[scoring] = (best_classifier, best_params)
    elif scoring == "precision":
        precision_best_classifier[scoring] = (best_classifier, best_params)

        print("Best kNN Classifiers for AUPRC:")
for scoring, (classifier, params) in auprc_best_classifier.items():
    print(f"Scoring measure: {scoring}")
    print(f"Best kNN Classifier: {classifier}")
    print(f"Best Parameters: {params}")
    print()

print("Best kNN Classifiers for Precision:")
for scoring, (classifier, params) in precision_best_classifier.items():
    print(f"Scoring measure: {scoring}")
    print(f"Best kNN Classifier: {classifier}")
    print(f"Best Parameters: {params}")
    print()

Traceback (most recent call last):
  File "C:\Users\glout\AppData\Roaming\Python\Python311\site-packages\sklearn\metrics\_scorer.py", line 459, in _score
    y_pred = method_caller(clf, "decision_function", X, pos_label=pos_label)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\glout\AppData\Roaming\Python\Python311\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call
    result, _ = _get_response_values(
                ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\glout\AppData\Roaming\Python\Python311\site-packages\sklearn\utils\_response.py", line 73, in _get_response_values
    prediction_method = _check_response_method(estimator, response_method)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\glout\AppData\Roaming\Python\Python311\site-packages\sklearn\utils\validation.py", line 1940, in _check_response_method
    raise AttributeError(
AttributeError: KNeighborsClassifier has none o

Best kNN Classifiers for AUPRC:
Scoring measure: average_precision
Best kNN Classifier: KNeighborsClassifier(n_neighbors=1)
Best Parameters: {'n_neighbors': 1}

Best kNN Classifiers for Precision:
Scoring measure: precision
Best kNN Classifier: KNeighborsClassifier(n_neighbors=1)
Best Parameters: {'n_neighbors': 1}



Check your best classifiers here!

In [82]:
(auprc_best_classifier, precision_best_classifier)

({'average_precision': (KNeighborsClassifier(n_neighbors=1),
   {'n_neighbors': 1})},
 {'precision': (KNeighborsClassifier(n_neighbors=1), {'n_neighbors': 1})})

In [83]:
def accuracy_manual(truth, predicted):
    accurate= np.sum(truth==predicted)
    return accurate/len(truth)

In [84]:
def precision_manual(truth, predicted):
    tp=np.sum((truth == 1) & (predicted ==1))
    fp = np.sum((truth==0)&(predicted==1))
    return tp/(tp+fp)if tp/(tp+fp) > 0 else 0

In [85]:
def recall_manual(truth, predicted):
    tp = np.sum((truth==1)&(predicted==1))
    fp = np.sum((truth==1)&(predicted==0))
    return tp/(tp+fp) if (tp+fp)>0 else 0

In [86]:
def f1_manual(truth, predicted):
    precision= precision_manual(truth, predicted)
    recall = recall_manual(truth,predicted)
    f1_score = 2 * (precision * recall)/ (precision +recall) if (precision +recall)>0 else 0
    return f1_score

Assign the results of your four function on two arrays (`truth`, `predicted`).

In [87]:
truth     = np.array([0,1,0,1,1,1,1,0,0,1,1,0,0,1,1,0,1])
predicted = np.array([1,0,0,0,1,0,0,1,1,0,1,1,1,0,1,1,1])

In [88]:
accuracy_score_manual = accuracy_manual(truth,predicted)
precision_score_manual = precision_manual(truth,predicted)
recall_score_manual = recall_manual(truth,predicted)
f1_score_manual = f1_manual(truth,predicted)

Show your results here!

In [89]:
(accuracy_score_manual, precision_score_manual, recall_score_manual, f1_score_manual)

(0.29411764705882354, 0.4, 0.4, 0.4000000000000001)

In [90]:
diabetes = pd.read_csv("diabetes.csv")

In [91]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


- Split the dataset into two parts: attributes (`X`) and labels (Outcome, `y`).

In [92]:
X = diabetes.drop("Outcome",axis=1)
y = diabetes["Outcome"]

Your task is as follows:

1. Use scikit-learn's `train_test_split` function to divide the dataset into four parts.
- Follow the instruction below carefully to get a point!.
    - Use `X` and `y`.
    - Assign 10% to a test set.
    - Use our random state (`RANDOM_STATE`)
    - Enable stratify option.


In [93]:
# Remove the assigned values and write the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.10,random_state=RANDOM_STATE,stratify=y)

In [94]:
svc_classifier =SVC()
param_grid=[{"kernel":["poly"], "degree":[3,4,5]},{"kernel":["linear","rbf"],"C":[5,10,100]}]
grid_search= GridSearchCV(svc_classifier,param_grid,cv=3, scoring="average_precision")
grid_search.fit(X_train, y_train)

In [95]:
svm_best_classifier_dash= grid_search.best_estimator_

- Show your classifier here!

In [96]:
svm_best_classifier_dash

5. Using the best classifier, report the **test** score to `svm_test_score`.

In [97]:
svm_test_score=grid_search.score(X_test, y_test)
svm_test_score

0.7275277524867343

In [98]:
model_name =  "model_diabetes.pickle"
data_to_save= svm_best_classifier_dash
file_path= model_name
with open (file_path,"wb") as writeFile:
   pickle.dump(data_to_save,writeFile)