<a href="https://colab.research.google.com/github/fatemakotha/1804-Applied-Machine-Learning/blob/main/Lab3_PartA_Data_Pre_processing_with_testing_revised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___
# COMP1804 Lab 3 - Data Pre-processing Revised



**Learning Objectives:**
 *  Understand the stages of data pre-processing.
 *  Use Python for pre-processing. 
 *  Practice the different stages of data pre-processing in Python.
___


Note: this is a smaller version of week 2's tutorial. The added exercise is the application of all the data pre-processing steps to both the training and the test dataset.

## 1. Loading the Dataset


For this Lab, I have used a subset of the Loan Prediction dataset. You can download the training and testing dataset from Moodle: [Download Data](https://moodlecurrent.gre.ac.uk/mod/resource/view.php?id=1730057)


Now, lets get started by importing important packages and the dataset.

**1.1 Import the necessary Python modules**

In [1]:
PY = False # change this to True if you're running the .py version of this notebook

In [2]:
# Load python modules

import numpy as np # numpy is a library that allows us to work with vectors and matrices
import matplotlib.pyplot as plt # visualisation library
import pandas as pd # pandas is a library that allows us to work with DataFrames 

# On some level, dataframes are enhanced matrices where we have assigned names to each
# row and each column.


In [3]:
from IPython.display import HTML
def pretty_print_df(value_counts_):
  "Quick function to display value counts more nicely"
  display(HTML(pd.DataFrame(value_counts_).to_html()))


**1.2 Load Dataset **

Note: Download the csv files from the URL to your local drive and load from there as shown in the code below.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.


In [4]:
# Load dataset from local drive (for colab notebook)
if PY:
    # note: the lines below assume that the data is found in the same folder as the script.
    # If that's not the case, you need to change the argument of read_csv to the full
    # path of each dataset
    X = pd.read_csv('X_train.csv')
    Y = pd.read_csv('Y_train.csv')
    
    #train_dataset = pd.read_csv('X_train.csv')
    #train_labels = pd.read_csv('Y_train.csv')
else:
    from google.colab import files
    import io
    # Remember to only upload one file at a time
    uploaded = files.upload()    # Will prompt you to select file: remember to choose the right one!
    X = pd.read_csv(io.BytesIO(uploaded['X_train.csv'])) # python will expect the first file to be called X_train

    uploaded = files.upload()    # Will prompt you to select file
    Y = pd.read_csv(io.BytesIO(uploaded['Y_train.csv']))

    #uploaded = files.upload()    # Will prompt you to select file
    #test_dataset = pd.read_csv(io.BytesIO(uploaded['X_test.csv']))

    #uploaded = files.upload()    # Will prompt you to select file
    #test_labels = pd.read_csv(io.BytesIO(uploaded['Y_test.csv']))

# We need to upload the files in the exact order as we see above!


Saving X_train.csv to X_train.csv


Saving Y_train.csv to Y_train.csv


In [7]:
X.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001031,Male,No,0,Graduate,No,,0.0,125.0,360.0,1.0,Urban
1,LP001032,Male,No,0,Graduate,No,4950.0,0.0,125.0,360.0,1.0,Urban
2,LP001824,Male,Yes,1,Graduate,No,2882.0,1843.0,123.0,480.0,1.0,Semiurban
3,LP002928,Male,Yes,0,Graduate,No,3000.0,3416.0,56.0,180.0,1.0,Semiurban
4,LP001814,Male,Yes,2,Graduate,No,9703.0,0.0,112.0,360.0,1.0,Urban


In [8]:
Y.head()

Unnamed: 0,Target
0,Y
1,Y
2,Y
3,Y
4,Y


First, an observation. This dataset has labels, as made clear by the fact that there is a dedicated dataset storing them (typically, the variable Y indicates labels, or outputs, and the label X indicates input features). Since we have labels, we can use them to train our model: this is a **supervised** ML task.

**1.2.0 Dataset Split**

Let's split both the input features and the labels into a training set and a test set. Scikit-learn has a helpful function for this: "train_test_split".

The function can split multiple datasets at the same time. Note that if the datasets you need to split are related (that is, each row from one dataset correspond to the same row from another dataset), you **need** to split them at the same time. Otherwise, the correspondence will get lost.

Excercise 0 below will help you understand this point.

In [9]:
from sklearn.model_selection import train_test_split


train_dataset, test_dataset, train_labels, test_labels = train_test_split(X, Y, test_size= 0.2)


### Excercise 1
We'll create a "toy" dataset, just to showcase what can happen if we don't split things properly.
It's a simple dataset with only 5 datapoints - the labels are simply the english names for the numbers, who appear as digits and in other languages.
(Apologies, I have limited knowledge of other languages! I could only incorporate a few languages, and I probably got something wrong, because I relied on Google Translate. Feel free to make corrections and/or add your own language.)

Your task is to split the input features and the labels **separately** using the `train_test_split` function (use test_size=0.2) and then print out the results.

What do you notice? Are the labels still in the right place? If not, what do you think it's happening? (Remember that `train_test_split` splits the dataset randomly, so if we split datasets separately we get different outputs from two different random processes).

In [13]:

# Here we are creating a dataframe by giving it a list of data points.
toy_X = pd.DataFrame([[1, 'daya','एक'],
    [2, 'biyu', 'दो'],
    [3, 'uku', 'तीन'],
    [4, 'hudu', 'चार'],
    [5, 'biyar', 'पांच']], index = [1,2,3,4,5])
toy_Y = pd.DataFrame(['one','two','three','four','five'], index= [1,2,3,4,5])

# YOUR CODE HERE
train_X, test_X = train_test_split(toy_X, test_size = 0.2)
train_y, test_y = train_test_split(toy_Y, test_size = 0.2)

print('Training inputs:')
print(train_X)
print('Training labels') 
print(train_y)

print('Test set inputs')
print(test_X)
print('Test set labels')
print(test_y)

Training inputs:
   0      1     2
2  2   biyu    दो
1  1   daya    एक
4  4   hudu   चार
5  5  biyar  पांच
Training labels
       0
1    one
5   five
3  three
2    two
Test set inputs
   0    1    2
3  3  uku  तीन
Test set labels
      0
4  four


In [11]:
toy_X.head()

Unnamed: 0,0,1,2
1,1,daya,एक
2,2,biyu,दो
3,3,uku,तीन
4,4,hudu,चार
5,5,biyar,पांच


In [12]:
toy_Y.head()

Unnamed: 0,0
1,one
2,two
3,three
4,four
5,five


There is technically another way. You can specify the same `random_state` every time you call the `train_test_split` function. Otherwise, the correspondence will get lost. `random_state` is the argument that controls the randomness process. This is because in python random processes are only really 'pseudo-random': random numbers in python are generated in a sequence that are very close to being random, but at the same time they are generated from an initial "seed". Using the same seed will generate the same random sequence, and the seed can be controlled via the `random_state` argument: `train_test_split(X, y, test_size=0.2, random_state= 0)`.

Try doing the previous exercise again, but this time add `random_state=0` to all `train_test_split` functions. Do you notice any difference?

In [14]:
# Here we are creating a dataframe by giving it a list of data points.
toy_X = pd.DataFrame([[1, 'daya','एक'],
    [2, 'biyu', 'दो'],
    [3, 'uku', 'तीन'],
    [4, 'hudu', 'चार'],
    [5, 'biyar', 'पांच']], index = [1,2,3,4,5])
toy_Y = pd.DataFrame(['one','two','three','four','five'], index= [1,2,3,4,5])

# YOUR CODE HERE
train_X, test_X = train_test_split(toy_X, test_size = 0.2, random_state=0)
train_y, test_y = train_test_split(toy_Y, test_size = 0.2, random_state=0)

print('Training inputs:')
print(train_X)
print('Training labels') 
print(train_y)

print('Test set inputs')
print(test_X)
print('Test set labels')
print(test_y)


Training inputs:
   0      1     2
1  1   daya    एक
2  2   biyu    दो
4  4   hudu   चार
5  5  biyar  पांच
Training labels
      0
1   one
2   two
4  four
5  five
Test set inputs
   0    1    2
3  3  uku  तीन
Test set labels
       0
3  three


**1.2.1 Inspect Dataset**

**1.2.1.1 Dimensions of Dataset**


In [15]:
# Training data
# shape of input: 
# The number or rows is the number of data points
# The number or columns is the number of features
print(train_dataset.shape)
# shape of output:
# The number or rows is the number of data points (should be the same as before!)
# The number or columns is the number of labels we want to predict
print(train_labels.shape)

(312, 12)
(312, 1)


In [16]:
# Test data
# shape of input: 
# The number or rows is the number of data points
# The number or columns is the number of features (should be the same as for the training!)
print(test_dataset.shape)
# shape of output:
# The number or rows is the number of data points (should be the same as before!)
# The number or columns is the number of labels we want to predict
print(test_labels.shape)

(79, 12)
(79, 1)


In [17]:
# list of column titles 
print(train_dataset.columns)
print(train_labels.columns)

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')
Index(['Target'], dtype='object')


In [18]:
# list of column (field) data types
print(train_dataset.dtypes)
print()
print(train_labels.dtypes)

# Note: object is a Pandas data type for pretty much anything that is not a number


Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome      float64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
dtype: object

Target    object
dtype: object


**1.2.1.2 Take a peek at the Dataset**

Note that if you see "NaN", it means "Not a Number". It is not the same as 0. Python replaces empty/missing fields in the data with "NaN".


In [19]:
# you can show the first N rows in a dataframe with the function "head"
print(train_dataset.head(10))
# note how the first row has a missing value! We'll get back to it.
# Also notice how the first two rows are almost identical, aside from the Loan_ID and the missing value
# Could this be the same application submitted twice because the first time someone forgot to add the income?

      Loan_ID  Gender Married Dependents     Education Self_Employed  \
299  LP002555    Male     Yes          2      Graduate           Yes   
238  LP002097    Male      No          1      Graduate            No   
144  LP002659    Male     Yes         3+      Graduate            No   
242  LP001028    Male     Yes          2      Graduate            No   
139  LP002693    Male     Yes          2      Graduate           Yes   
355  LP002984    Male     Yes          2      Graduate            No   
83   LP001404  Female     Yes          0      Graduate            No   
93   LP001493    Male     Yes          2  Not Graduate            No   
363  LP001003    Male     Yes          1      Graduate            No   
114  LP001430  Female      No          0      Graduate            No   

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
299           4583.0             2083.0       160.0             360.0   
238           4384.0             1793.0       117.0          

## 2. Data cleaning

(If you want to check again the EDA steps, check the notebook from last week.)


In [20]:
# Data cleaning steps:
# 1. replace 'yes' with 'Yes' for 'self_employed'
# remember that .loc is a way to access a subset of the dataframe
# We can write df.loc[condition_A, condition_B] where condition_A determines
# which rows we select and condition_B determines which columns we select
train_dataset.loc[train_dataset.Self_Employed=='no','Self_Employed'] = 'No'

# 2. replace negative loan amount values with NaN 
# (we use np.nan because that's how the other missing values are represented in this dataset)
train_dataset.loc[train_dataset.LoanAmount<0,'LoanAmount'] = np.nan

# 3. replace out of scope Property_Area values with NaN
accepted_property_areas = ['Urban','Rural','Semiurban']
# remember the lambda notation: it is an anonymous function that lets us specificy
# a given (and easy) transformation for its input (in this case, x)
train_dataset.loc[train_dataset.Property_Area.map(lambda x: x not in accepted_property_areas),'Property_Area'] = np.nan


### Exercise 2
Repeat all the steps above for the test dataset

In [21]:
# YOUR CODE HERE
# YOUR CODE HERE
# 1. replace 'yes' with 'Yes' for 'self_employed'
test_dataset.loc[test_dataset.Self_Employed=='no','Self_Employed'] = 'No'

# 2. replace negative loan amount values with NaN 
test_dataset.loc[test_dataset.LoanAmount<0,'LoanAmount'] = np.nan

# 3. replace out of scope Property_Area values with NaN
accepted_property_areas = ['Urban','Rural','Semiurban']
test_dataset.loc[test_dataset.Property_Area.map(lambda x: x not in accepted_property_areas),'Property_Area'] = np.nan

## 3. Managing Missing Data

We will fill in numerical values with the "average" value and categorical values with the "most common" category (if anything is unclear, refer to week 2 resources).

First, though, let's find out how many missing values (or NaN values) there are in each feature, using Pandas `isna()` function. 

In [22]:
# Number of missing values per column
# isna(). returns a True/False value for each element in the dataframe/
# True if that value is a NaN value, False otherwise
# calling .sum() sums the number of True values in each column
# So, the output is the number of missing values in each column
train_dataset.isna().sum()

Loan_ID              0
Gender               1
Married              1
Dependents           1
Education            1
Self_Employed        0
ApplicantIncome      3
CoapplicantIncome    2
LoanAmount           2
Loan_Amount_Term     2
Credit_History       1
Property_Area        4
dtype: int64

### Filling in Missing Data

Remember the typical workflow with sklearn functions. These functions are typically objects which follow a specific pipeline:
1. create an instance of a given object (like the imputer class)
2. fit this instance on the training dataset. This is done via the attribute `.fit()`.
3. use the instance on all the relevant datasets (specifically, validation and test sets, as applicable). This is done via the attribute `.transform()`

Note that sklearn also allows you to fit and transform all at once by calling the attribute `.fit_transform()`.


<br/><br/>

<div>
<img src="https://drive.google.com/uc?export=view&id=1p-R8TJ6xCBaEw4XKDD6F1dCTJp5Vcb8z" width="500"/>
</div>


In [23]:
# handling missing data
from sklearn.impute import SimpleImputer 

train_dataset_no_nans =  train_dataset.copy()

# NUMERICAL FEATURES
# 1. Imputer
imptr_num = SimpleImputer(missing_values = np.nan, strategy = 'mean')  


# 2. Fit the imputer object to the feature matrix (only for numeric features)
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
imptr_num = imptr_num.fit(train_dataset_no_nans[numerical_columns]) # fit the data to estimate the parameters (here, the average value)

# 3. Call Transform to replace missing data in train_dataset (on specific columns) by the mean of the column to which that missing data belongs to
train_dataset_no_nans[numerical_columns] = \
  imptr_num.transform(train_dataset_no_nans[numerical_columns]) # apply the transformation using the parameters estimated above

# note column ApplicantIncome in the first row --> before it was a missing value!
print(train_dataset_no_nans.head(5))

# CATEGORICAL FEATURES
# Let's fill all the missing categories with the most common category

# let's list all categorical features
categorical_columns= ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Property_Area']

# 1. Imputer
imptr_cat = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')  

# 2. Fit the imputer object to the feature matrix (only for categorical features)
imptr_cat = imptr_cat.fit(train_dataset_no_nans[categorical_columns])

# 3. Call Transform to replace missing data in train_dataset (on specific columns) by the most frequent category of the column to which that missing data belongs to
train_dataset_no_nans[categorical_columns] = imptr_cat.transform(train_dataset_no_nans[categorical_columns])  


      Loan_ID Gender Married Dependents Education Self_Employed  \
299  LP002555   Male     Yes          2  Graduate           Yes   
238  LP002097   Male      No          1  Graduate            No   
144  LP002659   Male     Yes         3+  Graduate            No   
242  LP001028   Male     Yes          2  Graduate            No   
139  LP002693   Male     Yes          2  Graduate           Yes   

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
299           4583.0             2083.0       160.0             360.0   
238           4384.0             1793.0       117.0             360.0   
144           3466.0             3428.0       150.0             360.0   
242           3073.0             8106.0       200.0             360.0   
139           7948.0             7166.0       480.0             360.0   

     Credit_History Property_Area  
299             1.0     Semiurban  
238             1.0         Urban  
144             1.0         Rural  
242           

Note: if you want to try more advanced imputation methods, checkout [other scikit-learn methods](https://scikit-learn.org/stable/modules/impute.html) or the [autoimpute package](https://kearnz.github.io/autoimpute-tutorials/).

## 4. Encoding categorical Data

Data Preprocessing in machine learning requires values of the data in numerical form. Therefore text values in the columns of datasets must be converted into numerical form. 

We will use OneHotEncoder for all the categorical variables.

In [24]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# define the transformation
# ColumnTransformer takes a list of transformations. Each transformation is expressed
# as a tuple (name, transformer, columns).
ct_cat = ColumnTransformer(
    [
        (
            "onehot_categorical", # --> name of the transformation
            OneHotEncoder(), # --> main function to apply 
            categorical_columns, #-->columns to apply it to (we can give more than one column at once!)
        ),
    ],
    remainder="passthrough", #--> what to do with the non-transformed columns. passthrough=keep them
    verbose_feature_names_out=False #--> this keeps columns names simple. Try what happens if you set it as True
)

# the output is an NUMPY ARRAY with the encoded columns.
encoded_array= ct_cat.fit_transform(train_dataset_no_nans) 

# What if we want a dataframe back? We can combine the array with the info about
# the original and transformed column names. 
# This is stored in the ColumnTransformer object, which we called "ct"
# We can access it via the "get_feature_names_out()" attribute like this:
encoded_col_names= ct_cat.get_feature_names_out() #remember python's dot notation
print(encoded_col_names) #note the combined name: original column + category (e.g. Gender_Female)

train_dataset_no_nans = pd.DataFrame(encoded_array, columns=encoded_col_names)

# make sure numerical columns are of type float
train_dataset_no_nans= train_dataset_no_nans.astype(dtype={col: "float64" for col in numerical_columns})


['Gender_Female' 'Gender_Male' 'Gender_other' 'Married_No' 'Married_Yes'
 'Dependents_0' 'Dependents_1' 'Dependents_2' 'Dependents_3+'
 'Education_Graduate' 'Education_Not Graduate' 'Self_Employed_No'
 'Self_Employed_Yes' 'Property_Area_Rural' 'Property_Area_Semiurban'
 'Property_Area_Urban' 'Loan_ID' 'ApplicantIncome' 'CoapplicantIncome'
 'LoanAmount' 'Loan_Amount_Term' 'Credit_History']


## Label encoding
Let's perform label encoding for the labels.

In [None]:
# First check: what are the target categories?
train_labels.value_counts()

In [25]:
from sklearn.preprocessing import LabelEncoder

# create an object of the LabelEncoder class
lblEncoder_Y = LabelEncoder()   

# apply LblEncoder object to our target variables
train_encoded_labels = lblEncoder_Y.fit_transform(train_labels['Target']) 
print(train_encoded_labels)


[1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1
 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0
 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 1 1 1 0 1 1
 1 0 1 1 0 0 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1
 1 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1
 1 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1]


## Feature Scaling

When the data is comprised of feature values with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. 

We will be using Standardisation (that is, subtract the mean and divide by the variance). This way we control both the average value of each feature and its spread.


In [26]:
# create the Scaling object
from sklearn.preprocessing import StandardScaler
my_scaler_num= StandardScaler()

# fit and transform the appropriate columns in the training data
train_dataset_no_nans[numerical_columns]= my_scaler_num.fit_transform(train_dataset_no_nans[numerical_columns]) 

# show results
print(train_dataset_no_nans[numerical_columns].head())


   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0        -0.125450           0.229490    0.250068          0.271329   
1        -0.160523           0.117721   -0.330462          0.271329   
2        -0.322317           0.747866    0.115061          0.271329   
3        -0.391582           2.550814    0.790096          0.271329   
4         0.467619           2.188529    4.570294          0.271329   

   Credit_History  
0        0.384807  
1        0.384807  
2        0.384807  
3        0.384807  
4        0.384807  


###  Exercise 3

Perform all the previous data pre-processing steps to the test data and labels.
In order, we did:
1. fill numerical missing values with SimpleImputer (`imptr_num`)
2. fill categorical missing values with SimpleImputer (`imptr_cat`)
3. encoded all categorical values with ColumnTransformer (`ct_cat`)
4. scaled all numerical features with MinMax or StandardScaler (`my_scaler_num`)

Also, we encoded the labels with LabelEncoder (`lblEncoder_Y`)

When transforming the test dataset (both input and labels) we should use transformer objects that have been **fit on the training data**. This is because in real-world applications the test data would not be available during training. 

So, we can call the `.transform` function alone on previously trained object.

In [29]:
test_dataset.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
121,LP001489,Female,Yes,0,Graduate,No,4583.0,0.0,84.0,360.0,1.0,Rural
256,LP002894,other,Yes,0,Graduate,No,3166.0,0.0,36.0,360.0,1.0,Semiurban
188,LP001155,Female,Yes,0,Not Graduate,No,1928.0,1644.0,100.0,360.0,1.0,Semiurban
325,LP001664,Male,No,0,Graduate,No,4191.0,0.0,120.0,360.0,1.0,Rural
278,LP001917,Female,No,0,Graduate,No,1811.0,1666.0,54.0,360.0,1.0,Urban


In [33]:
# Your code here
test_dataset_no_nans= test_dataset.copy()

# filling nans
test_dataset_no_nans[numerical_columns] = imptr_num.transform(test_dataset_no_nans[numerical_columns])
test_dataset_no_nans[categorical_columns] = imptr_cat.transform(test_dataset_no_nans[categorical_columns])

# encoding categorical values
encoded_test_array= ct_cat.transform(test_dataset_no_nans) #ct_cat was the ColumnTransformer

test_dataset_no_nans= pd.DataFrame(encoded_test_array, columns=ct_cat.get_feature_names_out()) #???

# make sure numerical columns are of type float ???
test_dataset_no_nans= test_dataset_no_nans.astype(dtype={col: "float64" for col in numerical_columns})

# scaling
test_dataset_no_nans[numerical_columns] = my_scaler_num.transform(test_dataset_no_nans[numerical_columns])

# encode labels
test_encoded_labels = lblEncoder_Y.transform(test_labels['Target']) 

NotFittedError: ignored

Here I provide a standard classifier so that you perform some tests if you want

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


# we will use a classifier (we will skip details and what it does for now, so just use it as it is)
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(train_dataset_no_nans.drop(columns=['Loan_ID']), train_encoded_labels) #train_labels['Target'])  
## we train the classifier with the training data and labels: train_dataset_no_nans1 should be the training dataframe after:  
## i) filling in all missing values, ii) encoding all categorical features and (maybe) after feature scaling

# Checking the model's accuracy (performance); this should be performed on the test set and thus we use the test_dataset_no_nans and the test labels (after encoding)
performance = knn.score(test_dataset_no_nans.drop(columns=['Loan_ID']), test_encoded_labels)
print(f'Performance is {performance:.3f}')



NameError: ignored

## [Stretch goal] Final Exercise 

The steps performed above are essential for ML applications. For example, check what happens if we replace `train_dataset_no_nans` with `train_dataset`: what error message do we get? Why?

Try commenting out some of the pre-processing steps we've seen and rerun the algorithm in the cell above without them. Keep track of the final performance number you obtain in each case (if the code still runs). Without which ones does the code break, and without which ones doe the code still run, but perhaps less optimally? What do you think it's happening?

Note that the notebook so far can be used as a base for ML projects with appropriate modifications (at the very least don't forget about the EDA part and hyper-parameters optimization).