<a href="https://colab.research.google.com/github/dlkt101101/AMATH-445/blob/main/AMATH445_A1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AMATH 445
## Assignment 1
Prepared by: Darren Alexander Lam Kin Teng (20977843)

# Question 1


## 1a

1. From a given confusion matrix, we will get values for true positive (TP), true negative (TN), false positive (FP) and false negative (FP).
$$Precision = \frac{TP}{TP+FP}$$
$$Recall = \frac{TP}{TP + FN}$$

**Precision** is a measure of accuracy of the positive predictions out of all the instances that the model interpreted as positive (i.e. TP + FP).\
\
**Recall** is a measure of accuracy of the positive predictions out of all actual positive cases (i.e. TP + FN).\
\
2. $$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$
Accuracy alone can be a misleading metric for a classification problem as it does not represent the model's true performance well in the case of imbalanced data.\
For instance, if we have an imbalanced dataset where true labels are 5% of the data and 95% are negative labels. A model could predict all data points as negative. Therefore, accuracy = 95%, precision = $\frac{0}{0}$ and recall = 0%. Model performs poorly at predicting the true labels, based on the recall and precision, despite accuracy showing favourable results.

## 1b Data Preprocessing

In [21]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

Importing the dataset

In [22]:
data = pd.read_csv('/content/Cell viability and extrusion dataset V1.csv')

In [23]:
data.columns

Index(['Reference', 'DOI', 'Final_Alginate_Conc_(%w/v)',
       'Final_Gelatin_Conc_(%w/v)', 'Final_GelMA_Conc_(%w/v)',
       'Final_Hyaluronic_Acid_Conc_(%w/v)', 'Final_MeHA_Conc_(%w/v)',
       'Final_NorHA_Conc_(%w/v)', 'Final_Fibroin/Fibrinogen_Conc_(%w/v)',
       'Final_P127_Conc_(%w/v)', 'Final_Collagen_Conc_(%w/v)',
       'Final_Chitosan_Conc_(%w/v)', 'Final_CS-AEMA_Conc_(%w/v)',
       'Final_TCP_Conc_(%w/v)', 'Final_Gellan_Conc_(%w/v)',
       'Final_Nano/Methycellulose_Conc_(%w/v)', 'Final_PEGTA_Conc_(%w/v)',
       'Final_PEGMA_Conc_(%w/v)', 'Final_PEGDA_Conc_(%w/v)',
       'Final_Agarose_Conc_(%w/v)', 'CaCl2_Conc_(mM)', 'NaCl2_Conc_(mM)',
       'BaCl2_Conc_(mM)', 'SrCl2_Conc_(mM)',
       'Physical_Crosslinking_Durantion_(s)', 'Photocrosslinking_Duration_(s)',
       'Extrusion_Pressure (kPa)', 'Extrusion_Rate_Lengthwise_(mm/s)',
       'Extrusion_Rate_Volume-wise_(mL/s)', 'Nozzle_Movement_Speed_(mm/s)',
       'Inner_Nozzle_Outer_Diameter_(µm)', 'Outer_Nozzle_Inner_Di

In [24]:
df = data.copy()
df.drop(labels=['Reference', 'DOI'], axis=1, inplace=True)
df.head()

Unnamed: 0,Final_Alginate_Conc_(%w/v),Final_Gelatin_Conc_(%w/v),Final_GelMA_Conc_(%w/v),Final_Hyaluronic_Acid_Conc_(%w/v),Final_MeHA_Conc_(%w/v),Final_NorHA_Conc_(%w/v),Final_Fibroin/Fibrinogen_Conc_(%w/v),Final_P127_Conc_(%w/v),Final_Collagen_Conc_(%w/v),Final_Chitosan_Conc_(%w/v),...,Saline_Solution_Used?,EtOH_Solution_Used?,Photoinitiator_Used?,Enzymatic_Crosslinker_Used?,Matrigel_Used?,Conical_or_Straight_Nozzle,Primary/Not_Primary,Viability_at_time_of_observation_(%),Acceptable_Viability_(Yes/No),Acceptable_Pressure_(Yes/No)
0,2.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,S,Primary,96.0,Y,Y
1,2.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,S,Primary,72.0,N,N
2,2.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,S,Primary,80.0,Y,Y
3,0.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,S,Primary,96.0,Y,Y
4,0.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,S,Primary,77.0,N,N


In [25]:
df.shape

(617, 49)

### Handling missing values for bioink temperature as described in the paper

In [26]:
df['Syringe_Temperature_(°C)'] = df['Syringe_Temperature_(°C)'].fillna(22)

### Removing features with more than 50% missing or zero values

In [27]:
null_percentage = round(df.isnull().sum()/df.shape[0] * 100,2)
null_percentage_50_names = null_percentage[null_percentage > 50].index
df.drop(labels=null_percentage_50_names, axis=1, inplace=True)
print("""Dropped columns with more than > 50% null values.
{}""".format(list(null_percentage_50_names)))


Dropped columns with more than > 50% null values. 
['Extrusion_Rate_Lengthwise_(mm/s)', 'Extrusion_Rate_Volume-wise_(mL/s)', 'Nozzle_Movement_Speed_(mm/s)', 'Fiber_Spacing_(µm)']


In [28]:
cols_object = df.select_dtypes(include=['object']).columns.tolist()
cols_numericals = df.select_dtypes(include=['number']).columns.tolist()

print('''All columns included: {}'''.format(len(cols_numericals) + len(cols_object) == df.shape[-1]))

All columns included: True


In [None]:
df[cols_object] = df[cols_object]

### Imputing remaining missing values using KNNImputer from scikit-learn

In [34]:
ohe = OneHotEncoder(drop='first',sparse_output=False, handle_unknown='ignore')
df[remaing_missing_cols_object] = ohe.fit(df[remaing_missing_cols_object])

In [29]:
remaining_missing = df.isna().sum()
remaining_missing_cols = list(remaining_missing[remaining_missing > 0].index)
print('''Here are the remaining columns with missing values:
{}'''.format(remaining_missing_cols))

Here are the remaining columns with missing values: 
['Final_Collagen_Conc_(%w/v)', 'Final_PEGMA_Conc_(%w/v)', 'CaCl2_Conc_(mM)', 'NaCl2_Conc_(mM)', 'Physical_Crosslinking_Durantion_(s)', 'Extrusion_Pressure (kPa)', 'Outer_Nozzle_Inner_Diameter_(µm)', 'Fiber_Diameter_(µm)', 'Cell_Density_(cells/mL)', 'Substrate_Temperature_(°C)', 'Conical_or_Straight_Nozzle']


In [30]:
remaining_missing_cols_object = list(df[remaining_missing_cols].dtypes[df[remaining_missing_cols].dtypes == 'object'].index)
remaining_missing_cols_object

['Fiber_Diameter_(µm)', 'Conical_or_Straight_Nozzle']

`Fiber_Diameter_(µm)` is supposed to be numerical but is type `object`.

In [31]:
df['Fiber_Diameter_(µm)'] = df['Fiber_Diameter_(µm)'].str.extract(r'(\d+\.?\d*)').astype(float)

In [32]:
remaing_missing_cols_object = df[remaining_missing_cols].select_dtypes(include=['object']).columns.tolist()
remaing_missing_numerical_cols = df[remaining_missing_cols].select_dtypes(include=['number']).columns.tolist()

print('''Remaining missing columns object:
{}
Remaining missing columns numerical:
{}'''.format(remaing_missing_cols_object, remaing_missing_numerical_cols))

Remaining missing columns object: 
['Conical_or_Straight_Nozzle']
Remaining missing columns numerical:
['Final_Collagen_Conc_(%w/v)', 'Final_PEGMA_Conc_(%w/v)', 'CaCl2_Conc_(mM)', 'NaCl2_Conc_(mM)', 'Physical_Crosslinking_Durantion_(s)', 'Extrusion_Pressure (kPa)', 'Outer_Nozzle_Inner_Diameter_(µm)', 'Fiber_Diameter_(µm)', 'Cell_Density_(cells/mL)', 'Substrate_Temperature_(°C)']


Imputing numerical columns with kNN Imputer

In [33]:
knnimputer = KNNImputer(n_neighbors=30)
df[remaing_missing_numerical_cols] = knnimputer.fit_transform(df[remaing_missing_numerical_cols])

In [35]:
df.head()

Unnamed: 0,Final_Alginate_Conc_(%w/v),Final_Gelatin_Conc_(%w/v),Final_GelMA_Conc_(%w/v),Final_Hyaluronic_Acid_Conc_(%w/v),Final_MeHA_Conc_(%w/v),Final_NorHA_Conc_(%w/v),Final_Fibroin/Fibrinogen_Conc_(%w/v),Final_P127_Conc_(%w/v),Final_Collagen_Conc_(%w/v),Final_Chitosan_Conc_(%w/v),...,Saline_Solution_Used?,EtOH_Solution_Used?,Photoinitiator_Used?,Enzymatic_Crosslinker_Used?,Matrigel_Used?,Conical_or_Straight_Nozzle,Primary/Not_Primary,Viability_at_time_of_observation_(%),Acceptable_Viability_(Yes/No),Acceptable_Pressure_(Yes/No)
0,2.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,96.0,Y,Y
1,2.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,72.0,N,N
2,2.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,80.0,Y,Y
3,0.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,96.0,Y,Y
4,0.0,0.0,0.0,0.0,0,0,0.0,0,0.0,0,...,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,77.0,N,N


In [36]:
cols_object = df.select_dtypes(include=['object']).columns.tolist()
df[cols_object].head()

Unnamed: 0,Cell_Culture_Medium_Used?,DI_Water_Used?,Precrosslinking_Solution_Used?,Saline_Solution_Used?,EtOH_Solution_Used?,Photoinitiator_Used?,Enzymatic_Crosslinker_Used?,Matrigel_Used?,Conical_or_Straight_Nozzle,Primary/Not_Primary,Acceptable_Viability_(Yes/No),Acceptable_Pressure_(Yes/No)
0,Y,N,N,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,Y,Y
1,Y,N,N,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,N,N
2,Y,N,N,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,Y,Y
3,Y,N,N,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,Y,Y
4,Y,N,N,N,N,N,N,N,OneHotEncoder(drop='first'),Primary,N,N


One Hot Encode categroical columns

## 1c Decision Tree Classifier

## 1d Support Vector Machine (SVM)

# Question 3

# Question 4