## Writing basic machine learning models ##
This Jupyter notebook will use an example diabetes dataset to demonstrate linear regression, and the Iris dataset for decision trees and random forests. The notebook will also look to outline the fundamentals of programming for Machine Learning. If there are any problems with this notebook, please email me at ab550@st-andrews.ac.uk.

### Linear Regression ###

In [1]:
# scikit-learn is an ML library with various algorithms
from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y = True, as_frame = True)

# Check the shape of input and output
# Observe the structure
print("Shape of input: " + str(X.shape))
print("Shape of output: " + str(y.shape))

Shape of input: (442, 10)
Shape of output: (442,)


In [2]:
# print title and data type of column
print(X.columns)
print(X.dtypes)

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
age    float64
sex    float64
bmi    float64
bp     float64
s1     float64
s2     float64
s3     float64
s4     float64
s5     float64
s6     float64
dtype: object


In [3]:
# check if columns have any NaN values
X.isna().any()

age    False
sex    False
bmi    False
bp     False
s1     False
s2     False
s3     False
s4     False
s5     False
s6     False
dtype: bool

Check the shape of input and output to observe the general structure of the dataset. The shape is represented as the (number of rows, number of columns). The number of rows represents the number of cases in the dataset, and it should be the same for both input and output. 

The number of columns for the input represents the number of features, and the number of columns for the output represents the number of target variables. In general, the output only has 1 column. Reformat the dataset into formats that are ideal for your model if it isn't already and do further data processing, if necessary, with Pandas and Numpy.

If the shape of both input and output are good, the dataset can be split into training and testing. The model learns from the training dataset, and we evaluate its performance with the testing dataset.

In [35]:
# Training dataset gets ~80% of cases, Testing gets ~20%
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 32)

print("Shape of training input: " + str(X_train.shape))
print("Shape of testing input: " + str(X_test.shape))
print("Shape of training output: " + str(y_train.shape))
print("Shape of testing output: " + str(y_test.shape))

Shape of training input: (309, 10)
Shape of testing input: (133, 10)
Shape of training output: (309,)
Shape of testing output: (133,)


In [36]:
# Create a linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lin_reg = LinearRegression()

# Model learns from training dataset
lin_reg.fit(X_train, y_train)

# Model tries to predict testing dataset
y_predictions_lr = lin_reg.predict(X_test)

# Find the error of the model (RMSE, R2)
rmse = mean_squared_error(y_test, y_predictions_lr, squared = False)
print("RMSE: " + str(rmse))
r2 = r2_score(y_test, y_predictions_lr)
print("R2: " + str(r2))

RMSE: 52.30111625121456
R2: 0.49670590024012307


An R-squared score of 0.54 shows that the model is somewhat accurate in prediction. Let's move on to Decision Trees and solve a classification problem with it.

### Decision Tree ###
The section will be using the Iris flowers dataset.

In [41]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, f1_score

X, y = datasets.load_iris(return_X_y = True, as_frame = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 32)

In [44]:
from sklearn.tree import DecisionTreeClassifier

# Try to write the decision tree classifier method
dt_clf = DecisionTreeClassifier(min_samples_split = 80, max_features = 'auto', random_state = 32)
dt_clf.fit(X_train, y_train)

y_predictions_dt = dt_clf.predict(X_test)

# Find the error of the model (RMSE, F1)
rmse = mean_squared_error(y_test, y_predictions_dt, squared = False)
print("RMSE: " + str(rmse))
f1 = f1_score(y_test, y_predictions_dt, average = 'weighted')
print("F1: " + str(f1))

RMSE: 0.5577733510227171
F1: 0.5906432748538012


An F1 score of 0.59 means that the model is somewhat accurate in prediction. Let's try with Random Forest.

### Random Forest ###

In [54]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(random_state = 32)
rf_clf.fit(X_train, y_train)

y_predictions_rf = rf_reg.predict(X_test)

# Find the error of the model (RMSE, F1)
rmse = mean_squared_error(y_test, y_predictions_rf, squared = False)
print("RMSE: " + str(rmse))
f1 = f1_score(y_test, y_predictions_rf, average = 'weighted')
print("F1: " + str(f1))

RMSE: 0.0
F1: 1.0


An F1 score of 1.00 means that the model predicts perfectly. This isn't normal at all, and it can largely be explained by the fact that the testing dataset is small (<50 samples). Generally, a good F1 score to target for regression is 0.70.

We would encourage you to play around with some of the models and experiment with them. You can also find datasets from online, do EDA, and try to use one of these models on the dataset.

Enjoy!