Linear Regression

In [3]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the Diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MSE is:", mse)
print("R2 score is:", r2)

MSE is: 2900.1936284934814
R2 score is: 0.4526027629719195


Evaluation Metrics
Mean Squared Error (MSE): Measures the average of the squares of the errors. Lower values are better.
R-squared: Represents the percentage of the dependent variable’s variation that can be predicted based on the independent variables. Closer to 1 is better.

Applying with Sci-kit Learn
1. Get the Diabetes Dataset loaded: Ten baseline variables, including age, sex, BMI, average blood pressure, and six blood serum measures for diabetic patients, are included in this dataset.
2. Split the Dataset: Divide it into training and testing sets.
3. Create and Train the Linear Regression Model: Build the model using the training set.
4. Predict and Evaluate: Use the test set to make predictions and then evaluate the model using MSE and R-squared.

Logistic Regression

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.956140350877193
Precision: 0.9459459459459459
Recall: 0.9859154929577465
F1 Score: 0.9655172413793104


Evaluation Metrics
Accuracy: Accuracy is the ratio of correctly predicted observations to total observations.
Precision and Recall: Precision is the ratio of correctly predicted positive observations to all expected positive observations. Recall is the proportion of correctly predicted positive observations to all observations made in the actual class.
F1 Score: An equilibrium between recall and precision.

Here are the steps we’ll follow to apply logistic regression.
1. Load the Breast Cancer Dataset: This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, and the goal is to classify them as benign or malignant.
2. Split the Dataset: Divide it into training and testing sets.
3. Create and Train the Logistic Regression Model: Build the model using the training set.
4. Predict and Evaluate: Use the test set to make predictions and then evaluate the model using Accuracy, Precision, Recall, and F1 Score.

Decision Trees

In [5]:
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.9444444444444444
Precision: 0.953968253968254
Recall: 0.9345238095238096
F1 Score: 0.9424740010946907


Evaluation Metrics
For classification: Accuracy, precision, recall, and F1 score.
For Regression: Mean Squared Error (MSE), R-squared.

Here are the steps we’ll follow in the code below.
1. Load the Wine Dataset
2. Split the Dataset
There are training and testing sets inside the dataset. This is done to train the model on one part of the data (training set) and test its performance on unseen data (testing set). We used 80% of the data for training and 20% for testing.
3. Create and Train the Decision Tree Model
A Decision Tree Classifier is created. This model will learn from the training data. It builds a tree-like model of decisions, where each node in the tree represents a feature of the dataset, and the branches represent decision rules, leading to different outcomes or classifications.
4. Predict and Evaluate
The model is used to predict the classifications of the test set. The performance of the model is then assessed by contrasting these predictions with the actual labels.

Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier

breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Creating and training the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predicting the test set results
y_pred_rf = rf_model.predict(X_test)

# Evaluating the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf, average='macro')
recall_rf = recall_score(y_test, y_pred_rf, average='macro')
f1_rf = f1_score(y_test, y_pred_rf, average='macro')

# Print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.9444444444444444
Precision: 0.953968253968254
Recall: 0.9345238095238096
F1 Score: 0.9424740010946907


Evaluation Metrics:
Classification: Accuracy, Precision, Recall, F1 Score.
Regression: Mean Squared Error (MSE), R-squared.

Applying with Sci-kit Learn
1. Create and Train the Random Forest Model:
Initialize a Random Forest Classifier.
Utilizing the training data, fit (train) the model.
2. Predict:
Use the trained model to predict the labels of the test data.
3. Evaluate:
Assess the model’s performance on the test data using Accuracy, Precision, Recall, and F1 Score.

K-Means Clustering

In [7]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Applying K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Predicting the cluster for each data point
y_pred_clusters = kmeans.predict(X)

# Evaluating the model
inertia = kmeans.inertia_
silhouette = silhouette_score(X, y_pred_clusters)

print("Inertia:", inertia)
print("Silhouette:", silhouette)

Inertia: 78.8556658259773
Silhouette: 0.5511916046195919


Evaluation Metrics
Inertia: The total squared distance of the samples to the nearest cluster center is known as inertia. It is better to have lower values.
Silhouette Score: Indicates how cohesively an item belongs to its own cluster as opposed to how much it separates from other clusters. A high silhouette score means that the item is well matched to its own cluster and poorly matched to nearby clusters. The silhouette score goes from -1 to 1.

Applying with Sci-kit Learn
1. Load the Iris Dataset:
2. Apply K-Means Clustering:
We initialize a K-Means clustering algorithm with n_clusters=3, as there are three species of iris in the dataset. However, the algorithm is unaware of these species; it will simply try to find the best way to group the data into three clusters.
We fit the model to the data X, which includes our four features. The K-Means algorithm iteratively assigns each data point to one of the three clusters based on the distance of the data point to the cluster centroids.
3. Predict Clusters:
The predict method is used to assign each data point in X to one of the three clusters. This step is somewhat conceptual with K-Means since the fitting and prediction happen together, but essentially, each data point is now labeled with a cluster number
4. Evaluate the Clustering:
We evaluate our clustering using two metrics:
• Inertia: This is the sum of squared distances of samples to their closest cluster center. It’s a measure of how internally coherent clusters are. We aim for lower inertia.
• Silhouette Score: This measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Support Vector Machines (SVM)

In [8]:
from sklearn.svm import SVC

breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the SVM model
svm_model = SVC()
svm_model.fit(X_train, y_train)

# Predicting the test set results
y_pred_svm = svm_model.predict(X_test)

# Evaluating the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm, average='macro')
recall_svm = recall_score(y_test, y_pred_svm, average='macro')
f1_svm = f1_score(y_test, y_pred_svm, average='macro')

accuracy_svm, precision_svm, recall_svm, f1_svm

# Print the results
print("Accuracy:", accuracy_svm)
print("Precision:", precision_svm)
print("Recall:", recall_svm)
print("F1 Score:", f1_svm)

Accuracy: 0.9473684210526315
Precision: 0.961038961038961
Recall: 0.9302325581395349
F1 Score: 0.9422297297297297


Evaluation Metrics
Classification: Accuracy, Precision, Recall, F1 Score.
Regression: Mean Squared Error (MSE), R-squared.

Applying with Sci-kit Learn
1. Create and Train the SVM Model:
A Support Vector Machine (SVM) model is created using the default settings. SVM is known for its ability to create a hyperplane (or multiple hyperplanes in higher-dimensional spaces) that separates the classes with as wide a margin as possible.
2. Predict:
The trained SVM model is then used to predict the class labels of the test data. It does this by determining on which side of the hyperplane each data point falls.
3. Evaluate:
The model’s predictions are evaluated against the actual labels of the test set to assess its performance.

Naive Bayes

In [9]:
from sklearn.datasets import load_digits
from sklearn.naive_bayes import GaussianNB

# Load the Digits dataset
digits = load_digits()
X, y = digits.data, digits.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.8472222222222222
Precision: 0.8649844547206135
Recall: 0.8476479221745045
F1 Score: 0.8437352605469787


Evaluation Metrics:
Accuracy: Measures overall correctness of the model.
Precision, Recall, and F1 Score: Especially important in cases where class distribution is imbalanced.

1. Load the Digits Dataset:
2. Split the Dataset:
Similar to previous examples, the dataset is divided into training and testing sets. We use 80% of the data for training and 20% for testing. This helps in training the model on a large portion of the data and then evaluating its performance on a separate set that it hasn’t seen before.
3. Create and Train the Naive Bayes Model:
A Gaussian Naive Bayes classifier is created. This variant of Naive Bayes assumes that the continuous values associated with each feature are distributed according to a Gaussian (normal) distribution.
The model is then trained (fitted) on the training data. It learns to associate the input features (pixel values) with the target values (digit classes).
4. Predict and Evaluate:
After training, the model is used to predict the class labels of the test data.

K-Nearest Neighbors (KNN)

In [10]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the KNN model
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

# Predicting the test set results
y_pred_knn = knn_model.predict(X_test)

# Evaluating the model
accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn, average='macro')
recall_knn = recall_score(y_test, y_pred_knn, average='macro')
f1_knn = f1_score(y_test, y_pred_knn, average='macro')

# Print the results
print("Accuracy:", accuracy_knn)
print("Precision:", precision_knn)
print("Recall:", recall_knn)
print("F1 Score:", f1_knn)

Accuracy: 0.8055555555555556
Precision: 0.7912698412698412
Recall: 0.7976190476190476
F1 Score: 0.78998778998779


Evaluation Metrics
Classification: Accuracy, Precision, Recall, F1 Score.
Regression: Mean Squared Error (MSE), R-squared.

Applying with Sci-kit Learn
1. Create and Train the KNN Model:
A K-Nearest Neighbors (KNN) model is created with n_neighbors=3. This means the model looks at the three nearest neighbors of a data point to make a prediction.
The model is trained (fitted) with the training data. During training, it doesn’t build a traditional model but memorizes the dataset.
2. Predict:
The trained KNN model is then used to predict the class labels (types of wine) of the test data. The model determines the most common class among these neighbors for each point in the test set by examining the three nearest points in the training set.
3. Evaluate:
The model’s predictions are evaluated against the actual labels of the test set.