<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/machine-learning-scikit-learn/06_Unsupervised_Learning_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Unsupervised Learning Algorithms


##Overview

Unsupervised learning is a category of machine learning algorithms where the model is trained on a dataset without explicit supervision or labeled output. In other words, the algorithm is left to find patterns, structures, or relationships within the data on its own. Unlike supervised learning, where the algorithm is guided by labeled examples to make predictions, unsupervised learning deals with raw, unlabeled data.

One of the most popular unsupervised learning algorithms in Python is K-means clustering. K-means is a partition-based clustering algorithm that aims to divide the data into K clusters, where each data point belongs to the cluster with the nearest mean. The algorithm iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. K-means is widely used for various applications, such as customer segmentation, image compression, and anomaly detection.

Another essential unsupervised learning technique is hierarchical clustering. Hierarchical clustering builds a tree-like structure called a dendrogram to represent data points' hierarchy. It recursively merges data points or clusters based on their similarity until all data points belong to a single cluster. The hierarchical nature of this algorithm allows users to visualize the data's natural grouping and determine the number of clusters based on the dendrogram's structure.

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in unsupervised learning. PCA aims to transform the original high-dimensional data into a lower-dimensional space while preserving the most important information. It identifies the principal components, which are orthogonal vectors representing the directions of maximum variance in the data. By projecting the data onto a reduced set of principal components, PCA can simplify the data representation, making it easier to analyze and visualize complex datasets.

Anomaly detection is another crucial aspect of unsupervised learning, where the goal is to identify rare and abnormal data points or events in a dataset. Various unsupervised anomaly detection algorithms exist, such as K-nearest neighbors (KNN) and clustering-based approaches. In KNN-based anomaly detection, data points are identified as anomalies based on their distance to their K-nearest neighbors. In clustering-based approaches, anomalies are detected by considering data points that do not fit well into any cluster or have significantly different properties compared to other clusters.

Python provides a rich ecosystem of libraries for implementing unsupervised learning algorithms. The scikit-learn library is particularly popular and offers a wide range of tools and functions for K-means clustering, hierarchical clustering, PCA, and various anomaly detection techniques. By leveraging these unsupervised learning algorithms, data scientists and researchers can gain valuable insights from their data, discover patterns, and make data-driven decisions without the need for labeled data.

##K-means clustering

K-means clustering is a popular unsupervised machine learning algorithm used to partition data into distinct groups or clusters based on their similarities. Scikit-Learn is a powerful machine learning library in Python that provides an implementation of the K-means clustering algorithm.

Here's an example of how to perform K-means clustering on the Pima Indian Diabetes dataset using Scikit-Learn:


In [None]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Prepare the data for clustering
X = dataset.drop('Outcome', axis=1)  # Features
y = dataset['Outcome']  # Target variable

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform K-means clustering
k = 2  # Number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)

# Get the cluster labels for each data point
labels = kmeans.labels_

# Visualize the clusters (using the first two features)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-means Clustering')
plt.show()


In this example, we start by loading the Pima Indian Diabetes dataset using Pandas. We then prepare the data for clustering by separating the features (X) and the target variable (y).

Next, we standardize the features using the `StandardScaler` from Scikit-Learn to ensure that all features are on the same scale. Standardizing the features is important for K-means clustering since it is based on the Euclidean distance between data points.

Afterward, we create an instance of the `KMeans` class with the desired number of clusters (k=2 in this case) and fit it to the standardized data. The `fit()` method performs the actual clustering.

We then obtain the cluster labels for each data point using the `labels_` attribute of the fitted KMeans object.

Finally, we visualize the clusters by plotting the first two features of the standardized data and coloring the points based on their assigned cluster labels.

Note that in this example, we are performing K-means clustering for illustrative purposes, but the Pima Indian Diabetes dataset is not particularly suitable for clustering as it is typically used for binary classification tasks.


##Hierarchical clustering


In Scikit-Learn, hierarchical clustering can be performed using the AgglomerativeClustering class from the cluster module. Hierarchical clustering is a method of clustering data points based on their similarity and forms a hierarchy of clusters.

Here's an example of performing hierarchical clustering on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Select the features for clustering
features = dataset[['Glucose', 'BMI']]

# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Perform hierarchical clustering
cluster = AgglomerativeClustering(n_clusters=2)
cluster_labels = cluster.fit_predict(features_scaled)

# Plot the clusters
plt.scatter(features_scaled[:, 0], features_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.xlabel('Glucose')
plt.ylabel('BMI')
plt.title('Hierarchical Clustering on Pima Indian Diabetes Dataset')
plt.show()


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We select the 'Glucose' and 'BMI' features for clustering. Before clustering, we standardize the features using the StandardScaler from Scikit-Learn to ensure that they have a similar scale.

Next, we create an instance of the AgglomerativeClustering class with the parameter `n_clusters=2` to specify that we want to form 2 clusters. We then fit and predict the cluster labels using the `fit_predict` method.

Finally, we plot the clusters using a scatter plot, where the x-axis represents the standardized 'Glucose' values and the y-axis represents the standardized 'BMI' values. Each point is colored according to its assigned cluster label.


##Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used to identify patterns and reduce the number of variables in a dataset while preserving the most important information. It transforms the original variables into a new set of uncorrelated variables called principal components.

Scikit-Learn is a popular machine learning library in Python that provides a straightforward implementation of PCA through its `PCA` class. Here's an example of how to use PCA with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate the features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the principal components
principal_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1', 'Principal Component 2'])
principal_df['Outcome'] = y

# Print the resulting DataFrame
print(principal_df.head())


In this example, we first load the Pima Indian Diabetes dataset using Pandas. Then, we separate the features (`X`) and the target variable (`y`).

Next, we standardize the features using `StandardScaler` to ensure that all variables have a mean of 0 and a standard deviation of 1. Standardization is important before applying PCA to avoid variables with larger scales dominating the principal components.

We then create an instance of the `PCA` class and specify the number of components we want to retain (in this case, 2). The `fit_transform` method is used to apply PCA and obtain the transformed principal components (`X_pca`).

Finally, we create a new DataFrame (`principal_df`) with the principal components and the original target variable. We print the first few rows of the resulting DataFrame to see the output.

Note: It's common to visualize the results of PCA, but the example above focuses on the implementation. You can explore further by visualizing the principal components using scatter plots or other techniques.


##Anomaly detection


Scikit-Learn is a popular machine learning library in Python that provides various algorithms and tools for anomaly detection. One commonly used algorithm for anomaly detection in Scikit-Learn is the Isolation Forest algorithm. It is an efficient and effective algorithm for identifying anomalies in datasets.

Here's an example of anomaly detection using the Isolation Forest algorithm with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from sklearn.ensemble import IsolationForest

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Prepare the data
X = dataset.drop('Outcome', axis=1)

# Train the Isolation Forest model
model = IsolationForest(contamination=0.1)  # Contamination represents the expected proportion of anomalies in the data
model.fit(X)

# Predict anomalies
predictions = model.predict(X)

# Add the anomaly predictions to the dataset
dataset['Anomaly'] = predictions

# Print the dataset with anomaly predictions
print(dataset)


In this example, we load the Pima Indian Diabetes dataset using Pandas library. We then prepare the data by dropping the 'Outcome' column, as it represents the class labels and we want to perform unsupervised anomaly detection. Next, we train the Isolation Forest model by initializing an instance of the `IsolationForest` class with a contamination parameter of 0.1, indicating that we expect 10% of the data to be anomalies.

After training the model, we use it to predict anomalies in the dataset by calling the `predict()` method on the data. The resulting predictions are assigned to the `predictions` variable. We then add the anomaly predictions as a new column named 'Anomaly' to the dataset.

Finally, we print the dataset with the added 'Anomaly' column to see the output, where the values of -1 indicate anomalies and 1 indicate normal data points.


#Reflection points

1. **K-means Clustering**:
   - Explain the concept of K-means clustering and its primary objective.
     - Sample answer: K-means clustering is an unsupervised machine learning technique used to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean. The objective is to minimize the within-cluster sum of squares, effectively grouping similar data points together.
   - Discuss the role of the K parameter in K-means clustering.
     - Sample answer: The K parameter represents the number of clusters to create. It determines the granularity of the clustering. Choosing the optimal value of K is important and typically requires evaluating various metrics, such as the elbow method or silhouette score, to find the best balance between complexity and performance.
   - Explain the steps involved in the K-means clustering algorithm.
     - Sample answer: The steps in the K-means clustering algorithm include initializing K centroids, assigning data points to the nearest centroid, updating the centroids based on the assigned data points, and repeating the assignment and update steps until convergence is achieved or a predefined stopping criterion is met.

2. **Hierarchical Clustering**:
   - Define hierarchical clustering and its key characteristics.
     - Sample answer: Hierarchical clustering is an unsupervised clustering algorithm that builds a hierarchy of clusters. It does not require a pre-defined number of clusters. Instead, it generates a tree-like structure called a dendrogram, which represents the merging and splitting of clusters based on the similarity of data points.
   - Discuss the differences between agglomerative and divisive hierarchical clustering.
     - Sample answer: Agglomerative hierarchical clustering starts with each data point as a separate cluster and progressively merges similar clusters until a single cluster remains. Divisive hierarchical clustering starts with all data points in a single cluster and recursively splits it into smaller clusters until each data point forms its own cluster.
   - Explain the linkage criteria used in hierarchical clustering.
     - Sample answer: Linkage criteria determine the distance between clusters and guide the merging process. Common linkage methods include single-linkage (minimum distance), complete-linkage (maximum distance), and average-linkage (average distance).

3. **Principal Component Analysis (PCA)**:
   - Describe the purpose and benefits of PCA.
     - Sample answer: Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining most of the relevant information. It helps uncover the underlying structure and patterns in the data, simplifies data visualization, and can improve model performance by reducing noise and multicollinearity.
   - Explain the concept of principal components and their interpretation.
     - Sample answer: Principal components are new variables obtained by linearly combining the original variables in a way that captures the maximum variance in the data. Each principal component is orthogonal to the others and represents a unique direction in the data space. The first principal component explains the most significant variation, followed by subsequent components in decreasing order of importance.
   - Discuss the steps involved in performing PCA.
     - Sample answer: The steps in PCA include standardizing the data, computing the covariance or correlation matrix, calculating the eigenvectors and eigenvalues of the covariance matrix, selecting the desired number of principal components based on explained variance or a cumulative variance threshold, and transforming the data into the new lower-dimensional space.

4. **Anomaly Detection**:
   - Define anomaly detection and its applications.
     - Sample answer: Anomaly detection is the identification of rare or abnormal instances that deviate from the expected behavior within a dataset. It finds applications in fraud detection, network intrusion detection, equipment failure prediction, and outlier detection in various domains.
   - Discuss the types of anomalies and their detection techniques.
     - Sample answer: Anomalies can be classified as point anomalies (individual instances), contextual anomalies (anomalies within a specific context), or collective anomalies (anomalies that occur as a group). Detection techniques include statistical methods (e.g., z-score, Mahalanobis distance), clustering-based approaches, and machine learning algorithms (e.g., isolation forest, one-class SVM).
   - Explain the evaluation metrics for anomaly detection.
     - Sample answer: Evaluation metrics for anomaly detection include precision, recall, F1-score, and the area under the receiver operating characteristic curve (AUC-ROC). The choice of the appropriate metric depends on the specific problem and the desired balance between false positives and false negatives.


#A quiz on Unsupervised Learning Algorithms


**Quiz on Clustering and Dimensionality Reduction Techniques in Python**

**1. K-means clustering is a method used for:**
<br>a) Classification
<br>b) Regression
<br>c) Clustering
<br>d) Dimensionality reduction

**2. In K-means clustering, what is the objective of the algorithm?**
<br>a) Minimize the variance within each cluster
<br>b) Maximize the variance within each cluster
<br>c) Minimize the distance between clusters
<br>d) Maximize the distance between clusters

**3. The number of clusters in K-means is determined by:**
<br>a) The user-specified value
<br>b) Random initialization
<br>c) The algorithm itself
<br>d) None of the above

**4. Hierarchical clustering can be classified into two main types, which are:**
<br>a) Single-linkage and complete-linkage
<br>b) K-means and Ward's linkage
<br>c) Mean-shift and DBSCAN
<br>d) Linear regression and logistic regression

**5. In hierarchical clustering, the dendrogram is used to:**
<br>a) Visualize the data points in clusters
<br>b) Determine the number of clusters
<br>c) Evaluate the quality of clustering
<br>d) Store the clusters

**6. Principal Component Analysis (PCA) is used for:**
<br>a) Clustering similar data points together
<br>b) Reducing the number of dimensions in the data
<br>c) Adding more features to the dataset
<br>d) Performing regression tasks

**7. In PCA, the principal components are orthogonal to each other, meaning:**
<br>a) They have equal magnitudes
<br>b) They are parallel to each other
<br>c) They are perpendicular to each other
<br>d) They are not correlated

**8. Anomaly detection is used for:**
<br>a) Identifying outliers in the data
<br>b) Creating clusters of similar data points
<br>c) Dimensionality reduction
<br>d) Fitting a line to the data

**9. Which of the following is a commonly used unsupervised anomaly detection algorithm?**
<br>a) K-nearest neighbors (KNN)
<br>b) Decision trees
<br>c) Support Vector Machines (SVM)
<br>d) Linear regression

**10. In Python, which library can be used for various clustering and anomaly detection techniques?**
<br>a) Matplotlib
<br>b) Pandas
<br>c) Scikit-learn
<br>d) NumPy

---
**Answers:**
1. c) Clustering
2. a) Minimize the variance within each cluster
3. a) The user-specified value
4. a) Single-linkage and complete-linkage
5. b) Determine the number of clusters
6. b) Reducing the number of dimensions in the data
7. c) They are perpendicular to each other
8. a) Identifying outliers in the data
9. a) K-nearest neighbors (KNN)
10. c) Scikit-learn
---