# Welcome to the Notebook
---
In this notebook, we will explore high-dimensional data analysis. Our agenda includes:
- Importing the Dataset
- Normalization
- K-means Clustering
- Scatterplot Matrix
- Parallel Coordinate Plot (PCP)
- Data Reduction

### Tasks:
- Correlation Analysis
- Outlier Detection
- Cluster Analysis

## Importing Modules
Let's start by importing the necessary libraries for data manipulation, clustering, preprocessing, and visualization. 📚

## Loading the Dataset
Let's load our dataset and take a look at its structure. 🗂️

The dataset contains various metrics for different countries or regions, including GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption. Nordic countries like Finland, Denmark, Norway, and Iceland rank high in these metrics, indicating strong social support systems and high levels of life satisfaction. 🌍

### Data Overview
We'll check the data types and get a summary of our data to understand its distribution and characteristics. 📊

The dataset consists of seven columns, with 'Country or region' being categorical, while the remaining columns are continuous numerical data. This structure suggests the dataset is used for analyzing socio-economic and well-being indicators across different countries or regions.

The dataset contains 156 entries for each of the six variables. Key statistics include the mean values, with GDP per capita averaging 0.905, Social support at 1.209, and Healthy life expectancy at 0.725. The wide range of data across these variables is evident, such as GDP per capita ranging from 0.000 to 1.684.

## Data Normalization
To ensure all features contribute equally to the analysis, we'll normalize the data to a range between 0 and 1. 🔄

The dataset has been normalized using Min-Max scaling, transforming the features to a range between 0 and 1, while retaining the 'Country or region' column as a categorical identifier. This scaling allows for a direct comparison of these features across different countries.

The dataset contains 156 entries for each of the six variables, with all variables normalized between 0 and 1. Key observations include the highest mean value for Social support (0.744) and the lowest for Perceptions of corruption (0.244).

## K-means Clustering
Let's apply K-means clustering to group the data into clusters. 🎯

The KMeans algorithm has clustered the dataset into three groups. The first five countries are assigned to cluster '0', suggesting they share similar characteristics across the metrics.

## Scatter Plot Matrix
A scatter plot matrix helps visualize the relationships between variables and identify patterns or outliers. 📈

### Scatterplot Matrix Explanation

The scatterplot matrix displays pairwise relationships between various factors contributing to the Happiness Index, categorized by class. Notable positive correlations exist between factors like "Social support" and "Healthy life expectancy." The class distributions indicate distinct clustering patterns, especially in parameters like "Social support" and "Healthy life expectancy," suggesting their influence on happiness levels. Class 0 tends to show higher values across most indicators.

## Parallel Coordinate Plot
This plot allows us to visualize multi-dimensional data and observe the clustering results. 📊

The parallel coordinates plot displays the relationships among several features, categorized by three clusters. Each line represents a data point, and the clusters exhibit distinct patterns, with cluster 0 generally showing higher values across most features.

## Data Reduction
To handle large datasets, we can reduce the data size through sampling and reduce dimensions using techniques like UMAP. 🔍

### Scatter Plot Matrix of Sampled Data
Let's visualize the sampled data to see if the patterns hold. 📉

The scatterplot matrix visualizes relationships between variables of the Happiness Index across three classes. Key patterns show a strong correlation between "GDP per capita" and "Healthy life expectancy," especially for class 2.

## PCA Projection
We'll apply PCA projection to visualize the data in a reduced dimensional space. 🌐

The first five rows of the dataset have been transformed into a two-dimensional space using PCA, capturing the most significant variance in the data.

Let's visualize the projected data

The scatter plot depicts a projection of the Happiness Index with data points grouped into three distinct classes—0, 1, and 2—distinguished by color. Class 2 clusters around the origin, Class 1 is primarily on the left side, while Class 0 is more dispersed across the right, suggesting variation in the projected dimensions.