<h4>Variable combinations, feature generation and interaction features- Using Cluster Profiling</h4>

In [1]:
# importing all the necessary modules
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("white_wine_cleaned.csv")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [2]:
# Here, K-Means Clustering to group data into 6 clusters based on normalized features. 
# The target variable quality is excluded, and 
# StandardScaler ensures all features contribute equally. 
# Significantly, the K-Means model assigns each data point to a cluster, 
# and the cluster labels are then added to the DataFrame as a new column, cluster. 
# As it helps analyzing the patterns and relationships between clusters and other variables 
# I think, it will enhance predictions in supervised learning.
# That can be further used for evaluation and visualization of clusters can provide deeper insights.
# X/y -split

X = df.drop("quality", axis=1)

# Set a random seed for reproducibility
random_seed = 42

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# create Kmeans-instance and train it with data
kmeans = KMeans(n_clusters=6, random_state=random_seed)

# place the cluster values back to DataFrame
df['cluster'] = kmeans.fit_predict(X_scaled)

df.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,cluster
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,4
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,5
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,1
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,1


In [3]:
# counting the vlaue of each cluster feature
# Cluster labels (0 to 5) 
# Which represent the six clusters identified by K-Means.
# Cluster 5 has the most data points (1094), meaning it represents the largest group.
# Cluster 2 has the fewest data points (107), suggesting it identifies a much smaller 
# or distinct subset of the dataset.
# From the output, it shows other clusters (0, 1, 3, 4) have moderately sized groups 
# ranging between 848 and 1005 data points.
df['cluster'].value_counts()

cluster
5    1094
0    1005
1     929
3     915
4     848
2     107
Name: count, dtype: int64

<h4>Here, grouping the DataFrame df by the quality and cluster columns, then calculates the count of occurrences for each combination</h4>

In [4]:
# calculating the counts
cluster_counts = df.groupby(['quality', 'cluster']).size().unstack(fill_value=0)
print(cluster_counts)

# remember to lock down the random seed when you find a good clustering result

cluster    0    1   2    3    4    5
quality                             
3          1    7   1    3    3    5
4         22   26   4   34   19   58
5        190  455  50   83  337  342
6        524  400  48  370  368  488
7        228   34   2  344  103  169
8         40    7   2   77   18   31
9          0    0   0    4    0    1


<h4><b>The Cluster Distribution</b></h4>
<h4>I think the clusters are not evenly distributed across qualities. For example, cluster 5 dominates the dataset for quality 5 and quality 6 with counts of 342 and 488, respectively. And some clusters have fewer data points, such as cluster 3 and cluster 4.</h4>
<h4>I think the quality levels with sparse data, that quality 3 and quality 9 have significantly fewer data points compared to other qualities, indicating they are less represented in the dataset.</h4>
<h4>Subsequently, I think higher-quality wines (7, 8, 9) are associated with clusters 1, 2, and 4, while lower-quality wines (3, 4) are more evenly spread across clusters.</h4>

<h4><b>Cluster profiling for creating the cluster average</b></h4>

In [5]:
# Then calculate the average of quality for each cluster
cluster_quality_average = df.groupby('cluster')['quality'].mean()

# place the averages based on clusters back into DataFrame
df['cluster_average_quality'] = df['cluster'].map(cluster_quality_average)

df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,cluster,cluster_average_quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,4,5.711085
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0,6.070647
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,5,5.781536
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,1,5.483315
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,1,5.483315
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,5,5.781536
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,1,5.483315
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,4,5.711085
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0,6.070647
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6,5,5.781536


<h4><b>The Cluster_Average_Quality Feature</b></h4>
<h4>The first 10 rows showing how cluster_average_quality aligns with individual rows' cluster assigned.
For example:
<ul>
<li>Row 0 belongs to cluster 4 and has a cluster_average_quality of 5.711</li>
<li>Row 1 belongs to cluster 0 and has a cluster_average_quality of 6.071</li>
</ul>
</h4>
<h4>I think the cluster_average_quality column is essential for
<ul><li>revealing the average quality characteristic of each cluster.</li></ul>
<ul><li>It can also help to determine if clustering aligns with the target variable (quality).</li></ul>
The dataset I think will now have a meaningful feature (cluster_average_quality) that could be used in further analysis or modeling, particularly if clusters provide insights into quality.
Therefore, I can say that the cluster_average_quality column enriches the dataset by summarizing the cluster's overall quality behavior, providing a simple way to understand cluster-target relationships and patterns.</h4>

In [6]:
df['cluster'].value_counts()

cluster
5    1094
0    1005
1     929
3     915
4     848
2     107
Name: count, dtype: int64

<h4><b>Here, I now evaluate the new feature</b></h4>

In [7]:
# one way is to check the correlation towards the target with the cluster profiled variable
correlation = df['quality'].corr(df['cluster_average_quality'])
correlation

np.float64(0.3480011533470964)

<h4>Since, correlation measures the strength and direction of a linear relationship between two variables. And the value ranges from:
<ul>
<li>-1: Perfect negative linear relationship</li>
<li>0: No linear relationship</li>
<li>+1: Perfect positive linear relationship</li>
</ul>
Therefore, the value 0.348, is a positive correlation, meaning - when cluster_average_quality increases, there is a tendency for quality to also increase.
The correlation is moderate (closer to 0 than to 1), so the relationship is not very strong but still meaningful.</h4>

<h4>The correlation coefficient suggests that the clustering approach adds some value in understanding the target variable (quality) but cannot fully capture its variability. I think there is perhaps need to incorporate additional features to improve the clustering strategy for better result.</h4>