## **1. Selection of data.**

Importing modules, defending some functions, reading the data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans

In [None]:
# Function to calculate entropy
def calculate_entropy(data):
    class_counts = data['class'].value_counts()
    probabilities = class_counts / len(data)
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy

# Function to calculate information gain
def calculate_information_gain(data, attribute):
    entropy_before = calculate_entropy(data)
    values = data[attribute].unique()
    entropy_after = 0

    for value in values:
        subset = data[data[attribute] == value]
        entropy_after += len(subset) / len(data) * calculate_entropy(subset)

    information_gain = entropy_before - entropy_after
    return information_gain

In [None]:
df = pd.read_csv("/content/sample_data/car.data", names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'])
print(df)

     buying  maint  doors persons lug_boot safety  class
0     vhigh  vhigh      2       2    small    low  unacc
1     vhigh  vhigh      2       2    small    med  unacc
2     vhigh  vhigh      2       2    small   high  unacc
3     vhigh  vhigh      2       2      med    low  unacc
4     vhigh  vhigh      2       2      med    med  unacc
...     ...    ...    ...     ...      ...    ...    ...
1723    low    low  5more    more      med    med   good
1724    low    low  5more    more      med   high  vgood
1725    low    low  5more    more      big    low  unacc
1726    low    low  5more    more      big    med   good
1727    low    low  5more    more      big   high  vgood

[1728 rows x 7 columns]


***Comments on reading. Short explanations about the need to transform and supplement data in certain columns:***

>

The resulting data is in tabular DataFrame  format with seven columns: '***buying***', '***maint***', '***doors***', '***persons***', '***lug_boot***', '***safety***' and '***class***'. Each row corresponds to a specific car.
To prepare data for modeling, we can use the **Label Encoding** technique to replace categorical values with numeric ones.

## **2. Transformation of data**

In [None]:
# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()
for col in df.columns:
    df[col] = label_encoder.fit_transform(df[col])

print(df)

      buying  maint  doors  persons  lug_boot  safety  class  cluster
0          3      3      0        0         2       1      2        1
1          3      3      0        0         2       2      2        2
2          3      3      0        0         2       0      2        1
3          3      3      0        0         1       1      2        1
4          3      3      0        0         1       2      2        2
...      ...    ...    ...      ...       ...     ...    ...      ...
1723       1      1      3        2         1       2      1        2
1724       1      1      3        2         1       0      3        1
1725       1      1      3        2         0       1      2        0
1726       1      1      3        2         0       2      1        0
1727       1      1      3        2         0       0      3        0

[1728 rows x 8 columns]


***Comments on the conversion. Explanation of the need for standardization and normalization of data:***


The resulting data were subjected to the **Label Encoding** process, where the categorical values of each attribute were replaced by their corresponding numerical equivalents. For example, '***buying***' has four possible categories (***v-high, high, med, low***), which are now represented by numerical values between 0 and 3. This transformation makes the data suitable for use in machine learning algorithms that require numerical input.

Regarding standardization and normalization for the simple multilayer perceptron (**MLP**) (part 2):
In the case of a simple **MLP**, standardization or normalization of the data may be important because the algorithm is sensitive to the values of the input attributes and their scales.
Standardization can improve the speed of convergence of the algorithm and avoid the disproportionate impact of large values of individual attributes.


In [None]:
# Standardization and normalization of data
scaler = StandardScaler()
min_max_scaler = MinMaxScaler()

df_scaled = pd.DataFrame(min_max_scaler.fit_transform(scaler.fit_transform(df)), columns=df.columns)

print(df_scaled)

        buying     maint  doors  persons  lug_boot  safety     class  cluster
0     1.000000  1.000000    0.0      0.0       1.0     0.5  0.666667      0.5
1     1.000000  1.000000    0.0      0.0       1.0     1.0  0.666667      1.0
2     1.000000  1.000000    0.0      0.0       1.0     0.0  0.666667      0.5
3     1.000000  1.000000    0.0      0.0       0.5     0.5  0.666667      0.5
4     1.000000  1.000000    0.0      0.0       0.5     1.0  0.666667      1.0
...        ...       ...    ...      ...       ...     ...       ...      ...
1723  0.333333  0.333333    1.0      1.0       0.5     1.0  0.333333      1.0
1724  0.333333  0.333333    1.0      1.0       0.5     0.0  1.000000      0.5
1725  0.333333  0.333333    1.0      1.0       0.0     0.5  0.666667      0.0
1726  0.333333  0.333333    1.0      1.0       0.0     1.0  0.333333      0.0
1727  0.333333  0.333333    1.0      1.0       0.0     0.0  1.000000      0.0

[1728 rows x 8 columns]


**Standardization:**

'***buying***', '***maint***', '***doors***', '***persons***', '***lug_boot***', '***safety***': After **Label Encoding** we have numerical values, but their scale can vary greatly.
Standardization allows to rescale attributes so that their values have a mean of 0 and a standard deviation of 1.
Formula: Standardized_Value = (X - Mean(X)) / StdDev(X), where X is the original value of the attribute, Mean(X) is the average value of the attribute, StdDev(X) is the standard deviation of the attribute.

**Normalization (optional):**

Normalization can also be used to bring values between 0 and 1.
Formula: Normalized_Value = (X - Min(X)) / (Max(X) - Min(X)), where X is the original value of the attribute, Min(X) is the minimum value of the attribute, and Max(X) is the maximum value of the attribute.

## **3. Dividing samples into clusters and creating a sample**

***Discussion of possible options for clustering:***

**K-Means clustering:**

**Pros**: Simple and computationally efficient, works well with spherical clusters.
Cons: Sensitive to initial placement of centroids, assumes clusters have similar variances.

**Hierarchical clustering:**

Pros: captures hierarchical relationships in the data, does not assume a fixed number of clusters.
Cons: Can be computationally expensive, difficult to interpret for large datasets.

**DBSCAN (Spatial Clustering of Density-Based Noise Applications):**

Pros: Can detect clusters of arbitrary shape, resistant to outliers.
Cons: Sensitive to the choice of hyperparameters, can struggle with clusters of different densities.
I will use the **K-Means** clustering in the program.

In [None]:
# Apply K-means clustering
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=42)
df['cluster'] = kmeans.fit_predict(df_scaled.drop('class', axis=1))

# Display clusters in the original DataFrame
print("\nDataFrame with Clusters:")
print(df.head())


DataFrame with Clusters:
   buying  maint  doors  persons  lug_boot  safety  class  cluster
0       3      3      0        0         2       1      2        2
1       3      3      0        0         2       2      2        0
2       3      3      0        0         2       0      2        2
3       3      3      0        0         1       1      2        2
4       3      3      0        0         1       2      2        0


***Comments on the obtained clusters and sample:***

So the clustering has been successfully applied and the "***cluster***" column has been added to the DataFrame. Each row is assigned a cluster label (in this case 0, 1, or 2). The resulting DataFrame now contains both the original attributes and the assigned cluster labels.



## **4. Data compression using the method of principal components**

***Description of the data compression algorithm using the method of principal components:***

• Data standardization

• Construction of the covariance matrix

• Finding eigenvalues and eigenvectors of the covariance matrix

• Sorting of eigenvalues by magnitude in descending order of absolute value

• Selection of ***k<d*** eigenvectors for the ***k*** largest eigenvalues

• Construction of the projection matrix

In [None]:
# Covariance matrix
covariance_matrix = np.cov(df_scaled, rowvar=False)

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Print eigenvectors
print("\nEigenvectors:")
for i, eigenvector in enumerate(eigenvectors):
    print(f"PC{i + 1}:", eigenvector)

# Perform PCA using the first two principal components
pca_result = df_scaled.dot(eigenvectors[:, :2])

# Display PCA result
print("\nPCA Result (First Two Principal Components):")
print(pca_result.head())


Eigenvectors:
PC1: [-3.91493594e-04 -5.56935493e-03  7.65372078e-02 -4.81091183e-02
  6.74767000e-01  5.20258323e-01  5.15574602e-01  3.62181907e-17]
PC2: [-2.66127672e-02  3.47391179e-02  6.64939076e-02 -3.82243408e-02
  5.98432772e-01 -2.41675976e-12 -7.96291800e-01 -3.40151553e-16]
PC3: [ 2.38496097e-04  3.39282542e-03 -4.66261151e-02  2.93078537e-02
 -4.11064954e-01  8.54008945e-01 -3.14085677e-01 -1.58476289e-15]
PC4: [-9.17857653e-02  1.22224005e-01 -3.31780639e-01  9.19516361e-01
  9.69523263e-02  4.88545134e-14  9.41712420e-03 -1.07515115e-01]
PC5: [-6.40132993e-01  6.30039912e-01  1.34065972e-01 -1.40472302e-01
 -4.53859394e-02  9.43518823e-14  3.27095407e-02 -3.90440150e-01]
PC6: [-2.84145052e-01  2.83414272e-01  1.82355692e-02  4.81401059e-02
 -7.98033837e-03  4.72961311e-14  1.50751049e-02  9.14328710e-01]
PC7: [-1.16142472e-02  1.10527874e-01 -9.24031152e-01 -3.57473991e-01
  7.75580569e-02  2.80887523e-14  3.49592545e-03  1.65479274e-16]
PC8: [-7.07264713e-01 -7.03080085

***Comments on compression results:***

Principal Component Analysis (**PCA**) is applied to the vehicle dataset to reduce its dimensionality. The program first standardizes and normalizes the data, then calculates the covariance matrix and outputs the eigenvectors and eigenvalues. The eigenvectors represent the principal components that capture the maximum variance in the data. The program outputs these eigenvectors and performs **PCA**, projecting the data onto the first two principal components. The resulting **PCA** coordinates are displayed, providing a lower-dimensional representation of the original dataset.

## **5. Calculation of entropy and assessment of informativeness of sample attributes**


***Discussion of possible options for selecting the target attribute.***
***Formulas for calculating entropy and information gain for processed data:***

>

**Calculation of entropy:**

>

The entropy (H(S)) for a set S with c classes is calculated as the sum of the negative products of the fraction of instances (p_i) and the base 2 logarithm of p_i:

>

Entropy (S) = - sum from i=1 to c (p_i * log2(p_i))

>

**Calculation of information amplification:**

The information gain (IG(S, A)) for the attribute A and the set S is calculated by subtracting the sum of the weighted entropies of the subsets (S_v) from the entropy of the original set:

IG(S, A) = Entropy(S) – the sum of v in values(A) from ((|S_v| / |S|) * Entropy(S_v) )

**Where:**

Values (A) are the unique values of attribute A.
|S| is the total number of instances in the set S.
|С_в| is the number of instances in set S with value v for attribute A.
Entropy(S_v) is the entropy of a subset of S_v.

>

https://colab.research.google.com/drive/1ExiTy--Tsq4D56sZyvYEIryYZ67lKxKS#scrollTo=D3DMLIxxI1wD&line=1&uniqifier=1

>

Select the target attribute:

The choice of the target attribute depends on the goal of the problem and the required information. In a vehicle evaluation data set, the '***class***' attribute may be a suitable object because it represents the vehicle's eligibility.
Entropy and information gain:
A lower entropy indicates a cleaner or more homogeneous set. A high information gain indicates that the attribute effectively partitions the data into subsets with greater homogeneity relative to the target attribute.
In summary, the goal is to select a target attribute that leads to an optimal distribution, resulting in a model that effectively classifies instances.

In [None]:
# Display information gain for each attribute excluding the target attribute 'class'
target_attribute = 'class'
for attribute in df.columns[:-1]:  # Exclude the target attribute 'class'
    if attribute != target_attribute:
        information_gain = calculate_information_gain(df, attribute)
        print(f'Information Gain for {attribute}: {information_gain}')


Information Gain for buying: 0.09644896916961399
Information Gain for maint: 0.07370394692148596
Information Gain for doors: 0.004485716626632108
Information Gain for persons: 0.2196629633399082
Information Gain for lug_boot: 0.030008141247605424
Information Gain for safety: 0.262184356554264


***Comments on the value of sample attributes:***

Information Gain for buying: 0.0964

Information Gain for maint: 0.0737

Information Gain for doors: 0.0045

Information Gain for persons: 0.2197

Information Gain for lug_boot: 0.0300

Information Gain for safety: 0.2622

'***persons***' and '***safety***' have relatively higher values of information gain, indicating that these features are more informative for distinguishing different classes.

## **6. General conclusions of the first part of work:**

**K-means** clustering revealed patterns in the data, assigning each sample to one of three clusters.
**PCA** was used to reduce the dimensionality of the data set while preserving information.
**Information acquisition** analysis helps identify the attributes that contribute most to the classification task.
Feature '***cluster***' provides an additional view of the data set by grouping similar samples.

Overall, the program offers an understanding of the structure of the vehicle evaluation data set, providing a framework for further analysis and decision-making. The combination of **PCA** and **K-means** clustering enriches the understanding of patterns and relationships in the data.