In [54]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Step 2: Load the dataset (Upload your dataset to Google Colab first)
from google.colab import files
uploaded = files.upload()

# Load the dataset (make sure the file is properly uploaded)
df = pd.read_excel('EastWestAirlines.xlsx')  # Adjust the filename as necessary
print("Data Loaded:")
print(df.head())  # Print the first few rows to ensure the data is loaded

# Step 3: Check data types of all columns
print("Data Types:")
print(df.dtypes)

# Check for missing values
print("Checking for missing values...")
print(df.isnull().sum())  # Show missing values count

# Drop rows with missing values
df.dropna(inplace=True)

# Convert object columns to numeric if possible
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='ignore')  # Convert to numeric, ignore errors
    if df[col].dtype == 'object':
        print(f"Column '{col}' could not be converted to numeric and remains as object.")

# Select only numeric columns for scaling
numeric_columns = df.select_dtypes(include=[np.number]).columns
print(f"Numeric columns to be scaled: {numeric_columns}")

if len(numeric_columns) == 0:
    print("No numeric columns available for scaling.")
else:
    # Scale the numeric features using StandardScaler
    scaler = StandardScaler()
    scaled_df = scaler.fit_transform(df[numeric_columns])  # Now scaled_df is defined
    print(f"Scaled data shape: {scaled_df.shape}")  # Check shape to ensure scaling worked

    # Check if scaled_df is defined
    print("Scaled Data Sample:")
    print(scaled_df[:5])  # Print first 5 rows of scaled data

# Proceed to K-Means Clustering only if scaled_df exists
if 'scaled_df' in locals():
    # Determine the optimal number of clusters using the Elbow Method
    sse = []  # List to store sum of squared errors
    k_values = range(1, 11)  # K values from 1 to 10

    # Loop to calculate SSE for different K values
    for k in k_values:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_df)  # Use scaled_df (scaled data)
        sse.append(kmeans.inertia_)  # Append SSE to the list

    # Plot the Elbow Curve
    plt.figure(figsize=(10, 6))
    plt.plot(k_values, sse, 'bo-')
    plt.xlabel('Number of clusters (K)')
    plt.ylabel('Sum of Squared Errors (SSE)')
    plt.title('Elbow Curve for K-Means')
    plt.show()

    # Choosing K=4 based on the elbow curve
    kmeans = KMeans(n_clusters=4, random_state=42)
    kmeans_labels = kmeans.fit_predict(scaled_df)
    print(f"KMeans Labels:\n{kmeans_labels}")

    # Step 6: Calculate the silhouette score for K-Means
    silhouette_avg = silhouette_score(scaled_df, kmeans_labels)
    print(f'Silhouette Score for K-Means: {silhouette_avg}')
else:
    print("Cannot proceed to K-Means clustering as scaled_df is not defined.")


Saving EastWestAirlines.xlsx to EastWestAirlines (10).xlsx
Data Loaded:
  East-West Airlines is trying to learn more about its customers.  Key issues are their  \
0  flying patterns, earning and use of frequent f...                                      
1  card.  The task is to identify customer segmen...                                      
2                                                NaN                                      
3                                                NaN                                      
4  Source: Based upon real business data; company...                                      

  Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4  
0        NaN        NaN        NaN        NaN  
1        NaN        NaN        NaN        NaN  
2        NaN        NaN        NaN        NaN  
3        NaN        NaN        NaN        NaN  
4        NaN        NaN        NaN        NaN  
Data Types:
East-West Airlines is trying to learn more about its customers.  Key issues are t

  df[col] = pd.to_numeric(df[col], errors='ignore')  # Convert to numeric, ignore errors


Step 1: Import Necessary Libraries

pandas: A library for data manipulation and analysis, particularly useful for working with structured data.

numpy: A library for numerical computations in Python, providing support for arrays and matrices.

matplotlib: A plotting library for creating static, interactive, and animated visualizations in Python.

seaborn: A data visualization library based on Matplotlib that provides a high-level interface for drawing attractive statistical graphics.

sklearn (Scikit-learn): A machine learning library that provides simple and efficient tools for data mining and data analysis.

StandardScaler: A class for scaling features to have a mean of 0 and a standard deviation of 1.

KMeans: An algorithm for clustering data into K distinct groups based on feature similarity.

silhouette_score: A metric to evaluate the quality of clusters by measuring how similar an object is to its own cluster compared to other clusters.

Step 2: Load the Dataset

This block allows you to upload files to Google Colab directly from your local system.

Here, the code reads an Excel file (EastWestAirlines.xlsx) into a DataFrame named df. The head() function displays the first few rows to confirm that the data has been loaded correctly.

Step 3: Check Data Types and Missing Values

The dtypes attribute displays the data types of each column in the DataFrame.

The isnull().sum() function counts the number of missing values in each column, helping identify if any data needs to be cleaned.

Data Cleaning: Drop Missing Values

This line removes any rows with missing values from the DataFrame to ensure that the analysis can proceed without issues related to missing data.

Convert Object Columns to Numeric

This loop attempts to convert each column in the DataFrame to a numeric type. If a column cannot be converted (e.g., it contains non-numeric strings), it will remain as an object type, and a message will be printed.

Select Numeric Columns for Scaling

The select_dtypes() function retrieves the names of columns that are of numeric type, which will be used for scaling. This ensures that only relevant columns are scaled.

Check for Numeric Columns and Scale

If there are numeric columns available, the code creates an instance of StandardScaler and fits it to the numeric data, transforming it to have a mean of 0 and a standard deviation of 1. The scaled data is stored in scaled_df.

It then prints the shape of the scaled data and a sample of the first five rows to confirm that the scaling worked.

Step 5: K-Means Clustering

The code checks if scaled_df exists (meaning scaling was successful).

It initializes an empty list sse to store the sum of squared errors for different values of K (the number of clusters).

It then iterates over K values from 1 to 10, fitting a K-Means model for each value and appending the inertia (the sum of squared distances of samples to their closest cluster center) to the sse list.

Plot the Elbow Curve

This section plots the Elbow Curve using Matplotlib, showing the relationship between the number of clusters (K) and the corresponding SSE values.

The "elbow" point indicates the optimal number of clusters.

Final K-Means Clustering

Here, the optimal number of clusters is chosen based on the elbow curve (in this case, K=4). The K-Means algorithm is then fit to the scaled data, and the cluster labels for each data point are stored in kmeans_labels.

Calculate Silhouette Score

Finally, the silhouette score is calculated using the silhouette_score() function, which evaluates the quality of the clusters formed by the K-Means algorithm. A higher silhouette score indicates better-defined clusters.

If scaled_df is not defined at any point, a message is printed indicating that clustering cannot proceed.

Summary

This code walks through a complete process of loading data, preprocessing it, and applying K-Means clustering while evaluating the results. It includes steps for data cleaning, scaling, determining the optimal number of clusters using the Elbow method, and measuring clustering quality with the silhouette score. Each step includes checks and print statements to provide feedback on the process and ensure everything is functioning as expected. If any issues arise, these print statements can help identify where things went wrong.
