# Data Analysis Group Assignment

# Data Analysis Group Assignment

This notebook performs an exploratory data analysis (EDA) and applies clustering and regression techniques to the provided dataset.

## Analysis Steps:

1.  **Data Loading:** The dataset is loaded from an Excel file.
2.  **Data Structure:** The shape, data types, and summary statistics of the dataset are displayed.
3.  **Missing Values:** The number of missing values per column is identified.
4.  **Outlier Detection:** Boxplots are generated to identify and visualize outliers in numerical variables.
5.  **Correlation Analysis:** A correlation matrix and heatmap are computed and displayed to understand the relationships between numerical variables. Strong positive and negative correlations are highlighted.
6.  **Variable Selection for Clustering:** Variables with notable correlations are considered for clustering analysis.
7.  **K-Means Clustering:** K-Means clustering is applied to the selected variables, and the Elbow method is used to determine a suitable number of clusters.
8.  **Cluster Visualization:** Principal Component Analysis (PCA) is used to reduce the dimensionality of the data, and the clusters are visualized in a 2D scatter plot.
9.  **Cluster Interpretation:** The mean values of the clustering variables are calculated for each cluster to understand their characteristics.
10. **Variable Selection for Regression:** Independent and dependent variables are selected for regression analysis.
11. **Linear Regression:** A linear regression model is applied to predict the 'Level' based on the selected independent variables. The data is split into training and testing sets, and variables are scaled.
12. **Regression Interpretation:** The intercept and coefficients of the linear regression model are displayed and interpreted to understand the impact of each independent variable on the dependent variable. Regression evaluation metrics (MAE, MSE, RMSE, R2) are computed and displayed.
13. **Data Export:** The cleaned dataset is exported to a new Excel file.

This notebook provides insights into the dataset's structure, relationships between variables, identifies potential clusters, and explores the factors influencing the 'Level' variable through regression.

## Load the data

In [None]:
import pandas as pd

file_path = "/content/drive/MyDrive/Colab Data/Group_Project.xlsx"

df = pd.read_excel(file_path)

display(df.head())

## Show dataset structure

In [None]:
print("Data Frame Shape (Rows, Columns)", df.shape)

print("\nData types of each column:")
print(df.dtypes)

print("\nConcise summary of the DataFrame:")
df.info()

print("\nDescriptive statistics for numerical columns:")
display(df.describe())

## Handle missing values

In [None]:
missing_values = df.isnull().sum()
print("Number of missing values per column:")
print(missing_values)

## Identify and describe outliers

In [None]:
import matplotlib.pyplot as plt
import numpy as np

numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
numerical_cols.remove('index')
numerical_cols.remove('Gender')
numerical_cols.remove('Age')

plt.figure(figsize=(12, 6))
df[numerical_cols].boxplot(patch_artist=True, boxprops=dict(facecolor="lightcoral"))
plt.xticks(rotation=45)
plt.ylabel("Values")
plt.title("Boxplots of Numerical Variables")
plt.grid(False)
plt.show()

plt.figure(figsize=(4, 6))
df[['Age']].boxplot(patch_artist=True, boxprops=dict(facecolor="lightcoral"))
plt.title("Boxplot of Age")
plt.grid(False)
plt.show()

## Compute and interpret correlation coefficients

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

numerical_cols_forr = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_cols_forr = [col for col in numerical_cols_forr if col not in ['index', 'Patient Id', 'Gender']]

correlation_matrix_forr = df[numerical_cols_forr].corr()

print("Correlation Matrix:")
display(correlation_matrix_forr)

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix_forr, annot=False, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()


print("\nStrong Positive Correlations (Correlation > 0.7):")
strong_positive_corr_forr = correlation_matrix_forr[abs(correlation_matrix_forr) > 0.7].stack().sort_values(ascending=False)
strong_positive_corr_forr = strong_positive_corr_forr[strong_positive_corr_forr < 1]
if not strong_positive_corr_forr.empty:
    print(strong_positive_corr_forr)
else:
    print("No strong positive correlations (above 0.7) found.")

print("\nStrong Negative Correlations (Correlation < -0.5):")
strong_negative_corr_forr = correlation_matrix_forr[correlation_matrix_forr < -0.5].stack().sort_values()
if not strong_negative_corr_forr.empty:
    print(strong_negative_corr_forr)
else:
    print("No strong negative correlations (below -0.5) found.")

## Select variables for clustering

In [None]:
print("Review of Strong Positive Correlations:")
display(strong_positive_corr_forr)
clustering_vars = [
    'Age',
    'Air Pollution',
    'Alcohol use',
    'Dust Allergy',
    'OccuPational Hazards',
    'Genetic Risk',
    'chronic Lung Disease',
    'Balanced Diet',
    'Obesity',
    'Smoking',
    'Passive Smoker',
    'Chest Pain',
    'Coughing of Blood',
    'Fatigue',
    'Weight Loss',
    'Shortness of Breath',
    'Wheezing',
    'Swallowing Difficulty',
    'Clubbing of Finger Nails',
    'Frequent Cold',
    'Dry Cough',
    'Snoring'
]

## Apply k-means clustering and determine optimal clusters

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X = df[clustering_vars]

inertia = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True)
plt.show()

## Visualize clusters

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt

X = df[clustering_vars]

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

X_pca = pd.DataFrame(X_pca, columns=['PCA1', 'PCA2'])

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)

kmeans.fit(X)

X_pca['cluster'] = kmeans.labels_

plt.figure(figsize=(10, 8))
sns.scatterplot(x='PCA1', y='PCA2', hue='cluster', data=X_pca, palette='viridis', legend='full')

plt.title('K-Means Clustering Results (PCA Reduced)')

plt.show()

## Interpret Clusters

In [None]:
df_clustered = df.copy()
df_clustered['cluster'] = kmeans.labels_

cluster_means = df_clustered.groupby('cluster')[clustering_vars].mean()

print("\nMean values of variables per cluster:")
display(cluster_means)

## Select variables for regression analysis

In [None]:
dependent_variable = 'Level'
independent_variables = [
    'Age',
    'Air Pollution',
    'Alcohol use',
    'Dust Allergy',
    'OccuPational Hazards',
    'Genetic Risk',
    'chronic Lung Disease',
    'Smoking',
    'Passive Smoker',
    'Chest Pain',
    'Coughing of Blood',
    'Fatigue',
    'Weight Loss',
    'Shortness of Breath',
    'Wheezing',
    'Swallowing Difficulty',
    'Clubbing of Finger Nails',
    'Frequent Cold',
    'Dry Cough',
    'Snoring'
 ]

## Apply regression model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np

X = df[independent_variables]
y = df[dependent_variable]

level_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
y_encoded = y.map(level_mapping)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=independent_variables)

X_train, X_test, y_train_encoded, y_test_encoded = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train_encoded)

print("Linear Regression model fitted")
print(f"Independent variables used: {independent_variables}")
print(f"Dependent variable used: {dependent_variable} (Encoded)")
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

print("Regression Output:")
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

coefficients_df = pd.Series(model.coef_, index=independent_variables)
print("\nCoefficients with variable names:")
display(coefficients_df)

for var, coef in coefficients_df.items():
    if abs(coef) > 0.1:
        print(f"{var}: Notable impact on 'Level' (coef = {coef:.4f})")

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test_encoded, y_pred)
mse = mean_squared_error(y_test_encoded, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_encoded, y_pred)

print("\nRegression Evaluation Metrics on Test Set:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2) Score: {r2:.4f}")


In [None]:
output_path = "/content/drive/MyDrive/Colab Data/cleaned_Group_Project.xlsx"
df.to_excel(output_path, index=False)
print(f"Cleaned dataset exported to: {output_path}")