# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python08 - Data Exploration</span>

**Prof. Robin Van Oirbeek**  

<br/>

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)

---
## **Goal of Data Exploration**

Data exploration is a critical first step in any data analysis or machine learning workflow. It involves examining the dataset to understand its structure, detect patterns, identify anomalies, and prepare it for further processing. Proper exploration helps ensure that the data is suitable for the analysis and reveals key insights that may guide your modeling choices.

In this section, we will use the **Breast Cancer Dataset** from Scikit-Learn to demonstrate the process of data exploration. The dataset contains information about various features of cell nuclei, such as radius, texture, and perimeter, which can be used to classify whether a tumor is malignant or benign.

---

### **Steps in Data Exploration**

1. **Loading the Dataset**:
   - Use `sklearn.datasets.load_breast_cancer` to load the Breast Cancer dataset.
   - Understand the structure of the dataset, including features, target variables, and metadata.

2. **Inspecting Data**:
   - Check the size and shape of the dataset.
   - Examine feature names, target labels, and any descriptive statistics.

3. **Descriptive Statistics**:
   - Summarize the data using statistical measures like mean, median, and standard deviation to understand feature distributions.

4. **Checking for Missing Values**:
   - Identify any missing or incomplete data, which might require preprocessing.

5. **Visualizing Data**:
   - Use visualization techniques (e.g., histograms, scatter plots, boxplots) to understand distributions, correlations, and potential outliers.

---

### **Why Explore Data?**

1. **Detect Patterns**:
   - Identify relationships and trends that might be useful for modeling.

2. **Handle Missing or Outlier Data**:
   - Ensure data integrity by handling anomalies or filling in missing values.

3. **Understand Feature Importance**:
   - Determine which features might contribute the most to predictions.

4. **Validate Assumptions**:
   - Confirm that the dataset aligns with the problem you're trying to solve.

---


# Load a dataset

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

dataset_df = pd.DataFrame(dataset.data)

dataset_df.columns = dataset.feature_names
dataset_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - Exploring the Breast Cancer Dataset**

#### **Objective**
Learn how to explore a dataset by examining its shape and statistical properties using `describe()` and `shape`.

---

#### **Instructions**

1. **Load the Dataset**:
   - Use Scikit-Learn's `load_breast_cancer` to load the Breast Cancer dataset.

2. **Convert to a DataFrame**:
   - Convert the dataset into a Pandas DataFrame for easier exploration.

3. **Print the Shape**:
   - Use `.shape` to determine the number of rows (samples) and columns (features) in the dataset.

4. **Use `describe()`**:
   - Apply the `describe()` function to generate descriptive statistics for the features (e.g., mean, min, max, std).

5. **Interpret the Output**:
   - Reflect on the range of values, central tendencies, and variations in the features.


**Expected output**

# **Remove Low Variance Variables (Be Careful!!!)**

### **Why Remove Low Variance Columns?**

In some cases, **columns with very low variance** can be removed from the dataset because they contribute very little to the model's predictive power. A column with nearly constant values across all observations does not provide useful information to distinguish between different target classes or predict outcomes.

---

### **When Can Low Variance Columns Be Removed?**
- If a column has almost the same value for all observations, it **does not contribute useful variability** to the dataset.
- Features with extremely low variance **do not influence the model's decisions** since they remain constant across most samples.
- Removing such columns can **reduce computational cost** and help simplify the dataset.

---

### **When Should You Be Careful?**

- Some low-variance features might still contain useful **signal** for certain models, especially in high-dimensional datasets.
- If a feature has **low variance globally** but is meaningful within certain groups or classes, removing it may negatively impact model performance.
- For imbalanced datasets, a low-variance feature might be critical for the minority class, so always check **class-specific statistics** before removal.




<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - Identifying and Removing Low Variance Features**

#### **Objective**
Learn how to:
1. Compute descriptive statistics of a dataset.
2. Identify features with low variance using `VarianceThreshold` from Scikit-Learn.
3. Remove low-variance features from a dataset.
4. Visualize a selected feature distribution.

---

#### **Instructions**

1. **Explore the Dataset**:
   - Use `.describe()` to analyze feature statistics.

2. **Apply Variance Thresholding**:
   - Use `VarianceThreshold(threshold=0.02)` to identify and remove features with very low variance.

3. **Analyze the Results**:
   - Print the shape of the dataset before and after feature selection.
   - Identify which columns were retained after applying the variance threshold.

4. **Visualize a Feature**:
   - Plot a histogram of `"mean smoothness"` to examine its distribution.


#### **Bonus Challenge**
- Experiment with different threshold values (e.g., `0.01`, `0.05`) and observe how many features are removed at each level.
- Compare the distributions of a retained feature and a removed feature.



</div>

**Expecte output**

![meanSmoothness.png](attachment:90def0f0-3839-4534-9c39-de88d57ca02e.png)

# **Univariate Distributions**

## **Goal of Analyzing Univariate Distributions**

A **univariate distribution** refers to the probability distribution of a single variable. Analyzing univariate distributions is a crucial step in **exploratory data analysis** because it helps us understand the **patterns, central tendency, spread, and shape** of individual variables in a dataset.

### **Why Analyze Univariate Distributions?**
1. **Understand the Data Distribution**  
   - Identify whether a variable follows a normal, skewed, or uniform distribution.
   - Detect **symmetry or asymmetry** in data.

2. **Detect Outliers**  
   - Outliers can be identified by observing extreme values in histograms, boxplots, or density plots.

3. **Assess Data Variability**  
   - Measures like **variance, standard deviation, and interquartile range (IQR)** help understand the spread of data.

4. **Identify Potential Data Transformations**  
   - If data is **skewed**, transformations like **log, square root, or Box-Cox** may improve model performance.

5. **Feature Engineering & Selection**  
   - Highly **skewed** or **low variance** features may need preprocessing before applying machine learning models.

---

## **Common Ways to Visualize Univariate Distributions**
1. **Histogram** üìä  
   - A graphical representation of the distribution of a dataset.
   - Shows frequency counts of data values in bins.

2. **Boxplot** üì¶  
   - Visualizes the median, quartiles, and outliers.
   - Helps detect skewness and extreme values.

3. **Summary Statistics** üìë  
   - Mean, median, mode, variance, and standard deviation summarize data properties.


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - Visualizing Feature Distributions by Class**

#### **Objective**
Learn how to:
- Compare the distributions of different features across target classes.
- Use **Seaborn‚Äôs `histplot()`** to overlay histograms for binary classification problems.
- Create subplots to visualize multiple feature distributions simultaneously.

---

#### **Instructions**

1. **Separate Data by Class**:
   - Create two subsets:  
     - `X0`: Contains data where `target == 0`.
     - `X1`: Contains data where `target == 1`.

2. **Plot Histograms for Each Feature**:
   - Create a **grid of subplots** (5 columns, 6 rows) to visualize all features.
   - Use **Seaborn‚Äôs `histplot()`** to overlay the distributions:
     - **Blue** for `target = 0`
     - **Red** for `target = 1`
     - 
3. **Plot Boxplot for Each Feature**:
   - Create a **grid of subplots** (5 columns, 6 rows) to visualize all features.
   - Use **Seaborn‚Äôs `boxplot()`** to overlay the distributions:
     - **Blue** for `target = 0`
     - **Red** for `target = 1`

3. **Analyze the Distributions**:
   - Observe whether features **separate the two classes well**.
   - Identify features that show a **clear distinction** between classes.


![distribution_class.png](attachment:d4ce3445-df82-45f2-85c4-29a5103e0c7b.png)

![boxplot_target.png](attachment:a6eec902-09ba-4dc4-8bdb-d301e20b9b50.png)

# **Univariate Feature Selection (BE CAREFUL!)**

## **Goal of This Code**
Feature selection is a crucial step in the **data preprocessing pipeline** of machine learning. This code demonstrates how to use **univariate feature selection** to identify and retain the most relevant features based on statistical tests.

### **Why Perform Feature Selection?**
- **Improves Model Performance** üöÄ: Reducing the number of features eliminates noise and potential overfitting.
- **Reduces Computational Cost** ‚ö°: Working with fewer features speeds up training and inference.
- **Enhances Interpretability** üîç: Helps focus on the most important variables influencing predictions.

---

## **How It Works**
The code applies **two different univariate tests** to rank and select the **top 5 most relevant features** from a dataset:

1. **Chi-Square (`chi2`)**:
   - Used for **classification problems** where features are categorical or positive numerical.
   - Measures dependency between features and target variable.
   - Higher chi-square scores indicate stronger relationships.

2. **ANOVA F-test (`f_classif`)**:
   - Used for **classification** with numerical features.
   - Compares variance **between classes** vs. variance **within classes**.
   - Higher F-values suggest a feature contributes significantly to class separation.



In [25]:
from sklearn import feature_selection
from sklearn.feature_selection import SelectKBest

X = dataset_df.iloc[:,:-1]

y = dataset.target

# Select the 5 best features using Chi-Square and ANOVA F-test
selector_chi = SelectKBest(feature_selection.chi2, k=5)
selector_f = SelectKBest(feature_selection.f_classif, k=5)

# Transform the dataset to retain selected features
X_chi = pd.DataFrame(selector_chi.fit_transform(X, y), columns=X.columns[selector_chi.get_support()])
X_f = pd.DataFrame(selector_f.fit_transform(X, y), columns=X.columns[selector_f.get_support()])

In [27]:
print("Top 5 features selected using Chi-Square test:", X.columns[selector_chi.get_support()])
print("Top 5 features selected using ANOVA F-test:", X.columns[selector_f.get_support()])

Top 5 features selected using Chi-Square test: Index(['mean perimeter', 'mean area', 'area error', 'worst perimeter',
       'worst area'],
      dtype='object')
Top 5 features selected using ANOVA F-test: Index(['mean perimeter', 'mean concave points', 'worst radius',
       'worst perimeter', 'worst concave points'],
      dtype='object')


In [29]:
for i in range(len(selector_chi.scores_)):
    print(f'Feature {i} ({X.columns[i]}): {selector_chi.scores_[i]:.4f}')

Feature 0 (mean radius): 266.1049
Feature 1 (mean texture): 93.8975
Feature 2 (mean perimeter): 2011.1029
Feature 3 (mean area): 53991.6559
Feature 4 (mean smoothness): 0.1499
Feature 5 (mean compactness): 5.4031
Feature 6 (mean concavity): 19.7124
Feature 7 (mean concave points): 10.5440
Feature 8 (mean symmetry): 0.2574
Feature 9 (mean fractal dimension): 0.0001
Feature 10 (radius error): 34.6752
Feature 11 (texture error): 0.0098
Feature 12 (perimeter error): 250.5719
Feature 13 (area error): 8758.5047
Feature 14 (smoothness error): 0.0033
Feature 15 (compactness error): 0.6138
Feature 16 (concavity error): 1.0447
Feature 17 (concave points error): 0.3052
Feature 18 (symmetry error): 0.0001
Feature 19 (fractal dimension error): 0.0064
Feature 20 (worst radius): 491.6892
Feature 21 (worst texture): 174.4494
Feature 22 (worst perimeter): 3665.0354
Feature 23 (worst area): 112598.4316
Feature 24 (worst smoothness): 0.3974
Feature 25 (worst compactness): 19.3149
Feature 26 (worst concav

## **Important Notes ‚ö†Ô∏è**

### üîπ **Use the Right Test for the Right Task!**
- **Classification**: `chi2`, `f_classif`, `mutual_info_classif`
- **Regression**: `f_regression`, `mutual_info_regression`

---

### üîπ **Chi-Square Assumptions**
- Works best when **features contain only positive values**.
- **Not suitable for continuous numerical features** without binning.

---

### üîπ **F-test Assumptions**
- Assumes **normally distributed data** with **equal variance**.


# **Bivariate Analysis**

## **Goal of Bivariate Analysis**
Bivariate analysis is the statistical analysis of two variables to determine the **relationship** between them. Unlike **univariate analysis**, which examines a single variable, bivariate analysis helps us understand how one variable influences another.

### **Why Perform Bivariate Analysis?**
- To **identify correlations** between features.
- To detect **associations** between categorical and numerical variables.
- To examine **trends and patterns** that can inform predictive modeling.
- To detect **outliers** and relationships that may require feature transformations.

---

## **Types of Bivariate Analysis**
The type of analysis depends on the nature of the two variables being compared:

### **1. Numerical vs. Numerical**
- **Correlation Analysis**: Measures how strongly two numerical variables are related.
  - Pearson correlation (linear relationship)
  - Spearman correlation (monotonic relationship)
  - Kendall correlation (ordinal association)
- **Scatter Plots**: Visualizes the relationship between two numerical features.

### **2. Numerical vs. Categorical**
- **Boxplots**: Compare the distribution of a numerical variable across categories.
- **Violin Plots**: Show both distribution and density of numerical variables across categories.
- **T-tests & ANOVA**: Statistical tests to compare means of a numerical variable across categorical groups.

### **3. Categorical vs. Categorical**
- **Contingency Tables**: Show the frequency distribution between two categorical variables.
- **Chi-Square Test**: Measures whether two categorical variables are independent.
- **Stacked Bar Charts**: Visual representation of categorical relationships.

By performing bivariate analysis, we gain valuable insights into feature relationships, which helps in feature selection, transformation, and predictive modeling! üöÄ


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - Visualizing Relationships with Pairplot**

#### **Objective**
- Learn how to use **Seaborn's `pairplot()`** to visualize relationships between multiple numerical features.
- Use the **`hue` parameter** to distinguish different classes.
- Improve feature selection by analyzing feature distributions and correlations.

---

#### **Instructions**

1. **Create a Subset of the Dataset**:
   - Select only the **first 6 numerical columns** from `dataset_df`.
   - Include the `target` column to use as the grouping variable.

2. **Generate a Pairplot**:
   - Use **`sns.pairplot()`** to plot pairwise relationships between selected features.
   - Set the `hue='target'` to color-code the data points by class.
   - Improve the visualization by adjusting aesthetics.

3. **Analyze the Relationships**:
   - Look for **clusters** and **class separations**.
   - Identify **highly correlated features** that might be redundant.


![bivariate.png](attachment:f2fbff3c-8e2c-4023-971a-80764e99ae93.png)