# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python04 - Data Incompleteness</span>

**Prof. Robin Robin Van Oirbeek **  

<br/>

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)

---

Missing data is a **common challenge** in data mining workflows. If not managed properly, missing values can **distort analyses**, **reduce model accuracy**, or **bias** the results entirely. In this session, we focus on **detecting**, **understanding**, and **handling** missing data through various techniques—such as **imputation**, **removal**, or **model-based approaches**—in order to ensure more **reliable** and **robust** data mining outcomes.


## Inspect DATA

The first step in any data mining project is to **import and inspect your data**. It's best practice to save the data used in this course in a subfolder named `Data` (as provided on Moodle). This helps keep your project organized and makes it easier to locate your datasets.

### Steps to Inspect Your Data

1. **Importing the Data**  
   Use the `pandas` library to load your data. Depending on your data format, you might use functions such as `pd.read_csv()`, `pd.read_excel()`, etc.
2. **Viewing the Data**
Check the first few rows to understand the structure of your dataset. with `data.head()`.
3. **Understanding the Data Structure**
Get insights about the dimensions, data types, and non-null counts. `data.shape` and `data.info()`
4. **Statistical Summary**
   For numerical columns, use the `describe()` method to view summary statistics like mean, standard deviation, min, and max values.
6. **Inspecting Column Names and Data Types**
    Review the column names and their corresponding data types to ensure everything is as expected.



In [128]:
import pandas as pd
data = pd.read_csv("../Data/glass.csv")
data

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.0,1
1,1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.0,1
...,...,...,...,...,...,...,...,...,...,...
209,1.51623,14.14,0.00,2.88,72.61,0.08,9.18,1.06,0.0,7
210,1.51685,14.92,0.00,1.99,73.06,0.00,8.40,1.59,0.0,7
211,1.52065,14.36,0.00,2.02,73.42,0.00,8.44,1.64,0.0,7
212,1.51651,14.38,0.00,1.94,73.61,0.00,8.48,1.57,0.0,7


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

##### **Exercise**

Create a function that print the different elements about a dataframe:

1. Shape
2. Type
3. the name of columns
4. the indexes



</div>


In [130]:
def information_dataset(dataset):
    print("dataset")
    return 

information_dataset(data)

dataset


**Expected output**

##  Duplicate values

The command **help(pd.DataFrame.duplicated)** is used to display the documentation for the **`duplicated()`** method of a Pandas DataFrame. This function helps identify **duplicate rows** in a dataset.

In [135]:
help(pd.DataFrame.duplicated)

Help on function duplicated in module pandas.core.frame:

duplicated(self, subset: 'Hashable | Sequence[Hashable] | None' = None, keep: 'DropKeep' = 'first') -> 'Series'
    Return boolean Series denoting duplicate rows.

    Considering certain columns is optional.

    Parameters
    ----------
    subset : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns.
    keep : {'first', 'last', False}, default 'first'
        Determines which duplicates (if any) to mark.

        - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
        - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
        - False : Mark all duplicates as ``True``.

    Returns
    -------
    Series
        Boolean series for each duplicated rows.

    See Also
    --------
    Index.duplicated : Equivalent method on index.
    Series.duplicated : Equivalent method on Seri

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise: Identifying Duplicate Rows in a DataFrame**

Your task is to **identify and remove rows with duplicate values** in the given DataFrame.


</div>

In [1]:
# exercise


## Missing Values

For the exercises, we manually introduce the missing values ourselves.

In [140]:
import numpy as np
data.iloc[::3,-1] = np.nan #not a number (to every third element)
data.iloc[::4,0] = np.nan

data.head(10)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1.0
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1.0
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,
4,,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1.0
5,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0.0,0.26,1.0
6,1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0.0,0.0,
7,1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0.0,0.0,1.0
8,,14.04,3.58,1.37,72.08,0.56,8.3,0.0,0.0,1.0
9,1.51755,13.0,3.6,1.36,72.99,0.57,8.4,0.0,0.11,


In [141]:
data.isna().any()

RI       True
Na      False
Mg      False
Al      False
Si      False
K       False
Ca      False
Ba      False
Fe      False
Type     True
dtype: bool

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise: Identifying Missing Values in a DataFrame**

Using **Pandas** functions `isna()` and `notna()`, determine which columns in a DataFrame contain **missing values** (NaN values).



</div>

## How to Deal with Missing Values

Handling missing values is a crucial step in **data preprocessing**. The presence of missing values can lead to **biased results**, **incorrect analysis**, or even **errors in model training**. Below are different techniques to deal with missing values, depending on the context and the impact of missing data on the dataset.

### Drop


In [165]:
data = pd.read_csv("../Data/glass.csv")
data.iloc[::3,-1] = np.nan #not a number (to every third element)

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

##### **Exercise**

Do 2 copies of the dataset and drop in one the row with missing values and in the other the columns with missing values. Print the shape after these steps to compare.



</div>

### SUBSTITUTION:  (Be careful using this approach!)

In [203]:
data = pd.read_csv("../Data/glass.csv")
data.iloc[::4,0] = np.nan
data.iloc[::3,-1] = np.nan 

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise: Custom Imputation with `fillna()`**

The goal of this exercise is to practice **handling missing values** by using the `fillna()` method with **custom replacement values** for different columns. Use this dictionary to replace missing values `values = {'Type': 999, "RI": -1}`. 

</div>

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise: Custom Imputation with `mean`**

The goal of this exercise is to practice **handling missing values** by using the `fillna()` method with **custom replacement values** for different columns. Use the mean to replace missing values in the columns with missing values. 

</div>