<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/main/Exercises/day-6/Data_Validation/Data_Validation_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Validation and Preparation Using the Iris Dataset in Colab
This lab is designed to help you practically apply data validation and preparation techniques as shown in the image. You’ll use Python and pandas in Google Colab.

Objectives
- Ingest a classic dataset

- Validate data types and values

- Detect and handle missing or anomalous data

- Prepare the data for downstream machine learning tasks



In [20]:
# Step 1: Set Up the Environment
# Install pandas if not available (usually pre-installed in Colab)
!pip install pandas -q

In [21]:
# Step 2: Load the IRIS Dataset
import pandas as pd

url = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(url)
print(df.head())



   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


In [22]:
# Step 3: Data Validation
# Check Data Types and Summary
print(df.info())
print(df.describe())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.50000

In [23]:
# 3.2 Validate Expected Value Ranges
print(df['species'].unique())
print(df.describe(include='all'))


['setosa' 'versicolor' 'virginica']
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.054000      3.758667     1.198667     NaN
std         0.828066     0.433594      1.764420     0.763161     NaN
min         4.300000     2.000000      1.000000     0.100000     NaN
25%         5.100000     2.800000      1.600000     0.300000     NaN
50%         5.800000     3.000000      4.350000     1.300000     NaN
75%         6.400000     3.300000      5.100000     1.800000     NaN
max         7.900000     4.400000      6.900000     2.500000     NaN


In [24]:
# 3.3 Detect Missing or Anomalous Data
print(df.isnull().sum())

# Example: Check for any negative lengths/widths
for col in ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']:
    print(f"{col} has values < 0:", (df[col] < 0).any())


sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
sepal_length has values < 0: False
sepal_width has values < 0: False
petal_length has values < 0: False
petal_width has values < 0: False


In [25]:
# Step 4: Data Preparation
# 4.1 Handle Missing Values (if found)
# If missing, fill with median (as example)
df.fillna(df.median(numeric_only=True), inplace=True)


In [26]:
# 4.2 Create New Feature (example: sepal area)
df['sepal_area'] = df['sepal_length'] * df['sepal_width']
print(df[['sepal_length', 'sepal_width', 'sepal_area']].head())


   sepal_length  sepal_width  sepal_area
0           5.1          3.5       17.85
1           4.9          3.0       14.70
2           4.7          3.2       15.04
3           4.6          3.1       14.26
4           5.0          3.6       18.00


In [27]:
# 4.3 Encode Categorical Variables
df['species_encoded'] = df['species'].astype('category').cat.codes
print(df[['species', 'species_encoded']].head())


  species  species_encoded
0  setosa                0
1  setosa                0
2  setosa                0
3  setosa                0
4  setosa                0


In [28]:
# Step 5: Save Prepared Data (Optional)
df.to_csv("iris_prepared.csv", index=False)


### Step 6: Reflection Questions
At the end of your notebook, answer:

- What kind of data problems could you catch with validation?

- Why is preparation crucial before training machine learning models?

- What issues would occur if you skip validation and preparation?

### Deliverables:

- All code outputs for each step above.

- Written answers to the reflection questions.

Additional Exercises (Dataset)
- Student Performance (Portuguese School) - https://raw.githubusercontent.com/selva86/datasets/master/StudentPerformance.csv
- Credit Card Fraud Detection - https://raw.githubusercontent.com/datasets/credit-card-fraud-detection/master/data/creditcard.csv