# Building First Predictive Models  
### Iris Dataset: Data Foundations, Grouping, and Vector Scoring

#### 1. Data Inspection and Quality Check
In this step, I inspect the dataset structure, data types, and missing values using the `.info()` method.

#### 2. Summary Statistics
The `.describe()` method provides key descriptive statistics for each numerical feature, including mean, standard deviation, and quartiles. This is equivalent to a full inspection report of the dataset.

#### 3. Grouping and Aggregation
I group the dataset by Iris species and compute average feature measurements for each group. This condenses the dataset into a high-level statistical summary useful for modeling.

#### 4. Vector Scoring Using the Dot Product
A single flower is treated as a vector of features. By computing the dot product between this vector and a weight vector, I generate a scalar score—an early form of a linear predictive model.


In [1]:
import sys
from pathlib import Path

# Add ../src to Python path so imports work from notebooks/
sys.path.append(str(Path("..").resolve()))

from src.iris_utils import (
    load_iris_df,
    describe_features,
    mean_sepal_length,
    grouped_feature_means,
    first_flower_vector,
    dot_score,
    export_csv,
)

import numpy as np
from IPython.display import display

df = load_iris_df(add_species_name=True)

df.info()
display(describe_features(df))

print("Mean sepal length (cm):", mean_sepal_length(df))

display(grouped_feature_means(df, by="species_name"))

weights = np.array([1.5, 0.5, 1.5, 0.5])
x0 = first_flower_vector(df)
print("Dot score (first flower):", dot_score(x0, weights))

export_csv(df, "../data/iris_clean.csv")



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int64  
 5   species_name       150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Mean sepal length (cm): 5.843333333333334


Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
species_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


Dot score (first flower): 11.599999999999998



### Why this matters:
We’re now practicing the real “notebook → application” pathway: notebooks call functions in the *source* **/src** directory.