Dataset:
https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume

## Infering Relationships Between Traffic Volume and Weather, Temporal Data

In [11]:
# Package Imports
import pandas as pd
import altair as alt
# from ucimlrepo import fetch_ucirepo
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.pipeline import Pipeline

## Summary

## Introduction

## Methods

#### Data

#### Analysis

In this step, we load the Iris dataset from an online source into a pandas DataFrame. We check for missing values to ensure the dataset is complete, view the first and last few rows to get a sense of the data structure, and look at the shape and data types to understand what kind of data we are working with.

In [3]:
# Data Import
iris = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")

# Check NA values, head, tail
display(
   iris.isna().sum(),
   iris.head(),
   iris.tail(),
   iris.shape,
   iris.dtypes,
   iris.describe()
)

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


(150, 5)

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Insights from Data

From the above code cells, we can see:

- The dataset has 150 rows and 5 columns.

- There are no missing values, so the data is complete.

- The dataset contains three species: Setosa, Versicolor, and Virginica.

- Petal measurements show the strongest separation between species.

- Setosa is clearly distinct, while Versicolor and Virginica overlap somewhat.

- Sepal measurements show smaller differences and are less useful for distinguishing species.

Scatter plot: 

This plot shows how the species are separated based on petal measurements. The three species form distinct clusters, indicating that these two features are good for classification. It also shows that Setosa is well-separated, while Versicolor and Virginica have some overlap.

In [4]:
alt.Chart(iris).mark_circle(size=120).encode(
    x=alt.X("petal_length", title="Petal Length (cm)"),
    y=alt.Y("petal_width", title="Petal Width (cm)"),
    color=alt.Color("species", title="Species"),
    tooltip=["species", "petal_length", "petal_width"]
).properties(
    width=500,
    height=400,
    title="Petal Length vs Petal Width by Species"
).interactive()

Boxplots: 

These plots show the distribution of each feature for different species. By comparing medians and ranges, we see that Setosa generally has smaller petals and sepals, while Virginica has the largest. This helps us understand the differences between species and why some features are better for classification.

In [5]:
features = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
# EDA Boxplot
iris_melt = iris.melt(id_vars="species", var_name="feature", value_name="value")
alt.Chart(iris_melt).mark_boxplot(size=40).encode(
    y=alt.Y("species:N", title="Species"),
    x=alt.X("value:Q", title="Measurement (cm)"),
    color=alt.Color("species:N", legend=None)
).properties(
    width=400,
    height=150
).facet(
    row=alt.Row("feature:N", title="Feature")
)

Correlation Heatmap: 

This plot shows how the numeric features are related to each other. We can see that petal length and petal width are highly correlated, while sepal length and sepal width have a weaker correlation. This suggests that petal measurements might be more useful for distinguishing species.

In [6]:

corr = iris.drop(columns="species").corr().stack().reset_index()
corr.columns = ["feature1", "feature2", "correlation"]

# Heatmap
heatmap = alt.Chart(corr).mark_rect().encode(
    x=alt.X("feature1:N", title="Feature"),
    y=alt.Y("feature2:N", title="Feature"),
    color=alt.Color("correlation:Q", scale=alt.Scale(scheme="redblue"), title="Correlation")
).properties(
    width=300,
    height=300,
    title="Correlation Heatmap of Iris Features"
)

# Add correlation values
text = alt.Chart(corr).mark_text(size=14).encode(
    x="feature1:N",
    y="feature2:N",
    text=alt.Text("correlation:Q", format=".2f"),
    color=alt.condition(
        "datum.correlation > 0.5",
        alt.value("white"),
        alt.value("black")
    )
)

heatmap + text

## Results & Discussion

In [None]:
# Dropping species column to create feature matrix and target vector

X = iris.drop(columns = ['species'], axis=1)
y = iris['species']

The next step of the process consists of splitting the data into train and test. In this case a 80-20% split is being considered.

In [9]:
# Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=522)

As a precautinary step, it is always beneficial to run the dummy model on the data to get the baseline accuracy. This aids in tuning the true regression model being used for training and prediction so that a balanced train-test score can be achieved.

In [17]:
#Using dummy classifier to test and get the worst baseline accuracy
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dummy = DummyClassifier(strategy="most_frequent")   
cv_dummyscore = cross_validate(dummy, X_train, y_train, cv=5, return_train_score=True)
cv_dummyscore_df = pd.DataFrame(cv_dummyscore)
cv_dummyscore_df

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.000625,0.000404,0.333333,0.333333
1,0.00049,0.000351,0.333333,0.34375
2,0.000417,0.00031,0.333333,0.34375
3,0.000352,0.00099,0.333333,0.34375
4,0.000311,0.000277,0.333333,0.34375


In order to insert all the columns that require column transformations such as the StandardScaler(), we need to obtain all the feature columns from the dataset

In [21]:
features = X_train.columns.tolist()
features

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

The preprocessor is required and is a good practice before using the pipeline. The preprocessor consists of all the required transformations and the features on which they will be performed

In [None]:
from sklearn.preprocessing import StandardScaler

preprocessor = make_column_transformer(
    (StandardScaler(), features)
)

In [None]:
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

## References