# Iris Classification

## EDA Notebook

This notebook uses the training data to conduct exploratory data analysis (EDA).

Data will be loaded and merged. Mapping will match target values to species labels.

Both summary stats and visualization offer insight into patterns and relationships.

Some steps shown here are for example only given the cleanliness of the data (e.g., no missing data).

**NOTE:** EDA is done on training data as that is the data the model will be built with. Inaccurate assumptions or associations may arise if test data is not withheld. 

In [None]:
import os
import pandas as pd
import seaborn as sns
import plotly.express as px
from skimpy import skim_get_figure

In [None]:
# Set data path
data_path = os.path.join('..', 'data')
# Set img path
img_path = os.path.join('..', 'imgs')
# Create img dir if not there
os.makedirs(img_path, exist_ok=True)

In [None]:
# Create X and y file paths
X_train_data_path = os.path.join(data_path, 'X_train.csv')
y_train_data_path = os.path.join(data_path, 'y_train.csv')

In [None]:
# Read in data
X_train = pd.read_csv(X_train_data_path)
y_train = pd.read_csv(y_train_data_path)

In [None]:
# Combine X and y so labels can be used
df = pd.concat([X_train, y_train], axis=1)

In [None]:
# Create dictionary mapping target values to actual species names
target_mapping = {
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica'
}
# Create new column where value comes from target_mapping dict based on target value
df['species'] = df['target'].map(target_mapping)
# Check species values and counts
df.species.value_counts()

## Summary Statistics

- Pandas methods:
    - [info](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)
    - [describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
- Skim report

In [None]:
# Built-in pandas function, returns None
df.info()

In [None]:
# Built-in pandas function, returns df
df_desc = df.describe()
df_desc

In [None]:
# Set save path and generate skim report
skim_img_path = os.path.join(img_path, 'skim_summary.svg')
skim_get_figure(df, save_path=skim_img_path)

## Visualizations

In [None]:
df.columns

In [None]:
feat_cols = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

In [None]:
df_grouped=df.groupby('species')[feat_cols].mean()
df_grouped

In [None]:
test = df_grouped.reset_index(drop=False, names=['species'])
test

In [None]:
def apply_default_plotly_styling(fig, title, xaxis_title=None, 
                          yaxis_title=None, legend_title=None):
    """ Function to update layout with consistent styling and flexible parameters

    Args:
        fig (plotly.graph_objects.Figure): Figure for formatting
        title (str): Main title for graph
        xaxis_title (str): Title for x-axis of graph
        yaxis_title (str): Title for y-axis of graph

    Returns:
        plotly.graph_objects.Figure: Plotly figure with updated formatting
    """    
    # Update layout for title and fonts
    fig.update_layout(
        title={
            'text': title,
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        font=dict(
            family="Arial, sans-serif",
            size=14,
            color="black"
        ),
        title_font=dict(size=24)
    )

    if xaxis_title is not None:
        fig.update_layout( xaxis_title=xaxis_title)

    if yaxis_title is not None:
        fig.update_layout(yaxis_title=yaxis_title)

    if legend_title is not None:
        fig.update_layout(legend_title_text=legend_title)
    
    return fig

In [None]:
fig = px.bar(test, x="species", y=feat_cols, barmode='group',
            color_discrete_sequence=px.colors.qualitative.Bold)

fig = apply_default_plotly_styling(fig, title="Average by Species", xaxis_title="Species", yaxis_title="Average", legend_title="Measurement")

fig.show()
bar_path = os.path.join(img_path, 'Iris_Average_Features_by_Species.png')
fig.write_image(bar_path)