# Daily Task: Predicting Iris Flower Species

#### Step 1: Load the Iris Dataset
- Load the Iris dataset from sklearn.datasets.
- Convert the dataset into a Pandas DataFrame with feature names and target species.

In [8]:
from sklearn.datasets import load_iris
import pandas as pd

raw_data = load_iris()

df = pd.DataFrame(raw_data.data, columns=raw_data.feature_names)
df['Species'] = raw_data.target

print(df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   Species  
0        0  
1        0  
2        0  
3        0  
4        0  


#### Step 2: Data Exploration
- Check for missing values in the dataset.
- Describe the basic statistics of the dataset.
- Check the distribution of the target variable (Species

In [15]:
## Missing Values
df.isnull().sum()

## Basic Statistics 
df.describe

## Target Distribution
df['Species'].value_counts()

0    50
1    50
2    50
Name: Species, dtype: int64

#### Step 3: Feature Selection and Data Splitting
- Select the feature columns and the target column (Species).
- Split the dataset into training and testing sets (80% train, 20% test) using train_test_split from sklearn.

In [16]:
from sklearn.model_selection import train_test_split
import numpy as np

data = df.drop(columns = 'Species')
target = df['Species']

data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.2)

print(f'data_train shape = {data_train.shape}')
print(f'data_test shape = {data_test.shape}')

data_train shape = (120, 4)
data_test shape = (30, 4)


#### Step 4: Train a Logistic Regression Model
- Create and train a Logistic Regression model using LogisticRegression from sklearn.
- Evaluate the model's accuracy on the test set.

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression(max_iter = 1000)
log_reg.fit(data_train, target_train)

score =  log_reg.score(data_test, target_test)

print(f'Model accuracy = {score:.4f}')

Model accuracy = 0.9333


#### Step 5: Model Interpretation
- Print the coefficients of the trained Logistic Regression model.
- Interpret the features with the strongest impact on the model.

In [31]:
co_effs = log_reg.coef_
inter = log_reg.intercept_

# Step 2: Check if it's binary or multi-class classification
if len(log_reg.classes_) == 2:
    # Binary Classification
    print("Binary Classification Problem")
    
    # Get the coefficients
    co_effs = log_reg.coef_[0]  # Only one set of coefficients for binary classification
    feature_names = data_train.columns
    
    # Create a DataFrame to interpret coefficients
    coef_df = pd.DataFrame(co_effs.T, index=feature_names, columns=['Coefficient'])
    
    # Sort by the absolute value of coefficients to see the most important features
    coef_df['Abs_Coefficient'] = coef_df['Coefficient'].abs()
    coef_df_sorted = coef_df.sort_values(by='Abs_Coefficient', ascending=False)
    
    print("\nFeature Importance based on Coefficients for Binary Classification:")
    print(coef_df_sorted)

else:
    # Multi-Class Classification
    print(f"Multi-Class Classification Problem with {len(log_reg.classes_)} classes")

    # Loop over classes and extract coefficients for each class
    for i in range(len(log_reg.classes_)):
        print(f"\nCoefficients for class {log_reg.classes_[i]}:")
        
        # Extracting coefficients for the ith class
        co_effs = log_reg.coef_[i]

        # Create a DataFrame to interpret coefficients
        coef_df = pd.DataFrame(co_effs.T, index=data_train.columns, columns=['Coefficient'])

        # Sort by the absolute value of coefficients to see the most important features
        coef_df['Abs_Coefficient'] = coef_df['Coefficient'].abs()
        coef_df_sorted = coef_df.sort_values(by='Abs_Coefficient', ascending=False)

        print(coef_df_sorted)

# Step 3: Intercept
print(f"\nIntercepts for each class: {log_reg.intercept_}")

# Evaluate the model's accuracy
score = log_reg.score(data_test, target_test)
print(f'\nModel accuracy = {score:.4f}')

Multi-Class Classification Problem with 3 classes

Coefficients for class 0:
                   Coefficient  Abs_Coefficient
petal length (cm)    -2.335806         2.335806
petal width (cm)     -0.990408         0.990408
sepal width (cm)      0.839192         0.839192
sepal length (cm)    -0.518536         0.518536

Coefficients for class 1:
                   Coefficient  Abs_Coefficient
petal width (cm)     -0.783642         0.783642
sepal width (cm)     -0.469044         0.469044
sepal length (cm)     0.271644         0.271644
petal length (cm)    -0.177481         0.177481

Coefficients for class 2:
                   Coefficient  Abs_Coefficient
petal length (cm)     2.513287         2.513287
petal width (cm)      1.774050         1.774050
sepal width (cm)     -0.370148         0.370148
sepal length (cm)     0.246892         0.246892

Intercepts for each class: [ 10.05716565   3.71657699 -13.77374264]

Model accuracy = 0.9333


#### Step 6: Train a Decision Tree Classifier
- Create and train a Decision Tree Classifier.
- Evaluate the Decision Tree model on the test set.

In [32]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(data_train, target_train)

dtc_score =  dtc.score(data_test, target_test)

print(f'Model accuracy = {dtc_score:.4f}')

Model accuracy = 0.9333


#### Bonus Task: Cross-Validation
- Perform 5-fold Cross-Validation on both Logistic Regression and Decision Tree models. Compare the average accuracy.

In [35]:
from sklearn.model_selection import cross_val_score

log_reg_cross_val_results = cross_val_score(log_reg, data, target, cv=5)
dtc_cross_val_results = cross_val_score(dtc, data, target, cv=5)

print(f'Logistic Regression Cross-Validation Results (Accuracy): {log_reg_cross_val_results}')
print(f'Decision Tree Classifier Results (Accuracy): {dtc_cross_val_results}')

## Additional:
# Mean and standard deviation for Logistic Regression
log_reg_mean = np.mean(log_reg_cross_val_results)
log_reg_std = np.std(log_reg_cross_val_results)

# Mean and standard deviation for Decision Tree Classifier
dtc_mean = np.mean(dtc_cross_val_results)
dtc_std = np.std(dtc_cross_val_results)

print(f'Logistic Regression - Mean Accuracy: {log_reg_mean:.4f}, Std Dev: {log_reg_std:.4f}')
print(f'Decision Tree Classifier - Mean Accuracy: {dtc_mean:.4f}, Std Dev: {dtc_std:.4f}')


Logistic Regression Cross-Validation Results (Accuracy): [0.96666667 1.         0.93333333 0.96666667 1.        ]
Decision Tree Classifier Results (Accuracy): [0.96666667 0.96666667 0.9        1.         1.        ]
Logistic Regression - Mean Accuracy: 0.9733, Std Dev: 0.0249
Decision Tree Classifier - Mean Accuracy: 0.9667, Std Dev: 0.0365


#### Notes:

- Mean Accuracy: This tells you the average performance of the model across the 5 folds. A higher mean accuracy is better.
- Standard Deviation: This tells you how much the accuracy varies across the different folds. A lower standard deviation means the model is more consistent.