<a href="https://colab.research.google.com/github/aaaraafaat/ML--Learning_Base/blob/main/ML_Logistic_Regression_using_SciKit_Learn_on_IRIS_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IRIS Dataset : ML Classficiation using Logistic Regression Model (SciKit Learn)

Task: Use sklearn.datasets iris flower dataset to train your model using logistic regression. You need to figure out accuracy of your model and use that to predict different samples in your test dataset. In iris dataset there are 150 samples containing following features,

1. Sepal Length
2. Sepal Width
3. Petal Length
4. Petal Width

Using above 4 features you will clasify a flower in one of the three categories,

1. Setosa
2. Versicolour
3. Virginica

The text steps has been added by Gemini.

### 1. Import necessary libraries

First, we import the required libraries: `pandas` for data manipulation, `sklearn.linear_model` for Logistic Regression, `sklearn.model_selection` for splitting data, and `sklearn.metrics` for evaluating the model.

In [12]:
import pandas as pd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

### 2. Load the Iris dataset

We load the Iris dataset, which is a built-in dataset in scikit-learn, suitable for classification tasks.

In [13]:
from sklearn.datasets import load_iris
iris = load_iris()

### 3. Explore the dataset

Let's inspect the keys available in the `iris` dataset object to understand its structure.

In [14]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

Next, we look at the `target` array, which contains the numerical labels for each flower species (0, 1, or 2).

In [15]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

To understand what these numerical labels represent, we check the `target_names`.

In [16]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

We also inspect the `feature_names` to know what each column in our dataset represents.

In [17]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

### 4. Create a Pandas DataFrame

We convert the `iris.data` into a pandas DataFrame, using the `feature_names` as column headers, for easier manipulation and viewing.

In [18]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)

Display the first few rows of the DataFrame to get a glimpse of the data.

In [None]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


### 5. Initialize the Logistic Regression Model

We instantiate the Logistic Regression model from `sklearn.linear_model`. We'll use this model for our classification task.

In [19]:
model = linear_model.LogisticRegression()

### 6. Train the model

We train the Logistic Regression model using the `fit` method. The features (`df`) are used as input, and the `iris.target` (flower species) serves as the output variable.

In [20]:
model.fit(df, iris.target)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 7. Make a prediction

Now, let's use our trained model to predict the species of a new flower based on its sepal and petal measurements. The input `[[5.1, 3.5, 1.4, 0.2]]` corresponds to the sepal length, sepal width, petal length, and petal width, respectively.

In [21]:
model.predict([[5.1, 3.5, 1.4, 0.2]])



array([0])

The model has been trained and can now make predictions. To further evaluate its performance, we would typically split the data into training and testing sets, train on the training set, and evaluate accuracy on the test set. We can also look at the confusion matrix to understand the model's performance on each class.

In [23]:
print(iris.target_names[model.predict([[5.1, 3.5, 1.4, 0.2]])])

['setosa']




To prevent the 'X does not have valid feature names' warning, we should provide the input for prediction as a pandas DataFrame with the appropriate feature names. This aligns with how the model was initially trained.

In [24]:
new_flower_data = pd.DataFrame([[5.1, 3.5, 1.4, 0.2]], columns=iris.feature_names)
predicted_species_index_df = model.predict(new_flower_data)[0]
print(f"Predicted species (using DataFrame input): {iris.target_names[predicted_species_index_df]}")

Predicted species (using DataFrame input): setosa
