
First, let's load the dataset and examine its structure to understand what needs pre-processing.

In [1]:
import pandas as pd

# Load the dataset

df = pd.read_csv("iris.csv")

# Display the first few rows of the dataset
df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Explanation of Steps:

Install scikit-learn: The package is installed at the beginning to ensure it is available.

Load the Data: We load the dataset and inspect the first few rows.

Handle Missing Values: Missing values are replaced with either the mean (for numerical data) or the most frequent value (for categorical data).
Encode Categorical Variables: Categorical data is converted to numerical values using label encoding.
Feature Scaling: The numerical features are scaled to ensure consistent magnitude across all features.
Train-Test Split: The data is split into training and testing sets (80% for training, 20% for testing).
After running this, your data will be pre-processed and ready for model training. Let me know if you encounter any other issues!

In [5]:
!pip install scikit-learn


^C
Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.14.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--

ERROR: Exception:
Traceback (most recent call last):
  File "F:\Anaconda\envs\ANN\Lib\site-packages\pip\_vendor\urllib3\response.py", line 438, in _error_catcher
    yield
  File "F:\Anaconda\envs\ANN\Lib\site-packages\pip\_vendor\urllib3\response.py", line 561, in read
    data = self._fp_read(amt) if not fp_closed else b""
           ^^^^^^^^^^^^^^^^^^
  File "F:\Anaconda\envs\ANN\Lib\site-packages\pip\_vendor\urllib3\response.py", line 527, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
           ^^^^^^^^^^^^^^^^^^
  File "F:\Anaconda\envs\ANN\Lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 98, in read
    data: bytes = self.__fp.read(amt)
                  ^^^^^^^^^^^^^^^^^^^
  File "F:\Anaconda\envs\ANN\Lib\http\client.py", line 479, in read
    s = self.fp.read(amt)
        ^^^^^^^^^^^^^^^^^
  File "F:\Anaconda\envs\ANN\Lib\socket.py", line 720, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
 

In [10]:
# Install scikit-learn if not already installed
!pip install scikit-learn

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Load the dataset

df = pd.read_csv("iris.csv")

# Step 1: Understanding the data
print("First 5 rows of the dataset:")
print(df.head())

print("\nData types and missing values:")
print(df.info())

# Step 2: Handling missing values
# If there are missing values, we will replace them with the mean for numerical columns
# and the most frequent value for categorical columns

# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# Handling missing numerical values (if any)
imputer_num = SimpleImputer(strategy='mean')
df[df.select_dtypes(include=['float64', 'int64']).columns] = imputer_num.fit_transform(df.select_dtypes(include=['float64', 'int64']))

# Handling missing categorical values (if any)
imputer_cat = SimpleImputer(strategy='most_frequent')
df[df.select_dtypes(include=['object']).columns] = imputer_cat.fit_transform(df.select_dtypes(include=['object']))

# Step 3: Encoding categorical variables (if any)
# Let's check if there are any categorical columns that need to be encoded
categorical_cols = df.select_dtypes(include=['object']).columns
print(f"\nCategorical columns: {categorical_cols}")

# Label encoding for categorical variables (if necessary)
encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = encoder.fit_transform(df[col])

# Step 4: Feature Scaling
# Scaling the numerical features to ensure they are on the same scale
scaler = StandardScaler()
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

print("\nScaled features preview:")
print(df.head())

# Step 5: Splitting the dataset into training and testing sets
# Assuming the last column is the target variable
X = df.iloc[:, :-1]  # Features
y = df.iloc[:, -1]   # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nShapes of training and test sets:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

# Now you can proceed with model training using X_train, y_train, X_test, and y_test.


First 5 rows of the dataset:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Data types and missing values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

Missing values in each column:
sepal_length    0
sepal_width     0
petal_length    0
petal_w