<h1><span style="color: lightblue;">Performing Data Wrangling ⚡️</span></h1>

## 🌟 Step 1
<h2><span style="color: yellow;">Data Gathering 🌎 </span></h2> 

In [1]:
# Importing nessesary modules
import pandas as pd
import numpy as np

In [9]:
# Reading CSV
iris_data = pd.read_csv('Iris.csv')
iris_data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


## 🌟 Step 2
<h2><span style="color: yellow;">Data  Assessments 🌎 </span></h2>  
- In this step, the data is to be understood more deeply. Before implementing methods to clean it, you will definitely need to have a better idea about what the data is about.<br>
- Basically a whole summary of data.<br>
- Data assessment is often an iterative process.<br>
    
<h2><span style="color: red;"> 🛑 Step 1 Discover </span></h2>  
- View datasets.<br>
- Check shape of the data.<br>
    
### 🔥 Automatic Assessments 
- Programmatic.<br>
- Using Pandas.<br>
  - `head and tail`
  - `describe`
  - `sample`
  - `info`
  - `isnull`
  - `duplicated`

In [5]:
# Display random 50 sample
iris_data.sample(50)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
31,32,5.4,3.4,1.5,0.4,Iris-setosa
27,28,5.2,3.5,1.5,0.2,Iris-setosa
53,54,5.5,2.3,4.0,1.3,Iris-versicolor
123,124,6.3,2.7,4.9,1.8,Iris-virginica
97,98,6.2,2.9,4.3,1.3,Iris-versicolor
9,10,4.9,3.1,1.5,0.1,Iris-setosa
42,43,4.4,3.2,1.3,0.2,Iris-setosa
36,37,5.5,3.5,1.3,0.2,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
101,102,5.8,2.7,5.1,1.9,Iris-virginica


In [6]:
# Expose null values in a dataset
iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [10]:
# Sum of duplicated colomn in iris_data
iris_data.duplicated().sum()

0

In [11]:
# Statistcal detail of all numerical colomn in dataset
iris_data.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


### 🔥 Manual Assessments 
- Export data into excel sit and watch for hours to the data or looking through the data manually in google sheets.

In [12]:
# Exporting data into excel sheet
with pd.ExcelWriter('iris_dataset.xlsx') as writer:
  iris_data.to_excel(writer,sheet_name='iris_data')

<h2><span style="color: red;"> 🛑 Step 2 Document </span></h2> 
- Summary (**refer to readme.md**) <br>
- Address issues within the dataset combine and make documents<br>

- **The Iris Flower Dataset**
- The Iris dataset is a classic dataset in the field of machine learning and statistics, first introduced by the British biologist and statistician Ronald A. Fisher in 1936. It is often used for pattern recognition and classification tasks.
- **Dataset Overview**: Contains information about 150 iris flowers, categorized into three species: Setosa, Versicolor, and Virginica.
- **Species Distribution**: The dataset is evenly distributed, with 50 samples from each species.
- **Flower Measurements**: The dataset includes four features of the iris flowers: sepal length, sepal width, petal length, and petal width, measured in centimeters. These measurements provide valuable information for differentiating between the three species of iris flowers.

## 🌟 Step 3
<h2><span style="color: yellow;">Data Cleaning or Data Quality Dimensions✨</span></h2> 

In [13]:
# Making a copy of dataset for analysis
iris_data_df = iris_data.copy()
iris_data_df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


`Always make sure to create a copy of your pandas dataframe before you start the cleaning process`

In [15]:
# Drop Duplicates
iris_data_df = iris_data_df.drop_duplicates()

In [16]:
# Drop Irrelevant Columns
iris_data_df = iris_data_df.drop(columns=['Id'])

## 🌟 Step 4
<h2><span style="color: yellow;">Feature Engineering ✨</span></h2> 

In [17]:
# Creating Petal Area and Sepal Area
iris_data_df['PetalArea'] = iris_data_df['PetalLengthCm'] * iris_data_df['PetalWidthCm']
iris_data_df['SepalArea'] = iris_data_df['SepalLengthCm'] * iris_data_df['SepalWidthCm']

In [19]:
# Creating Petal to Sepal Ratio features
iris_data_df['PetalLengthToSepalLength'] = iris_data_df['PetalLengthCm'] / iris_data_df['SepalLengthCm']
iris_data_df['PetalWidthToSepalWidth'] = iris_data_df['PetalWidthCm'] / iris_data_df['SepalWidthCm']

In [20]:
iris_data_df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalArea,SepalArea,PetalLengthToSepalLength,PetalWidthToSepalWidth
0,5.1,3.5,1.4,0.2,Iris-setosa,0.28,17.85,0.274510,0.057143
1,4.9,3.0,1.4,0.2,Iris-setosa,0.28,14.70,0.285714,0.066667
2,4.7,3.2,1.3,0.2,Iris-setosa,0.26,15.04,0.276596,0.062500
3,4.6,3.1,1.5,0.2,Iris-setosa,0.30,14.26,0.326087,0.064516
4,5.0,3.6,1.4,0.2,Iris-setosa,0.28,18.00,0.280000,0.055556
...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,11.96,20.10,0.776119,0.766667
146,6.3,2.5,5.0,1.9,Iris-virginica,9.50,15.75,0.793651,0.760000
147,6.5,3.0,5.2,2.0,Iris-virginica,10.40,19.50,0.800000,0.666667
148,6.2,3.4,5.4,2.3,Iris-virginica,12.42,21.08,0.870968,0.676471


In [21]:
# Write the DataFrame to a CSV file
iris_data_df.to_csv('iris_cleaned_dataset.csv', index=False)