## Lab 4
### Task 1: Data Inspection and Missing Value Handling 
- Inspect the Dataset: Examine the dataset and identify which columns contain 
missing values. Report how many missing values are present in each column. 
- Handle Missing Values in Numeric Columns: Replace any missing values in the 
numeric columns (sepal_length, sepal_width, petal_length, and petal_width) 
with the average (mean) value of the respective column. 
- Handle Missing Values in Categorical Column: Impute any missing values in the 
species column by replacing them with the most frequent value in that column.

In [1]:
import pandas as pd
df = pd.read_csv("Iris.csv")
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [2]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

### Task 2: Data Cleaning and Transformation 
- Remove Duplicate Entries: Check if there are any duplicate rows in the dataset and 
remove them if found, ensuring the dataset only contains unique entries. 
- Create a New Column by Modifying Existing Ones: Create a new column that 
calculates the petal area by multiplying the petal_length and petal_width 
columns. Add this column to the dataset. 
- Drop Rows with Any Remaining Missing Values: After handling missing data in 
the previous step, drop any rows that still contain missing values

In [3]:
df['patel area'] = df.PetalLengthCm * df.PetalWidthCm
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,patel area
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.28
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.28
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.26
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.30
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.28
...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,11.96
146,147,6.3,2.5,5.0,1.9,Iris-virginica,9.50
147,148,6.5,3.0,5.2,2.0,Iris-virginica,10.40
148,149,6.2,3.4,5.4,2.3,Iris-virginica,12.42


In [4]:
df = df.drop_duplicates()

In [5]:
df.dropna()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,patel area
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.28
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.28
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.26
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.30
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.28
...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,11.96
146,147,6.3,2.5,5.0,1.9,Iris-virginica,9.50
147,148,6.5,3.0,5.2,2.0,Iris-virginica,10.40
148,149,6.2,3.4,5.4,2.3,Iris-virginica,12.42


### Task 3: Aggregation and Transformation 
- Convert Categorical Data to Numeric: Convert the species column (which is 
categorical) into numeric values by assigning each unique species a distinct integer 
value. 
- Aggregation: Calculate the mean of each numeric column (sepal_length, 
sepal_width, petal_length, petal_width) grouped by the species of the flowers. 
This will give you insights into the average measurements for each species. 

In [6]:
df["sepcies_code"] = pd.Categorical(df['Species']).codes
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,patel area,sepcies_code
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.28,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.28,0
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.26,0
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.30,0
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.28,0
...,...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,11.96,2
146,147,6.3,2.5,5.0,1.9,Iris-virginica,9.50,2
147,148,6.5,3.0,5.2,2.0,Iris-virginica,10.40,2
148,149,6.2,3.4,5.4,2.3,Iris-virginica,12.42,2


In [7]:
numeric_col = ['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
agg_col = df.groupby('Species')[numeric_col].mean()
print(agg_col)
# for i in numeric_col:
#     print(i ,':', df[i].mean())

                 SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
Species                                                                  
Iris-setosa              5.006         3.418          1.464         0.244
Iris-versicolor          5.936         2.770          4.260         1.326
Iris-virginica           6.588         2.974          5.552         2.026


### Task 4: Advanced Reshaping 
- Reshape the Data: Reshape the dataset from a wide format to a long format. The 
goal is to create a new version of the dataset where each row corresponds to a single 
measurement (sepal length, sepal width, petal length, or petal width) for each flower. 
You should also create a column that identifies the type of measurement. 
 

In [8]:
reshaped_df = pd.melt(df, id_vars=['Species'],value_vars=['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm'],value_name="measurement_value", var_name='measurement_type')
reshaped_df

Unnamed: 0,Species,measurement_type,measurement_value
0,Iris-setosa,SepalLengthCm,5.1
1,Iris-setosa,SepalLengthCm,4.9
2,Iris-setosa,SepalLengthCm,4.7
3,Iris-setosa,SepalLengthCm,4.6
4,Iris-setosa,SepalLengthCm,5.0
...,...,...,...
595,Iris-virginica,PetalWidthCm,2.3
596,Iris-virginica,PetalWidthCm,1.9
597,Iris-virginica,PetalWidthCm,2.0
598,Iris-virginica,PetalWidthCm,2.3
