<h1> Assignment 2</h1>
The objective of this assignment is to provide hands-on experience and proficiency in data analysis using Pandas. The assignment aims to enhance skills in data loading, exploration, selection, filtering, manipulation, and  grouping. The ultimate goal is to empower students to effectively navigate and analyze data, laying a solid foundation for more advanced data science tasks.

**Instructions:**
  - Do not remove comments such as # Task 1.1,# Task 1.2 etc. Provide your answers bellow these comments.
  - Use the provided variable names for your answers.
  - If the question is related to pandas operations, you are not allowed to use for loops.

## Task 1: Data Loading and Exploration (20 points)

1. **Loading Dataset**: Download the irirsh dataset from [here](https://drive.google.com/file/d/13s4W8dEaeV_fyOZ4uPf-dLkF9smw4zdW/view?usp=sharing) and Load it into a Pandas DataFrame named `iris_df`. (5pts)

2. **Displaying Data**: Print the first 10 rows of the DataFrame to get an initial overview of the dataset. (5pts)

3. **Checking Data Integrity**: Check the data types and null values in each column to ensure the dataset's completeness and consistency. (5pts)

4. **Summary Statistics**: Print the summary statistics for the numeric columns (mean, min, max, etc.) and for the categorical columns (count, unique, top, freq) providing a statistical snapshot of the dataset. (5pts)

In [1]:
import pandas as pd
# Task 1.1
url = 'https://drive.google.com/file/d/13s4W8dEaeV_fyOZ4uPf-dLkF9smw4zdW/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
iris_df = pd.read_csv(path)

# Task 1.2
print(iris_df.head(10))

# Task 1.3
print(iris_df.dtypes)
print(iris_df.isnull().sum())

# Task 1.4
print(iris_df.describe(include='all'))



   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa
6           4.6          3.4           1.4          0.3  setosa
7           5.0          3.4           1.5          0.2  setosa
8           4.4          2.9           1.4          0.2  setosa
9           4.9          3.1           1.5          0.1  setosa
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
        sepal_length  sepal_width  petal_length  petal_widt

## Task 2: Data Selection and Filtering (20 points)

1. **Creating Subset**: Create a new DataFrame, `versicolor_df`, containing only rows where the species is 'versicolor'. (10pts)

2. **Column Selection**: Select and display only the 'sepal_length' and 'sepal_width' columns for the 'versicolor' species, narrowing down the focus to specific attributes. (5pts)

3. **Maximum Sepal Length**: Find the maximum sepal length for each species, offering insights into the range of sepal lengths across different species. (5pts)

In [None]:
# Task 2.1
versicolor_df = iris_df[iris_df['species'] == 'versicolor']
print(versicolor_df)

# Task 2.2
versicolor_df = versicolor_df[['sepal_length', 'sepal_width']]
print(versicolor_df)

# Task 2.3
max_sepal_length = iris_df.groupby('species')['sepal_length'].max()
print(max_sepal_length)

## Task 3: Data Manipulation (30 points)

1. **Adding New Column**: Add a new column to `iris_df` named 'sepal_ratio', representing the ratio of 'sepal_length' to 'sepal_width'. (10pts)

2. **Sorting Data**: Sort `iris_df` based on the 'petal_length' column in descending order, aiding in identifying patterns related to petal lengths. (10pts)

3. **Replacing Value**: Replace the 'versicolor' species with 'Iris-versicolor' in the 'species' column. (10pts)

In [3]:
# Task 3.1
iris_df['sepal_ratio'] = iris_df['sepal_length'] / iris_df['sepal_width']
print(iris_df.head())

# Task 3.2
sorted_df = iris_df.sort_values(by='petal_length', ascending=False)
print(sorted_df.head())

# Task 3.3
iris_df['species'] = iris_df['species'].replace('versicolor', 'Iris-versicolor')
print(iris_df['species'].unique())


   sepal_length  sepal_width  petal_length  petal_width species  sepal_ratio
0           5.1          3.5           1.4          0.2  setosa     1.457143
1           4.9          3.0           1.4          0.2  setosa     1.633333
2           4.7          3.2           1.3          0.2  setosa     1.468750
3           4.6          3.1           1.5          0.2  setosa     1.483871
4           5.0          3.6           1.4          0.2  setosa     1.388889
     sepal_length  sepal_width  petal_length  petal_width    species  \
118           7.7          2.6           6.9          2.3  virginica   
122           7.7          2.8           6.7          2.0  virginica   
117           7.7          3.8           6.7          2.2  virginica   
105           7.6          3.0           6.6          2.1  virginica   
131           7.9          3.8           6.4          2.0  virginica   

     sepal_ratio  
118     2.961538  
122     2.750000  
117     2.026316  
105     2.533333  
131     2.

## Task 4: Data Aggregation and Grouping (30 points)

1. **Grouping and Calculating Means**: Group `iris_df` by the 'species' column and calculate the mean for each numeric column within each group, providing insights into the average characteristics of each species. (10pts)

2. **Counting Species Occurrences**: Count the number of occurrences of each species in the original `iris_df`, giving an overview of the dataset's species distribution. (10pts)

3. **Identifying Highest Averages**: Find the species with the highest average 'petal_length' and 'sepal_length', offering insights into which species tends to have longer petals and sepals on average. (10pts)

In [2]:
# Task 4.1
grouped_means = iris_df.groupby('species').mean()
print(grouped_means)

# Task 4.2
species_counts = iris_df['species'].value_counts()
print(species_counts)

# Task 4.3
max_avg_petal_length_species = grouped_means['petal_length'].idxmax()
max_avg_sepal_length_species = grouped_means['sepal_length'].idxmax()

print("Species with the highest average petal length:", max_avg_petal_length_species)
print("Species with the highest average sepal length:", max_avg_sepal_length_species)

            sepal_length  sepal_width  petal_length  petal_width
species                                                         
setosa             5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
virginica          6.588        2.974         5.552        2.026
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
Species with the highest average petal length: virginica
Species with the highest average sepal length: virginica
