# Introduction to NumPy & Pandas - Iris dataset

## 1. Import packages

In [44]:
import numpy as np
import pandas as pd

## 2. Import the Iris dataset into a Pandas DataFrame

In [45]:
url = 'C:/Users/annika/Nextcloud/Dokumente/_My_Workshops_Trainings/Github/Quickstart_Python_Jupyter/Datasets/Iris.csv'
#url = 'https://github.com/anolte-DSC/Python_for_Earth_Sciences/blob/main/Quickstart_Python_Jupyter/Datasets/Iris.csv'
data = pd.read_csv(url)

## 3. Explore the data

Plot the first 5 rows of the dataset.

In [15]:
data.head(5) # note: indexing starts at 0!

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


How many samples per species?

In [17]:
data['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

Plot basic statistics of the dataset.

In [19]:
data.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


## 4. Data manipulation

Check for missing values.

In [21]:
data.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

Create a new feature as a function of other features.

In [46]:
data['PetalAreaCm2'] = data['PetalLengthCm'] * data['PetalWidthCm'] 
data.head(5)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,PetalAreaCm2
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0.28
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0.28
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0.26
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0.3
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0.28


## 5. Data analysis with NumPy

Convert DataFrame to NumPy array.

In [56]:
# Factorize the species column to convert it into integers:
data_int = data.copy()
data_int['Species'], unique_species = pd.factorize(data['Species'])
species_mapping = dict(enumerate(unique_species)) # mapping of species to integers
print("Species to integer mapping:", species_mapping)
# Receive NumPy aray from DataFrame
data_array = data_int.iloc[:, 1:].values
data_array

Species to integer mapping: {0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'}


array([[ 5.1 ,  3.5 ,  1.4 ,  0.2 ,  0.  ,  0.28],
       [ 4.9 ,  3.  ,  1.4 ,  0.2 ,  0.  ,  0.28],
       [ 4.7 ,  3.2 ,  1.3 ,  0.2 ,  0.  ,  0.26],
       [ 4.6 ,  3.1 ,  1.5 ,  0.2 ,  0.  ,  0.3 ],
       [ 5.  ,  3.6 ,  1.4 ,  0.2 ,  0.  ,  0.28],
       [ 5.4 ,  3.9 ,  1.7 ,  0.4 ,  0.  ,  0.68],
       [ 4.6 ,  3.4 ,  1.4 ,  0.3 ,  0.  ,  0.42],
       [ 5.  ,  3.4 ,  1.5 ,  0.2 ,  0.  ,  0.3 ],
       [ 4.4 ,  2.9 ,  1.4 ,  0.2 ,  0.  ,  0.28],
       [ 4.9 ,  3.1 ,  1.5 ,  0.1 ,  0.  ,  0.15],
       [ 5.4 ,  3.7 ,  1.5 ,  0.2 ,  0.  ,  0.3 ],
       [ 4.8 ,  3.4 ,  1.6 ,  0.2 ,  0.  ,  0.32],
       [ 4.8 ,  3.  ,  1.4 ,  0.1 ,  0.  ,  0.14],
       [ 4.3 ,  3.  ,  1.1 ,  0.1 ,  0.  ,  0.11],
       [ 5.8 ,  4.  ,  1.2 ,  0.2 ,  0.  ,  0.24],
       [ 5.7 ,  4.4 ,  1.5 ,  0.4 ,  0.  ,  0.6 ],
       [ 5.4 ,  3.9 ,  1.3 ,  0.4 ,  0.  ,  0.52],
       [ 5.1 ,  3.5 ,  1.4 ,  0.3 ,  0.  ,  0.42],
       [ 5.7 ,  3.8 ,  1.7 ,  0.3 ,  0.  ,  0.51],
       [ 5.1 ,  3.8 ,  1.5 ,  0

Similar calculations possible with NumPy & Pandas: 

In [57]:
species = data_array[:, -2]  # select the column with the species
unique_species, counts = np.unique(species, return_counts=True)
print("Species and their counts:", dict(zip(unique_species, counts)))

Species and their counts: {0.0: 50, 1.0: 50, 2.0: 50}


Compute mean and standard deviation for each feature.

In [58]:
mean_values = np.mean(data_array, axis=0)
std_deviation = np.std(data_array, axis=0)
mean_values = mean_values.round(2)
std_deviation = std_deviation.round(2)
print("Mean of each feature:", mean_values)
print("Standard deviation of each feature:", std_deviation)

Mean of each feature: [5.84 3.05 3.76 1.2  1.   5.79]
Standard deviation of each feature: [0.83 0.43 1.76 0.76 0.82 4.7 ]


Calculate Pearson correlation with continuous numerical variables.

In [60]:
correlation_matrix = np.corrcoef(data_array.T)  # Transpose due to expectation for the input format: variables (features) as rows, observations as columns
print("Correlation matrix:\n", correlation_matrix)

Correlation matrix:
 [[ 1.         -0.10936925  0.87175416  0.81795363  0.78256123  0.85732586]
 [-0.10936925  1.         -0.4205161  -0.35654409 -0.4194462  -0.28061173]
 [ 0.87175416 -0.4205161   1.          0.9627571   0.94904254  0.95847224]
 [ 0.81795363 -0.35654409  0.9627571   1.          0.95646382  0.98022894]
 [ 0.78256123 -0.4194462   0.94904254  0.95646382  1.          0.95014238]
 [ 0.85732586 -0.28061173  0.95847224  0.98022894  0.95014238  1.        ]]


Select data.

In [69]:
filtered_data = data_array[data_array[:, 2] > 2.0] # Example: Filter data where petal length (third column) is greater than 2.0 cm
print('Full data sample size:', data_array.size)
print('Filtered data sample size:', filtered_data.size)

Full data sample size: 900
Filtered data sample size: 600
