# Introduction to NumPy & Pandas - Iris dataset
![Iris](Pictures/iris-machinelearning.png)
Figure from: https://www.datacamp.com/tutorial/machine-learning-in-r

## 1. Import packages

In [None]:
import numpy as np
import pandas as pd

## 2. Import Iris dataset using Pandas

Import the Iris dataset into a Pandas DataFrame. Pandas can handle a wide variety of data formats and types and it is generally used to import tabular data.

In [None]:
#url = 'https://github.com/anolte-DSC/Python_for_Earth_Sciences/blob/main/Quickstart_Python_Jupyter/Datasets/Iris.csv'
url = 'C:/Users/annika/Nextcloud/Dokumente/3_Trainings/C_Datasets/Iris/Iris.csv'
data_df = pd.read_csv(url)

## 3. Data in Pandas series vs. NumPy array

Extract only the petal length from the dataset. The result will be a Pandas series. Dataframes consist of multiple series that share an index.

In [None]:
petal_length_series = data_df['PetalLengthCm']
petal_length_series

Convert the Pandas series to a NumPy array and compare the result. A Pandas series is build on top of a NumPy array. Compared to a NumPy array, the Pandas series has a named index that allows calling values based on the index name.

In [None]:
petal_length_array = petal_length_series.values
petal_length_array

## 4. The Iris dataset as a NumPy two-dimensional array

The Iris dataset, when represented as a two-dimensional NumPy array, is a matrix. However, a multidimensional NumPy array can also have more than two dimensions.

In [None]:
data_array = data_df.iloc[:, 1:].values
data_array

## 5. Data selection

Select the first five samples from the Iris dataset. Note that indexing starts at 0. 

In [None]:
# Select from dataframe:
data_df.loc[0:5,:]
# or
data_df.head(5) 

In [None]:
# Select from array:
data_array[0:5,:] 

Select species Iris-versicolor.

In [None]:
# Select from dataframe:
data_df.loc[data_df['Species'] == 'Iris-versicolor',:]

In [None]:
# Select from array:
print(data_array[:, 4] == 'Iris-versicolor')
mask = (data_array[:, 4] == 'Iris-versicolor')
data_array[mask] # A NumPy slice is still pointing to the orignal array.

## 6. Data analysis

How many samples per species?

In [None]:
# In dataframe:
data_df['Species'].value_counts()

In [None]:
# In array:
np.unique(data_array[:, -1], return_counts=True)

A build in Pandas functions for basic data statistics:

In [None]:
data_df.describe()

Summary statistics (mean, standard deviation, minimum, and maximum) for each feature within each species group using the build in Pandas functions 'groupby' and 'agg':

In [None]:
grouped = data_df.groupby('Species')
summary = grouped.agg({
    'SepalLengthCm': ['mean', 'std', 'min', 'max'],
    'SepalWidthCm': ['mean', 'std', 'min', 'max'],
    'PetalLengthCm': ['mean', 'std', 'min', 'max'],
    'PetalWidthCm': ['mean', 'std', 'min', 'max']
})
print(summary)

## 7. Data operations

Calculate the petal area from the other features and add it as a new feature.

In [None]:
# In dataframe:
data_df['PetalAreaCm2'] = data_df['PetalLengthCm'] * data_df['PetalWidthCm'] 
data_df.head(5)

In [None]:
# In array:
petal_area = data_array[:, 2] * data_array[:, 3]
data_array = np.hstack((data_array, petal_area.reshape(-1, 1)))
data_array[0:5,:] 

Some build in NumPy functions for basic data operations:

In [None]:
mean_values = np.mean(data_array[:, 5], axis=0)
std_deviation = np.std(data_array[:, 5], axis=0)
mean_values = mean_values.round(2)
std_deviation = std_deviation.round(2)
print("Mean of petal area:", mean_values)
print("Standard deviation of petal area:", std_deviation)

## 8. Visualizing data with Matplotlib

Matplotlib is a plotting library that Pandas leverages for its built-in plotting capabilities, allowing for easy visualization of DataFrame and Series data. There are various plot types available with Matplotlib as well as options for customizations. The following is a histogram with default settings.

In [None]:
data_df['PetalLengthCm'].plot.hist()