<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Python Libraries

Python works with libraries, which are collections of pre-written code that provide useful functions and tools for various tasks. This notebooks contains a quickstart to two fundamental Python libraries for data analysis: NumPy (`numpy`) and Pandas (`pandas`).

<img src="../Images/pandas_numpy.png" style="width: 300px;">

Data analysis with `pandas` works with tabular data using **DataFrames** (similar to Excel spreadsheets or database tables. DataFrames have a 2-dimensional data structure and labeled axes (rows and columns). These are indexed for effiecient data retrieval.

<img src="../Images/dataframe.png" style="width: 300px;">

`numpy` is the foundational library for numerical computing, supporting large and multi-dimensional **arrays** and vectorized operations. A data array is a structure for stroring elements of the same type. Arrays can be one-dimensional (like a list) or multi-dimensional (like a matrix).

<img src="../Images/array.png" style="width: 600px;">

## 1. Predefined functions

Python packages have predefined functions for mathematical operations; Examples below:

In [None]:
import numpy as np
import statistics as stat
import math

In [None]:
x = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]

In [None]:
stat.mean(x)

In [None]:
np.mean(x)

In [None]:
stat.median(x)

In [None]:
stat.variance(x)

In [None]:
math.exp(3.2)

In [None]:
math.sqrt(3)

In [None]:
math.sin(2)

## 2. NumPy & Pandas - Temperature time series
![sky](../Images/temperature.jpg)

*Image modified from Gerd Altmann, Pixabay*

**Original dataset:**

NOAA National Centers for Environmental information: Climate at a Glance: Global Time Series [Data set]. https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series, retrieved on August 23, 2024.

## 2.1 NumPy

In [None]:
import numpy as np 

Import data as a 2D numpy array:

In [None]:
path = '../Datasets/NOAA_time_series.csv' # Relative path to dataset file
time_series = np.loadtxt(path, skiprows=5, delimiter=',')
print(time_series)

Data selection: Selecting the last 30 years of data from the time series corresponds to selecting the last 30 elements of the time series.

In [None]:
time_series[-30:, :]

However, selecting elements from an array based on the values in a specific column is also possible using boolean indexing:

In [None]:
mask = time_series[:, 0] > 1995
print(mask)

In [None]:
filtered_data = time_series[mask]
print(filtered_data)

Mathematical operations:

In [None]:
print(np.mean(time_series[:,1]))

**Exercise:** Convert temperature unit °C to Kelvin in the array. 

Plot:

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure()
plt.plot(time_series[:, 0], time_series[:, 1])  # Plot Year vs. Anomaly
plt.xlabel('Year')
plt.ylabel('Anomaly (°C)')
plt.show()

## 2.2 Pandas

In [None]:
import pandas as pd

(Re)Import data in DataFrame:

In [None]:
path = '../Datasets/NOAA_time_series.csv' # Relative path to dataset file
time_series = pd.read_csv(path, skiprows=4, delimiter=',')

Investigate DataFrame structure:

In [None]:
print(time_series.head())

In [None]:
print(time_series.index)

In [None]:
print(time_series.columns)

In [None]:
print(time_series.dtypes)

Data selection:

In [None]:
print(time_series.loc[0:4,:])

In [None]:
time_series_selection = time_series.loc[time_series['Year'] >= 1994,:]
print(time_series_selection)

Reset index:

In [None]:
time_series_selection.reset_index(drop=True, inplace=True)
print(time_series_selection)

Mathematical operations:

In [None]:
print(time_series.describe())

In [None]:
time_series['Anomaly'] = time_series['Anomaly'] + 273.15 # convert temperature unit °C to Kelvin
print(time_series.head())

## 3. Pandas - Iris dataset
![Iris](../Images/Iris.png)

*Image modified from Steve Dorand, Pixabay*

**Original dataset:** https://www.kaggle.com/datasets/uciml/iris/data

In [None]:
import pandas as pd

Import data in DataFrame:

In [None]:
path = '../Datasets/Iris.csv' # relative path to dataset file
iris = pd.read_csv(path)

Investigate DataFrame structure and data:

In [None]:
print(iris.head())
print(iris.index)
print(iris.columns)
print(iris.dtypes)
print(iris.describe())

Data selection:

In [None]:
print(iris.loc[0:4,:])

In [None]:
print(iris.loc[iris['Species'] == 'Iris-versicolor'])

Data in Pandas DataFrame vs. NumPy array:

In [None]:
array = iris.values
print(array)

Sort & reset index:

In [None]:
sorted_iris = iris.sort_values(by='PetalLengthCm', ascending=False)
print(sorted_iris.head(5))

In [None]:
sorted_iris.reset_index(drop=True, inplace=True)
print(sorted_iris.head(5))

Histogram plot showing the value distribution of petal length:

In [None]:
iris['PetalLengthCm'].plot.hist()

**Exercise:** Calculate the area of the Petals via multiplication (no need for precise ellipse area) in a new column and print the first rows of the DataFrame.