# Working with structured data - Part 1

This notebook accompanies the script <strong><span style="color:red;">05_pandas_part_A.pdf</span></strong>  and provides practical examples related to its content.

In [None]:
import pandas as pd

<hr style="border: none; height: 20px; background-color: green;">

# Pandas Data Structures: Series and DataFrames

A Pandas **Series** is a 1D array with an associated index, making it more flexible than a standard NumPy array


In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0])
series

#### Each value in a Series has an associated index

Label-based indexing makes Pandas Series more powerful than NumPy arrays

In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0], 
                   index=['a','b','c','d'])
series

In [None]:
print(series['b'])

## Duplicate Index Example

In Pandas, Series indexes do not have to be unique, meaning multiple entries can share the same label.   
However, this behavior differs from dictionaries, where each key must be unique.


In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0], 
                   index=['a','b','b','d'])
print(series)

In [None]:
print(series['b'])

## Values and Index

Series values are stored as a NumPy array (`series.values`), allowing fast numerical operations

In [None]:
print(type(series.values))
print(series.values)

The index (`series.index`) is a Pandas Index object, which supports advanced indexing

In [None]:
print(type(series.index))
print(series.index)

## Changing Index

In Pandas, we can change the index of a Series without modifying the values.  
This allows for more flexible data access and better organization.

In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0])
print("Before:")
print(series)

series.index = ['A','B','C','D']
print("\nAfter:")
print(series)

## Reset Index

- In Pandas, we can reset the index of a Series to remove custom labels and replace them with the default integer index.  
- Useful for handling duplicate index labels or converting a labeled Series back to a simple list-like structure.  
- If `drop=False` (default), the old index is kept as a new column.

In [None]:
# Creating a Series with duplicate index labels
series = pd.Series([0.25, 0.5, 0.75, 1.0], 
                 index=['a','b','b','d'])

print("Before resetting index:")
print(series)

# Resetting the index
series = series.reset_index(drop=True)
print("\nAfter resetting index:")
print(series)

## iloc vs loc

Pandas provides two powerful indexing methods:
- Positional indexing with `.iloc`(excludes stop)
- Label-based indexing with `.loc` (includes stop)

In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])

# Accessing by POSITION using .iloc[]
print("Using .iloc[]:")
print(series.iloc[1])
print(series.iloc[1:3])

# Accessing by LABEL using .loc[]
print("\nUsing .loc[]:")
print(series.loc['b'])
print(series.loc['b':'c'])  # (includes 'c')

### **Caution:** Labels that are integers can cause confusion

`data.loc[start:stop]` is label-based indexing, meaning Pandas selects elements based on their explicit index labels, not their position

In [None]:
# Integer index
series = pd.Series([0.25, 0.5, 0.75, 1.0, 1.25],
                 index=[1, 2, 9, 4, 3])

print(series.loc[1:3])

`data[start:stop]` here Pandas performs index-based slicing when using string indices, but position-based slicing when using numeric indices

In [None]:
# Positional slicing
print(series[1:3])

In [None]:
# String index
series = pd.Series([0.25, 0.5, 0.75, 1.0, 1.25],
                 index=['1', '2', '9', '4', '3'])

print(series['1':'3'])

**Direct Access:**  
In older Pandas versions, `data[0]` could refer to either the first position or a label 0 (if present)  

Now, if 0 is not an index label, Pandas raises a KeyError instead of falling back to positional access  

Future-proof code should always use .iloc[0] for positional indexing  

In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[1, 2, 3, 4])

print(series[1])

try:
    print(series[0])
except KeyError as e:
    print(f"KeyError: index {e} not found")

print(series.iloc[0])

## Dictionary to Series

Pandas allows you to create a Series from a Python dictionary, where:  
- Keys become the index
- Values become the data


In [None]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

# Convert dictionary to a Series
population = pd.Series(population_dict)
population

## Sorting

`sort_index()` Sorts the Series by its index labels in ascending order by default

In [None]:
population.sort_index()

`sort_values()` Sorts the Series by its values, arranging the data from smallest to largest (or vice versa with `ascending=False`)

In [None]:
population.sort_values()

## Exploring Data


#### `.info()`

- Provides an overview of the Series, including:
- Data type (dtype)
- Number of non-null values
- Name of the Series
- Memory usage


In [None]:
series = pd.Series([10, 50, 30, None, 50, 40, 10], 
                   dtype="Int16", 
                   name="my_series")

# Display basic info
series.info()

#### `.describe()`   

Provides statistical insights into numerical data

- `count` → Number of non-null values.
- `mean` → Average of the values
- `std` → Standard deviation (spread of data)
- `min`, `max`, and `quartiles` (25%, 50%, 75%)
- Works only for numerical data (for categorical data, use .value_counts()



In [None]:
# Summary statistics
series.describe()

#### `.unique()`  

Returns an array of all unique values in the Series, preserving the original order of appearance.

In [None]:
# Lists all unique values
series.unique()

#### `.value_counts()`  

Counts the occurrences of each unique value in the Series and returns a sorted Series (by frequency)

In [None]:
# Counts occurrences of each value
series.value_counts()

#### Detecting Missing Values

Missing values (NaN - Not a Number) are common in real-world data, and Pandas provides built-in tools to handle them efficiently.   
Use `.isna()` or `.notna()` to check for missing values

Later, we will explore different strategies for detecting, filling, and removing missing values in more detail


In [None]:
import numpy as np

series = pd.Series([10, np.nan, 30, None, 50])

# Detect missing values
print(series.isna())
print(series.notna())

<hr style="border: none; height: 10px; background-color: orange;">

# Pandas DataFrames

#### We can create a DataFrame by combining multiple series objects

In [None]:
# Creating two Pandas Series
population = pd.Series({
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
})

area = pd.Series({
    'California': 423967,
    'Texas': 695662,
    'New York': 141297,
    'Florida': 170312,
    'Illinois': 149995
})

# Creating a DataFrame using the two Series
states = pd.DataFrame({
    'population': population,
    'area': area
})

states

#### Dictionary of Lists

Each key represents a column, and values are stored in lists

In [None]:
# Creating a DataFrame from a dictionary of lists
states = pd.DataFrame(
    {
        'population': [38332521, 26448193, 19651127, 19552860, 12882135],
        'area': [423967, 695662, 141297, 170312, 149995]
    },
    index=['California', 'Texas', 'New York', 'Florida', 'Illinois']
)
states

#### List of Dictionary

Each dictionary represents a row, and keys act as column names

In [None]:
states = pd.DataFrame(
    [
        {"population": 38332521, "area": 423967},
        {"population": 26448193, "area": 695662},
        {"population": 19651127, "area": 141297},
        {"population": 19552860, "area": 170312},
        {"population": 12882135, "area": 149995}
    ],
    index=['California', 'Texas', 'New York', 'Florida', 'Illinois']
)
states

#### Ceating a DataFrame from a 2D NumPy array

This is useful when working with numerical data, as NumPy arrays are highly efficient for computation, and Pandas provides enhanced data manipulation tools

In [None]:
# Creating a 2D NumPy array
np_array = np.array([
    [0.1, 0.3],
    [0.4, 0.2],
    [0.5, 0.7]
])
print(np_array)

In [None]:
# Converting the NumPy array to a DataFrame
df = pd.DataFrame(
    np_array,
    columns=['foo', 'bar'],
    index=['a', 'b', 'c']
)
print(df)

#### DataFrame as a Dictionary of Series

For example, if we access the area column, a Pandas Series is returned

In [None]:
print(type(states["area"]))
states["area"]

#### What if some data are missing?

If some keys are missing from a dictionary, they will be filled in with `NaN`

In [None]:
pd.DataFrame([
    {'a': 1, 'b': 2},
    {'b': 3, 'c': 4}
])

<hr style="border: none; height: 10px; background-color: orange;">

## DataFrame Attributes

Pandas DataFrames have attributes that provide important metadata about their structure.   
These attributes help us inspect column types, dimensions, and data values efficiently, without modifying the DataFrame

In [None]:
data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 32, 37, 29, 41],
    "salary": [52000, 64000, 72000, 58000, 81000],
    "score": [88.5, 92.0, np.nan, 79.5, 91.0]
}

df = pd.DataFrame(data)
df

#### `head(n)` and `tail(n)`

Show the first or last `n` rows.

In [None]:
df.head(3)

In [None]:
df.tail(2)

### Descriptive Statistics

#### `describe()`

Generate descriptive statistics for numeric columns.

In [None]:
df.describe()

#### `max()` and `min()`

Return maximum and minimum values.

In [None]:
df.max(numeric_only=True)

In [None]:
df.min(numeric_only=True)

#### `mean()` and `median()`

Return mean and median values.

In [None]:
df.mean(numeric_only=True)

In [None]:
df.median(numeric_only=True)

#### `std()`

Return the standard deviation.

In [None]:
df.std(numeric_only=True)

### Sampling Data

#### `sample(n)`

Return `n` random rows.

In [None]:
df.sample(2, random_state=1)

### Handling Missing Values

#### `dropna()`

Remove rows with missing values.


In [None]:
df.dropna()

### Data Types and Structure

#### `dtypes`

Show column data types.

In [None]:
df.dtypes

#### `columns`

List column names.

In [None]:
df.columns

#### `axes`

Return row and column labels.

In [None]:
df.axes

#### `ndim`

Number of dimensions.

In [None]:
df.ndim

#### `size`

Total number of elements.

In [None]:
df.size

#### `shape`

Return `(rows, columns)`.

In [None]:
df.shape

### Accessing Underlying Data

#### `values`

Return NumPy representation of the DataFrame.

In [None]:
df.values

In [None]:
type(df.values)

<hr style="border: none; height: 20px; background-color: green;">

# Selecting Data in Series and DataFrames

Data selection can be performed using different methods

## Selecting Data in `Series`

In [None]:
series = pd.Series([0.25, 0.5, 0.75, 1.0],
              index=["a", "b", "c", "d"])

In [None]:
series["a"]

In [None]:
series.a

#### Checking for Membership

Like a Dictionary, we can check if a key exists in a Series

In [None]:
print(f'"a" in s: {"a" in series}')
print(f'"x" in s: {"x" in series}')

#### Accessing Index Keys and Items

Like a Dictionary, a Pandas Series has keys (index labels) and values.   
You can retrieve all index labels using `.keys()` and extract all key-value pairs using `.items()`

In [None]:
series.keys()

In [None]:
list(series.items())

#### Selecting items Like a 1D NumPy Array

In older Pandas versions, `s[0]` could refer either to the first element by position or to an index label `0` if present.

In [None]:
print(series.iloc[0])  # First element

In [None]:
print(series.iloc[:2]) # First two elements

In [None]:
try:
    print(series[0])   # Label-based indexing (may fail)
except KeyError:
    print("KeyError: label 0 not found in index")

#### Label-based slicing

`series["b":"d"]` Includes the stop label

In [None]:
print(series["b":"d"])   # Slices using explicit index labels

#### Position-base slicing

`series.iloc[1:3]` Excludes the stop index 

In [None]:
print(series.iloc[1:3])  # Slices using integer positions

#### Reversing
`series[::-1]` Works regardless of label or position


In [None]:
print(series[::-1])      # Reverses the Series

#### Full copy: 

`series[:]` Retains all elements but creates a new Series


In [None]:
print(series[:])         # Selects all elements (full copy)

#### Boolean Masking
We can filter elements based on conditions

In [None]:
series[series > 0.3]

#### Fancy Indexing
Selecting multiple specific indices

In [None]:
series[["a", "c"]]

#### Modifying Elements
Series are mutable, meaning we can modify values

In [None]:
series["b"] = 0.6
series

#### Pandas DataFrame 

By default, Pandas assigns an integer index.  
If you need a custom index, you must specify it explicitly using the index parameter.

In [None]:
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6]
})
df

#### Selecting Columns
We can access columns in multiple ways

In [None]:
df["A"]          # Access column A

In [None]:
df[["A", "B"]]   # Access multiple columns

#### Using attribute-style access 

Only valid column names can be referenced as attributes of the DataFrame object

In [None]:
df.A

#### Selecting Rows

Using `.loc[]` (label-based) and `.iloc[]` (position-based)


In [None]:
df.loc[0]     # First row using label

In [None]:
df.iloc[0]    # First row using position

#### Selecting Rows & Columns Together

Using `.loc[]` (label-based indexing)

In [None]:
print(df.loc[0, "A"])     # First row, column A

In [None]:
print(df.loc[:, "A"])    # All rows, column A

Using `.iloc[]` (position-based indexing)

In [None]:
print(df.iloc[0, 0])     # First row, first column

#### Boolean Indexing
Selecting rows based on a condition

In [None]:
df[df["A"] > 1]

#### Using `.query()` for Selection
A more readable alternative for filtering


In [None]:
df.query("A > 1")

#### Selecting Data with `.isin()`
Filtering based on multiple values

In [None]:
df[df["A"].isin([1, 3])]

#### `.filter()` for Column Selection
Useful for selecting columns by name patterns

In [None]:
# Selects all columns containing 'A'
print(df.filter(like="A", axis=1))

#### Some usefull `.iloc[]` Patterns


In [None]:
print(df.iloc[:3])        # First 3 rows
print(df.iloc[:, :2])     # First 2 columns
print(df.iloc[[0, 2], [1]])  # Specific rows and columns

## Modifying DataFrames


In [None]:
states

#### Adding a New Column

We can create a new column density by dividing population by area

In [None]:
states["density"] = states["population"] / states["area"]
states

#### Modifying Values Using `.iloc[]`

We can modify a specific value by position using `.iloc[]` 

In [None]:
states.iloc[1, 0] = 27000000  # Update population of Texas
states

#### Updating Values Based on a Condition

Increase the population by 5% for all states where area is greater than 200’000

In [None]:
# First, we must cast population to float; otherwise, Pandas may raise a TypeError 
states["population"] = states["population"].astype(float)

states.loc[states["area"] > 200000, "population"] *= 1.05
states

In [None]:
pd.options.display.float_format = "{:.2f}".format
states

#### Transpose `states.T`

flips the structure, making states the columns and variables the rows

In [None]:
states.T

<hr style="border: none; height: 20px; background-color: green;">

## Ufuncs in Series and DataFrames  

#### Ufuncs in Series
You can apply NumPy Ufuncs directly to Pandas Series.   
The result preserves the index of the Series.

In [None]:
s = pd.Series([1, 4, 9, 16, 25], 
              index=["a", "b", "c", "d", "f"])

np.sqrt(s)   # Element-wise square root

Pandas handles missing values gracefully by returning NaN when an index is missing in one of the Series

In [None]:
population = pd.Series({
    "New York": 19651127,
    "Florida": 19552860,
    "Illinois": 12882135,
    "California": 38332521,
    "Texas": 26448193,
})

area = pd.Series({
    "Texas": 695662,
    "Illinois": 149995,
    "Florida": 170312,
    "California": 423967,
})

np.divide(population, area)

#### Ufuncs in DataFrames

Ufuncs can also be applied to DataFrames

In [None]:
df = pd.DataFrame({
    "A": [1, 4, 9, 16, 25],
    "B": [2, 3, 5, 7, 11],
}, index=["a", "b", "c", "d", "f"])

np.sqrt(df)

In [None]:
df1 = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": [10, 20, 30, 40],
    "C": [100, 200, 300, 400],
})

df2 = pd.DataFrame({
    "A": [1, 4, 8, 16],
    "B": [10, 40, 80, 160],
})

np.divide(df1, df2)

#### Using `.apply()` for Complex Functions

The `.apply()` method is used to apply a function to each row or column of a DataFrame when vectorized operations (NumPy ufuncs) are not possible

In [None]:
df

In [None]:
def col_mean(col):
    return col.mean()

df.apply(col_mean, axis=0) # apply to columns

In [None]:
def row_sum(row):
    return row["A"] + row["B"]

df.apply(row_sum, axis=1) # apply to rows

#### apply a function to a single, specifc column

In [None]:
def custom_function(x):
    return np.log(x) if x > 2 else x

df = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": [10, 20, 30, 40],
})

# apply() used on a Series df['A']
df["A_transformed"] = df["A"].apply(custom_function)
df

#### Using a lambda function directly in .apply()

It keeps the code compact and is ideal for **simple**, **one-time** transformations.  
**Rule of thumb:** If you can't understand a lambda function immediately, you should write a regular function instead

In [None]:
df = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": [10, 20, 30, 40],
})

df["A_transformed"] = df["A"].apply(lambda x: np.log(x) if x > 2 else x)
df

#### apply a function row-wise

This allows for more flexible operations that depend on multiple values in a row

In [None]:
def custom_function(x):
    if x["A"] > 2:
        return x["A"] + x["B"]
    return x["A"]

df = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": [10, 20, 30, 40],
})

df["C"] = df.apply(custom_function, axis=1)
df

### NumPy vs. Pandas Built-in Functions

Pandas also provides built-in methods that internally use NumPy universal functions

In [None]:
# Equivalent approaches
print(df["A"].mean())      # Pandas built-in method
print(np.mean(df["A"]))    # NumPy ufunc

#### Performance comparison

In [None]:
df = pd.DataFrame({
    "A": np.random.randint(low=0, high=1000, size=1_000_000),
    "B": np.random.randint(low=0, high=1000, size=1_000_000),
})

%timeit np.sqrt(df["A"])                 # Fast NumPy ufunc
%timeit df["A"].apply(np.sqrt)           # apply with NumPy ufunc
%timeit df["A"].apply(lambda x: x**0.5)  # Much slower


#### Pandas is built on NumPy

- Pandas stores data in NumPy arrays internally
- Many Pandas functions are just wrappers around NumPy operations
- Using NumPy-based operations ensures speed and efficiency

In [None]:
# A Pandas Series is basically a NumPy array with extra metadata
series = pd.Series([1, 2, 3])
type(series.values)

#### How do we align indexes in DataFrames?

In [None]:
df1 = pd.DataFrame(
    {
        "A": [1, 2, 3, 4],
        "B": [10, 20, 30, 40],
        "C": [100, 200, 300, 400],
    },
    index=[1, 2, 4, 5],
)
df1

In [None]:
df2 = pd.DataFrame({
    "A": [1, 4, 8, 16],
    "B": [10, 40, 80, 160],
})
df2

In [None]:
np.add(df1, df2)

<hr style="border: none; height: 20px; background-color: green;">

## Recap – Direct Indexing 

In [None]:
df = pd.DataFrame({
    "col0": [10, 20, 30, 40],
    "col1": [50, 60, 70, 80],
    "col2": [90, 100, 110, 120],
}, index=["row0", "row1", "row2", "row3"])
df

In [None]:
df.loc['row1']     # Returns a row as a Series

In [None]:
df.iloc[1]          # Returns a row as a Series

In [None]:
df["col1"]          # Returns column "col1" as a Series

In [None]:
df.loc[:, ["col1"]] # Returns column "col1" as a DataFrame

In [None]:
df.iloc[:, [0]]     # Returns the first column as a DataFrame

In [None]:
df[["col1"]]        # Returns column "col1" as a DataFrame

In [None]:
series = pd.Series([100, 200, 300, 400],
              index=["item0", "item1", "item2", "item3"])

series

In [None]:
series.loc["item1"]    # Returns the element with label "item1"

In [None]:
series.iloc[1]         # Returns the second element

In [None]:
series["item1"]        # Returns the element with label "item1"

## Recap – Slicing (Range Query)

In [None]:
df = pd.DataFrame({
    "col0": [10, 20, 30, 40],
    "col1": [50, 60, 70, 80],
    "col2": [90, 100, 110, 120],
}, index=["row0", "row1", "row2", "row3"])

df

In [None]:
# Values from a single column
df.loc["row1":"row3", "col0"]   # Includes 'row3'

In [None]:
df.iloc[1:3, 0]   # Selects rows 1 and 2 from column 'col0'

In [None]:
# Get rows by label (inclusive)
df.loc["row1":"row2"]   # Includes 'row2'

In [None]:
# Get rows by position (exclusive end)
df.iloc[1:2]   # 'row2' is excluded

In [None]:
# Implicit slicing by label (inclusive)
df["row1":"row2"]   # Includes 'row2'

In [None]:
# Columns by label (inclusive)
df.loc[:, "col1":"col2"]   # Includes "col2"

In [None]:
# Columns by position (exclusive end)
df.iloc[:, 0:2]            # Columns at position 0 and 1, not 2

In [None]:
series = pd.Series(
    [100, 200, 300, 400],
    index=["row0", "row1", "row2", "row3"]
)

In [None]:
# Series slicing by label (inclusive)
series.loc["row1":"row3"]       # Includes "row3"

In [None]:
# Series slicing by position (exclusive end)
series.iloc[1:3]                # Index 3 is not included

In [None]:
# Implicit label slicing (inclusive)
series["row1":"row3"]           # Index 3 is included

In [None]:
# Implicit positional slicing (exclusive end)
s[1:3]                     # Index 3 is not included