Copyright (c) 2025 aamirmd. All Rights Reserved.

This work is licensed under the MIT License. See LICENSE file for details.

# Pandas tutorial

Welcome!

Pre-requisites:
- Basic python (variables, types, lists, indexing, dicts, etc.) 
- Numpy (preferred)

##  Installing and importing Pandas

Please view official documentation here: https://pandas.pydata.org/

In [None]:
!pip install pandas

This notebook can be run on google colab without needing to install python on your computer.

Please follow instructions here: [Google Colab Instructions](https://medium.com/@jessica0greene/running-your-notebooks-in-the-cloud-with-google-colab-4387529bfad4)

In [None]:
import pandas as pd

## Series and Dataframes

- Series and Dataframes are the two major data structures in pandas.
- Series can be thought of as 1-dimensional lists or arrays.
- Dataframes can be thought of as 2-d dimensional tables.

In [None]:
# Creating a series
s = pd.Series([1,2,3])
print(f"Series example:")
display(s)

df = pd.DataFrame({
    'A': [1,2,3],
    'B': [4,5,6],
    'C':[7,8,9]
})
print(f"Dataframe example:")
display(df)

## Indexing and Slicing

- In pandas' data structures, indexing is done using the indices provided, contrary to regular python lists.
    * _Caveat_: As of this tutorial, python list style indexing is still allowed with a warning.
- If no indices are provided, like in the previous example, the series/dataframe uses the default index, which is 0 to n-1 where 'n' is the number of elements.

In [None]:
# Indexing example
s = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(f"Series: ")
display(s)
print(f"Element 'b': {s['b']}")

# Another indexing example
s = pd.Series(['d', 'e', 'f'], index=[215, 168, 900])
print(f"Series: ")
display(s)
print(f"Element 900: {s[900]}")

# Slicing example
s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])

## Data loading

- Data can be loaded/inputted using Pandas in many formats such as:
    * csv
    * json
    * excel
- 'csv' is typically used more than others

In [None]:
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)
df

Common operations to get a glimpse of the dataset.

In [None]:
# To get first 4 elements
print("First 4 elements")
print(df.head(4))
print()

# To get last 4 elements
print("Last 4 elements")
print(df.tail(4))
print()

# To get names of columns as a Series object
columns = df.columns
print("Columns: ", columns)
print()

How to get metadata and summary of data?

In [None]:
# To get metadata about the dataset
print("Metadata about dataset")
df.info()
print()

# To get statistics about columns
print("Summary statistics of columns/features")
df.describe()

## Data Selection and Indexing

In [None]:
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)
print(f"Columns are {df.columns.tolist()}")

- Some common ways to retreive data from dataframes are given below.

In [None]:
# To get a column as a Series object
column = df['sepal.length']
print("Sepal length column: ")
print(column)
print()

# To get a single element from row index and column index
item = df.loc[4, "petal.width"]
print(f"Item at in row index '4' and column 'petal.width' {item}")
print()

# To get single element using position-based indexing (like python lists)
item = df.iloc[6,3]
print(f"Element in 6th row and 3rd column (0-based indexing): {item}")
print()

## Data Manipulation

- Pandas allows easy manipulation of columns including insertion and deletion.

In [None]:
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)
print(f"Columns are {df.columns.tolist()}")

In [None]:
# Create a new column called sepal.area = sepal.length * sepal.width
df['sepal.area'] = df['sepal.length'] * df['sepal.width']
print("Iris dataframe (part) after adding 'area' column")
print(df[["sepal.length", "sepal.width", "sepal.area"]])

## Sorting Data in Pandas

Pandas provides powerful sorting capabilities through the `sort_values()` method. Here are the key features:

- Sort by one or multiple columns using the `by` parameter
- Control sort order with `ascending=True/False` (default is ascending)
- Handle missing values with `na_position='first'/'last'`
- Sort in-place with `inplace=True` or create a new sorted dataframe

Example syntax:
```python
df.sort_values(by='column_name')  # Sort by single column
df.sort_values(by=['col1', 'col2'])  # Sort by multiple columns
```
```

In [None]:
# Load the iris dataset
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)

# Sort the dataframe by sepal.length
df_sorted = df.sort_values(by='sepal.length')

print("Iris dataset sorted by sepal length:")
display(df_sorted)