# 🐼 Pandas in Python

**Created by**: [@adigasuhas](https://github.com/adigasuhas)  
**Contact**: suhasadiga@jncasr.ac.in

---

Welcome! 🙌  
This notebook is part of the **Python4MS** tutorial series, carefully crafted to teach **Python from scratch** — in a simple, clear, and beginner-friendly way.  
We’ll be using examples especially relevant to **Materials Science 🔬** to make learning both practical and engaging.

Python is one of the most popular and versatile programming languages today. Whether you're:

- analyzing data 📊
- automating repetitive tasks 🔁
- or running scientific simulations 🧮

— Python is becoming an essential tool in modern research.

---

## 📘 What You'll Learn in This Notebook

In this notebook, you'll explore:

- ✨ **How to perform data manipulation and analysis using Pandas**
- 📊 Learn how to create a DataFrame and analyze data relevant for machine learning & materials science applications.

---

> 📝 **Note**: This tutorial assumes **no prior programming experience**.  
> Each concept is introduced step-by-step with simple explanations, real-world analogies, and hands-on examples.

Let's get started! 🚀


## 🐼 Pandas — Your Data Analysis Superpower!

**Pandas** is a powerful software library written for the Python programming language, specifically designed for **data manipulation and analysis**.  
It offers versatile data structures and rich operations for working with:

- numerical tables 📊
- labeled data 📋
- time series data ⏳

The name **Pandas** is derived from *"panel data"*, an econometrics term referring to datasets that include observations over multiple time periods for the same individuals — and also serves as a playful reference to **Python Data Analysis**. 🐼

---

### 🧑‍💻 A Brief History

- 📅 **Wes McKinney** started developing what would later become Pandas during his time at AQR Capital (2007–2010).
- Pandas introduced into Python many features for working with **DataFrames** — a concept that was already well-established in the R programming language.
- Pandas is built on top of another core library called **NumPy**, which provides high-performance array operations.

---

✨ In short:  
Pandas makes it **ridiculously easy** to work with tabular data — something we’ll do a lot in **Materials Science** and **Machine Learning**! 🔬🤖



### 📦 Installing Pandas

Let's first install the **Pandas** library (if it's not already installed).


In [1]:
# Installing the latest version of Pandas
!pip install pandas -q

In [2]:
# 📦 Importing necessary libraries

import pandas as pd  # Pandas for data manipulation
import numpy as np    # NumPy for numerical operations

### 🏷️ Core Data Structures in Pandas

Pandas offers two primary classes for handling data:

1️⃣ **Series**  
A **one-dimensional** labeled array that can hold data of any type — numbers, strings, Python objects, etc.

2️⃣ **DataFrame**  
A **two-dimensional** labeled data structure that holds data in rows and columns — very similar to a spreadsheet or SQL table.

📊 These two classes form the foundation of almost everything you'll do with Pandas!


## 🛠️ Creating a Pandas Series

Let's create a simple **Series** containing the first 8 elements from the periodic table. 🔬

In [3]:
# Creating a Series of first 8 elements
elements_series = pd.Series(['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O'])

# Display the Series
elements_series


0     H
1    He
2    Li
3    Be
4     B
5     C
6     N
7     O
dtype: object

## 🧪 Creating a DataFrame of First 8 Elements with Their Atomic Number and Mass

In [4]:
# Defining element properties
elements = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O']
atomic_numbers = [1, 2, 3, 4, 5, 6, 7, 8]
atomic_masses = [1.008, 4.003, 7.000, 9.012, 10.81, 12.011, 14.077, 15.999]

# Creating a NumPy array
periodic_table_array = np.array([elements, atomic_numbers, atomic_masses])

# Checking dimensions
print(f"Dimensions of periodic_table_array: {periodic_table_array.shape}")  
# Output: (3, 8) => 3 rows (attributes), 8 columns (elements)

# Transposing to switch rows to columns: (8, 3)
periodic_table_array.T


Dimensions of periodic_table_array: (3, 8)


array([['H', '1', '1.008'],
       ['He', '2', '4.003'],
       ['Li', '3', '7.0'],
       ['Be', '4', '9.012'],
       ['B', '5', '10.81'],
       ['C', '6', '12.011'],
       ['N', '7', '14.077'],
       ['O', '8', '15.999']], dtype='<U32')

📌 **Note:**  
We apply `.T` (transpose) because we want **rows as elements** and **columns as attributes** (Element, Atomic Number, Atomic Mass).

## 📊 Creating a DataFrame from the Periodic Table Array

In [5]:
# Creating a DataFrame using the transposed periodic table array
periodic_table_df = pd.DataFrame(periodic_table_array.T, columns=['Element', 'Atomic Number', 'Atomic Mass'])

# Display the DataFrame
periodic_table_df

Unnamed: 0,Element,Atomic Number,Atomic Mass
0,H,1,1.008
1,He,2,4.003
2,Li,3,7.0
3,Be,4,9.012
4,B,5,10.81
5,C,6,12.011
6,N,7,14.077
7,O,8,15.999


## 👀 Viewing the DataFrame

We can quickly peek into the data using the `.head(n)` function, which displays the first `n` rows of the DataFrame.


In [6]:
# Displaying the first 6 rows of the DataFrame
periodic_table_df.head(6)

Unnamed: 0,Element,Atomic Number,Atomic Mass
0,H,1,1.008
1,He,2,4.003
2,Li,3,7.0
3,Be,4,9.012
4,B,5,10.81
5,C,6,12.011


In [7]:
# Using tail(n) displays last 'n' entries in the dataframe

periodic_table_df.tail(2)

Unnamed: 0,Element,Atomic Number,Atomic Mass
6,N,7,14.077
7,O,8,15.999


In [8]:
# Using sample(n) displays any n entries in the dataframe

periodic_table_df.sample(3)

Unnamed: 0,Element,Atomic Number,Atomic Mass
5,C,6,12.011
6,N,7,14.077
3,Be,4,9.012


In [9]:
# Printing column names and index of the DataFrame
print(f'The column names of the DataFrame are: {periodic_table_df.columns}')
print(f'The index of the DataFrame is: {periodic_table_df.index}')


The column names of the DataFrame are: Index(['Element', 'Atomic Number', 'Atomic Mass'], dtype='object')
The index of the DataFrame is: RangeIndex(start=0, stop=8, step=1)


In [10]:
# Checking the data types of each column
periodic_table_df.dtypes

Element          object
Atomic Number    object
Atomic Mass      object
dtype: object

📌 **Note:**  
Currently, Pandas identifies all columns as type `object` (because the data was created from a NumPy array of mixed types).

🛠️ Since `Atomic Number` and `Atomic Mass` are numerical values, we’ll convert them into proper numeric types using `astype(float)` to allow mathematical operations.


## 🔧 Converting Columns to Numeric Types

In [11]:
# Selecting only the numeric columns and converting them to float
periodic_table_df_stat = periodic_table_df[['Atomic Number', 'Atomic Mass']].astype(float)

# Verifying the data types after conversion
periodic_table_df_stat.dtypes

Atomic Number    float64
Atomic Mass      float64
dtype: object

✅ Now, both **Atomic Number** and **Atomic Mass** are correctly stored as `float64`, which allows us to perform numerical operations like statistics, sorting, filtering, etc.


## 📊 Statistical Summary of the DataFrame

In [12]:
# Generating summary statistics for Atomic Number and Atomic Mass
periodic_table_df_stat.describe()

Unnamed: 0,Atomic Number,Atomic Mass
count,8.0,8.0
mean,4.5,9.24
std,2.44949,5.063674
min,1.0,1.008
25%,2.75,6.25075
50%,4.5,9.911
75%,6.25,12.5275
max,8.0,15.999


📌 **Note:**  
The `.describe()` function gives us a quick statistical overview of our numeric columns:

- `count` ➔ total number of entries  
- `mean` ➔ average value  
- `std` ➔ standard deviation (spread of data)  
- `min` ➔ minimum value  
- `25%`, `50%`, `75%` ➔ quartiles  
- `max` ➔ maximum value

Super useful for a quick sense-check of your data! ✅


## 🔽 Sorting the DataFrame in Descending Order by Atomic Number

In [13]:
# Sorting the DataFrame based on 'Atomic Number' in descending order
periodic_table_df_sort = periodic_table_df.sort_values(by='Atomic Number', ascending=False)

# Display the sorted DataFrame
periodic_table_df_sort


Unnamed: 0,Element,Atomic Number,Atomic Mass
7,O,8,15.999
6,N,7,14.077
5,C,6,12.011
4,B,5,10.81
3,Be,4,9.012
2,Li,3,7.0
1,He,2,4.003
0,H,1,1.008


📌 **Note:**  
- `sort_values()` helps you sort your DataFrame by any column.
- Setting `ascending=False` sorts it in descending order (largest value first).


## 🔎 Locating Entries in the DataFrame

We can use **`loc`** and **`iloc`** to access specific entries in the DataFrame:

- 📌 **`loc[row_label, column_label]`**  
  ➔ Used when you want to select data by **labels** (i.e., row index name & column name).

- 📌 **`iloc[row_index, column_index]`**  
  ➔ Used when you want to select data by **position** (i.e., integer-based row & column positions).

✅ In both cases, we need two inputs:  
- The row (label or index)
- The column (label or index)

In [14]:
# Creating the DataFrame
materials_data = pd.DataFrame(
    [
        [5, 225],
        [1.46, 194],
        [0, 14],
        [0, 14],
        [np.nan, 225],
        [0, np.nan]
    ],
    index=['NaCl', 'MoS2', 'AgF2', 'Ag2Te', 'NbN', 'Nb3Sn'],
    columns=['band_gap', 'space_group_number']
)

# Display the DataFrame
materials_data


Unnamed: 0,band_gap,space_group_number
NaCl,5.0,225.0
MoS2,1.46,194.0
AgF2,0.0,14.0
Ag2Te,0.0,14.0
NbN,,225.0
Nb3Sn,0.0,


## 🔍 Accessing Data Using `loc()`

We can access specific entries using their **row label** and **column name**.


In [15]:
# Accessing the band gap value of NaCl
materials_data.loc['NaCl', 'band_gap']

5.0

📌 **Explanation:**  
- `'NaCl'` is the **index (row label)**.
- `'band_gap'` is the **column name**.
- This directly retrieves the value stored at the intersection of that row and column.


## 🔍 Accessing Data Using `iloc()`

We can also access data using **integer-based indexing** with `iloc()`.


In [16]:
# Accessing the space_group_number of MoS2 using positional indexing
materials_data.iloc[1, 1]

194.0

📌 **Explanation:**  
- `iloc[1, 1]` means:  
   - Row at position `1` → corresponds to **MoS2** (since indexing starts at 0).
   - Column at position `1` → corresponds to **space_group_number**.
   
So, this returns the space group number for MoS2. ✅


In [17]:
# Accessing the band gap of NaCl using positional indexing
materials_data.iloc[0, 0]  # Output: 5 (before updating)

5.0

📌 **Note:**  
- Position indexing starts from `0` (not `1`).
- So `iloc[0, 0]` refers to the first row and first column of the DataFrame.


In [18]:
# Updating the band gap value of NaCl from 5 eV to 4.86 eV
materials_data.iloc[0, 0] = 4.86

# Display the updated DataFrame
materials_data

Unnamed: 0,band_gap,space_group_number
NaCl,4.86,225.0
MoS2,1.46,194.0
AgF2,0.0,14.0
Ag2Te,0.0,14.0
NbN,,225.0
Nb3Sn,0.0,


In [19]:
# Selecting rows indexed 0 to 1 (inclusive of 0, exclusive of 2), and only the first column
materials_data.iloc[0:2, 0:1]


Unnamed: 0,band_gap
NaCl,4.86
MoS2,1.46


📌 **Note:**  
- `iloc[0:2, 0:1]` returns rows 0 and 1, and only the first column (`band_gap`).
- Remember: slicing works like `start:stop` where `stop` is exclusive.


## 🚩 Handling Missing Values

Let’s check if our `materials_data` DataFrame contains any missing values and then learn how to handle them.

In [20]:
# Checking for missing values
pd.isna(materials_data)

Unnamed: 0,band_gap,space_group_number
NaCl,False,False
MoS2,False,False
AgF2,False,False
Ag2Te,False,False
NbN,True,False
Nb3Sn,False,True


📌 **Note:**  
- `True` indicates a missing value (`NaN`) in that position.
- We will now handle these missing values by filling them with the **mean of their corresponding columns**.


In [21]:
# Attempt to fill NaN values with the mean of 'band_gap'
materials_data.fillna(value=materials_data['band_gap'].mean())

Unnamed: 0,band_gap,space_group_number
NaCl,4.86,225.0
MoS2,1.46,194.0
AgF2,0.0,14.0
Ag2Te,0.0,14.0
NbN,1.264,225.0
Nb3Sn,0.0,1.264


## 🚮 Correctly Dropping Missing Values


In [22]:
# Dropping rows where ANY column has a missing value
materials_data_clean = materials_data.dropna(how='any')

# Displaying the cleaned DataFrame
materials_data_clean

Unnamed: 0,band_gap,space_group_number
NaCl,4.86,225.0
MoS2,1.46,194.0
AgF2,0.0,14.0
Ag2Te,0.0,14.0


📌 **Explanation:**

- `how='any'` ➔ Drops a row **if any column** has a missing (`NaN`) value.
- `how='all'` ➔ Drops a row **only if all columns** have missing values.
- Use `axis=1` if you want to drop columns instead of rows.

⚠️ In real-world datasets, always analyze which strategy is appropriate before removing data!  
We don't want to accidentally lose useful information 🔬🤖.


## 🔢 Using `value_counts()` for Quick Frequency Counts

In [23]:
# Count how many times each unique row appears
materials_data.value_counts()

band_gap  space_group_number
0.00      14.0                  2
1.46      194.0                 1
4.86      225.0                 1
dtype: int64

In [24]:
# Filtering rows where space_group_number equals 194 and then counting
materials_data[materials_data['space_group_number'] == 194].value_counts()


band_gap  space_group_number
1.46      194.0                 1
dtype: int64

📌 **Explanation:**

- `value_counts()` works on **rows** by default (when applied on a DataFrame).  
- The second example first filters rows where `space_group_number` is `194`, and then counts how many times each combination of values appears.

✅ Very useful when analyzing duplicates or frequency of identical data entries!


## 🔗 Concatenation: Adding More Data to Our DataFrame

In [25]:
# Creating another DataFrame containing metal property and atomic number
mat_data = pd.DataFrame(
    [
        ['yes', 11],
        ['yes', 79],
        ['yes', 47],
        ['yes', 26]
    ],
    index=['Na', 'Au', 'Ag', 'Fe'],
    columns=['is_metal', 'atomic_number']
)

# Display the new DataFrame
mat_data

Unnamed: 0,is_metal,atomic_number
Na,yes,11
Au,yes,79
Ag,yes,47
Fe,yes,26


📌 **Note:**  
We have now created a second DataFrame `mat_data` containing information on:

- Whether the element is a metal (`is_metal`)
- Its atomic number (`atomic_number`)

This will allow us to practice **concatenation** operations next. 🔧🐼


In [26]:
# Creating a second DataFrame to concatenate
mat_data_1 = pd.DataFrame(
    [
        ['no', 5],
        ['no', 16],
        ['no', 36],
        ['yes', 55]
    ],
    index=['B', 'S', 'Kr', 'Cs'],
    columns=['is_metal', 'atomic_number']
)
mat_data_1

Unnamed: 0,is_metal,atomic_number
B,no,5
S,no,16
Kr,no,36
Cs,yes,55


## 🔗 Concatenating DataFrames by Rows and Columns

In [27]:
# Concatenating the two DataFrames along rows (stacking one below another)
concat_row = pd.concat([mat_data, mat_data_1], axis=0)
concat_row

Unnamed: 0,is_metal,atomic_number
Na,yes,11
Au,yes,79
Ag,yes,47
Fe,yes,26
B,no,5
S,no,16
Kr,no,36
Cs,yes,55


In [28]:
# Concatenating the two DataFrames along columns (side-by-side)
concat_columns = pd.concat([mat_data, mat_data_1], axis=1)
concat_columns

Unnamed: 0,is_metal,atomic_number,is_metal.1,atomic_number.1
Na,yes,11.0,,
Au,yes,79.0,,
Ag,yes,47.0,,
Fe,yes,26.0,,
B,,,no,5.0
S,,,no,16.0
Kr,,,no,36.0
Cs,,,yes,55.0


📌 **Explanation:**

- `axis=0` ➔ Stacks DataFrames vertically (rows added).  
- `axis=1` ➔ Stacks DataFrames horizontally (columns added).  
- When indexes don't match, Pandas automatically aligns them and fills missing values with `NaN`.

✅ Be cautious when merging on `axis=1` — matching indexes are essential to avoid unintended `NaN` entries.  



## 📚 Some Useful Resources to Learn Pandas Further

1️⃣ [**Complete Python Pandas Data Science Tutorial!** (2025 Updated Edition)
](https://www.youtube.com/watch?v=2uvysYbKdjM)  
A beginner-friendly crash course that covers the basics of Python in just over an hour.
