# 🔍 Regular Expressions in Python

**Created by**: [@adigasuhas](https://github.com/adigasuhas)  
**Contact**: suhasadiga@jncasr.ac.in

---

Welcome! 🙌  
This notebook is part of a tutorial series crafted to teach **Python from the ground up** — in a simple, clear, and beginner-friendly way. We'll use examples especially relevant to **Materials Science** to make learning practical and engaging.

Python is one of the most popular and versatile programming languages today. Whether you're analyzing data, automating repetitive tasks, or running scientific simulations, Python is an essential tool in modern research.

---

## 📘 What You'll Learn in This Notebook

In this notebook, you'll learn:

- ✨ **How to match patterns and process text using Regular Expressions (RegEx)**  
  Understand how to search, extract, and manipulate patterns in raw text — a skill that’s incredibly useful in data cleaning and scientific text mining.

---

> 📝 **Note**: This tutorial assumes **no prior programming experience**. Each concept is introduced step-by-step with simple explanations, real-world analogies, and hands-on examples.

Let's get started! 🚀


# 🔍 Regular Expressions (Regex)

A **regular expression** (commonly abbreviated as *regex* or *regexp*) is a sequence of characters that defines a search pattern in text. It serves a similar purpose to the "Find" or "Find and Replace" functions that we often use in word processors, PDF readers, or other software tools.

The concept of regular expressions was introduced in the 1950s by American mathematician **Stephen Cole Kleene**, and it was first implemented in Unix-based text-processing utilities.

## 📄 Relevance in Materials Science

In the context of materials science, regular expressions can be particularly useful when working with text files that contain repeated patterns. For example, they can help extract chemical compositions or other structured information from raw data files. This becomes especially important in the application of machine learning to materials science, where preprocessing and extracting relevant information from text is a key step in building accurate models.

---
✅ Python provides a built-in module called `re`, which contains all the essential utilities for working with regular expressions.


In [1]:
import re

## 🛠️ Functions in `re`

Python's `re` module provides several functions to work with regular expressions. Below are some of the most commonly used ones:

- `findall(<pattern>, string)`  
  → Searches through the string and returns a list of all substrings that match the specified pattern.

- `search(<pattern>, string)`  
  → Scans through the string and returns a match object for the **first** occurrence of the pattern. Returns `None` if no match is found.

- `split(<pattern>, string)`  
  → Splits the string at each occurrence of the pattern. This is especially useful when the splitting criterion is not a simple space or delimiter.

- `sub(<pattern>, <replacement>, string)`  
  → Identifies all occurrences of the pattern in the string and replaces them with the specified `replacement`.

While the functionality of `sub` is generally self-explanatory, the behavior of the other three functions (`findall`, `search`, and `split`) may seem less intuitive at first. These will become clearer through examples.

Before we proceed to those examples, we will first explore **metacharacters**, which are essential for defining regex patterns in a way that can be interpreted by the machine.


## 🤖 Metacharacters in Regular Expressions

Metacharacters are symbols used in regex to define search patterns more precisely. Below is a table summarizing commonly used metacharacters:

| Character(s) | Function | Example |
|--------------|----------|---------|
| `[]`         | Identifies a set or range of characters | `[u-w]` matches any one character in the range *u* to *w*. |
| `()`         | Captures and groups characters | `(u-w)` groups characters literally, not as a range. |
| `\`          | Indicates a special sequence | `\s` matches any whitespace character. |
| `.`          | Matches any single character (except newline) | `d..d` matches any 4-letter word that starts and ends with `d`, like `dead` or `died`. |
| `^`          | Asserts the start of a string | `^Na` matches strings starting with `Na`, like `NaCl` or `Nascent`. |
| `$`          | Asserts the end of a string | `.$` matches strings ending with any single character followed by a period. |
| `*`          | Matches zero or more occurrences | `Na.*C` matches strings where `Na` is followed by any characters (including none) and then a `C`. |
| `+`          | Matches one or more occurrences | Similar to `*`, but requires at least one occurrence. |
| `?`          | Matches zero or one occurrence | Useful for optional characters or groups. |
| `{n}`        | Matches exactly *n* occurrences | `Na.{2}Cl` matches `Na` followed by any 2 characters, and then `Cl`. |
| `|`          | Logical OR (either this or that) | `metal|non-metal` matches either `metal` or `non-metal`. |

These metacharacters are building blocks for defining patterns in a flexible and powerful way. They become especially useful in extracting structured data from unstructured text.


## 🤫 Special Sequences in Regex

Regular expressions provide several **special sequences**—shortcuts that simplify pattern matching using a backslash (`\`). These are useful for identifying specific positions or character types in a string.

| Character(s) | Function | Example |
|--------------|----------|---------|
| `\A`         | Matches if the specified characters are at the **beginning** of the string | `\AMaterial` matches only if the string starts with `Material`. |
| `\b`         | Matches at the **beginning or end** of a word boundary | `r"\bNaCl"` matches `NaCl` at the start of a word; `r"NaCl\b"` matches it at the end. |
| `\B`         | Matches where `\b` does **not** match, i.e., **not** at the beginning or end of a word | Useful for identifying substrings inside a word, like `r"\BCl"` in `NaCl`. |
| `\d`         | Matches **any digit** (equivalent to `[0-9]`) | In `Na0.5Cl0.5`, using `\d+` would return `['0', '5', '0', '5']`. |
| `\D`         | Matches **any non-digit character** | In `NaCl`, `\D` would return `['N', 'a', 'C', 'l']`. |
| `\s`         | Matches any **whitespace** character (spaces, tabs, newlines) | In `'Ti O2'`, `\s` would match the space between `Ti` and `O2`. |
| `\S`         | Matches any **non-whitespace** character | In `'Ti O2'`, `\S` would match `['T', 'i', 'O', '2']`. |
| `\w`         | Matches any **word character** (letters, digits, or underscore) | In `BaTi_O3`, `\w` matches all characters. |
| `\W`         | Matches any **non-word character** | In `BaTi_O3!`, `\W` matches only the `!`. |
| `\Z`         | Matches if the specified characters are at the **end** of the string | `O3\Z` matches only if the string ends with `O3`. |

These sequences make regex patterns more concise and powerful, especially when dealing with structured or semi-structured text.


## 📝 Exercises

### 🔍 a) Finding the word *computation*

In [2]:
# Sample paragraph to analyze
intro_text = (
    """Topological quantum computation remains one of the most exciting and pursued paths towards fault-tolerant 
    quantum computation. In general, the framework for topological quantum computation relies on the emergence, 
    spatial manipulation, and measurement of anyonic excitations known as Majorana zero modes (MZMs), exotic states 
    of matter which form on the boundary of topological superconductors [1–3]. These obey non-Abelian statistics 
    under mutual spatial exchange which allows for encoding of unitary gates on the many-body ground state manifold 
    [4–6]. Crucially, MZM based processes have been theorized to encode a universal gate set, thus providing a 
    direct platform to encode a universal quantum computer [7–16]. Recent work in full many-body simulation did not
    only verify this, but were also able to quantify the physical constraints of braiding in a full dynamic context."""
)


In [3]:
# Word to search for
search_word = 'computation'

# Using regular expression to find all occurrences
occurrences = re.findall(search_word, intro_text)

# Displaying the result
print(f"The word '{search_word}' occurs {len(occurrences)} times.")

The word 'computation' occurs 3 times.


Clearly, we can see the word `computation` is repeated **thrice** in the entire paragraph. 🔎


### 🔍 b) Check if the text contains the word *superconductivity*

In [4]:
# Sample texts
text_superconductivity = '''
We consider a type I superconducting body that contains one or more holes in its interior that 
undergoes a transition between normal and superconducting states in the presence of a magnetic field. 
We argue that unlike other thermodynamic systems that undergo first order phase transitions 
the system cannot reach its equilibrium thermodynamic state, and that this sheds new light on the physics 
of the Meissner effect. How the Meissner effect occurs has not been addressed within the conventional theory 
of superconductivity, BCS. The situation considered in this paper indicates that expulsion of magnetic field 
requires physical elements absent from Hamiltonians assumed to describe superconductors within BCS theory. 
These physical elements are essential components of the alternative theory of hole superconductivity.
'''

text_non_superconductivity = '''
Galaxy appearances reveal the physics of how they formed and evolved. Machine learning models can 
now exploit galaxies’ information-rich morphologies to predict physical properties directly from image cutouts. 
Learning the relationship between pixel-level features and galaxy properties is essential for building a physical 
understanding of galaxy evolution, but we are still unable to explicate the details of how deep neural networks 
represent image features. To address this lack of interpretability, we present a novel neural network architecture 
called a Sparse Feature Network (SFNet). SFNets produce interpretable features that can be linearly combined in 
order to estimate galaxy properties like optical emission line ratios or gas-phase metallicity. We find that SFNets
do not sacrifice accuracy in order to gain interpretability, and that they perform comparably well to cutting-edge 
models on astronomical machine learning tasks. Our novel approach is valuable for finding physical patterns in large
datasets and helping astronomers interpret machine learning results.
'''

# Pattern to search for superconductivity-related terms (case-insensitive)
pattern = r'superconductivity|superconductor'

# Check first text
if re.search(pattern, text_superconductivity, re.IGNORECASE):
    print("The first text is related to superconductivity research. 🔬")
else:
    print("The first text is not related to superconductivity research.")

# Check second text
if re.search(pattern, text_non_superconductivity, re.IGNORECASE):
    print("The second text is related to superconductivity research. 🔬")
else:
    print("The second text is not related to superconductivity research.")

The first text is related to superconductivity research. 🔬
The second text is not related to superconductivity research.


### 🔬 c) Extracting elements and their composition from a chemical formula

We know that chemical formulas are mostly of the form:  
`[Element1][Composition1][Element2][Composition2]...`

- The **element** starts with a capital letter (A–Z), optionally followed by a lowercase letter (a–z).
- The **composition** (subscript) should be numeric.


In [5]:
# Sample chemical formulas
formula_valid_1 = 'NaCl'      # ✅ Valid chemical formula
formula_invalid_1 = 'NaxCl1'  # ❌ Invalid - variable 'x' used as composition
formula_valid_2 = 'AgTe2'     # ✅ Valid chemical formula
formula_invalid_2 = 'Au@Tn1'  # ❌ Invalid - special character '@' present

# Regex pattern: element followed by optional numeric composition
formula_pattern = r'([A-Z][a-z]?)(\d*\.?\d*)'

# Applying regex extraction
matches_valid_1 = re.findall(formula_pattern, formula_valid_1)
matches_invalid_1 = re.findall(formula_pattern, formula_invalid_1)
matches_valid_2 = re.findall(formula_pattern, formula_valid_2)
matches_invalid_2 = re.findall(formula_pattern, formula_invalid_2)

# Displaying the outputs
print(f"Formula: {formula_valid_1} ➔ {matches_valid_1}")
print(f"Formula: {formula_invalid_1} ➔ {matches_invalid_1}")
print(f"Formula: {formula_valid_2} ➔ {matches_valid_2}")
print(f"Formula: {formula_invalid_2} ➔ {matches_invalid_2}")

Formula: NaCl ➔ [('Na', ''), ('Cl', '')]
Formula: NaxCl1 ➔ [('Na', ''), ('Cl', '1')]
Formula: AgTe2 ➔ [('Ag', ''), ('Te', '2')]
Formula: Au@Tn1 ➔ [('Au', ''), ('Tn', '1')]


Clearly, our regex pattern is able to identify the elements and extract the composition.  
However:
- An empty string (`''`) in composition may imply a count of **1** or indicate invalid characters.
- Handling such cases will be discussed in more detail in upcoming tutorials. 🚧


### 🔄 d) Replacing words using regular expressions

We will now replace all instances of  
`superconductor / Superconductor / superconductivity / Superconductivity` with `Non-superconductor` or `Non-superconductivity` respectively.


In [6]:
# Define the replacement pattern
superconduct_pattern = r'Superconductivity|superconductivity|Superconductor|superconductor'

# Perform substitution
text_modified = re.sub(superconduct_pattern, "Non-superconductivity", text_superconductivity)

# Display the modified text
print(text_modified)


We consider a type I superconducting body that contains one or more holes in its interior that 
undergoes a transition between normal and superconducting states in the presence of a magnetic field. 
We argue that unlike other thermodynamic systems that undergo first order phase transitions 
the system cannot reach its equilibrium thermodynamic state, and that this sheds new light on the physics 
of the Meissner effect. How the Meissner effect occurs has not been addressed within the conventional theory 
of Non-superconductivity, BCS. The situation considered in this paper indicates that expulsion of magnetic field 
requires physical elements absent from Hamiltonians assumed to describe Non-superconductivitys within BCS theory. 
These physical elements are essential components of the alternative theory of hole Non-superconductivity.



⚠️ Note:  
- Here, we committed a small *linguistic crime* by replacing everything with `"Non-superconductivity"`, even if it originally was `"superconductor"`.  
- Technically, this isn't entirely correct — but for now, we plead guilty in the court of Regex Justice ⚖️😂  
- Instead we should have learned how to handle such delicate replacements more gracefully and avoid upsetting the grammar police! 🚓📚

### ✂️ e) Breaking a paragraph into sentences

Now we will try to break the abstract paragraph into individual sentences.


In [7]:
# Sample paragraph
abstract_text = '''
The discovery of high-temperature superconducting materials holds great significance for human industry
and daily life. In recent years, research on predicting superconducting transition temperatures using artificial 
intelligence (AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, 
the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different 
AI algorithms and impeded further advancement of these methods. In this work, we present HTSC-2025, an 
ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses 
theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on 
BCS superconductivity theory, including the renowned X2YH6 system, perovskite MXH3 system, M3XH8 system, 
cage-like BCN-doped metal atomic systems derived from LaH10 structural evolution, and two-dimensional honeycomb-
structured systems evolving from MgB2. In addition, we note a range of approaches inspired by physical intuition 
for designing high-temperature superconductors, such as hole doping, the introduction of light elements to form 
strong covalent bonds, and the tuning of spin–orbit coupling. The HTSC-2025 benchmark has been open-sourced at and 
will be continuously updated. This benchmark holds significant importance for accelerating the discovery of 
superconducting materials using AI-based methods.
'''

# Split paragraph into sentences using period as delimiter
sentences = re.split(r'\.', abstract_text)

# Display each sentence
for idx, sentence in enumerate(sentences, 1):
    print(f"Sentence {idx}: {sentence.strip()}")

Sentence 1: The discovery of high-temperature superconducting materials holds great significance for human industry
and daily life
Sentence 2: In recent years, research on predicting superconducting transition temperatures using artificial 
intelligence (AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy
Sentence 3: However, 
the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different 
AI algorithms and impeded further advancement of these methods
Sentence 4: In this work, we present HTSC-2025, an 
ambient-pressure high-temperature superconducting benchmark dataset
Sentence 5: This comprehensive compilation encompasses 
theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on 
BCS superconductivity theory, including the renowned X2YH6 system, perovskite MXH3 system, M3XH8 system, 
cage-like BCN-doped metal atomic systems derived fro

⚠️ Note:  
- This is a simple sentence splitter using just `.` as delimiter.
- In real-world NLP tasks, sentence splitting can get much more complicated due to abbreviations, decimals, etc.  
- We can handle such advanced cases later with specialized libraries like `nltk` or `spacy` 🤓📚, which unfortunately won't be covered in this series.


## 📚 Some Useful Resources to Learn Python Further

1️⃣ You can use [regex101](https://regex101.com/) — an excellent interactive website to test your regular expression patterns instantly without writing full Python code. Very handy for quick debugging and learning! 🔎

2️⃣ Check out the excellent tutorial by **Prof. Schwaller's group at EPFL**:  👉 [Practical Programming in Chemistry - Regular Expressions](https://schwallergroup.github.io/practical-programming-in-chemistry/tutorials/lecture_06/01_regex.html)  
A fantastic resource for anyone working at the intersection of chemistry and programming. 🧪💻
