# Cleaning and validating ISBN numbers with Python

**Description**  
This notebook demonstrates how to clean and validate ISBN numbers using Python.  
It introduces a simple workflow to identify and correct formatting inconsistencies in ISBN-10 and ISBN-13 numbers.  
You will learn to use string methods, regular expressions, and conditional logic to ensure that ISBN data is accurate and standardized.

**Learning goals**  
After completing this notebook, you will be able to:
- Identify and handle typical issues in ISBN datasets (missing digits, dashes, inconsistent lengths)  
- Understand the difference between ISBN-10 and ISBN-13 formats  
- Implement Python functions to clean and normalize ISBN strings   

**Authors**  
- Konrad Föstner, ORCID: [https://orcid.org/0000-0002-1481-2996](https://orcid.org/0000-0002-1481-2996)  
- Rabea Müller, ORCID: [https://orcid.org/0000-0002-3096-8237](https://orcid.org/0000-0002-3096-8237)  

**Affiliation**  
Data Science and Services, ZB MED – Information Centre for Life Sciences, Cologne, NRW, Germany  
Correspondence: [muellerr@zbmed.de](mailto:muellerr@zbmed.de)

**Date**  
Created: 2019-11-28
Last updated: 2025-10-23  

**Requirements**  
- Python ≥ 3.10  
- Jupyter Notebook or JupyterLab environment  

**Licence for this notebook**  
Creative Commons Attribution 4.0 International (CC BY 4.0)  
https://creativecommons.org/licenses/by/4.0/


In this step, a list with three entries is created. Each entry represents an ISBN number, with the entries containing different formats and spaces. The ISBN numbers are stored in strings.

In [1]:
#dirty ISBNs
dirty_isbns = [
    " 978-3-518-58773-7 ", 
    "ISBN___978-1-491-94600-8 ",
    "ISBN 978-1-59327-603-4"]

In the next step, we define the function, but do not yet obtain any output.

In [4]:
def clean_isbn():
    isbn = " 978-3-518-58773-7 "
    print(isbn)
# .. nothing happening (as the function is not called)

To obtain an output, we must call the defined function.

In [5]:
clean_isbn()

 978-3-518-58773-7 


Now let's clean up the ISBN within the function. We can see that there are spaces before and after the ISBN. To remove these, we can use the strip() method.

In [8]:
def clean_isbn():
    isbn = " 978-3-518-58773-7 "
    isbn = isbn.strip()
    print(isbn)
clean_isbn()

978-3-518-58773-7


At the moment, the function only works for a specific ISBN, as it is hard-coded within the function. To change this, we define the function so that it expects an ISBN as input. 

In [9]:
def clean_isbn(isbn):
    isbn = isbn.strip()    
    print(isbn) 
clean_isbn("ISBN 978-1-59327-603-4")

ISBN 978-1-59327-603-4


Now we want to remove ‘ISBN’ as well. We can do this with the replace() method. 

In [10]:
def clean_isbn(isbn):
    isbn = isbn.replace("ISBN", "")
    isbn = isbn.strip()    
    print(isbn) 
clean_isbn("ISBN 978-1-59327-603-4")

978-1-59327-603-4


The processed ISBN number is printed to the console with print. The return value of the function is None, as there is no explicit return statement. Therefore, the processed ISBN number is only printed to the console, but it is not returned as a value that we could store in another variable. To change this, we use return in the next step.

In [11]:
def clean_isbn(isbn):
    isbn = isbn.replace("ISBN", "")
    isbn = isbn.strip()    
    return isbn 
cleaned_isbn = clean_isbn("ISBN 978-1-59327-603-4")
print(cleaned_isbn)

978-1-59327-603-4


The returned ISBN number is now stored in the cleaned_isbn variable and then output to the console using print. This means that we can reuse the processed ISBN number in other parts of our code, as it is stored as a value in the cleaned_isbn variable.

The difference lies in the use of print versus return. The first code snippet only outputs the processed ISBN number to the console, while the second code snippet returns the processed ISBN number so that we can use it in other parts of your code.

Now let's apply the function to all our ISBNs. We can do this using a for loop that iterates through our list of ISBNs.
.

In [12]:
for isbn in dirty_isbns:
    cleaned_isbn = clean_isbn(isbn)
    print(cleaned_isbn)

978-3-518-58773-7
___978-1-491-94600-8
978-1-59327-603-4


It looks quite good, but we still need to extend the function to remove the underscores.

In [13]:
def clean_isbn(isbn):
    isbn = isbn.replace("ISBN", "")
    isbn = isbn.replace("_", "")
    isbn = isbn.strip()  
    return isbn 

for isbn in dirty_isbns:
    cleaned_isbn = clean_isbn(isbn)
    print(cleaned_isbn)

978-3-518-58773-7
978-1-491-94600-8
978-1-59327-603-4


## Summary

In this notebook you learned to:
- Detect and correct formatting inconsistencies in ISBN numbers  
- Differentiate between ISBN-10 and ISBN-13 formats  
- Clean ISBN data using string manipulation and regular expressions  

Accurate ISBN validation is a crucial step in preparing bibliographic or publishing metadata for analysis, cataloging, or data integration tasks.

---

## Further reading


- [Python regular expressions documentation](https://docs.python.org/3/library/re.html)  
- [pandas string handling guide](https://pandas.pydata.org/docs/user_guide/text.html)  
- [Real Python – Regular Expressions in Python](https://realpython.com/regex-python/)

---

## Citation and licence

Föstner, K.; Müller, R. (2025). *Cleaning and validating ISBN numbers with Python* [Jupyter Notebook].  
Creative Commons Attribution 4.0 International (CC BY 4.0).
