# Extracting annotations from Kobo file

Annotations made in a Kobo ebook are contained in a `.epub.annot` file. To get this file, connect you Kobo to your computer, and open the Kobo in your file browser (you might need to press the `Connect` button on the Kobo). Mine was in a folder called `Digital Editions`. 

The file from Kobo is a simple XML file with the following structure:

```
<annotationSet ...>
    <publication>...</publication>
    <annotation>
        <dc:identifier>urn:uuid:4e5f9ac1-f3e2-4b64-82fb-f8ddeab6d835</dc:identifier>
        <dc:date>2024-08-27T17:26:06Z</dc:date>
        <dc:creator>urn:uuid:dd417800-d0d1-471f-acaf-74a460c73b28</dc:creator>
        <target>
            <fragment start="Barbery,Muriel-L'elegance%20du%20herisson(2006).French.ebook.AlexandriZ_split_005.html#point(/1/4/34/1:115)" end="Barbery,Muriel-L'elegance%20du%20herisson(2006).French.ebook.AlexandriZ_split_005.html#point(/1/4/34/1:183)" progress="0.049763">
                <text>Mais le monde tel qu’il
est n’est pas fait pour les princesses. </text>
            </fragment>
        </target>
    </annotation>
    <annotation>
        ...
    </annotation>
...
```

In [18]:
%pip install -q \
    pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
import xml.etree.ElementTree as ET
import re

In [4]:
with open('annotations.epub.annot', 'r', encoding='utf-8') as file:
    annotations_file_content = file.read()

In [23]:
# Parse XML
root = ET.fromstring(annotations_file_content)

# Define namespace dictionary to access elements with namespaces
ns = {
    'dc': 'http://purl.org/dc/elements/1.1/',
    '': 'http://ns.adobe.com/digitaleditions/annotations'
}

# Extract quotes and page numbers
quotes = []
for annotation in root.findall('.//annotation', ns):
    # Extract book title and author from the publication element
    publication = root.find('.//publication', ns)
    book_title = publication.find('dc:title', ns).text
    book_author = publication.find('dc:creator', ns).text
    book = f"{book_title} ({book_author})"
    text = annotation.find('.//target/fragment/text', ns).text.strip()
    text = re.sub(r'\n(?!\n)', ' ', text)
    fragment_start = annotation.find('.//target/fragment', ns).attrib['start']
    
    # Extract page information from the fragment URL
    page_match = re.search(r'split_(\d+)', fragment_start)
    page = page_match.group(1) if page_match else "unknown"
    page = page.lstrip('0')
    
    # Append the formatted quote with page
    quotes.append([book, text, page])

print ("Here's a sample \n")

# Display a sample of quotes
for quote in quotes[:5]:
    print(quote)

Here's a sample 

["L'élégance du hérisson (Barbery,Muriel)", 'Mais le monde tel qu’il est n’est pas fait pour les princesses.', '5']
["L'élégance du hérisson (Barbery,Muriel)", 'Il ne reste plus qu’à s’anesthésier comme on peut en tentant de se masquer le fait qu’on ne trouve aucun sens à sa vie et on trompe ses propres enfants pour tenter de mieux se convaincre soi-même.', '5']
["L'élégance du hérisson (Barbery,Muriel)", 'Apparemment, de temps en temps, les adultes prennent le temps de s’asseoir et de contempler le désastre qu’est leur vie.', '5']
["L'élégance du hérisson (Barbery,Muriel)", 'en cherchant toujours la même chose\xa0: des moments compacts où un joueur devenait son propre mouvement sans avoir besoin de se fragmenter en se dirigeant vers.', '8']
["L'élégance du hérisson (Barbery,Muriel)", 'les coucougnettes.', '17']


In [None]:
# No need for CSV, copy to clipboard instead

# import csv

# columns = ['Book', 'Quote', 'Page']
# def create_csv():
#     with open(f'quotes.csv', 'w', newline='', encoding='utf-8') as csvfile:
#         writer = csv.writer(csvfile)
#         writer.writerow(columns)
#         writer.writerows(quotes)

# # Example usage
# create_csv()

In [22]:
import pandas as pd

columns = ['Book', 'Quote', 'Page']

# Create a DataFrame from the quotes
df_quotes = pd.DataFrame(quotes, columns=columns)

# Copy the DataFrame to the clipboard
df_quotes.to_clipboard(index=False)