
# Academy data pipeline -- Data extraction from text to CSV

<h3>Extracting data from modules text and save it in a dataframe: Approach </h3>
<ol type="1">
  <li> Reading the Document and split it into parts ( sections, slides, texts, diagrams/photos) </li>
    <li> Structure the Data: Parsing the data following the hierarchical structure of the doc  </li>
    <li> Create DataFrame:Convert this structured data into a pandas DataFrame </li>
    
</ol>

In [1]:
# Import libraries 
from docx import Document 
import pandas as pd
import os
import re

Read the doc

In [2]:
# read the test .docx file
docx_path = 'Modules/sea-test.docx'

# load the document
document = Document(docx_path)

# extract text from the document
doc_text = []
for para in document.paragraphs:
    doc_text.append(para.text)

# print the text
print('\n'.join(doc_text))

#5.3.2 Energy Transition

*Slide* [Innovation in the Just Energy Transition]

**Photo**
img=access-1

**Text-narrow**

Energy lorem ipsum alo vil cu li li oppo energy sustainable etc Energy lorem ipsum alo vil cu li li oppo energy sustainable etc Energy lorem ipsum alo vil cu li li oppo energy sustainable 


**Text-wide**
Energy lorem ipsum alo vil cu li li oppo 
Energy lorem ipsum alo vil cu li li oppo energy sustainable etc
Energy lorem ipsum alo vil cu li li oppo energy sustainable etc

*Slide* [Energy Access]

**Photo-Box**

img=access-3

**Text**

Energy lorem ipsum alo vil 

**Diagram**

dgm=transition-2

Here the diagram shows that renewable energy is good.


**Text-dark-6**

Energy lorem ipsum alo vil cu li li oppo energy sustainable etc
Energy lorem ipsum alo vil cu li li oppo energy sustainable etc
Energy lorem ipsum alo vil cu li li oppo energy sustainable etc


**Text-dark-4**

Energy lorem ipsum alo vil cu li li oppo energy sustainable etc Energy lorem ipsum alo vil cu li 

Structure the data

In [3]:
def parse_docx(doc_path):
    doc = Document(doc_path)
    data = []
    current_section = None
    section_title = None
    slide_count = 0
    current_slide = None
    slide_title = None

    for para in doc.paragraphs:
        text = para.text.strip()
        if not text:
            continue

        if text.startswith('#'):
            # Section header
            current_section = text[1:].split()[0]  # Full section number e.g., 5.3.2
            section_title = ' '.join(text[1:].split()[1:])
            data.append({
                "Type": "Text",
                "Code": "",
                "Section Number": current_section,
                "Section Title": section_title,
                "Slide Number": "",
                "Slide Title": "",
                "Text": section_title,
                "Order": len(data) + 1
            })
        elif '*Slide*' in text:
            # Slide header
            slide_count += 1
            current_slide = slide_count
            slide_title = text.split('[', 1)[1].split(']')[0]
            data.append({
                "Type": "Text",
                "Code": "",
                "Section Number": current_section,
                "Section Title": section_title,
                "Slide Number": current_slide,
                "Slide Title": slide_title,
                "Text": slide_title,
                "Order": len(data) + 1
            })
        else:
            # Identify type of content and handle image codes
            text_type = "Text"
            code = ""
            content = text
            if '**' in text:
                text_type = text.split('**')[1]
                content = para.text.split('\n\n', 1)[1].strip() if '\n\n' in para.text else text
                if 'img=' in content or 'dgm=' in content:
                    code = content.split('\n')[0].split('=')[1].strip()
                    content = '\n'.join(content.split('\n')[1:])
            
            data.append({
                "Type": text_type,
                "Code": code,
                "Section Number": current_section,
                "Section Title": section_title,
                "Slide Number": current_slide,
                "Slide Title": slide_title,
                "Text": content,
                "Order": len(data) + 1
            })

    return pd.DataFrame(data)

# Path to your .docx file
df = parse_docx(docx_path)
df


Unnamed: 0,Type,Code,Section Number,Section Title,Slide Number,Slide Title,Text,Order
0,Text,,5.3.2,Energy Transition,,,Energy Transition,1
1,Text,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,Innovation in the Just Energy Transition,2
2,Photo,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,**Photo**,3
3,Text,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,img=access-1,4
4,Text-narrow,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,**Text-narrow**,5
5,Text,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,Energy lorem ipsum alo vil cu li li oppo energ...,6
6,Text-wide,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,**Text-wide**,7
7,Text,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,Energy lorem ipsum alo vil cu li li oppo,8
8,Text,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,Energy lorem ipsum alo vil cu li li oppo energ...,9
9,Text,,5.3.2,Energy Transition,1.0,Innovation in the Just Energy Transition,Energy lorem ipsum alo vil cu li li oppo energ...,10


 Using NLP library to identify sentences and paragraphs

In [5]:
# python -m spacy download en_core_web_sm
import spacy


In [6]:
# Load the spaCy model for natural language processing
nlp = spacy.load('en_core_web_sm')

