In [1]:
from bs4 import BeautifulSoup

def clean_markdown_for_notion(markdown_input):
    soup = BeautifulSoup(markdown_input, 'html.parser')
    output_lines = []

    def convert_list(tag, indent=''):
        lines = []
        if tag.name == 'ol':
            for i, li in enumerate(tag.find_all('li', recursive=False), start=1):
                subtext = convert_list(li, indent + '   ')
                lines.append(f"{indent}{i}. {subtext[0].strip()}")
                lines.extend(subtext[1:])
        elif tag.name == 'li':
            parts = []
            for child in tag.children:
                if isinstance(child, str):
                    parts.append(child.strip())
                elif child.name == 'br':
                    parts.append('\n' + indent + '   ')
                elif child.name in ['ol', 'ul']:
                    sublist = convert_list(child, indent + '   ')
                    parts.append('\n' + '\n'.join(sublist))
            return [''.join(parts)]
        return lines

    for element in soup:
        if element.name is None:
            output_lines.append(element.strip())
        elif element.name == 'b':
            output_lines.append(f"**{element.text.strip()}**")
        elif element.name == 'ol':
            output_lines.extend(convert_list(element))
        else:
            output_lines.append(element.get_text().strip())

    return '\n'.join(output_lines)


In [7]:
html_md = """
You should only drop whole columns if most entries in the column are empty. In the data set, none of the columns are empty enough to drop entirely.
You have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. Apply each method to different columns:

<b>Replace by mean:</b>
<ul>
    <li>"normalized-losses": 41 missing data, replace them with mean</li>
    <li>"stroke": 4 missing data, replace them with mean</li>
    <li>"bore": 4 missing data, replace them with mean</li>
    <li>"horsepower": 2 missing data, replace them with mean</li>
    <li>"peak-rpm": 2 missing data, replace them with mean</li>
</ul>

<b>Replace by frequency:</b>
<ul>
    <li>"num-of-doors": 2 missing data, replace them with "four". 
        <ul>
            <li>Reason: 84% sedans are four doors. Since four doors is most frequent, it is most likely to occur</li>
        </ul>
    </li>
</ul>

<b>Drop the whole row:</b>
<ul>
    <li>"price": 4 missing data, simply delete the whole row
        <ul>
            <li>Reason: You want to predict price. You cannot use any data entry without price data for prediction; therefore any row now without price data is not useful to you.</li>
        </ul>
    </li>
</ul>


"""

In [8]:
print(clean_markdown_for_notion(html_md))

You should only drop whole columns if most entries in the column are empty. In the data set, none of the columns are empty enough to drop entirely.
You have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. Apply each method to different columns:
**Replace by mean:**

"normalized-losses": 41 missing data, replace them with mean
"stroke": 4 missing data, replace them with mean
"bore": 4 missing data, replace them with mean
"horsepower": 2 missing data, replace them with mean
"peak-rpm": 2 missing data, replace them with mean

**Replace by frequency:**

"num-of-doors": 2 missing data, replace them with "four". 
        
Reason: 84% sedans are four doors. Since four doors is most frequent, it is most likely to occur

**Drop the whole row:**

"price": 4 missing data, simply delete the whole row
        
Reason: You want to predict price. You cannot use any data entry without price data for prediction; therefore any row now wi