# Extract Text from a Webpage

We presented this [tool](https://crosscompute.com/t/ymxigXFmpg2YWlXCov1jaMTBIc1m5liU) and [notebook](https://crosscompute.com/n/BhkK4AlpPD4Hmn8O0mbDtrH0q4HzphfN/-/extract-text-from-a-webpage) as part of our workshop on [Computational Approaches to Fight Human Trafficking](https://www.meetup.com/spatiotemporal-analysis-for-community-health-and-safety/events/244179401).

We use [requests](http://docs.python-requests.org) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup) to perform the following steps:

1. Get HTML
2. Extract the title
3. Extract the body while ignoring scripts and tags
4. Replace multiple newlines with a single newline

In [None]:
# CrossCompute
url = 'https://www.unodc.org'
target_folder = '/tmp'

In [None]:
import requests

response = requests.get(url)
html = response.content
html[:200]

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
title = soup.find('title')
print('title = ' + title.text)

In [None]:
body = soup.find('body')
# Remove script content
for x in body.find_all('script'):
    x.decompose()
# Extract text without tags
text = body.getText(separator=u'\n').strip()
print(text[:70])    

In [None]:
import re

# Replace multiple newlines with a single newline
pattern = re.compile(r'\n+', re.MULTILINE)
text = pattern.sub('\n', text)
print(text[:89])

In [None]:
target_path = target_folder + '/raw.txt'
open(target_path, 'wt').write(text)
print('body_text_path = ' + target_path)