# Web Scraping and cleaning

In this notebook we will be scraping the data from any given url and then use cleaning methods like regular expression to remove unwanted contents from the scraped text like unwanted spaces, \n, \t. Remove the citation marks like [6] or (ab) so we are left with only useful text.

In [10]:
import bs4 as bs
import requests
import re

In [11]:
URL = 'https://en.wikipedia.org/wiki/Artificial_intelligence'
html_page = requests.get(URL).text

In [12]:
soup = bs.BeautifulSoup(html_page, 'lxml')

In [13]:
main_heading = soup.find('h1').text
main_heading

'Artificial intelligence'

In [14]:
heading_tags = soup.find_all('h2')
headings = []
for heading in heading_tags:
    h = heading.text
    h = h.replace("\n", "")
    headings.append(h)
headings

['Contents',
 'History',
 'Goals',
 'Tools',
 'Applications',
 'Philosophy',
 'Future',
 'In fiction',
 'Scientific diplomacy',
 'See also',
 'Explanatory notes',
 'Citations',
 'References']

In [15]:
subheading_tags = soup.find_all('h3')
subheadings = []
for subheading in subheading_tags:
    s = subheading.text
    s = s.replace("\n", "")
    subheadings.append(s)
subheadings

['Reasoning, problem-solving',
 'Knowledge representation',
 'Planning',
 'Learning',
 'Natural language processing',
 'Perception',
 'Motion and manipulation',
 'Social intelligence',
 'General intelligence',
 'Search and optimization',
 'Logic',
 'Probabilistic methods for uncertain reasoning',
 'Classifiers and statistical learning methods',
 'Artificial neural networks',
 'Specialized languages and hardware',
 'Defining artificial intelligence',
 'Evaluating approaches to AI',
 'Machine consciousness, sentience and mind',
 'Superintelligence',
 'Risks',
 'Ethical machines',
 'Regulation',
 'Warfare',
 'Cybersecurity',
 'Election security',
 'Future of work',
 'AI and foreign policy',
 'AI textbooks',
 'History of AI',
 'Other sources']

In [16]:
para_tags = soup.find_all('p')
content = []
citation_pattern = r"([\[]\w*|[\]])"
for para in para_tags:
    p = para.text
    p = p.replace("\n", "")
    p = re.sub(citation_pattern, "", p)
    if(p):
        content.append(p)
print(content[18])

Many of these algorithms proved to be insufficient for solving large reasoning problems because they experienced a "combinatorial explosion": they became exponentially slower as the problems grew larger.Even humans rarely use the step-by-step deduction that early AI research could model. They solve most of their problems using fast, intuitive judgments.


## Making it into a function

In [19]:
def get_content(URL):
    html_page = requests.get(URL).text
    soup = bs.BeautifulSoup(html_page, 'lxml')
    main_tags = soup.find_all('h1')
    main_headings = []
    spaces_pattern = r"(  |\r|\n|\t)"
    space_pattern = r'\s+'
    for main in main_tags:
        m = main.text
        m = re.sub(spaces_pattern, "", m)
        main_headings.append(m)
    
    heading_tags = soup.find_all('h2')
    headings = []
    for heading in heading_tags:
        h = heading.text
        h = re.sub(spaces_pattern, "", h)
        headings.append(h)
    
    subheading_tags = soup.find_all('h3')
    subheadings = []
    for subheading in subheading_tags:
        s = subheading.text
        s = re.sub(spaces_pattern, "", s)
        subheadings.append(s)
    
    para_tags = soup.find_all('p')
    content = []
    citation_pattern = r"([\[]\w*|[\]])|([\(]\w*|[\)])"
    citation_pattern = r'\[[0-9a-zA-Z]*\]'
    for para in para_tags:
        p = para.text
        p = re.sub(spaces_pattern, "", p)
        p = re.sub(citation_pattern, "", p)
        if(p):
            content.append(p)
    
    return (main_headings, headings, subheadings, content)

In [20]:
data = get_content('https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full')

In [21]:
data[3][9]

'Air pollution has various health effects. The health of susceptible and sensitive individuals can be impacted even on low air pollution days. Short-term exposure to air pollutants is closely related to COPD (Chronic Obstructive Pulmonary Disease), cough, shortness of breath, wheezing, asthma, respiratory disease, and high rates of hospitalization (a measurement of morbidity).'

In [22]:
def clean_text(text):
    space_pattern = r"(  |\r|\n|\t)"
    citation_pattern = r'\[[0-9a-zA-Z]*\]'
    text = re.sub(space_pattern, " ", text)
    text = re.sub(citation_pattern, "", text)
    
    return text

In [25]:
with open('./SampleData/sample_text4.txt', encoding="utf8") as f:
    input_text = f.readlines()
final_text = ''
for text in input_text:
    final_text += text
    
final_text



In [26]:
clean_text(final_text)

