https://www.lexico.com

This notebook shows web scraping examples.
The data is extracted from an online English dictionary. 


https://www.lexico.com

##Example page




<!--
![sdf](https://user-images.githubusercontent.com/79875767/126273408-96951cf5-cc3b-40f4-8f1a-b886f2a35e3a.png)
-->
<!--
<img src="https://user-images.githubusercontent.com/79875767/126273408-96951cf5-cc3b-40f4-8f1a-b886f2a35e3a.png" height=400/>
<img src="https://user-images.githubusercontent.com/79875767/126273421-06699614-0115-4d0b-9d38-31bbb5806989.png" height=400/>
<img src="https://user-images.githubusercontent.com/79875767/126273428-feba8255-01cd-4eb1-b403-9eb414d091e3.png" height=400/>
<img src="https://user-images.githubusercontent.com/79875767/128669591-27d43bf5-778a-409b-9cbd-e928259834f3.png" height=400/>
-->


<!--First column name -->  | <!--Second column name--> 
-------------------|------------------
<img src="https://user-images.githubusercontent.com/79875767/126273408-96951cf5-cc3b-40f4-8f1a-b886f2a35e3a.png" height=400/>      | <img src="https://user-images.githubusercontent.com/79875767/126273421-06699614-0115-4d0b-9d38-31bbb5806989.png" height=400/>
<img src="https://user-images.githubusercontent.com/79875767/126273428-feba8255-01cd-4eb1-b403-9eb414d091e3.png" height=400/>      | <img src="https://user-images.githubusercontent.com/79875767/128669591-27d43bf5-778a-409b-9cbd-e928259834f3.png" height=400/>

## Downloading and processing

In [1]:
from urllib.request import urlopen
from lxml.html import fromstring, tostring

In [2]:
broken_html=""
with urlopen('https://www.lexico.com/definition/word') as response:
    for line in response:
        broken_html+=line.decode('utf-8')

In [3]:
tree = fromstring(broken_html)
fixed_html = tostring(tree, pretty_print=True).decode('utf-8')
tree=fromstring(fixed_html)

In [4]:
tree

<Element html at 0x7f03368619b0>

## View page source




<img src="https://user-images.githubusercontent.com/79875767/126277852-44af137b-8acf-462f-b7f1-0e4d4b21613b.png"/> 

In [5]:
# selecting only part of the page for further search
# a div with class "entryWrapper"
# it's a parent node for elements containing information about given word

entry_tree=tree.xpath('//div[@class="entryWrapper"]')

## Extracting definitions

<img src="https://user-images.githubusercontent.com/79875767/126279489-f884dbfe-9bf4-4d79-a604-4c830f092620.png"/> 



Main definitions (broad meanings)

In [6]:
main_defs = entry_tree[0].xpath('//section[@class="gramb"]/ul[@class="semb"]/li/div[@class="trg"]/p/span[@class="ind one-click-content"]')

In [7]:
for x in main_defs:
    print(x.text_content()) 

A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.
A command, password, or signal.
One's account of the truth, especially when it differs from that of another person.
The text or spoken part of a play, opera, or other performed piece; a script.
A basic unit of data in a computer, typically 16 or 32 bits long.
Express (something spoken or written) in particular words.
Used to express agreement or affirmation.


All definitions

In [8]:
# all definitions (without specifying full path, looks for any element with specified id)
all_defs = entry_tree[0].xpath('//span[@class="ind one-click-content"]')

In [9]:
len(all_defs)

46

In [10]:
for x in all_defs[:15]:
    print(x.text_content())

A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.
A single distinct conceptual unit of language, comprising inflected and variant forms.
Something spoken or written; a remark or statement.
Even the smallest amount of something spoken or written.
Angry talk.
Speech as distinct from action.
A command, password, or signal.
Communication; news.
One's account of the truth, especially when it differs from that of another person.
A promise or assurance.
The text or spoken part of a play, opera, or other performed piece; a script.
A basic unit of data in a computer, typically 16 or 32 bits long.
Express (something spoken or written) in particular words.
Used to express agreement or affirmation.
Interpret a person's words literally, especially by believing them or doing as they suggest.


## Function for getting main meanings

In [11]:
def get_defs(word):
    b_html=""
    with urlopen(f'https://www.lexico.com/definition/{word}') as response:
        for line in response:
            b_html+=line.decode('utf-8')
    t = fromstring(b_html)
    f_html = tostring(t, pretty_print=True).decode('utf-8')
    t = fromstring(f_html)
    entry_t=t.xpath('//div[@class="entryWrapper"]')
    main_defs = entry_t[0].xpath('//section[@class="gramb"]/ul[@class="semb"]/li/div[@class="trg"]/p/span[@class="ind one-click-content"]')
    print(f"{word}:")
    for d in main_defs:
        print("\n", d.text_content())

Examples

In [12]:
get_defs('subsense')

subsense:

 A subsidiary sense of a word defined in a dictionary.


In [13]:
get_defs('subsidiary')

subsidiary:

 Less important than but related or supplementary to something.

 A company controlled by a holding company.




## With parts of speech, subsenses, synonyms, examples...

In [14]:
def get_all(word):
    print(word,"\n")
    broken_html=""
    try:
        with urlopen(f'https://www.lexico.com/definition/{word}') as response:
            for line in response:
                line = line.decode('utf-8')
                broken_html+=line
    except:
        print('error')
        return
    try:
        tree = fromstring(broken_html)
        fixed_html = tostring(tree, pretty_print=True).decode('utf-8')
        tree = fromstring(fixed_html)
        entry_tree=tree.xpath('//div[@class="entryWrapper"]')
    except:
        print("error")
        return
    
    grambs=entry_tree[0].xpath('//section[@class="gramb"]')
    for gramb in grambs: # for every part of speech
        part_of_speech=gramb.xpath('h3[@class="ps pos"]/span[@class="pos"]')
        print(part_of_speech[0].text_content().upper())

        get_single_text= lambda x: x[0].text_content() if len(x)>0 else "" # preventing errors when no results
        get_p1_text = lambda p1: "%s" % ' '.join([f'({x.text_content()})' for x in p1]) # (result1), (result2) 
        get_p2_text = lambda p2: "%s" % ' '.join([f'[{x.text_content()}]' for x in p2]) # [result1], [result2]
        get_syn_text = lambda s: "Synonyms: %s" % ''.join([x.text_content() for x in s]) if len(s)>0 else "" # Synonyms: a, b

        sense_reg_gramb = gramb.xpath('span[@class="sense-registers"]')
        sense_reg_gramb = get_p2_text(sense_reg_gramb)
        form_groups_gramb = gramb.xpath('span[@class="form-groups"]')
        form_groups_gramb = get_p1_text(form_groups_gramb)
        print(sense_reg_gramb, form_groups_gramb)
        senses = gramb.xpath('ul[@class="semb"]/li')
        for sense in senses: # for every general meaning / broad definition

            
            iteration= sense.xpath('div[@class="trg"]/p/span[@class="iteration"]')
            iteration=get_single_text(iteration)
            form_groups = sense.xpath('div[@class="trg"]/p/span[@class="form-groups"]')
            fg_text=get_p1_text(form_groups) # for example: (one's word) before the definition
            grammatical_notes = sense.xpath('div[@class="trg"]/p/span[@class="grammatical_note"]')
            gn_text=get_p2_text(grammatical_notes) # example: [mass noun]
            
            sense_reg = sense.xpath('span[@class="sense-registers"]')
            sense_reg = get_p2_text(sense_reg)

            main_def = sense.xpath('div[@class="trg"]/p/span[@class="ind one-click-content"]')
            main_def = get_single_text(main_def)
            cross_ref = sense.xpath('div[@class="trg"]/div[@class="crossReference"]')
            cross_ref = get_single_text(cross_ref)

            example =  sense.xpath('div[@class="trg"]//div[@class="exg"]')
            example = get_single_text(example)
            synonyms = sense.xpath('div[@class="trg"]/div[@class="synonyms"]/div[@class="exg"]/div')
            syn_text = get_syn_text(synonyms)

            print(iteration, sense_reg, fg_text, gn_text, main_def, cross_ref)

            print(example)
            print(syn_text)

            subsenses = sense.xpath('div[@class="trg"]/ol[@class="subSenses"]/li[@class="subSense"]')
            for subsense in subsenses:
                iteration=subsense.xpath('span[@class="subsenseIteration"]')
                iteration=get_single_text(iteration)
                form_groups = subsense.xpath('span[@class="form-groups"]')
                fg_text=get_p1_text(form_groups)
                grammatical_notes = subsense.xpath('span[@class="grammatical_note"]')
                gn_text=get_p2_text(grammatical_notes)

                sense_reg = subsense.xpath('span[@class="sense-registers"]')
                sense_reg = get_p2_text(sense_reg)

                sub_def = subsense.xpath('span[@class="ind one-click-content"]')
                sub_def = get_single_text(sub_def)
                example = subsense.xpath('div[@class="exg"]')
                example = get_single_text(example)
                synonyms = subsense.xpath('div[@class="trg"]/div[@class="synonyms"]/div[@class="exg"]/div')
                syn_text = get_syn_text(synonyms)

                print(iteration,sense_reg, fg_text, gn_text, sub_def)
                print(example)
                print(syn_text)

    

### Examples

In [15]:
get_all('word')

word 

NOUN
 
1    A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed. 
‘I don't like the word ‘unofficial’’
Synonyms: 
term, name, expression, designation, locution

1.1    A single distinct conceptual unit of language, comprising inflected and variant forms.
‘He is a knowledge worker in all senses of the word and carries a message everyone involved in best practise in education should hear.’

1.2  (usually words)  Something spoken or written; a remark or statement.
‘his grandfather's words had been meant kindly’

1.3  (a word) [with negative] Even the smallest amount of something spoken or written.
‘don't believe a word of it’

1.4  (words)  Angry talk.
‘her father would have had words with her about that’

1.5   [mass noun] Speech as distinct from action.
‘he conforms in word and deed to the values of a society that he rejects’

2    A command, pas

In [16]:
get_all('subsense')

subsense 

NOUN
 
    A subsidiary sense of a word defined in a dictionary. 
‘Dictionaries usually put polysemous words with all their senses in one article and homonymous words in two or more articles, dividing each into senses and subsenses as appropriate.’



In [17]:
get_all('communication')

communication 

NOUN
 
1   [mass noun] The imparting or exchanging of information by speaking, writing, or using some other medium. 
‘television is an effective means of communication’
Synonyms: 
transmission, imparting, conveying, reporting, presenting, passing on, handing on, relay, conveyance, divulgence, divulgation, disclosure

1.1   [count noun] A letter or message containing information or news.
‘a telephone communication’

1.2    The successful conveying or sharing of ideas and feelings.
‘there was a lack of communication between Pamela and her parents’

1.3    Social contact.
‘she gave him some hope of her return, or at least of their future communication’

2  (communications)  Means of sending or receiving information, such as phone lines or computers. 
‘satellite communications’

2.1   [treated as singular] The field of study concerned with the transmission of information.
‘After studying communications and political science, he was soon ready for more wanderings.’

3  (comm

## Obtaining information in specified format (a table) 




In [18]:
import pandas as pd

In [19]:
def get_df(words):
    df_rows=[]
    for word in words:
        broken_html=""
        try:
            with urlopen(f'https://www.lexico.com/definition/{word}') as response:
                for line in response:
                    line = line.decode('utf-8')
                    broken_html+=line
        except:
            print('error')
            return
        try:
            tree = fromstring(broken_html)
            fixed_html = tostring(tree, pretty_print=True).decode('utf-8')
            tree = fromstring(fixed_html)
            entry_tree=tree.xpath('//div[@class="entryWrapper"]')
        except:
            print("error")
            return
        
        grambs=entry_tree[0].xpath('//section[@class="gramb"]')
        for gramb in grambs: # for every part of speech
            part_of_speech=gramb.xpath('h3[@class="ps pos"]/span[@class="pos"]')
            part_of_speech = part_of_speech[0].text_content().upper()

            get_single_text= lambda x: x[0].text_content() if len(x)>0 else "" # preventing errors when no results
            get_p1_text = lambda p1: "%s" % ' '.join([f'({x.text_content()})' for x in p1]) # (result1), (result2) 
            get_p2_text = lambda p2: "%s" % ' '.join([f'[{x.text_content()}]' for x in p2]) # [result1], [result2]
            get_syn_text = lambda s: "%s" % ''.join([x.text_content() for x in s]) if len(s)>0 else "" # Synonyms: a, b

            sense_reg_gramb = gramb.xpath('span[@class="sense-registers"]')
            sense_reg_gramb = get_p2_text(sense_reg_gramb)
            form_groups_gramb = gramb.xpath('span[@class="form-groups"]')
            form_groups_gramb = get_p1_text(form_groups_gramb)

 
            senses = gramb.xpath('ul[@class="semb"]/li')
            for sense in senses: 

                form_groups = sense.xpath('div[@class="trg"]/p/span[@class="form-groups"]')
                fg_text=get_p1_text(form_groups) # for example: (one's word) before the definition
                grammatical_notes = sense.xpath('div[@class="trg"]/p/span[@class="grammatical_note"]')
                gn_text=get_p2_text(grammatical_notes) # example: [mass noun]
                
                sense_reg = sense.xpath('span[@class="sense-registers"]')
                sense_reg = get_p2_text(sense_reg)

                main_def = sense.xpath('div[@class="trg"]/p/span[@class="ind one-click-content"]')
                main_def = get_single_text(main_def)
                cross_ref = sense.xpath('div[@class="trg"]/div[@class="crossReference"]')
                cross_ref = get_single_text(cross_ref)

                example =  sense.xpath('div[@class="trg"]//div[@class="exg"]')
                example = get_single_text(example)
                synonyms = sense.xpath('div[@class="trg"]/div[@class="synonyms"]/div[@class="exg"]/div')
                syn_text = get_syn_text(synonyms)


                def_text = sense_reg_gramb + form_groups_gramb+ sense_reg + fg_text + gn_text + main_def +cross_ref
                row = [word, part_of_speech, def_text.strip(), example.strip(), syn_text.strip()]
                df_rows.append(row)
    return pd.DataFrame(df_rows, columns=['word', 'part of speech', 'meaning', 'example', 'synonyms'])

### Examples

In [20]:
get_df(['morning','evening'])

Unnamed: 0,word,part of speech,meaning,example,synonyms
0,morning,NOUN,"The period of time between midnight and noon, ...",‘I've got a meeting this morning’,"before noon, before lunch, before lunchtime, a.m."
1,morning,ADVERB,[informal ](mornings)Every morning.,"‘mornings, she'd sleep late’",
2,morning,EXCLAMATION,[informal ]short for good morning,‘Morning mate. I trust you are feeling a whole...,
3,evening,NOUN,"The period of time at the end of the day, usua...",‘it was seven o'clock in the evening’,"night, late afternoon, end of day, close of day"
4,evening,ADVERB,[informal ]In the evening; every evening.,‘Saturday evenings he invariably fell asleep’\...,
5,evening,EXCLAMATION,[informal ]short for good evening,,


In [21]:
get_df(['early','late'])

Unnamed: 0,word,part of speech,meaning,example,synonyms
0,early,ADJECTIVE,Happening or done before the usual or expected...,‘we ate an early lunch’,"untimely, premature\n\nprompt, timely, quick, ..."
1,early,ADJECTIVE,Belonging or happening near the beginning of a...,‘an early goal secured victory’,"advance, forward, prior"
2,early,ADVERB,Before the usual or expected time.,‘I was planning to finish work early today’,"early in the day, in the early morning\n\nbefo..."
3,early,ADVERB,Near the beginning of a particular time or per...,‘we lost a couple of games early in the season’,"beginning, opening, commencing, starting, ince..."
4,early,NOUN,(earlies)Potatoes which are ready to be harves...,‘The versatile early potato Solanum tuberosum ...,
5,early,NOUN,(earlies)Early shifts.,‘she is on earlies’\n‘He asked to be put on ea...,
6,late,ADJECTIVE,Doing something or taking place after the expe...,‘his late arrival’,"behind time, behind schedule, behind, behindhand"
7,late,ADJECTIVE,Belonging or taking place far on in a particul...,‘they won the game with a late goal’,
8,late,ADJECTIVE,(the/one's late)(of a specified person) no lon...,‘the late Francis Bacon’,"dead, deceased, departed, lamented, passed awa..."
9,late,ADVERB,"After the expected, proper, or usual time.",‘she arrived late’,"behind schedule, behind time, behindhand, unpu..."
