# Search and Scraping

The imports below are:
- bs4 (a.k.a. BeautifulSoup) for getting the html from the websites
- requests for the searching part on the internet

In [17]:
from bs4 import BeautifulSoup
import requests

## Search

As you can see below it requests an input for the search query.  
For now this is just the input method so you can manually search for something and see if it works

In [18]:
url = "https://www.bing.com/search?q="+input()
print(url)
page = requests.get(url)

soup = BeautifulSoup(page.text,"html")


https://www.bing.com/search?q=something


Bing works with links that are coupled to the H2 html tag.  
So if we get everything related to the H2 and then search for the a html tag within we can see if its a link or not.

In [19]:
BingLinks= soup.find_all("h2")
links =[]

for link in BingLinks:
    a =link.find("a",href=True)
    href = a['href']
    links.append(href)

Here below we can which links it found with the Bing search engine.

In [20]:
links

['https://www.youtube.com/watch?v=UelDrZ1aFeY',
 'https://nl.wikipedia.org/wiki/Something_(The_Beatles)',
 'https://en.wikipedia.org/wiki/Something_(Beatles_song)',
 'https://dictionary.cambridge.org/dictionary/english/something',
 'https://www.youtube.com/watch?v=VWO3nEuWo4k',
 'https://nl.bab.la/woordenboek/engels-nederlands/something',
 'https://www.youtube.com/watch?v=cNavPZ8GA6I',
 'https://en.wiktionary.org/wiki/something',
 'https://www.thebeatles.com/something',
 'https://genius.com/The-beatles-something-lyrics',
 'https://www.wordreference.com/definition/something',
 'https://www.oxfordlearnersdictionaries.com/definition/english/something_1',
 'https://www.mijnwoordenboek.nl/vertaal/EN/NL/something',
 'https://dictionary.cambridge.org/us/dictionary/english/something',
 'https://www.dictionary.com/browse/something',
 'https://www.youtube.com/watch?v=uXRvmkQLyTc',
 'https://www.merriam-webster.com/dictionary/something',
 'https://context.reverso.net/vertaling/engels-nederlands/s

Now I'm going to make a matrix of links and the found paragraphs within these websites.  
I have chosen to search on the `p` html tag because most websites will have their text inside this tag.

In [21]:
paragraphs =[['0' for x in range(2)] for y in range(len(links)) ]

## Scraper

Now we are going to scrape through every website we can find.  
Hereby we need to add some extra imports these being:
- ConnectTimeout as sometimes you can't load the website.
- TooManyRedirects if the page redirects to another page but the bot doesn't follow easily.

I also have a general exception catcher if something does go wrong it will catch it.

In [22]:
from http.client import RemoteDisconnected
from requests import ConnectTimeout, TooManyRedirects


i=0
for link in links:
    print(link)
    try:
        page = requests.get(link,allow_redirects=True,timeout=100)
        soup = BeautifulSoup(page.text,"html")
        ps =soup.find_all("p")
        paragraphs[i][1] =ps
    except TooManyRedirects:
        print("Failed for ",link)
    except ConnectTimeout:
        print("Failed for ",link)
    except Exception as e:
        print("Unknown error, for ",link)
    finally:
        paragraphs[i][0] =link
        i=i+1


https://www.youtube.com/watch?v=UelDrZ1aFeY
https://nl.wikipedia.org/wiki/Something_(The_Beatles)
https://en.wikipedia.org/wiki/Something_(Beatles_song)
https://dictionary.cambridge.org/dictionary/english/something
Unknown error, for  https://dictionary.cambridge.org/dictionary/english/something
https://www.youtube.com/watch?v=VWO3nEuWo4k
https://nl.bab.la/woordenboek/engels-nederlands/something
https://www.youtube.com/watch?v=cNavPZ8GA6I
https://en.wiktionary.org/wiki/something
https://www.thebeatles.com/something
https://genius.com/The-beatles-something-lyrics
https://www.wordreference.com/definition/something
https://www.oxfordlearnersdictionaries.com/definition/english/something_1
Unknown error, for  https://www.oxfordlearnersdictionaries.com/definition/english/something_1
https://www.mijnwoordenboek.nl/vertaal/EN/NL/something
https://dictionary.cambridge.org/us/dictionary/english/something
Unknown error, for  https://dictionary.cambridge.org/us/dictionary/english/something
https:/

### Making into a DataFrame
I want to make it into a DataFrame so it will be easier to manipulate and transform it to other things such as a csv file.  
For this I will use pandas as its one of the well known tools to make DataFrames with.

I also import warnings because I needed to ignore a warning of pandas because I didn't find an easier way of doing something.  
More information later down in a block where it happens

In [23]:
import pandas as pd
import warnings

warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

In [24]:
df = pd.DataFrame(columns=["Link","Paragraphs"],data=paragraphs)
df

Unnamed: 0,Link,Paragraphs
0,https://www.youtube.com/watch?v=UelDrZ1aFeY,[]
1,https://nl.wikipedia.org/wiki/Something_(The_B...,"[[[<b>Something</b>], is een nummer van de , ..."
2,https://en.wikipedia.org/wiki/Something_(Beatl...,"[[\n\n\n], ["", [Something], "" is a song by the..."
3,https://dictionary.cambridge.org/dictionary/en...,0
4,https://www.youtube.com/watch?v=VWO3nEuWo4k,[]
5,https://nl.bab.la/woordenboek/engels-nederland...,[[This website is using a security service to ...
6,https://www.youtube.com/watch?v=cNavPZ8GA6I,[]
7,https://en.wiktionary.org/wiki/something,"[[From , [<a class=""extiw"" href=""https://en.wi..."
8,https://www.thebeatles.com/something,"[[Song], [[Release date:], 06 October 1969], ..."
9,https://genius.com/The-beatles-something-lyrics,"[[Produced by], [], [How to Format Lyrics:], [..."


In the block below I search for the longest length of paragraphs that is inside of the DataFrame currently.

In [25]:
max_para: int = 0


for i in range(df["Paragraphs"].__len__()):
    x =len(df["Paragraphs"][i])
    if(max_para < x):
        max_para =x 

print(max_para)

67


Now in the following block I will make the columns for where the paragraphs will be separated.

In [26]:
df_add = pd.DataFrame({"List_Link": df["Paragraphs"]})

for x in range(max_para):
    column: str =("Paragraphs")+x.__str__()
    df_add.loc[:,column] = None


In this block I will do something that is not recommended such as doing a double for loop where I iterate through the DataFrame so I can add the paragraph at the correct place in the DataFrame column and row

In [27]:
x=0
for ps in df["Paragraphs"]:   
   n=0
   for p in ps:
      column: str =("Paragraphs")+n.__str__()
      # print(column)
      df_add[column][x] = p.__str__()
      n=n+1
   x=x+1

df_add

Unnamed: 0,List_Link,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,Paragraphs8,...,Paragraphs57,Paragraphs58,Paragraphs59,Paragraphs60,Paragraphs61,Paragraphs62,Paragraphs63,Paragraphs64,Paragraphs65,Paragraphs66
0,[],,,,,,,,,,...,,,,,,,,,,
1,"[[[<b>Something</b>], is een nummer van de , ...",<p><i><b>Something</b></i> is een nummer van d...,"<p>De openingszin ""Something in the way she mo...",<p>Het lied werd door Harrison geschreven als ...,<p>De bijbehorende promotieclip is vlak na de ...,<p><i>Something</i> is een van de meest gecove...,,,,,...,,,,,,,,,,
2,"[[\n\n\n], ["", [Something], "" is a song by the...","<p class=""mw-empty-elt"">\n\n\n</p>","<p>""<b>Something</b>"" is a song by the English...",<p>The track is generally considered a love so...,"<p>""Something"" received the <a class=""mw-redir...","<p><a href=""/wiki/George_Harrison"" title=""Geor...",<p>The opening lyric was taken from the title ...,<p>Having begun to write love songs that were ...,<p>In the version issued on the Beatles' 1969 ...,"<p>Leng considers that, lyrically and musicall...",...,<p>In honour of Harrison's fondness for the in...,"<p><a href=""/wiki/Bob_Dylan"" title=""Bob Dylan""...","<p>Harrison played ""Something"" at the two <a c...","<p>Harrison included ""Something"" in all of his...",<p>A version from Harrison's December 1991 tou...,"<p>According to <a href=""/wiki/Walter_Everett_...",<p><b>The Beatles</b>\n</p>,<p><b>Additional musicians</b>\n</p>,<p><small><sup>^</sup> Shipments figures based...,<p> \n</p>
3,0,0,,,,,,,,,...,,,,,,,,,,
4,[],,,,,,,,,,...,,,,,,,,,,
5,[[This website is using a security service to ...,"<p data-translate=""blocked_why_detail"">This we...","<p data-translate=""blocked_resolve_detail"">You...","<p class=""text-13"">\n <span class=""cf-foote...",,,,,,,...,,,,,,,,,,
6,[],,,,,,,,,,...,,,,,,,,,,
7,"[[From , [<a class=""extiw"" href=""https://en.wi...","<p>From <span class=""etyl""><a class=""extiw"" hr...","<p><span class=""headword-line""><strong class=""...","<p><span class=""headword-line""><strong class=""...","<p><span class=""headword-line""><strong class=""...","<p><span class=""headword-line""><strong class=""...","<p><span class=""headword-line""><strong class=""...",,,,...,,,,,,,,,,
8,"[[Song], [[Release date:], 06 October 1969], ...","<p class=""pre-title"">Song</p>","<p class=""subtitle""><span class=""font-heavy"">R...",<p></p>,<p>Something in the way she moves<br/>\nAttrac...,<p>Somewhere in her smile she knows<br/>\nThat...,"<p>You're asking me will my love grow,<br/>\nI...",<p>Something in the way she knows<br/>\nAnd al...,<p></p>,"<p>B-side, in custom Apple Records sleeve</p>",...,,,,,,,,,,
9,"[[Produced by], [], [How to Format Lyrics:], [...","<p class=""HeaderCredits__Label-wx7h8g-2 ghcavQ...","<p class=""SongPage__HeaderSpace-sc-19xhmoi-1 g...",<p>How to Format Lyrics:</p>,"<p>To learn more, check out our <a class=""Styl...","<p>“Something” is the first <a data-api_path=""...",<p>The song is likely about Pattie Boyd – Harr...,<p>George later denied that the song was about...,"<p>Well no, I didn’t [write it about her]. I j...",<p>The song was released on a single with “Com...,...,,,,,,,,,,


Now I will drop the extra column and combine the 2 datasets with each other.

In [28]:
df = pd.concat([df, df_add], axis=1)
df=df.drop("List_Link",axis=1)

In the following I will remove every html tag from the DataFrame per column

In [29]:
for i in range(max_para):
    column: str =("Paragraphs")+i.__str__()
    df[column]=df[column].str.replace(r'<[^<>]*>', '', regex=True)

df

Unnamed: 0,Link,Paragraphs,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,...,Paragraphs57,Paragraphs58,Paragraphs59,Paragraphs60,Paragraphs61,Paragraphs62,Paragraphs63,Paragraphs64,Paragraphs65,Paragraphs66
0,https://www.youtube.com/watch?v=UelDrZ1aFeY,[],,,,,,,,,...,,,,,,,,,,
1,https://nl.wikipedia.org/wiki/Something_(The_B...,"[[[<b>Something</b>], is een nummer van de , ...",Something is een nummer van de Britse groep Th...,"De openingszin ""Something in the way she moves...",Het lied werd door Harrison geschreven als een...,De bijbehorende promotieclip is vlak na de laa...,Something is een van de meest gecoverde Beatle...,,,,...,,,,,,,,,,
2,https://en.wikipedia.org/wiki/Something_(Beatl...,"[[\n\n\n], ["", [Something], "" is a song by the...",\n\n\n,"""Something"" is a song by the English rock band...",The track is generally considered a love song ...,"""Something"" received the Ivor Novello Award fo...","George Harrison began writing ""Something"" in S...",The opening lyric was taken from the title of ...,Having begun to write love songs that were dir...,In the version issued on the Beatles' 1969 alb...,...,In honour of Harrison's fondness for the instr...,Bob Dylan also played the song live during his...,"Harrison played ""Something"" at the two Concert...","Harrison included ""Something"" in all of his su...",A version from Harrison's December 1991 tour o...,"According to Walter Everett,[52] Bruce Spizer[...",The Beatles\n,Additional musicians\n,^ Shipments figures based on certification alo...,\n
3,https://dictionary.cambridge.org/dictionary/en...,0,0,,,,,,,,...,,,,,,,,,,
4,https://www.youtube.com/watch?v=VWO3nEuWo4k,[],,,,,,,,,...,,,,,,,,,,
5,https://nl.bab.la/woordenboek/engels-nederland...,[[This website is using a security service to ...,This website is using a security service to pr...,You can email the site owner to let them know ...,\n Cloudflare Ray ID: 863378d18de166a5\n ...,,,,,,...,,,,,,,,,,
6,https://www.youtube.com/watch?v=cNavPZ8GA6I,[],,,,,,,,,...,,,,,,,,,,
7,https://en.wiktionary.org/wiki/something,"[[From , [<a class=""extiw"" href=""https://en.wi...","From Middle English somþyng, some-thing, som t...",something (indefinite pronoun)\n,something (not comparable)\n,something (not comparable)\n,something (third-person singular simple presen...,something (plural somethings)\n,,,...,,,,,,,,,,
8,https://www.thebeatles.com/something,"[[Song], [[Release date:], 06 October 1969], ...",Song,Release date: 06 October 1969,,Something in the way she moves\nAttracts me li...,Somewhere in her smile she knows\nThat I don't...,"You're asking me will my love grow,\nI don't k...",Something in the way she knows\nAnd all I have...,,...,,,,,,,,,,
9,https://genius.com/The-beatles-something-lyrics,"[[Produced by], [], [How to Format Lyrics:], [...",Produced by,,How to Format Lyrics:,"To learn more, check out our transcription gui...",“Something” is the first George Harrison penne...,The song is likely about Pattie Boyd – Harriso...,George later denied that the song was about her.,"Well no, I didn’t [write it about her]. I just...",...,,,,,,,,,,


## What's next?
Now comes the part where it needs to be cleaned and send to something.

## Improvements

- Connect to the platform and get the search part.
- Remove links that will not deliver anything useful.
- Have a max size of the paragraphs
- Look into time improvements