# Search and Scraping

The imports below are:
- bs4 (a.k.a. BeautifulSoup) for getting the html from the websites
- requests for the searching part on the internet

In [385]:
from bs4 import BeautifulSoup
import requests

## Search

As you can see below it requests an input for the search query.  
For now this is just the input method so you can manually search for something and see if it works

In [386]:
url = "https://www.bing.com/search?q="+input()
print(url)
page = requests.get(url)

soup = BeautifulSoup(page.text,"html")


https://www.bing.com/search?q=mastodon


Bing works with links that are coupled to the H2 html tag.  
So if we get everything related to the H2 and then search for the a html tag within we can see if its a link or not.

In [387]:
BingLinks= soup.find_all("h2")
links =[]

for link in BingLinks:
    a =link.find("a",href=True)
    href = a['href']
    links.append(href)

Here below we can which links it found with the Bing search engine.

In [388]:
links

['https://mastodon.social/',
 'https://joinmastodon.org/nl-NL',
 'https://joinmastodon.org/',
 'https://mastodon.nl/',
 'https://en.wikipedia.org/wiki/Mastodon_(social_network)',
 'https://joinmastodon.org/nl-NL/servers',
 'https://nl.wikipedia.org/wiki/Mastodon_(software)',
 'https://joinmastodon.org/nl-NL/apps',
 'https://joinmastodon.org/about',
 'https://joinmastodon.org/servers',
 'https://techcrunch.com/2023/07/24/what-is-mastodon/',
 'https://www.consumentenbond.nl/internet-privacy/mastodon',
 'https://androidworld.nl/tips/mastodon-voor-beginners-uitleg',
 'https://mastodon.social/about',
 'https://social.overheid.nl/',
 'https://mastodon.help/',
 'https://joinmastodon.org/apps',
 'https://www.dutchcowboys.nl/socialmedia/wat-is-mastodon-en-waarom-praat-iedereen-erover',
 'https://github.com/mastodon/mastodon',
 'https://mastodon.nl/@nrc_nl',
 'https://www.vrt.be/vrtnws/nl/2022/11/01/mastodon-twitter/',
 'https://nl.wikipedia.org/wiki/Mastodon',
 'https://en.wikipedia.org/wiki/Ma

Now I'm going to make a matrix of links and the found paragraphs within these websites.  
I have chosen to search on the `p` html tag because most websites will have their text inside this tag.

In [389]:
paragraphs =[['0' for x in range(2)] for y in range(len(links)) ]

## Scraper

Now we are going to scrape through every website we can find.  
Hereby we need to add some extra imports these being:
- ConnectTimeout as sometimes you can't load the website.
- TooManyRedirects if the page redirects to another page but the bot doesn't follow easily.

I also have a general exception catcher if something does go wrong it will catch it.

In [390]:
from http.client import RemoteDisconnected
from requests import ConnectTimeout, TooManyRedirects


i=0
for link in links:
    print(link)
    try:
        page = requests.get(link,allow_redirects=True,timeout=100)
        soup = BeautifulSoup(page.text,"html")
        ps =soup.find_all("p")
        paragraphs[i][1] =ps
    except TooManyRedirects:
        print("Failed for ",link)
    except ConnectTimeout:
        print("Failed for ",link)
    except Exception as e:
        print("Unknown error, for ",link)
    finally:
        paragraphs[i][0] =link
        i=i+1


https://mastodon.social/
https://joinmastodon.org/nl-NL
https://joinmastodon.org/
https://mastodon.nl/
https://en.wikipedia.org/wiki/Mastodon_(social_network)
https://joinmastodon.org/nl-NL/servers
https://nl.wikipedia.org/wiki/Mastodon_(software)
https://joinmastodon.org/nl-NL/apps
https://joinmastodon.org/about
https://joinmastodon.org/servers
https://techcrunch.com/2023/07/24/what-is-mastodon/
https://www.consumentenbond.nl/internet-privacy/mastodon
https://androidworld.nl/tips/mastodon-voor-beginners-uitleg
https://mastodon.social/about
https://social.overheid.nl/
https://mastodon.help/
https://joinmastodon.org/apps
https://www.dutchcowboys.nl/socialmedia/wat-is-mastodon-en-waarom-praat-iedereen-erover
https://github.com/mastodon/mastodon
https://mastodon.nl/@nrc_nl
https://www.vrt.be/vrtnws/nl/2022/11/01/mastodon-twitter/
https://nl.wikipedia.org/wiki/Mastodon
https://en.wikipedia.org/wiki/Mastodon


### Making into a DataFrame
I want to make it into a DataFrame so it will be easier to manipulate and transform it to other things such as a csv file.  
For this I will use pandas as its one of the well known tools to make DataFrames with.

I also import warnings because I needed to ignore a warning of pandas because I didn't find an easier way of doing something.  
More information later down in a block where it happens

In [391]:
import pandas as pd
import warnings

warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

In [392]:
df = pd.DataFrame(columns=["Link","Paragraphs"],data=paragraphs)
df

Unnamed: 0,Link,Paragraphs
0,https://mastodon.social/,[]
1,https://joinmastodon.org/nl-NL,[[Jouw tijdlijn zou vol moeten staan met wat v...
2,https://joinmastodon.org/,[[Your home feed should be filled with what ma...
3,https://mastodon.nl/,[]
4,https://en.wikipedia.org/wiki/Mastodon_(social...,"[[\n], [[Mastodon], is a , [free and open-sou..."
5,https://joinmastodon.org/nl-NL/servers,[[Mastodon is niet één website. Om het te gebr...
6,https://nl.wikipedia.org/wiki/Mastodon_(software),"[[[Android:], 2.3.0 , [(13 februari 2024)], [..."
7,https://joinmastodon.org/nl-NL/apps,[[De beste manier om aan de slag te gaan met M...
8,https://joinmastodon.org/about,"[[Free, open-source decentralized social media..."
9,https://joinmastodon.org/servers,"[[Mastodon is not a single website. To use it,..."


In the block below I search for the longest length of paragraphs that is inside of the DataFrame currently.

In [393]:
max_para: int = 0


for i in range(df["Paragraphs"].__len__()):
    x =len(df["Paragraphs"][i])
    if(max_para < x):
        max_para =x 

print(max_para)

131


Now in the following block I will make the columns for where the paragraphs will be separated.

In [394]:
df_add = pd.DataFrame({"List_Link": df["Paragraphs"]})

for x in range(max_para):
    column: str =("Paragraphs")+x.__str__()
    df_add.loc[:,column] = None


In this block I will do something that is not recommended such as doing a double for loop where I iterate through the DataFrame so I can add the paragraph at the correct place in the DataFrame column and row

In [395]:
x=0
for ps in df["Paragraphs"]:   
   n=0
   for p in ps:
      column: str =("Paragraphs")+n.__str__()
      # print(column)
      df_add[column][x] = p.__str__()
      n=n+1
   x=x+1

df_add

Unnamed: 0,List_Link,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,Paragraphs8,...,Paragraphs121,Paragraphs122,Paragraphs123,Paragraphs124,Paragraphs125,Paragraphs126,Paragraphs127,Paragraphs128,Paragraphs129,Paragraphs130
0,[],,,,,,,,,,...,,,,,,,,,,
1,[[Jouw tijdlijn zou vol moeten staan met wat v...,"<p class=""sh1 mb-11 max-w-[50ch]"">Jouw tijdlij...","<p class=""sh1 mb-8 text-gray-1"">Je weet zelf h...","<p class=""sh1 mb-8 text-gray-1"">Mastodon biedt...","<p class=""sh1 mb-8 text-gray-1"">Mastodon legt ...","<p class=""sh1 mb-8 text-gray-1"">Mastodon onder...","<p class=""b2 text-gray-1"">Directe wereldwijde ...","<p class=""b2 text-gray-1"">Mastodon is vrije en...","<p class=""b2 text-gray-1"">We respecteren jouw ...","<p class=""b2 text-gray-1"">Mastodon is gebouwd ...",...,,,,,,,,,,
2,[[Your home feed should be filled with what ma...,"<p class=""sh1 mb-11 max-w-[50ch]"">Your home fe...","<p class=""sh1 mb-8 text-gray-1"">You know best ...","<p class=""sh1 mb-8 text-gray-1"">Mastodon provi...","<p class=""sh1 mb-8 text-gray-1"">Mastodon puts ...","<p class=""sh1 mb-8 text-gray-1"">Mastodon suppo...","<p class=""b2 text-gray-1"">Instant global commu...","<p class=""b2 text-gray-1"">Mastodon is free and...","<p class=""b2 text-gray-1"">We respect your agen...","<p class=""b2 text-gray-1"">Built on open web pr...",...,,,,,,,,,,
3,[],,,,,,,,,,...,,,,,,,,,,
4,"[[\n], [[Mastodon], is a , [free and open-sou...","<p class=""mw-empty-elt"">\n</p>","<p><b>Mastodon</b> is a <a href=""/wiki/Free_an...",<p>Each user is a member of a specific Mastodo...,"<p>Mastodon was created by <a href=""/wiki/Euge...",<p>The project is maintained by German <a href...,<p>Mastodon servers run social networking soft...,"<p>Since version 2.9.0, Mastodon has offered a...","<p>Users join a specific Mastodon server, rath...",<p>Mastodon includes a number of specific priv...,...,,,,,,,,,,
5,[[Mastodon is niet één website. Om het te gebr...,"<p class=""sh1 mb-14 max-w-[36ch]"">Mastodon is ...","<p class=""b2 text-gray-1"">De eerste stap is te...","<p class=""b2 text-gray-1"">Met een account op j...","<p class=""b2 text-gray-1"">Zoek je een andere s...","<p class=""b2 text-gray-1"">We hebben geen contr...","<p class=""b2 mb-8 text-gray-1"">Alle hier verme...","<p class=""b3 mb-4 text-gray-2"">Waar de aanbied...","<p class=""b3 mb-4 text-gray-2"">Sommige aanbied...","<p class=""b3 mt-4 text-gray-2""><span class=""in...",...,,,,,,,,,,
6,"[[[Android:], 2.3.0 , [(13 februari 2024)], [...","<p><b>Android:</b> 2.3.0 <span style=""font-siz...","<p><b>Mastodon</b> is sinds 2016 <a href=""/wik...",<p>Een Mastodon-site is te vergelijken met <a ...,<p>Het netwerk kreeg meer bekendheid na de ove...,<p>Net zoals Twitter gebruikt de Mastodon web-...,<p>Mastodongebruikers kunnen net zoals op ande...,<p>Openbare tijdlijnen kunnen ook op taal gefi...,"<p>Als gevolg van de <a href=""/wiki/Open_stand...",<p>Het maakt net zoals bij e-mail in principe ...,...,,,,,,,,,,
7,[[De beste manier om aan de slag te gaan met M...,"<p class=""sh1"">De beste manier om aan de slag ...","<p class=""mt-2 max-w-[28ch]"">Vrij, open source...",,,,,,,,...,,,,,,,,,,
8,"[[Free, open-source decentralized social media...","<p class=""sh1"">Free, open-source decentralized...","<p class=""b1 mb-4""><strong>Mastodon gGmbH is a...","<p class=""b1 mb-4"">Believing that instant glob...","<p class=""b1 mb-4"">The first public launch occ...","<p class=""b1 mb-6"">The project was officially ...","<p class=""b1 mb-6""><a class=""inline-flex items...","<p class=""sh1 mb-8 text-gray-2"">What others wr...","<p class=""mt-2 max-w-[28ch]"">Free, open-source...",,...,,,,,,,,,,
9,"[[Mastodon is not a single website. To use it,...","<p class=""sh1 mb-14 max-w-[36ch]"">Mastodon is ...","<p class=""b2 text-gray-1"">The first step is de...","<p class=""b2 text-gray-1"">With an account on y...","<p class=""b2 text-gray-1"">Find a different ser...","<p class=""b2 text-gray-1"">We can't control the...","<p class=""b2 mb-8 text-gray-1"">All servers lis...","<p class=""b3 mb-4 text-gray-2"">Where the provi...","<p class=""b3 mb-4 text-gray-2"">Some providers ...","<p class=""b3 mt-4 text-gray-2""><span class=""in...",...,,,,,,,,,,


Now I will drop the extra column and combine the 2 datasets with each other.

In [396]:
df = pd.concat([df, df_add], axis=1)
df=df.drop("List_Link",axis=1)

In the following I will remove every html tag from the DataFrame per column

In [397]:
for i in range(max_para):
    column: str =("Paragraphs")+i.__str__()
    df[column]=df[column].str.replace(r'<[^<>]*>', '', regex=True)

df

Unnamed: 0,Link,Paragraphs,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,...,Paragraphs121,Paragraphs122,Paragraphs123,Paragraphs124,Paragraphs125,Paragraphs126,Paragraphs127,Paragraphs128,Paragraphs129,Paragraphs130
0,https://mastodon.social/,[],,,,,,,,,...,,,,,,,,,,
1,https://joinmastodon.org/nl-NL,[[Jouw tijdlijn zou vol moeten staan met wat v...,Jouw tijdlijn zou vol moeten staan met wat voo...,Je weet zelf het beste wat je op jouw tijdlijn...,Mastodon biedt je een unieke mogelijkheid om j...,Mastodon legt de besluitvorming weer in jouw h...,"Mastodon ondersteunt audio-, video- en fotober...",Directe wereldwijde communicatie is te belangr...,Mastodon is vrije en opensourcesoftware. Wij g...,We respecteren jouw handelsbekwaamheid. Jouw t...,...,,,,,,,,,,
2,https://joinmastodon.org/,[[Your home feed should be filled with what ma...,Your home feed should be filled with what matt...,You know best what you want to see on your hom...,Mastodon provides you with a unique possibilit...,Mastodon puts decision making back in your han...,"Mastodon supports audio, video and picture pos...",Instant global communication is too important ...,Mastodon is free and open-source software. We ...,We respect your agency. Your feed is curated a...,...,,,,,,,,,,
3,https://mastodon.nl/,[],,,,,,,,,...,,,,,,,,,,
4,https://en.wikipedia.org/wiki/Mastodon_(social...,"[[\n], [[Mastodon], is a , [free and open-sou...",\n,Mastodon is a free and open-source software fo...,Each user is a member of a specific Mastodon s...,Mastodon was created by Eugen Rochko and annou...,The project is maintained by German non-profit...,Mastodon servers run social networking softwar...,"Since version 2.9.0, Mastodon has offered a si...","Users join a specific Mastodon server, rather ...",...,,,,,,,,,,
5,https://joinmastodon.org/nl-NL/servers,[[Mastodon is niet één website. Om het te gebr...,Mastodon is niet één website. Om het te gebrui...,De eerste stap is te bepalen op welke server j...,Met een account op jouw server kun je elke and...,Zoek je een andere server waar je de voorkeur ...,We hebben geen controle over de servers maar w...,Alle hier vermelde servers hebben zich akkoord...,Waar de aanbieder wettelijk is gevestigd.,Sommige aanbieders zijn gespecialiseerd in het...,...,,,,,,,,,,
6,https://nl.wikipedia.org/wiki/Mastodon_(software),"[[[Android:], 2.3.0 , [(13 februari 2024)], [...",Android: 2.3.0 (13 februari 2024)[3] \niOS: 1....,Mastodon is sinds 2016 opensourcesoftware om z...,Een Mastodon-site is te vergelijken met Twitte...,Het netwerk kreeg meer bekendheid na de overna...,Net zoals Twitter gebruikt de Mastodon web-app...,Mastodongebruikers kunnen net zoals op andere ...,Openbare tijdlijnen kunnen ook op taal gefilte...,Als gevolg van de open standaarden bestaan er ...,...,,,,,,,,,,
7,https://joinmastodon.org/nl-NL/apps,[[De beste manier om aan de slag te gaan met M...,De beste manier om aan de slag te gaan met Mas...,"Vrij, open source gedecentraliseerd sociaal me...",,,,,,,...,,,,,,,,,,
8,https://joinmastodon.org/about,"[[Free, open-source decentralized social media...","Free, open-source decentralized social media",Mastodon gGmbH is a non-profit from Germany th...,Believing that instant global communications w...,The first public launch occurred in October 20...,The project was officially incorporated as a g...,Join the team,What others write about us.,"Free, open-source decentralized social media p...",...,,,,,,,,,,
9,https://joinmastodon.org/servers,"[[Mastodon is not a single website. To use it,...","Mastodon is not a single website. To use it, y...",The first step is deciding which server you’d ...,"With an account on your server, you can follow...",Find a different server you'd prefer? With Mas...,"We can't control the servers, but we can contr...",All servers listed here have committed to the ...,Where the provider is legally based.,Some providers specialize in hosting accounts ...,...,,,,,,,,,,


## Cleaning

In [398]:
for i in range(max_para):
    column: str =("Paragraphs")+i.__str__()
    df[column]=df[column].str.replace('[', '', regex=True)
    df[column]=df[column].str.replace(']', '', regex=True)
    df[column]=df[column].str.replace('', '', regex=True)
    df[column]=df[column].str.replace('&amp;', '', regex=True)
    df[column]=df[column].str.replace('\\n', '', regex=True)
df

Unnamed: 0,Link,Paragraphs,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,...,Paragraphs121,Paragraphs122,Paragraphs123,Paragraphs124,Paragraphs125,Paragraphs126,Paragraphs127,Paragraphs128,Paragraphs129,Paragraphs130
0,https://mastodon.social/,[],,,,,,,,,...,,,,,,,,,,
1,https://joinmastodon.org/nl-NL,[[Jouw tijdlijn zou vol moeten staan met wat v...,Jouw tijdlijn zou vol moeten staan met wat voo...,Je weet zelf het beste wat je op jouw tijdlijn...,Mastodon biedt je een unieke mogelijkheid om j...,Mastodon legt de besluitvorming weer in jouw h...,"Mastodon ondersteunt audio-, video- en fotober...",Directe wereldwijde communicatie is te belangr...,Mastodon is vrije en opensourcesoftware. Wij g...,We respecteren jouw handelsbekwaamheid. Jouw t...,...,,,,,,,,,,
2,https://joinmastodon.org/,[[Your home feed should be filled with what ma...,Your home feed should be filled with what matt...,You know best what you want to see on your hom...,Mastodon provides you with a unique possibilit...,Mastodon puts decision making back in your han...,"Mastodon supports audio, video and picture pos...",Instant global communication is too important ...,Mastodon is free and open-source software. We ...,We respect your agency. Your feed is curated a...,...,,,,,,,,,,
3,https://mastodon.nl/,[],,,,,,,,,...,,,,,,,,,,
4,https://en.wikipedia.org/wiki/Mastodon_(social...,"[[\n], [[Mastodon], is a , [free and open-sou...",,Mastodon is a free and open-source software fo...,Each user is a member of a specific Mastodon s...,Mastodon was created by Eugen Rochko and annou...,The project is maintained by German non-profit...,Mastodon servers run social networking softwar...,"Since version 2.9.0, Mastodon has offered a si...","Users join a specific Mastodon server, rather ...",...,,,,,,,,,,
5,https://joinmastodon.org/nl-NL/servers,[[Mastodon is niet één website. Om het te gebr...,Mastodon is niet één website. Om het te gebrui...,De eerste stap is te bepalen op welke server j...,Met een account op jouw server kun je elke and...,Zoek je een andere server waar je de voorkeur ...,We hebben geen controle over de servers maar w...,Alle hier vermelde servers hebben zich akkoord...,Waar de aanbieder wettelijk is gevestigd.,Sommige aanbieders zijn gespecialiseerd in het...,...,,,,,,,,,,
6,https://nl.wikipedia.org/wiki/Mastodon_(software),"[[[Android:], 2.3.0 , [(13 februari 2024)], [...",Android: 2.3.0 (13 februari 2024)3 iOS: 1.5.2 ...,Mastodon is sinds 2016 opensourcesoftware om z...,Een Mastodon-site is te vergelijken met Twitte...,Het netwerk kreeg meer bekendheid na de overna...,Net zoals Twitter gebruikt de Mastodon web-app...,Mastodongebruikers kunnen net zoals op andere ...,Openbare tijdlijnen kunnen ook op taal gefilte...,Als gevolg van de open standaarden bestaan er ...,...,,,,,,,,,,
7,https://joinmastodon.org/nl-NL/apps,[[De beste manier om aan de slag te gaan met M...,De beste manier om aan de slag te gaan met Mas...,"Vrij, open source gedecentraliseerd sociaal me...",,,,,,,...,,,,,,,,,,
8,https://joinmastodon.org/about,"[[Free, open-source decentralized social media...","Free, open-source decentralized social media",Mastodon gGmbH is a non-profit from Germany th...,Believing that instant global communications w...,The first public launch occurred in October 20...,The project was officially incorporated as a g...,Join the team,What others write about us.,"Free, open-source decentralized social media p...",...,,,,,,,,,,
9,https://joinmastodon.org/servers,"[[Mastodon is not a single website. To use it,...","Mastodon is not a single website. To use it, y...",The first step is deciding which server you’d ...,"With an account on your server, you can follow...",Find a different server you'd prefer? With Mas...,"We can't control the servers, but we can contr...",All servers listed here have committed to the ...,Where the provider is legally based.,Some providers specialize in hosting accounts ...,...,,,,,,,,,,


Replace the empty strings with None values so we can later delete it if necessary

In [399]:
df=df.replace(r'^\s*$', None, regex=True)
df

Unnamed: 0,Link,Paragraphs,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,...,Paragraphs121,Paragraphs122,Paragraphs123,Paragraphs124,Paragraphs125,Paragraphs126,Paragraphs127,Paragraphs128,Paragraphs129,Paragraphs130
0,https://mastodon.social/,[],,,,,,,,,...,,,,,,,,,,
1,https://joinmastodon.org/nl-NL,[[Jouw tijdlijn zou vol moeten staan met wat v...,Jouw tijdlijn zou vol moeten staan met wat voo...,Je weet zelf het beste wat je op jouw tijdlijn...,Mastodon biedt je een unieke mogelijkheid om j...,Mastodon legt de besluitvorming weer in jouw h...,"Mastodon ondersteunt audio-, video- en fotober...",Directe wereldwijde communicatie is te belangr...,Mastodon is vrije en opensourcesoftware. Wij g...,We respecteren jouw handelsbekwaamheid. Jouw t...,...,,,,,,,,,,
2,https://joinmastodon.org/,[[Your home feed should be filled with what ma...,Your home feed should be filled with what matt...,You know best what you want to see on your hom...,Mastodon provides you with a unique possibilit...,Mastodon puts decision making back in your han...,"Mastodon supports audio, video and picture pos...",Instant global communication is too important ...,Mastodon is free and open-source software. We ...,We respect your agency. Your feed is curated a...,...,,,,,,,,,,
3,https://mastodon.nl/,[],,,,,,,,,...,,,,,,,,,,
4,https://en.wikipedia.org/wiki/Mastodon_(social...,"[[\n], [[Mastodon], is a , [free and open-sou...",,Mastodon is a free and open-source software fo...,Each user is a member of a specific Mastodon s...,Mastodon was created by Eugen Rochko and annou...,The project is maintained by German non-profit...,Mastodon servers run social networking softwar...,"Since version 2.9.0, Mastodon has offered a si...","Users join a specific Mastodon server, rather ...",...,,,,,,,,,,
5,https://joinmastodon.org/nl-NL/servers,[[Mastodon is niet één website. Om het te gebr...,Mastodon is niet één website. Om het te gebrui...,De eerste stap is te bepalen op welke server j...,Met een account op jouw server kun je elke and...,Zoek je een andere server waar je de voorkeur ...,We hebben geen controle over de servers maar w...,Alle hier vermelde servers hebben zich akkoord...,Waar de aanbieder wettelijk is gevestigd.,Sommige aanbieders zijn gespecialiseerd in het...,...,,,,,,,,,,
6,https://nl.wikipedia.org/wiki/Mastodon_(software),"[[[Android:], 2.3.0 , [(13 februari 2024)], [...",Android: 2.3.0 (13 februari 2024)3 iOS: 1.5.2 ...,Mastodon is sinds 2016 opensourcesoftware om z...,Een Mastodon-site is te vergelijken met Twitte...,Het netwerk kreeg meer bekendheid na de overna...,Net zoals Twitter gebruikt de Mastodon web-app...,Mastodongebruikers kunnen net zoals op andere ...,Openbare tijdlijnen kunnen ook op taal gefilte...,Als gevolg van de open standaarden bestaan er ...,...,,,,,,,,,,
7,https://joinmastodon.org/nl-NL/apps,[[De beste manier om aan de slag te gaan met M...,De beste manier om aan de slag te gaan met Mas...,"Vrij, open source gedecentraliseerd sociaal me...",,,,,,,...,,,,,,,,,,
8,https://joinmastodon.org/about,"[[Free, open-source decentralized social media...","Free, open-source decentralized social media",Mastodon gGmbH is a non-profit from Germany th...,Believing that instant global communications w...,The first public launch occurred in October 20...,The project was officially incorporated as a g...,Join the team,What others write about us.,"Free, open-source decentralized social media p...",...,,,,,,,,,,
9,https://joinmastodon.org/servers,"[[Mastodon is not a single website. To use it,...","Mastodon is not a single website. To use it, y...",The first step is deciding which server you’d ...,"With an account on your server, you can follow...",Find a different server you'd prefer? With Mas...,"We can't control the servers, but we can contr...",All servers listed here have committed to the ...,Where the provider is legally based.,Some providers specialize in hosting accounts ...,...,,,,,,,,,,


Removing rows where it's mostly empty.  
I'm doing this because when you have a lot of data that doesn't bring much with only 10% filled of the max length of the dataset it is illogical to keep it.  
it also removes rows where it couldn't scrape anything from.

In [400]:
threshold: int =round(len(df.columns)*0.10)+2 # 2 is added so it doesn't include the columns `Link` and `Paragraphs`
print(threshold)
df= df.mask(df.eq('None')).dropna(axis=0, thresh=threshold)
df=df.reset_index()
df

15


Unnamed: 0,index,Link,Paragraphs,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,...,Paragraphs121,Paragraphs122,Paragraphs123,Paragraphs124,Paragraphs125,Paragraphs126,Paragraphs127,Paragraphs128,Paragraphs129,Paragraphs130
0,1,https://joinmastodon.org/nl-NL,[[Jouw tijdlijn zou vol moeten staan met wat v...,Jouw tijdlijn zou vol moeten staan met wat voo...,Je weet zelf het beste wat je op jouw tijdlijn...,Mastodon biedt je een unieke mogelijkheid om j...,Mastodon legt de besluitvorming weer in jouw h...,"Mastodon ondersteunt audio-, video- en fotober...",Directe wereldwijde communicatie is te belangr...,Mastodon is vrije en opensourcesoftware. Wij g...,...,,,,,,,,,,
1,2,https://joinmastodon.org/,[[Your home feed should be filled with what ma...,Your home feed should be filled with what matt...,You know best what you want to see on your hom...,Mastodon provides you with a unique possibilit...,Mastodon puts decision making back in your han...,"Mastodon supports audio, video and picture pos...",Instant global communication is too important ...,Mastodon is free and open-source software. We ...,...,,,,,,,,,,
2,4,https://en.wikipedia.org/wiki/Mastodon_(social...,"[[\n], [[Mastodon], is a , [free and open-sou...",,Mastodon is a free and open-source software fo...,Each user is a member of a specific Mastodon s...,Mastodon was created by Eugen Rochko and annou...,The project is maintained by German non-profit...,Mastodon servers run social networking softwar...,"Since version 2.9.0, Mastodon has offered a si...",...,,,,,,,,,,
3,6,https://nl.wikipedia.org/wiki/Mastodon_(software),"[[[Android:], 2.3.0 , [(13 februari 2024)], [...",Android: 2.3.0 (13 februari 2024)3 iOS: 1.5.2 ...,Mastodon is sinds 2016 opensourcesoftware om z...,Een Mastodon-site is te vergelijken met Twitte...,Het netwerk kreeg meer bekendheid na de overna...,Net zoals Twitter gebruikt de Mastodon web-app...,Mastodongebruikers kunnen net zoals op andere ...,Openbare tijdlijnen kunnen ook op taal gefilte...,...,,,,,,,,,,
4,10,https://techcrunch.com/2023/07/24/what-is-mast...,"[[As Twitter users fret over the , [direction]...",As Twitter users fret over the direction that ...,"Since October 27, when the SpaceX and Tesla CE...",If you’re a Twitter purist who likes to use ba...,Mastodon was founded in 2016 by German softwar...,“Unlike the past 5 years that I’ve been runnin...,Mastodon might look like a Twitter clone at fi...,"When you first create your account, you choose...",...,,,,,,,,,,
5,12,https://androidworld.nl/tips/mastodon-voor-beg...,"[[\n , [\n ...",...,Mastodon is ineens volop in het nieuws. Di...,Mastodon is opensourcesoftware waarmee ied...,"Nee, Mastodon is in 2016 ontworpen door de...",Nog een groot verschil met Twitter is het ...,Mastodon is dus opensourcesoftware dat bes...,Wees overigens niet bang dat je belangrijk...,...,,,,,,,,,,
6,15,https://mastodon.help/,"[[Hello :-), [], After about two months since ...",Hello :-)After about two months since we repla...,Mastodon is a Free and Open Source microbloggi...,But Mastodon is not a Twitter clone: it is str...,"This page, that was last updated at the end of...",The site also features a search engine for Mas...,There is no such thing as a social network cal...,"Every Instance has its own independent server,...",...,"Or, if you just want to follow the account tha...",This will cause a popup menu to appear from wh...,Once you enter your Mastodon Address and press...,That’s it! Now you can open the PeerTube accou...,,Mastodon is by far the most widely used platfo...,"At the same time, however, its development tea...","Moreover, the implementation of content views ...","Finally, to make the current trend of reproduc...",We think that in a social context that is cult...
7,18,https://github.com/mastodon/mastodon,"[[We read every piece of feedback, and take yo...","We read every piece of feedback, and take your...","To see all available qualifiers, s...","Your self-hosted, globally interconnec...",,"Mastodon is a free, open-source social network...",Click below to learn more in a video:,,...,,,,,,,,,,
8,20,https://www.vrt.be/vrtnws/nl/2022/11/01/mastod...,"[[""Mastodon"" wordt door sommigen naar voren ge...","""Mastodon"" wordt door sommigen naar voren gesc...",Waarover gaat het? Techmiljardair Elon Musk he...,Platformen met minimale moderatie worden heel ...,"Alleen is dat niet ideaal op het internet. ""De...",Ook het wispelturige karakter van Musk speelt ...,"Beluister hier het gesprek in ""De ochtend"":",Veel Twittergebruikers zouden daarom een overs...,...,,,,,,,,,,
9,22,https://en.wikipedia.org/wiki/Mastodon,"[[\n], [A , [mastodon], (, [<a class=""extiw"" ...",,A mastodon (mastós 'breast' + odoús 'tooth') i...,"M. americanum, known as an ""American mastodon""...","Taxonomically, M. americanum was first recogni...","As a member of the Mammutidae, it is defined b...",Mastodons disappeared along with many other No...,"In a letter dating to 1813, Edward Hyde, 3rd E...",...,,,,,,,,,,


In the block below I delete the columns where the most amount of columns are filled.  
I have done this secondly after checking the rows of mostly empty rows because otherwise I will remove too many columns``

In [401]:
for i in range(max_para):
    column: str =("Paragraphs")+i.__str__()
    empty_string_count = (df[column].values == '').sum()
    empty_string_count = (df[column].values == None).sum() + empty_string_count

    if empty_string_count > df[column].count():
        #print(column)
        df.drop(column, axis=1, inplace=True)

In [402]:
df.drop("index",axis=1,inplace=True)
df

Unnamed: 0,Link,Paragraphs,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,...,Paragraphs30,Paragraphs31,Paragraphs32,Paragraphs33,Paragraphs34,Paragraphs36,Paragraphs38,Paragraphs39,Paragraphs40,Paragraphs41
0,https://joinmastodon.org/nl-NL,[[Jouw tijdlijn zou vol moeten staan met wat v...,Jouw tijdlijn zou vol moeten staan met wat voo...,Je weet zelf het beste wat je op jouw tijdlijn...,Mastodon biedt je een unieke mogelijkheid om j...,Mastodon legt de besluitvorming weer in jouw h...,"Mastodon ondersteunt audio-, video- en fotober...",Directe wereldwijde communicatie is te belangr...,Mastodon is vrije en opensourcesoftware. Wij g...,We respecteren jouw handelsbekwaamheid. Jouw t...,...,,,,,,,,,,
1,https://joinmastodon.org/,[[Your home feed should be filled with what ma...,Your home feed should be filled with what matt...,You know best what you want to see on your hom...,Mastodon provides you with a unique possibilit...,Mastodon puts decision making back in your han...,"Mastodon supports audio, video and picture pos...",Instant global communication is too important ...,Mastodon is free and open-source software. We ...,We respect your agency. Your feed is curated a...,...,,,,,,,,,,
2,https://en.wikipedia.org/wiki/Mastodon_(social...,"[[\n], [[Mastodon], is a , [free and open-sou...",,Mastodon is a free and open-source software fo...,Each user is a member of a specific Mastodon s...,Mastodon was created by Eugen Rochko and annou...,The project is maintained by German non-profit...,Mastodon servers run social networking softwar...,"Since version 2.9.0, Mastodon has offered a si...","Users join a specific Mastodon server, rather ...",...,Following the Mastodon suspension and ban on M...,Rochko claimed that at least five venture capi...,"By the start of January 2023, Mastodon had 1.8...","In 2017, Pixiv launched a Mastodon-based socia...","In April 2019, computer manufacturer Purism re...","In October 2019, the Fourth Estate Public Bene...","In October 2021, former US President Donald Tr...",While Mastodon's decentralized structure is on...,Since many Mastodon instances are run by volun...,Volunteer-run instances may not have resources...
3,https://nl.wikipedia.org/wiki/Mastodon_(software),"[[[Android:], 2.3.0 , [(13 februari 2024)], [...",Android: 2.3.0 (13 februari 2024)3 iOS: 1.5.2 ...,Mastodon is sinds 2016 opensourcesoftware om z...,Een Mastodon-site is te vergelijken met Twitte...,Het netwerk kreeg meer bekendheid na de overna...,Net zoals Twitter gebruikt de Mastodon web-app...,Mastodongebruikers kunnen net zoals op andere ...,Openbare tijdlijnen kunnen ook op taal gefilte...,Als gevolg van de open standaarden bestaan er ...,...,,,,,,,,,,
4,https://techcrunch.com/2023/07/24/what-is-mast...,"[[As Twitter users fret over the , [direction]...",As Twitter users fret over the direction that ...,"Since October 27, when the SpaceX and Tesla CE...",If you’re a Twitter purist who likes to use ba...,Mastodon was founded in 2016 by German softwar...,“Unlike the past 5 years that I’ve been runnin...,Mastodon might look like a Twitter clone at fi...,"When you first create your account, you choose...",Mastodon users generally refer to individual c...,...,"You can add up to four images to a post, up to...",Similar to how Twitter now allows users to lim...,Yep! But this functionality isn’t built into M...,No. There’s no universal verification system l...,Some servers are having fun with the idea of v...,Image Credits: @stux@mstdn.social (opens in a ...,Mastodon is experiencing a massive influx of n...,"Yes, this is possible by way of third-party to...","Yes, this is also possible with third-party to...",Nope — not unless Bluesky chooses to adopt the...
5,https://androidworld.nl/tips/mastodon-voor-beg...,"[[\n , [\n ...",...,Mastodon is ineens volop in het nieuws. Di...,Mastodon is opensourcesoftware waarmee ied...,"Nee, Mastodon is in 2016 ontworpen door de...",Nog een groot verschil met Twitter is het ...,Mastodon is dus opensourcesoftware dat bes...,Wees overigens niet bang dat je belangrijk...,Wil je dus beginnen met Mastodon? Dan kun ...,...,"Veel mensen die net 'op Mastodon' zitten, ...",Ga je Mastodon of een van de andere Fedive...,Met dank aan Gerhard voor de hulp!,"Every Mastodon explanation is like ""It's very ...",@Ciaraioch@mastodon.ie,...,"donderdag, 14 maart 20...",...,...,"donderdag, 14 maart 20..."
6,https://mastodon.help/,"[[Hello :-), [], After about two months since ...",Hello :-)After about two months since we repla...,Mastodon is a Free and Open Source microbloggi...,But Mastodon is not a Twitter clone: it is str...,"This page, that was last updated at the end of...",The site also features a search engine for Mas...,There is no such thing as a social network cal...,"Every Instance has its own independent server,...",From any Instance it is possible to interact w...,...,You can read more about this topic in our Crit...,Considering the large number of tools Mastodon...,"In contrast, reality is quite different, and i...",While it is technically possible for those who...,"Since its inception, Mastodon has been adopted...","Each Mastodon Instance is independent, has its...","However, due to the federated nature of the pl...","On smartphones, Mastodon can be used from any ...","On Android, the best known Apps are Mastodon (...","On iOS you have a similar amount of choice, wi..."
7,https://github.com/mastodon/mastodon,"[[We read every piece of feedback, and take yo...","We read every piece of feedback, and take your...","To see all available qualifiers, s...","Your self-hosted, globally interconnec...",,"Mastodon is a free, open-source social network...",Click below to learn more in a video:,,,...,,,,,,,,,,
8,https://www.vrt.be/vrtnws/nl/2022/11/01/mastod...,"[[""Mastodon"" wordt door sommigen naar voren ge...","""Mastodon"" wordt door sommigen naar voren gesc...",Waarover gaat het? Techmiljardair Elon Musk he...,Platformen met minimale moderatie worden heel ...,"Alleen is dat niet ideaal op het internet. ""De...",Ook het wispelturige karakter van Musk speelt ...,"Beluister hier het gesprek in ""De ochtend"":",Veel Twittergebruikers zouden daarom een overs...,Het is een project dat is opgezet door vrijwil...,...,,,,,,,,,,
9,https://en.wikipedia.org/wiki/Mastodon,"[[\n], [A , [mastodon], (, [<a class=""extiw"" ...",,A mastodon (mastós 'breast' + odoús 'tooth') i...,"M. americanum, known as an ""American mastodon""...","Taxonomically, M. americanum was first recogni...","As a member of the Mammutidae, it is defined b...",Mastodons disappeared along with many other No...,"In a letter dating to 1813, Edward Hyde, 3rd E...",Abeel reported in a later that he went to the ...,...,"In 1930, Matthew erected a second species for ...","In 1933, Childs Frick named the species Mastod...","In 1963, J. Arnold Shotwell and Donald E. Russ...",The genus Pliomastodon was synonymized with Ma...,"In 2019, Alton C. Dooley Jr. et. al. establish...",Several mammutid species outside of North Amer...,Although the separation of the Mammutida and E...,"In the early Neogene phase of evolution, Eozyg...",Mammut as currently defined sensu lato (in a l...,The oldest evidence of mammutids in North Amer...


## Send information back

In [403]:
df_cleaned = df
df_cleaned.drop("Paragraphs",axis=1,inplace=True)
df_cleaned

Unnamed: 0,Link,Paragraphs0,Paragraphs1,Paragraphs2,Paragraphs3,Paragraphs4,Paragraphs5,Paragraphs6,Paragraphs7,Paragraphs8,...,Paragraphs30,Paragraphs31,Paragraphs32,Paragraphs33,Paragraphs34,Paragraphs36,Paragraphs38,Paragraphs39,Paragraphs40,Paragraphs41
0,https://joinmastodon.org/nl-NL,Jouw tijdlijn zou vol moeten staan met wat voo...,Je weet zelf het beste wat je op jouw tijdlijn...,Mastodon biedt je een unieke mogelijkheid om j...,Mastodon legt de besluitvorming weer in jouw h...,"Mastodon ondersteunt audio-, video- en fotober...",Directe wereldwijde communicatie is te belangr...,Mastodon is vrije en opensourcesoftware. Wij g...,We respecteren jouw handelsbekwaamheid. Jouw t...,Mastodon is gebouwd op open webprotocollen en ...,...,,,,,,,,,,
1,https://joinmastodon.org/,Your home feed should be filled with what matt...,You know best what you want to see on your hom...,Mastodon provides you with a unique possibilit...,Mastodon puts decision making back in your han...,"Mastodon supports audio, video and picture pos...",Instant global communication is too important ...,Mastodon is free and open-source software. We ...,We respect your agency. Your feed is curated a...,"Built on open web protocols, Mastodon can spea...",...,,,,,,,,,,
2,https://en.wikipedia.org/wiki/Mastodon_(social...,,Mastodon is a free and open-source software fo...,Each user is a member of a specific Mastodon s...,Mastodon was created by Eugen Rochko and annou...,The project is maintained by German non-profit...,Mastodon servers run social networking softwar...,"Since version 2.9.0, Mastodon has offered a si...","Users join a specific Mastodon server, rather ...",Mastodon includes a number of specific privacy...,...,Following the Mastodon suspension and ban on M...,Rochko claimed that at least five venture capi...,"By the start of January 2023, Mastodon had 1.8...","In 2017, Pixiv launched a Mastodon-based socia...","In April 2019, computer manufacturer Purism re...","In October 2019, the Fourth Estate Public Bene...","In October 2021, former US President Donald Tr...",While Mastodon's decentralized structure is on...,Since many Mastodon instances are run by volun...,Volunteer-run instances may not have resources...
3,https://nl.wikipedia.org/wiki/Mastodon_(software),Android: 2.3.0 (13 februari 2024)3 iOS: 1.5.2 ...,Mastodon is sinds 2016 opensourcesoftware om z...,Een Mastodon-site is te vergelijken met Twitte...,Het netwerk kreeg meer bekendheid na de overna...,Net zoals Twitter gebruikt de Mastodon web-app...,Mastodongebruikers kunnen net zoals op andere ...,Openbare tijdlijnen kunnen ook op taal gefilte...,Als gevolg van de open standaarden bestaan er ...,Het maakt net zoals bij e-mail in principe nie...,...,,,,,,,,,,
4,https://techcrunch.com/2023/07/24/what-is-mast...,As Twitter users fret over the direction that ...,"Since October 27, when the SpaceX and Tesla CE...",If you’re a Twitter purist who likes to use ba...,Mastodon was founded in 2016 by German softwar...,“Unlike the past 5 years that I’ve been runnin...,Mastodon might look like a Twitter clone at fi...,"When you first create your account, you choose...",Mastodon users generally refer to individual c...,Choosing which server to register your account...,...,"You can add up to four images to a post, up to...",Similar to how Twitter now allows users to lim...,Yep! But this functionality isn’t built into M...,No. There’s no universal verification system l...,Some servers are having fun with the idea of v...,Image Credits: @stux@mstdn.social (opens in a ...,Mastodon is experiencing a massive influx of n...,"Yes, this is possible by way of third-party to...","Yes, this is also possible with third-party to...",Nope — not unless Bluesky chooses to adopt the...
5,https://androidworld.nl/tips/mastodon-voor-beg...,...,Mastodon is ineens volop in het nieuws. Di...,Mastodon is opensourcesoftware waarmee ied...,"Nee, Mastodon is in 2016 ontworpen door de...",Nog een groot verschil met Twitter is het ...,Mastodon is dus opensourcesoftware dat bes...,Wees overigens niet bang dat je belangrijk...,Wil je dus beginnen met Mastodon? Dan kun ...,Elke Instance is volledig onafhankelijk en...,...,"Veel mensen die net 'op Mastodon' zitten, ...",Ga je Mastodon of een van de andere Fedive...,Met dank aan Gerhard voor de hulp!,"Every Mastodon explanation is like ""It's very ...",@Ciaraioch@mastodon.ie,...,"donderdag, 14 maart 20...",...,...,"donderdag, 14 maart 20..."
6,https://mastodon.help/,Hello :-)After about two months since we repla...,Mastodon is a Free and Open Source microbloggi...,But Mastodon is not a Twitter clone: it is str...,"This page, that was last updated at the end of...",The site also features a search engine for Mas...,There is no such thing as a social network cal...,"Every Instance has its own independent server,...",From any Instance it is possible to interact w...,Those who administer an Instance may limit or ...,...,You can read more about this topic in our Crit...,Considering the large number of tools Mastodon...,"In contrast, reality is quite different, and i...",While it is technically possible for those who...,"Since its inception, Mastodon has been adopted...","Each Mastodon Instance is independent, has its...","However, due to the federated nature of the pl...","On smartphones, Mastodon can be used from any ...","On Android, the best known Apps are Mastodon (...","On iOS you have a similar amount of choice, wi..."
7,https://github.com/mastodon/mastodon,"We read every piece of feedback, and take your...","To see all available qualifiers, s...","Your self-hosted, globally interconnec...",,"Mastodon is a free, open-source social network...",Click below to learn more in a video:,,,It doesn't have to be Mastodon; whatever imple...,...,,,,,,,,,,
8,https://www.vrt.be/vrtnws/nl/2022/11/01/mastod...,"""Mastodon"" wordt door sommigen naar voren gesc...",Waarover gaat het? Techmiljardair Elon Musk he...,Platformen met minimale moderatie worden heel ...,"Alleen is dat niet ideaal op het internet. ""De...",Ook het wispelturige karakter van Musk speelt ...,"Beluister hier het gesprek in ""De ochtend"":",Veel Twittergebruikers zouden daarom een overs...,Het is een project dat is opgezet door vrijwil...,"""Het lijkt op het eerste zicht wel wat op Twit...",...,,,,,,,,,,
9,https://en.wikipedia.org/wiki/Mastodon,,A mastodon (mastós 'breast' + odoús 'tooth') i...,"M. americanum, known as an ""American mastodon""...","Taxonomically, M. americanum was first recogni...","As a member of the Mammutidae, it is defined b...",Mastodons disappeared along with many other No...,"In a letter dating to 1813, Edward Hyde, 3rd E...",Abeel reported in a later that he went to the ...,"In 1739, a French military expedition under th...",...,"In 1930, Matthew erected a second species for ...","In 1933, Childs Frick named the species Mastod...","In 1963, J. Arnold Shotwell and Donald E. Russ...",The genus Pliomastodon was synonymized with Ma...,"In 2019, Alton C. Dooley Jr. et. al. establish...",Several mammutid species outside of North Amer...,Although the separation of the Mammutida and E...,"In the early Neogene phase of evolution, Eozyg...",Mammut as currently defined sensu lato (in a l...,The oldest evidence of mammutids in North Amer...


In [404]:
df_cleaned.to_json("./ResultOfSearchAndScrapeBot.json")

## What's next?
Now comes the part where it needs to be send too something.

## Improvements

- Connect to the platform and get the search part.
- Send data back