# Webscrapping

To extract data from websites,  we can use the **read_html** function to directly get DataFrames from a **url**. 

In [1]:
import pandas as pd
url = "https://en.wikipedia.org/wiki/World_population"
dataframe_list = pd.read_html(url, flavor='bs4')
display(dataframe_list[5])

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172240000,143998,1196
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17690000,41526,426
9,10,Israel,9480000,22072,429


However, sometimes the server refuses to authorize the request **(HTTP Error 403: Forbidden )**.

In [2]:
url= 'https://www.worldometers.info/world-population'
dataframe_list = pd.read_html(url, flavor='bs4')

HTTPError: HTTP Error 403: Forbidden

To overcome this issue, we can scrape data from HTML tables into a Dataframe using BeautifulSoup

In [4]:
from bs4 import BeautifulSoup # module for web scrapping.
import requests  #  module to download a web page

In [5]:
url= 'https://www.worldometers.info/world-population'
data  = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")

In [6]:
#find all html tables in the web page
tables = soup.find_all('table')

In [7]:
# we can see how many tables were found by checking the length of the tables list
number_of_tables= len(tables)
number_of_tables

5

**Function to get the columns names**

In [8]:
def get_column_names(my_table):
    columns= my_table.findAll('th') 
    columns_name = [x.text.strip() for x in columns] 
    return columns_name
    

**Function to create the dataframe corresponding to a table**

In [9]:
def create_dataframe(table):
    columns_name=get_column_names(table)
    df = pd.DataFrame(columns=columns_name)
    for row in table.tbody.find_all("tr"):
        col = row.find_all("td")
        if (col != []):
            my_data = [x.text.strip() for x in col] # List comprehension
            my_data_columns=zip(columns_name, my_data)
            df = df.append(dict(my_data_columns), ignore_index=True)         
    return df

**Loop over all tables in the websitet**: create a dataframe for each table and display it. Adittionally, we can save the table in a csv format. If the table has a caption, the csv file will be save with the caption name. Otherwise it will be saved with the name 'tableX'

In [10]:
my_list=[]
for index,table in enumerate(tables):
        table_index = index
        title=table.find('caption')
        if title==None:
            save_name='table_'+ str(table_index)
        else:
            save_name= title.text.strip()
        print(save_name)
        my_df=create_dataframe(table)
        my_list.extend([my_df])
        display(my_df)
        # Save df to csv
        my_df.to_csv(save_name+'.csv')  



table_0


Unnamed: 0,Year (July 1),Population,Yearly % Change,Yearly Change,Median Age,Fertility Rate,Density (P/Km²),Urban Pop %,Urban Population
0,2018,7631091040,1.10 %,83232115,29.8,2.51,51,55.3 %,4219817318
1,2017,7547858925,1.12 %,83836876,29.8,2.51,51,54.9 %,4140188594
2,2016,7464022049,1.14 %,84224910,29.8,2.51,50,54.4 %,4060652683
3,2015,7379797139,1.19 %,84594707,30.0,2.52,50,54.0 %,3981497663
4,2010,6956823603,1.24 %,82983315,28.0,2.58,47,51.7 %,3594868146
5,2005,6541907027,1.26 %,79682641,27.0,2.65,44,49.2 %,3215905863
6,2000,6143493823,1.35 %,79856169,26.0,2.78,41,46.7 %,2868307513
7,1995,5744212979,1.52 %,83396384,25.0,3.01,39,44.8 %,2575505235
8,1990,5327231061,1.81 %,91261864,24.0,3.44,36,43.0 %,2290228096
9,1985,4870921740,1.79 %,82583645,23.0,3.59,33,41.2 %,2007939063


table_1


Unnamed: 0,Year (July 1),Population,Yearly % Change,Yearly Change,Median Age,Fertility Rate,Density (P/Km²),Urban Pop %,Urban Population
0,2020,7794798739,1.10 %,83000320,31,2.47,52,56.2 %,4378993944
1,2025,8184437460,0.98 %,77927744,32,2.54,55,58.3 %,4774646303
2,2030,8548487400,0.87 %,72809988,33,2.62,57,60.4 %,5167257546
3,2035,8887524213,0.78 %,67807363,34,2.7,60,62.5 %,5555833477
4,2040,9198847240,0.69 %,62264605,35,2.77,62,64.6 %,5938249026
5,2045,9481803274,0.61 %,56591207,35,2.85,64,66.6 %,6312544819
6,2050,9735033990,0.53 %,50646143,36,2.95,65,68.6 %,6679756162


table_2


0


table_3


Unnamed: 0,#,Region,Population(2020),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,1,Asia,4641054775,0.86 %,39683577,150,31033131,-1729112,2.2,32,0 %,59.5 %
1,2,Africa,1340598147,2.49 %,32533952,45,29648481,-463024,4.4,20,0 %,17.2 %
2,3,Europe,747636026,0.06 %,453275,34,22134900,1361011,1.6,43,0 %,9.6 %
3,4,Latin America and the Caribbean,653962331,0.9 %,5841374,32,20139378,-521499,2.0,31,0 %,8.4 %
4,5,Northern America,368869647,0.62 %,2268683,20,18651660,1196400,1.8,39,0 %,4.7 %
5,6,Oceania,42677813,1.31 %,549778,5,8486460,156226,2.4,33,0 %,0.5 %


table_4


Unnamed: 0,#,Country (or dependency),Population(2020),YearlyChange,NetChange,Density (P/Km²),Land Area (Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399,1.69,38,60.8 %,18.5 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687,2.2402,28,35 %,17.7 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806,1.7764,38,82.8 %,4.2 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955,2.3195,30,56.4 %,3.5 %
4,5,Pakistan,220892340,2 %,4327022,287,770880,-233379,3.55,23,35.1 %,2.8 %
...,...,...,...,...,...,...,...,...,...,...,...,...
230,231,Montserrat,4992,0.06 %,3,50,100,,N.A.,N.A.,9.6 %,0 %
231,232,Falkland Islands,3480,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0 %
232,233,Niue,1626,0.68 %,11,6,260,,N.A.,N.A.,46.4 %,0 %
233,234,Tokelau,1357,1.27 %,17,136,10,,N.A.,N.A.,0 %,0 %
