<a href="https://colab.research.google.com/github/Zaheer-Aswath/DataScience-Python/blob/main/Edureka/mohammed_zaheer_day18_diy_solution_doc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Q1. Problem Statement: Web Scrapping using BeautifulSoup**

Write a Python program that can extract the data from a website using web scrapping concepts to perform the following tasks:

1. Use the request library and the link to extract the data.

2. Use BeautifulSoup to prepare the website's source code, then try to find a table on the source page.

3. After finding the table, extract data from all available columns and store it in the dataframe.

**Website Link:**

Use the below link to get the data from the table on the website.

https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

In [None]:
#Step - 1: Importing Libraries

import requests # requests for performing your HTTP requests to fetch web content.
from bs4 import BeautifulSoup # BeautifulSoup4 for wrangling HTML content as per your requirements.
import pandas as pd

#Step - 2: Get data(scrapping table) from the website link(url) provided, through requests library:

web=requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")
#print(web.status_code)  # status code 200(HTTP Status Codes) defines success - OK

#Step - 3: Beautiful Soup is a Python library for parsing structured data and makes it easy to scrape information from web page

soup=BeautifulSoup(web.content,"html.parser")
#print(soup.prettify()) # -> To make the soup content pretty (readable)

#Step - 4: After parsing the data from web, find the element under "table" tag and specify class of the table as well
my_table = soup.find("table",{"class":"wikitable sortable"})
#my_table

#Step - 5: Once table is found, extract all the rows from the table using findAll function which will return a list of all available <tr> in the table
table = my_table.findAll("tr")

#Step - 6: Declare a variable to store the table rows data as list of lists(rows)
data_need=[]

#Step - 7: Now, as per the table in website,
  #the first tr row ( i.e., 0th index) of table is available under <thead> which contains column names of the table. We will use this later, while creating dataframe
  #the second tr row ( i.e., 1st index) of table contains the world data, hence we will ignore that and concentrate only on individual countries data
  #Individual countries data are available from third row of the table onwards( i.e., 2nd index), hence iterating the table data from 2nd index
for row in table[2:]:

    #Step - 7.1:  for each row, gathering all the data under <td> tag using list comprehension
    data = [i.text.strip() for i in row.find_all('td')]

    #Step - 7.2: We will ignore the last column as it is not required
    data.pop()

    #Step - 7.3: Now, serial numbers in the table (first column) are placed under <th> tag,
      #Hence extracting that separately and inserting it in the row data's list 0th position(since it is first column)
    data.insert(0,row.find('th').text.strip())

    #print(row.find('th').text)
    #print(data)

    #Step - 7.4: Appending each record's data list to the main list
    data_need.append(data)

#print(data_need)

#Step - 8: To print the values from the main list in comma separated format as mentioned in DIY exercise
for row in data_need:
  for value in row:
    print(value,end=", ")
  print("\n")

#Step - 9: To display as DF with column names
  #As already mentioned, Column names are stored in 0th record of table inside th (table header) tag, hence using list comprehension to extract all header values
column_names = [d.text.strip() for d in table[0].find_all('th')]

  #Removing last column header since we removed last column data in Step - 7.2
column_names.pop()

#print(column_names)

#Create dataframe from the list of lists created in Step 7 and column names found above
df=pd.DataFrame(data_need, columns=column_names)
display(df)

#We can also store this dataframe to excel/csv as required
#df.to_csv("Countries_Population_Wiki.csv", index=False)

1, China, 1,411,750,000, 17.5%, 31 Dec 2022, Official estimate[4], 

2, India, 1,392,329,000, 17.3%, 1 Mar 2023, Official projection[5], 

3, United States, 335,141,000, 4.2%, 28 Jul 2023, National population clock[7], 

4, Indonesia, 277,749,853, 3.5%, 31 Dec 2022, Official estimate[8], 

5, Pakistan, 220,425,254, 2.7%, 1 Jul 2020, Official projection[9], 

6, Nigeria, 216,783,400, 2.7%, 21 Mar 2022, Official projection[10], 

7, Brazil, 203,062,512, 2.5%, 1 Aug 2022, 2022 census result[11], 

8, Bangladesh, 169,828,911, 2.1%, 15 Jun 2022, 2022 census result[12], 

9, Russia, 146,424,729, 1.8%, 1 Jan 2023, Official estimate[13], 

10, Mexico, 129,035,733, 1.6%, 31 Mar 2023, National quarterly estimate[14], 

11, Japan, 124,500,000, 1.5%, 1 May 2023, Official estimate[15], 

12, Philippines, 110,659,000, 1.4%, 28 Jul 2023, National population clock[16], 

13, Ethiopia, 105,163,988, 1.3%, 1 Jul 2022, National annual projection[17], 

14, Egypt, 102,060,688, 1.3%, 1 Jul 2021, Official es

Unnamed: 0,Unnamed: 1,Country / Dependency,Population,% ofworld,Date,Source (official or fromthe United Nations)
0,1,China,1411750000,17.5%,31 Dec 2022,Official estimate[4]
1,2,India,1392329000,17.3%,1 Mar 2023,Official projection[5]
2,3,United States,335141000,4.2%,28 Jul 2023,National population clock[7]
3,4,Indonesia,277749853,3.5%,31 Dec 2022,Official estimate[8]
4,5,Pakistan,220425254,2.7%,1 Jul 2020,Official projection[9]
...,...,...,...,...,...,...
236,–,Tokelau (NZ),1647,0%,1 Jan 2019,2019 Census [227]
237,–,Niue,1549,0%,1 Jul 2021,National annual projection[167]
238,–,Cocos (Keeling) Islands (Australia),593,0%,30 Jun 2020,2021 Census[228]
239,195,Vatican City,246,0%,26 Jun 2023,Monthly national estimate[229]
