### Web Scrapping with Beautiful Soup

#### And analysis from saved data into csv file

Going to use this url https://www.populationu.com/gen/countries-by-gdp it may change the data pattern or html pattern it holding the data. Kindly check the page by running developer tools and inspect. If they change you can update them in following extraction process.

BeautifulSoup [https://beautiful-soup-4.readthedocs.io/en/latest/] is a popular Python library for web scraping and data extraction. Here's a basic guide on how to use BeautifulSoup for web data extraction:

Install BeautifulSoup: Before we can use BeautifulSoup, we need to install it. You can install it using pip, the Python package manager. Open a terminal or command prompt and enter the following command: `pip install beautifulsoup4`

Review the HTML content: Go to the website and use browser developer tools and inspect the html elements in the page from where it needs to be extracted.

Import BeautifulSoup: Once you have installed BeautifulSoup, you need to import it in your Python script. Here's how you can do it: `from bs4 import BeautifulSoup`

Get the HTML content: Before we can extract data from a web page, we need to get the HTML content of the page. There are several ways to do this, but one common way is to use the requests library to make an HTTP request and get the response. Here's an example: 

`import requests`

`response = requests.get('https://www.example.com')`

`html_content = response.content`

Create a BeautifulSoup object: Once you have the HTML content, you can create a BeautifulSoup object that you can use to extract data. Here's how you can create a BeautifulSoup object: `soup = BeautifulSoup(html_content, 'html.parser')`

Extract data: Once you have a BeautifulSoup object, you can use its methods to extract data. For example, you can use the `find` method to find the first occurrence of a tag, or the `find_all` method to find all occurrences of a tag. Here are some examples:

__Find the first occurrence of the title tag__
`title_tag = soup.find('title')`

__Find all the links in the page__
`link_tags = soup.find_all('a')`

__Find all the paragraphs with class 'description'__
`description_tags = soup.find_all('p', class_='description')`

Additionally you can also use browser tools to auto login, click a button etc. 
Here is a informative tutorial https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3


In [29]:
# kept blank

__install all necessary packages__

In [30]:
# install the required lib
!pip install requests
!pip install html5lib
!pip install bs4



### import all packages required

In [31]:
import requests
from bs4 import BeautifulSoup
import csv
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

### Call the URL [page] from where need to extract the data

In [32]:
URL = "https://www.populationu.com/gen/countries-by-gdp"
r = requests.get(URL, verify=False)

#### Now BeautifulSoup library will fetch the html data from that page
print result can be very big html content

In [33]:
soup = BeautifulSoup(r.content, 'html.parser') # If this line causes an error, run 'pip install html5lib' or install html5lib
# print(soup.prettify())

In [34]:
# Find the div with table
table_div = soup.find("div", {"class": "pdiv"})

Find the main table for the html. After observations the page html carefully by browser developer tools

Have to store head columns 

Now have to fetch the contents for those colums collected last steps

__Verify the data for further deeper level data extraction__

After each print command run comment them. This is just for the observations.

In [35]:
# print(table_div)

In [36]:
# find the table within the div
table = table_div.find('table', {"class": "ptable2"})

__Collect the headers columns__

In [37]:
headers = []

### loop through the rows in the table and extract the data

In [38]:
thead = table.find('thead')
for row in thead.find_all('tr'):
    # loop through the cells in the row and print the data
    for cell in row.find_all('th'):
        headers.append(cell.text)

In [39]:
print(headers)

['Rank', 'Country', '2022', '2023 (Billions)', '2024', '2025', '2026', '2027']


In [40]:
data = []

### loop through the rows in the table and extract the data

In [41]:
tbody = table.find('tbody')
for row in tbody.find_all('tr'):
    # loop through the cells in the row and print the data
    temp = {}
    i = 0
    for cell in row.find_all('td'):
        temp[headers[i]] = cell.text
        i += 1
    data.append(temp)

In [49]:
# print(data)

### Save the data into a csv file

In [50]:
filename = 'data/gdp_year_wise.csv'
with open(filename, 'w', newline='') as f:
    w = csv.DictWriter(f, headers)
    w.writeheader()
    for row in data:
        w.writerow(row)

55

95

87

81

83

81

90

82

82

82

83

81

82

88

86

83

82

86

84

85

75

63

68

59

59

62

60

61

60

59

60

73

59

63

60

58

60

61

62

64

65

60

62

61

58

67

60

60

57

61

57

64

63

58

59

60

60

59

60

59

61

64

71

68

60

58

57

60

56

61

58

44

58

58

61

53

54

54

57

61

56

57

80

52

53

55

54

53

53

54

55

54

54

55

52

52

53

52

54

56

56

59

64

56

56

55

55

67

54

53

54

55

70

53

54

53

58

55

66

52

60

55

56

53

65

53

57

54

56

55

58

65

53

58

63

65

55

59

52

55

54

57

54

62

57

58

52

52

49

57

52

50

50

46

50

49

54

49

50

47

49

50

48

48

66

49

49

52

51

52

52

53

61

57

55

52

49

49

61

72

49

47

50

47

63

52

58

47

50

47

48

38

34

39

39

32

34

### Start the analysis

In [44]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
import os
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Write once the data folder

In [51]:
data_path = 'data/'

In [52]:
# fetch all files inside the data folder
os.listdir(data_path)

['income.csv',
 'car_data.csv',
 'survey_results_public.csv',
 '.DS_Store',
 'CAR_DETAILS_FROM_CAR_DEKHO.csv',
 'car_details_v4.csv',
 'car_details_v3.csv',
 'gdp_year_wise.csv',
 'survey_results_schema.csv',
 '.ipynb_checkpoints']

### Read the csv into Pandas data frame

In [53]:
gdp_df = pd.read_csv(data_path + '/gdp_year_wise.csv')
gdp_df

Unnamed: 0,Rank,Country,2022,2023 (Billions),2024,2025,2026,2027
0,1.0,United States,25035.164,26185.210,27057.202,28045.305,29165.531,30281.524
1,2.0,China,18321.197,19243.974,20699.148,22404.019,24295.368,26437.719
2,3.0,Japan,4300.621,4365.976,4568.729,4811.640,5009.999,5172.103
3,4.0,Germany,4031.149,4120.242,4337.385,4546.514,4740.723,4925.000
4,5.0,India,3468.566,3820.573,4170.220,4547.164,4947.391,5365.546
...,...,...,...,...,...,...,...,...
191,,Lebanon,,,,,,
192,,Pakistan,376.493,,,,,
193,,Sri Lanka,73.739,,,,,
194,,Syria,,,,,,


### you can continue to EDA on the data set

Including cleaning, preparing and analysis