#  **GDP Data Extraction and processing**
Sebuah perusahaan internasional yang ingin memperluas bisnisnya ke berbagai negara di seluruh dunia telah merekrutmu. Kamu telah dipekerjakan sebagai Data Engineer junior dan ditugaskan untuk membuat skrip yang dapat mengekstrak daftar 10 ekonomi terbesar di dunia dalam urutan turun berdasarkan GDP mereka dalam miliar USD (dibulatkan menjadi 2 angka desimal), seperti yang dicatat oleh Dana Moneter Internasional (IMF).

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/images/pandas_wbs_3.png">

# **Tujuan**

Setelah menyelesaikan lab ini, Anda akan dapat:

- Menggunakan Webscraping untuk mengekstrak informasi yang diperlukan dari sebuah website.
- Menggunakan Pandas untuk memuat dan memproses data tabular sebagai dataframe.
- Menggunakan Numpy untuk memanipulasi informasi yang terdapat dalam dataframe.
- Memuat dataframe yang diperbarui ke file CSV.

# **1. Web Scrapping**

### **Version 1 Scrapping BeatutifulSoup**

In [None]:
from bs4 import BeautifulSoup
import requests

# URL dari halaman Wikipedia yang ingin di-scrape
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

# Mengambil konten halaman web
page = requests.get(URL)

# Membuat objek BeautifulSoup untuk melakukan parsing HTML
soup = BeautifulSoup(page.content, "html.parser")

# Temukan Table pada Data GDP
table = soup.find("table", {"class": "wikitable"})

# Ekstraksi data dari table yang ditemukan kedalam list data
data = []
for row in table.find_all("tr")[1:]:
    cols = row.find_all("td")
    if len(cols) >= 8:
      # Rank Country Kolom 0
        rank = cols[0].text.strip()
      # Country Kolom 1
        country = cols[1].text.strip()
      # GDP IMF Kolom 2
        gdpIMF = cols[2].text.strip()
      # GDP World Bank Kolom 4
        gdpWB = cols[4].text.strip()
      # GDP United Nations Kolom 6
        gdpUN = cols[6].text.strip()
      # Gabung Kedalam 1 Data
        data.append({"Rank": rank, "Country": country, "GDP IMF": gdpIMF, "GDP World Bank":gdpWB, "GDP United Nations":gdpUN})

# Buat Dataframe dari Data yang dibutuhkan
df = pd.DataFrame(data)

# Cetak  dataframe
print(df)

### **Version 2 Only Pandas Without BeautifulSoup**

In [123]:
import pandas as pd

URL = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
tables = pd.read_html(URL)
# Letak Tabel Nomor 3
df = tables[3]

# Replace the column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,1,2]]

# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
df = df.iloc[1:11,:]

# Assign column names as "Country" and "GDP (Million USD)"
df.columns = ['Rank','Country','GDP (Million USD)']

print(df)

              Rank   Country GDP (Million USD)
1    United States  Americas          26854599
2            China      Asia          19373586
3            Japan      Asia           4409738
4          Germany    Europe           4308854
5            India      Asia           3736882
6   United Kingdom    Europe           3158938
7           France    Europe           2923489
8            Italy    Europe           2169745
9           Canada  Americas           2089672
10          Brazil  Americas           2081235


# **2. Ubah Nilai Data GDP Dalam Juta Menjadi Miliar USD**
 Setelah mendapatkan nilai dalam Juta USD, Anda dapat mengonversinya menjadi Miliar USD dan membulatkannya ke 2 angka desimal menggunakan metode round() dari library numpy.

In [129]:
import pandas as pd

URL = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
tables = pd.read_html(URL)
# Letak Tabel Nomor 3
df = tables[3]

# Replace the column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,1,2]]

# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
df = df.iloc[1:11,:]

# Assign column names as "Country" and "GDP (Million USD)"
df.columns = ['Rank','Country','GDP (Million USD)']

# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

# Convert the GDP value in Million USD to Billion USD
df[['GDP (Million USD)']] = df[['GDP (Million USD)']]/1000

# Use numpy.round() method to round the value to 2 decimal places.
df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']], 2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df.rename(columns = {'GDP (Million USD)' : 'GDP (Billion USD)'}, inplace=True)

print(df)

              Rank   Country  GDP (Billion USD)
1    United States  Americas           26854.60
2            China      Asia           19373.59
3            Japan      Asia            4409.74
4          Germany    Europe            4308.85
5            India      Asia            3736.88
6   United Kingdom    Europe            3158.94
7           France    Europe            2923.49
8            Italy    Europe            2169.74
9           Canada  Americas            2089.67
10          Brazil  Americas            2081.24


# **3.Simpan File Ekstensi CSV**

In [132]:
import pandas as pd

URL = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
tables = pd.read_html(URL)
# Letak Tabel Nomor 3
df = tables[3]

# Replace the column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,1,2]]

# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
df = df.iloc[1:11,:]

# Assign column names as "Country" and "GDP (Million USD)"
df.columns = ['Rank','Country','GDP (Million USD)']

# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

# Convert the GDP value in Million USD to Billion USD
df[['GDP (Million USD)']] = df[['GDP (Million USD)']]/1000

# Use numpy.round() method to round the value to 2 decimal places.
df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']], 2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df.rename(columns = {'GDP (Million USD)' : 'GDP (Billion USD)'}, inplace=True)
# Load the DataFrame to the CSV file named "Largest_economies.csv"
df.to_csv('./Largest_economies.csv')