# Project : Indonesia Demography (Part 1: Web Scraping & Data Preparation)

In this project, I will do simple data exploration about Indonesia Demography. The data for this exploration is from https://id.wikipedia.org/wiki/Demografi_Indonesia. This project will be divided into 2 parts, **Web Scraping & Data Preparation** and **Data Exploration**.
For this part of the project, we will do these steps:
1. Importing the needed libraries
2. Making request to the website
3. Retrieving the data from the table that we need
4. Creating a dataframe with the data
5. Preparing the DataFrame
6. Exporting the DataFrame to CSV format

## Import Library

For this part of the project, we will need these libraries: 
1. **Pandas** to make and prepare the DataFrame
2. **Requests** to make request to website
3. **BeautifulSoup** to do web scraping

In [1]:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup

## Make Request to Website

Making request means that we will retrieve content in the form of HTML script. With this HTML script, we can retrieve the data that we need.

In [2]:
url = 'https://id.wikipedia.org/wiki/Demografi_Indonesia'
response = get(url)
print(response.text[:500])


<!DOCTYPE html>
<html class="client-nojs" lang="id" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Demografi Indonesia - Wikipedia bahasa Indonesia, ensiklopedia bebas</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\t.",".\t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","Januari","Februari","Maret","April","Mei","Juni","Juli","Agustus","September","Oktober","November","Desember"],"wgR


## Retrieving Data from Table

Using the *Inspect Element* feature on Google Chrome, I find out that the table that I need is tagged with "wikitable sortable" class. So, what I do is searching for table with "wikitable sortable" class in the content that I retrieved before. The result is in *list* form, so I have to call the value. 

After I find the table, I need to retrieve the data. Also using the *Inspect Element* feature, I find out that all the data that I need is tagged with `<td>` tag. So, I search for all the `td` tag in the table.

In [3]:
html_soup = BeautifulSoup(response.text, 'html.parser')
table_in_page = html_soup.find_all('table', class_ = 'wikitable sortable')
demografi_table = table_in_page[0]

demografi_soup = BeautifulSoup(str(demografi_table), 'html.parser')
data_in_table = demografi_soup.find_all('td')

## Create DataFrame

When I retrieved the data, I got a list of data in order from top left of table, going to the right direction. So, I need to make a list of lists, which the inside lists contain 9 data (corresponding to total of table's columns).

To do this, I need to make 2 empty lists, one for storing the data to a tidy list of lists and one is temporary list for storing exactly 9 data.

After I completed my list of lists, I create the DataFrame and add the column names.

In [4]:
demografi_list = []
temp_list = []

counter = 0
for data in data_in_table:
    if counter < 9:
        temp_list.append(data.text)
        counter = counter + 1
    else:
        demografi_list.append(temp_list)
        temp_list = []
        temp_list.append(data.text)
        counter = 1
demografi_list.append(temp_list)

column_names = ['kode_bps', 'lambang', 'nama', 'kode_iso', 'ibu_kota', 'populasi', 'luas_km', 'status_khusus', 'pulau']
demografi_df = pd.DataFrame(demografi_list,columns=column_names)

demografi_df.head()

Unnamed: 0,kode_bps,lambang,nama,kode_iso,ibu_kota,populasi,luas_km,status_khusus,pulau
0,11,,Aceh,ID-AC,Banda Aceh,4.494.410,"56.500,51",Daerah khusus,Sumatra\n
1,12,,Sumatra Utara,ID-SU,Medan,12.982.204,"72.427,81",,Sumatra\n
2,13,,Sumatra Barat,ID-SB,Padang,4.846.909,"42.224,65",,Sumatra\n
3,14,,Riau,ID-RI,Pekanbaru,5.538.367,"87.844,23",,Sumatra\n
4,15,,Jambi,ID-JA,Jambi,3.092.265,"45.348,49",,Sumatra\n


The DataFrame looks good, but actually I only need these columns only:
1. kode_bps
2. nama
3. ibu_kota
4. populasi
5. luas_km
6. pulau

So, I will drop these columns:

1. lambang
2. kode_iso
3. status_khusus

In [5]:
demografi_df = demografi_df.drop(columns=['lambang', 'kode_iso', 'status_khusus'])

demografi_df.head()

Unnamed: 0,kode_bps,nama,ibu_kota,populasi,luas_km,pulau
0,11,Aceh,Banda Aceh,4.494.410,"56.500,51",Sumatra\n
1,12,Sumatra Utara,Medan,12.982.204,"72.427,81",Sumatra\n
2,13,Sumatra Barat,Padang,4.846.909,"42.224,65",Sumatra\n
3,14,Riau,Pekanbaru,5.538.367,"87.844,23",Sumatra\n
4,15,Jambi,Jambi,3.092.265,"45.348,49",Sumatra\n


## Data Manipulation

The column `populasi` and `luas_km` now have *object* as the type. But, actually, they need to be in float type, so that we can explore it better. To convert them to float type, first we need to remove the thousand separator and change the decimal separator from `,` to `.`.

Also, there's `\n` at the end of `pulau` value, which is unnecessary and need to be removed.

In [6]:
demografi_df = demografi_df.replace(to_replace='\.', value='', regex=True)
demografi_df = demografi_df.replace(to_replace=',', value='.', regex=True)
demografi_df = demografi_df.replace(to_replace='\n', value='', regex=True)

demografi_df[['populasi', 'luas_km']] = demografi_df[['populasi', 'luas_km']].astype(float)

demografi_df.dtypes

kode_bps     object
nama         object
ibu_kota     object
populasi    float64
luas_km     float64
pulau        object
dtype: object

## Export Data to CSV

Lastly, we can export the data to CSV format so we can use it later for the part 2.

In [7]:
demografi_df.to_csv('demografi_indonesia.csv', index=False)