## Data Scenario 
As a Data Engineer, I will dig up information from https://id.wikipedia.org/wiki/Demografi_Indonesia using the Python programming language. And i will create a Python script to save the "Population by Province" table from the web in CSV format.

Web Scraping is one method that we can use to collect data from the internet. We do scraping if we want to extract information from a website directly using the http protocol.
Web Scraping can be a solution in getting information from a website if the site does not provide an API for information retrieval. The following are the steps that can be taken in Web Scraping.
The libraries we need are:
1. Pandas, usually used to perform statistical data processing that is flexible, expressive and fast. This time I use it to extract data in the form of tables or data frames, so that the data can simplify the analysis process.
2.Requests, used to send various HTTP requests and return a Response Object
3.BeautifulSoup, serves as a parser to separate HTML components into a series of elements that are easy to read.
 


## About This Dataset

The columns to retrieve are as follows.

* 
Area (km2)
*   Population (2010)
* Population (2020)
*   Province name







## Web Scraping

In [12]:
#import library yang dibutuhkan
import pandas as pd
import requests
from bs4 import BeautifulSoup

#buatlah request ke website
website_url = requests.get('https://id.wikipedia.org/wiki/Demografi_Indonesia').text
soup = BeautifulSoup(website_url, 'lxml')

#ambil table dengan class 'wikitable sortable'
my_table = soup.find('table', {'class':'wikitable sortable'})

#cari data dengan tag 'td'
links = my_table.findAll('td')

#buatlah lists kosong
nama = []
luas_km = []
populasi10 = []
populasi20 = []

#memasukkan data ke dalam list berdasarkan pola HTML
for i, link in enumerate(links):
	if i in range(0, len(links), 4):
		nama.append(link.get_text()[:-1])
	if i in range(1, len(links), 4):
		luas_km.append(link.get_text()[:-1])
	if i in range(2, len(links), 4):
		populasi10.append(link.get_text()[:-1])
	if i in range(3, len(links), 4):
		populasi20.append(link.get_text()[:-1])
#buatlah DatFrame dan masukkan ke CSV
df = pd.DataFrame()
df['Nama Provinsi'] = nama
df['Luas km'] = luas_km
df['Populasi 2010'] = populasi10
df['Populasi 2020'] = populasi20
df.to_csv('Indonesia_Demography_by_Province.csv', index=False, encoding='utf-8', quoting=1)


In [13]:
print(df.head())

   Nama Provinsi    Luas km Populasi 2010 Populasi 2020
0           Aceh  56.500,51     4.494.410     5.274.871
1  Sumatra Utara  72.427,81    12.982.204    14.799.361
2  Sumatra Barat  42.224,65     4.846.909     5.534.472
3           Riau  87.844,23     5.538.367     6.394.087
4          Jambi  45.348,49     3.092.265     3.548.228


In [14]:
df.to_csv('Indonesia_Demography_by_Province.csv')

## Function and Regular Expression

As a Data Engineer, I was asked to create a function called "email_check" to filter several emails using the Python programming language. This function will accept a parameter named "input" which is an email and it will be either "Pass" or "NotPass".

In this case I use the Regular Expression library alias re. A regular expression (regex) is a string of characters that defines a search pattern. These patterns are commonly used by string search algorithms to perform "search" or "find and replace" operations on strings, or to inspect input strings.

In [15]:
#import library yang dibutuhkan
import re

# buat function email_check
def email_check(input):
	match = re.search('(?=^((?!-).)*$)(?=[^0-9])((?=^((?!\.\d).)*$)|(?=.*_))',input)
	if match:
		print('Pass')
	else:
		print('Not Pass')

#Masukkan data email ke dalam list
emails = ['my-name@someemail.com', 'myname@someemail.com', 'my.name@someemail.com', 'my.name2019@someemail.com', 'my.name.2019@someemail.com', 'somename.201903@someemail.com', 'my_name.201903@someemail.com', '201903myname@someemail.com', '201903.myname@someemail.com']

#Looping untuk pengecekan Pass atau Not Pass, gunakan variabel email untuk meng-iterasi emails
for email in emails :
	email_check(email)

Not Pass
Pass
Pass
Pass
Not Pass
Not Pass
Pass
Not Pass
Not Pass
