This notebook demonstrate how to scrape information from website using python's `requests` and `BeautifulSoup`, then obtain the latitude and longitude using `geocoder`. `Folium` is used to plot the map.

Website used is the [Top 10 Shopping Malls in Penang](http://www.penang.ws/shopping/top10-shopping.htm). =)

---

In [7]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# Scrape from website

- Request and get response from the page, response = 200 means successful

In [2]:
resp = requests.get('http://www.penang.ws/shopping/top10-shopping.htm')
resp

<Response [200]>

- Get content of the response, it is the HTML of the page

In [3]:
data = resp.content
data

b'\n\n\n\n\n\n\n\n            <!DOCTYPE html>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" itemscope itemtype="http://schema.org/Article" lang="en-US" class="js flexbox flexboxlegacy responsive-css">\n<head>\n  <title>Top 10 Shopping Malls In Penang - Best places to shop in Penang</title>\n  <meta name="description" content="From malls with exciting basement dance clubs and live puzzle rooms to marina frontage shopping centres and onsite museums, our list of the Top 10 Shopping Malls in Penang certainly has something for everyone. Though the street market scene in Penang receives a lot of media coverage, we are fans of these air conditioned retail-havens because more often than not, there are many deals to be had for the serious bargain hunter.">\n  <meta name="keywords" content="Top

- Use beautifulsoup to parse the html into tree 
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [4]:
soup = BeautifulSoup(data, 'html.parser')
soup


<!DOCTYPE html>

<html class="js flexbox flexboxlegacy responsive-css" itemscope="" itemtype="http://schema.org/Article" lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<title>Top 10 Shopping Malls In Penang - Best places to shop in Penang</title>
<meta content="From malls with exciting basement dance clubs and live puzzle rooms to marina frontage shopping centres and onsite museums, our list of the Top 10 Shopping Malls in Penang certainly has something for everyone. Though the street market scene in Penang receives a lot of media coverage, we are fans of these air conditioned retail-havens because more often than not, there are many deals to be had for the serious bargain hunter." name="description">
<meta content="Top 10 Shopping Malls In Penang, Best places to shop in Penang" name="keywords">
<meta content="noodp" name="googlebot">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" htt

- Start Extracting the information using `find_all`, `find_next` and regex

In [37]:
df_records = []
for div in soup.find_all('div',{'class','top10_item_wrapper'}):
    paragraph = div.find_next('p')
    info = paragraph.text
    row={}
    title = div.find_next('span',{'class':'top_itemListing_title'})
    try:
        row['Name'] = title.text.strip() or title.find_next('a').text.strip()
    except:
        pass
    row['Open'] = re.search('Open: (.+) Address',info).group(1)
    row['Address'] = re.search('Address: (.+) Tel',info).group(1)
    row['Tel'] = re.search('Tel: (\+|)(\d+)(\s\d+\s\d+)',info).groups()
    row['Tel'] = ''.join(row['Tel'][:2])+'-'+''.join(row['Tel'][2:]).replace(' ','')
    row['link'] = paragraph.find_next('a')['href']
    df_records.append(row)
    
df = pd.DataFrame.from_records(df_records, columns = ['Name','Open','Tel','Address','link'])

In [42]:
df.head()

Unnamed: 0,Name,Open,Tel,Address,link
0,Gurney Paragon,10:00 – 22:00,+604-2288266,"163 C & D, Persiaran Gurney, Georgetown",http://www.penang.ws/shopping/gurney-paragon-m...
1,1st Avenue Mall,10:00 – 22:00,+604-2611121,"Jalan Magazine, Georgetown, 10300 Georgetown",http://www.penang.ws/shopping/gurney-plaza.htm
2,Gurney Plaza,10:00 – 22:00,+604-2281111,"170-06-01, Persiaran Gurney, 10250 Georgetown",http://www.penang.ws/shopping/gurney-plaza.htm
3,Queensbay Mall,10:00 – 22:00,+604-6198989,"100, Persiaran Bayan Indah, Queens Bay, 14300 ...",http://www.penang.ws/shopping/queensbay-mall.htm
4,Straits Quay,10:30 – 22:30,+604-8918000,"3F-G-1 Straits Quay, Jalan Seri Tanjung Pinang...",http://www.penang.ws/penang-attractions/komtar...


# Running Geocoder to get latitude and longitude

In [48]:
import geocoder

def get_lat_lon(x):
    try:
        a = geocoder.google(x)
        if a.latlng: return a
    except:
        pass
    return None

df['lat_lng'] = df['Address'].map(lambda x: get_lat_lon(x))

# Plot on map

In [None]:
import folium
from folium.features import DivIcon

m = folium.Map([4.2105, 101.9758], zoom_start=13) 

for i in df.to_dict(orient='records')
    folium.map.Marker(
        i['lat_lng'],
        icon=DivIcon(
            icon_size=(150,36),
            icon_anchor=(0,0),
            html='<pre style="font-size: 24pt; color: red">\
            Shopping Mall:{0}\
            Open:{1}\
            Tel:{2}\
            </pre>'.format(i['Name'],i['Open'],i['Tel']),
            )
        ).add_to(m)

In [53]:
m