<a href="https://colab.research.google.com/github/alexislintw/Python-Web-Crawler-Tutorial/blob/main/web_crawler_ETtodayFinance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Web Crawler Tutorial

### Demonstrate how to crawl ETtoday website to collect the text data

- Author: Alexis
- Created: 2021/11/10
- Updated: 2022/3/2
- Data Source: https://finance.ettoday.net/

### Custom method for creating folders

In [None]:
def prepare_folder(dir_path):

    # Check if the folder exists, if not, make it
    if os.path.exists(dir_path) and os.path.isdir(dir_path):
        print(f'{dir_path} folder already exists')
    else:
        os.makedirs(dir_path)
        print(f'{dir_path} folder created')

    # Empty the folder
    for file_name in os.listdir(dir_path):
        file_path = os.path.join(dir_path,file_name)
        if os.path.isfile(file_path):
            os.unlink(file_path)

### Crawl article lists from ETtoday search result pages

In [None]:
import os
import time
import requests

# Create a folder
list_folder_path = './WebCrawler/ETtodayFinance/ArticleList'
prepare_folder(list_folder_path)

# Set the browser's client header
header = {
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) \
                AppleWebKit/537.36 (KHTML, like Gecko) \
                Chrome/88.0.4324.96 \
                Safari/537.36'
}

# Set parameters: search keywords and number of pages
total_pages = 5
keyword = '台積電'
keyword = '%E5%8F%B0%E7%A9%8D%E9%9B%BB'

# The full URL
#https://finance.ettoday.net/search.php7?keyword=%E5%8F%B0%E7%A9%8D%E9%9B%BB&page=7

# URL prefix: the part before the page number
url_prefix = 'https://finance.ettoday.net/search.php7?keyword='+keyword

# Crawl each page in turn
for i in range(total_pages):
    # URL prefix and page number combined into full URL
    page = str(i+1)
    url = url_prefix + '&page=' + page
    print(url)

    # Get the web page
    resp = requests.get(url,headers=header)
    
    # Save the web page source code in file if the response is success
    if resp.status_code == 200:
        file_name = 'list_'+page+'.htm'
        list_file_path = os.path.join(list_folder_path,file_name)
        with open(list_file_path,'w') as f:
            f.write(resp.text)
        print('ok')
    else:
        print(resp.status_code)
    
    # Pause 2 seconds
    time.sleep(2)

print('done')

已新增資料夾./網路爬蟲/ETtoday財經雲/文章列表
https://finance.ettoday.net/search.php7?keyword=%E5%8F%B0%E7%A9%8D%E9%9B%BB&page=1
ok
https://finance.ettoday.net/search.php7?keyword=%E5%8F%B0%E7%A9%8D%E9%9B%BB&page=2
ok
https://finance.ettoday.net/search.php7?keyword=%E5%8F%B0%E7%A9%8D%E9%9B%BB&page=3
ok
https://finance.ettoday.net/search.php7?keyword=%E5%8F%B0%E7%A9%8D%E9%9B%BB&page=4
ok
https://finance.ettoday.net/search.php7?keyword=%E5%8F%B0%E7%A9%8D%E9%9B%BB&page=5
ok
done


### Parse web page for article list

In [None]:
import bs4

# Create a container to hold article links
urls = []

for file_name in os.listdir(list_folder_path):
    # Skip if file extension is not htm
    if not file_name.split('.')[1] == 'htm':
        continue

    # Read web page/html source code
    file_path = os.path.join(list_folder_path,file_name)
    with open(file_path) as f:
        html = f.read()

    # Parse web page/html source code
    soup = bs4.BeautifulSoup(html,'lxml')

    # Get the parent node of the target node
    top_node = soup.find('div',class_='part_pictxt_3')
    if top_node != None:
        
        # Get all url nodes
        a_nodes = top_node.find_all('a')

        for a_node in a_nodes:
            # Get the article title
            txt = a_node.get('title').strip()
            print(txt)

            # Get the article URL
            url = a_node.get('href').strip()                
            print(url)
            
            # Put the url in container
            urls.append(url)

print(len(urls))

# Save all parsed URLs in a file
url_file_path = os.path.join(list_folder_path,'..','all_article_urls.txt')
with open(url_file_path,'w') as f:
    for url in urls:
        f.write(url+'\n')
        
print('done')

快訊／台積電證實「到高雄設晶圓廠」！動工時間曝光
https://finance.ettoday.net/news/2120293
台積電董事會通過四大議案　Q3將發放每股2.75元現金股利
https://finance.ettoday.net/news/2120236
台積電、Sony宣布在日本熊本設廠　初期資本支出1948億元
https://finance.ettoday.net/news/2120233
台積電「交出數據」了！專家質疑「美國看底牌」痛批：吃相難看
https://finance.ettoday.net/news/2119691
護國神山不再與世隔絕　台積電升級公務手機「可以使用LINE了」
https://finance.ettoday.net/news/2120176
光洋科經營權之爭！董座馬堅勇批台鋼盟違法　爆料台積電高度關切
https://finance.ettoday.net/news/2120144
台積電平均月薪「比聯發科少3萬」　內行揭1關鍵沒有輸：正常
https://finance.ettoday.net/news/2119667
台股漲逾百點！指數站上17500　台積電上漲10元
https://finance.ettoday.net/news/2119621
上繳大限到！美國：台積電等晶片商今天一定會交出「關鍵資料」
https://finance.ettoday.net/news/2119555
台積電薪水慘輸這兩家公司　平均薪資只排第3名！
https://finance.ettoday.net/news/2119385
台積電對美方要求完成交卷　強調並未揭露「特定客戶」資料
https://finance.ettoday.net/news/2119358
台積電職缺「文組也搶翻」！網揭「1殘酷現實」：根本爽缺
https://finance.ettoday.net/news/2118927
台積電薪水只排業界第7！半導體薪資10強揭密　第一名換算日薪高達萬元
https://finance.ettoday.net/news/2111445
全球半導體業唯一獲獎　台積電獲英國查爾斯王子Terra Carta Seal首屆獎項
https://finance.ettoday.net/news/2118902
坐擁1

### Crawl article content pages

In [None]:
import os
import re
import time

# Create a folder
content_folder_path = './WebCrawler/ETtodayFinance/ArticleContent'
prepare_folder(content_folder_path)

# Read previously saved file containing article URLs
with open(url_file_path,'r') as f:
    urls = f.read().strip().split('\n')
    print(f'There are {len(urls)} URLs in total')

# Add the user-agent to the browser's header
header = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) \
                        AppleWebKit/537.36 (KHTML, like Gecko) \
                        Chrome/88.0.4324.96 \
                        Safari/537.36'}

for url in urls:
    if not (len(url)>0 and re.search('^http',url)):
        print(url)
        continue

    file_name = url.split('/')[-1]+'.htm'
    print(file_name)

    # Get the web page
    resp = requests.get(url,headers=header)
    # Save the web page source code in file if the response is success
    if resp.status_code == 200:
        content_file_path = os.path.join(content_folder_path,file_name)
        with open(content_file_path,'w') as f:
            f.write(resp.text)
        print('ok')
    else:
        print(resp.status_code)

    # Pause 2 seconds
    time.sleep(2)

print('done')

### Parse web pages for article content

In [None]:
import os
import time

# Create a container to hold article content
docs = []

for file_name in os.listdir(content_folder_path):    
    print(file_name)
    # Create three container
    title, publish_date, body = '', '', ''

    # Read web page/html source code
    file_path = os.path.join(content_folder_path,file_name)
    with open(file_path) as f:
        html = f.read()

    # Parse web page/html source code
    soup = bs4.BeautifulSoup(html,'lxml')

    # Get the article title
    head_node = soup.find('h1',class_='title')
    if head_node != None:
        title = head_node.text
        print(title)

    # Get the article content/paragraphs
    story_node = soup.find(class_='story')
    if story_node != None:
        p_nodes = story_node.find_all('p')
        content = []
        for p_node in p_nodes:
            content.append(p_node.text.strip().replace('\n',''))
        # Join multiple paragraphs
        if len(content)>0:
            body = ''.join(content)
            print(body)
        else:
            print('Tere is no content')
    else:
        print('There is no story')

    # Put the title and body in the container
    if title != '' and body != '':
        docs.append((title,body))
    else:
        print('Incomplete data')

# Save all parsed titles and paragraphs in a tsv file
content_file_path = os.path.join(content_folder_path,'..','all_content.tsv')
with open(content_file_path,'w') as f:
    for doc in docs:
        f.write(doc[0]+'\t'+doc[1]+'\n')

print('done')

2117963.htm
蘋果正在研發第三代晶片　外媒曝性能有「倍數飛躍」
▲蘋果正在研發下一代晶片。（圖／取自MacRumors）記者陳俐穎／綜合報導外媒The Information透露未來蘋果自研晶片細節，指出新的晶片將繼承第一代 M1、M1 Pro 和 M1 Max，並且採用台積電的5nm製程製造。目前消息透露，新研發的晶片將具有相當40個計算核心的效能，對比目前的晶片，可以說是倍數成長。報導稱，蘋果和台積電計劃使用台積電 5nm製程的增強版製造第二代蘋果晶片，允許更多的核心。報告指出，這些晶片可能會用於下一代 MacBook Pro 機型和桌上型Mac。蘋果希望第三代晶片上能有「更大的飛躍」，部分晶片將採用台積電的 3nm 工藝製造，報告指出晶片內有多達 40 個計算核心。作為對比，M1 晶片有 8 核 CPU，M1 Pro 和 M1 Max 晶片有 10核  CPU，而蘋果的高階 Mac Pro 機則最多可配置 28 核 Intel Xeon W 處理器。也就是說新研發的晶片將超越目前最高階的機種。報告中提到，內部消息人士指出，預計台積電能夠在 2023 年之前製造用於 Mac 和 iPhone 的 3nm 晶片。該報告稱，第三代晶片代號為 Ibiza、Lobos 和 Palma，它們很可能首先在高階 Mac 上亮相，例如未來的 14 吋和 16 吋 MacBook Pro 機型。據說還計劃在未來的 MacBook Air 中使用功能較弱的第三代晶片。
2107024.htm
台積電美國亞利桑那新廠計畫　千億資金到位　
▲台積電美國設廠計畫，千億資金到位。（示意圖／路透社）記者高兆麟／綜合報導晶圓代工龍頭台積電（2330）美國亞利桑那州新廠45億美元（約新台幣1260億元）公司債在21日完成定價。公司債預計於2021年10月25日完成交割，資金用作一般營業用途，而業界也看好新廠千億資金到位，相關建設也將如期推進於2024年量產，屆時將能達到5奈米製程月產兩萬片。根據台積電公告，由台積電100%持股的亞利桑那子公司已於紐約時間2021年10月20日完成45億美元無擔保公司債定價，公司債由台積公司無條件且不可撤回地提供保證，將註冊登記且公開發行，由承銷商進行承銷。台積電說明，此次募集包括2026年10月25日到期的12.5億美元公司債，利率1.75%；2031