# Ari's Scrapeyard: Straits Times
The Straits Times is one of Singapore's most popular newspapers. Due to its popularity & its need to stay relevant in this world ruled by tech, Straits Times has their own website they've been running for quite some time, [here it is](https://straitstimes.com). Using its [sitemap](https://straitstimes.com/sitemap.xml), I was able to scrape, well, a lot from it. 

let's take a look at the notes.

In [6]:
!pip install selenium tqdm
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!rm -rf sample_data

# for computing total time taken, and also time taken for each model's training
import time
import pytz
from datetime import timedelta, datetime
time_alpha = time.time()

# for filestuffs, and some pretty printing
import os
import sys
from tqdm.notebook import tqdm

# for data scraping
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

# data scraping webdriver
from selenium import webdriver
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
# chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=chrome_options)
# driver.maximize_window()

# basic data manipulation & model training libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils import resample 
from typing import List, Tuple # for types
# from keras import layers
# from keras.models import Model
# from keras.preprocessing.text import Tokenizer
# from keras.utils.np_utils import to_categorical
# from keras_preprocessing.sequence import pad_sequences

# lastly, this is for visualization
from keras.callbacks import TensorBoard
import matplotlib.pyplot as plt


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromium-chromedriver is already the newest version (108.0.5359.71-0ubuntu0.18.04.5).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file


## Saves URL & datetime of each ST article

In [9]:
# go british (init the dataframe)
df = pd.DataFrame()

save_url_only = True

# from what i saw, today news & shinmin news also have sitemap, though I
# would have to absolutely change the scraping methods here
target_site = "https://www.straitstimes.com/"

# get how much pages are there in straitstimes.com/sitemap.xml
total_sitemap_pages = len([x for x in BeautifulSoup(urlopen(f"{target_site}sitemap.xml"),'lxml').get_text().split('\n') if "page=" in x])

# save file name
full_raw_name = "/content/straitstimes_sitemap.xml_full-raw.csv"

if not(os.path.isfile(full_raw_name)):
  for i in tqdm(range(total_sitemap_pages)):
    # assert i == 0 # for debugging

    soup_text = BeautifulSoup(urlopen(f"{target_site}sitemap.xml?page={str(i+1)}"), 'lxml').get_text()
    soup_url = [x for x in soup_text.split("\n") if "https://" in x and len(x) >= len(target_site)+1] # list of urls
    soup_datetime = [x for x in soup_text.split("\n") if "+08:00" in x] # list of datetimes
    
    if save_url_only:
      df = pd.concat([df, pd.DataFrame(
          {"url": soup_url, "datetime": soup_datetime}
      )], ignore_index=True)
    else:
      # Taking the each article text (bulk of time taken here)
      soup_article = []
      for url in tqdm(soup_url, leave=False):
        os.system(f"wget {url}")
        with open(url.split("/")[-1], 'r') as f:        
          soup_article.append('\n'.join([y[3:-4] for y in [x.strip() for x in f.read().split("\n")] if "<p>" == y[:3] and "</p>" == y[-4:] and " " != y[3]]))
        os.system("rm "+url.split("/")[-1])
      
      df = pd.concat([df, pd.DataFrame(
          {"url": soup_url, "datetime": soup_datetime, "article": soup_article}
      )], ignore_index=True)
    
  df.drop_duplicates(inplace=True)
  df.dropna(inplace=True)

  # backup to runtime if error in notebook occurs
  df.to_csv(full_raw_name)

  
else:
  print("Already got the data, proceeding with it..")

df

  0%|          | 0/36 [00:00<?, ?it/s]

Saved all 36 pages!

Unnamed: 0,url,datetime
0,https://www.straitstimes.com/singapore/singapo...,2016-01-20T01:58:03+08:00
1,https://www.straitstimes.com/world/americas/cu...,2016-01-31T12:16:09+08:00
2,https://www.straitstimes.com/world/middle-east...,2016-01-24T06:51:24+08:00
3,https://www.straitstimes.com/world/united-stat...,2016-01-22T07:42:48+08:00
4,https://www.straitstimes.com/world/united-stat...,2016-01-24T06:13:27+08:00
...,...,...
176168,https://www.straitstimes.com/singapores-best-c...,2022-08-24T12:29:58+08:00
176169,https://www.straitstimes.com/singapore-best-cu...,2022-09-05T11:40:25+08:00
176170,https://www.straitstimes.com/cortina-50th-anni...,2022-11-23T10:01:23+08:00
176171,https://www.straitstimes.com/world-cup-2022,2022-12-01T16:50:24+08:00


if you're running this on colab i'd highly recommend creating someway to save the outputs if this notebook finishes execution when you're away. the total estimated time taken to go through all 35 pages is at least **50 hours**. google colab does not allow that, so either way i'd would recommend this notebook to be executed on your own machine in the end.

In [None]:
# note to ari: work on a way for waypoints, so that you can run and stop this 
# at any point of time.