
```
                                       █████╗   █████╗  ██████╗
                                      ██╔══██╗ ██╔══██╗ ██╔══██╗
                                      ███████║ ███████║ ██████╔╝
                                      ██╔══██║ ██╔══██║ ██╔═══╝
                                      ██║  ██║ ██║  ██║ ██║
                                      ╚═╝  ╚═╝ ╚═╝  ╚═╝ ╚═╝
```
# A.I. APPLICATIONS PROJECT: PROJECT MILESTONE REPORT
MODULE_GRP: IT3100-02<br>
NRIC_NAME: MUHAMMAD ARIF BIN HAMED<br>
ADMIN_NO: null


<div style="text-align:center;">
  <img src="https://media1.giphy.com/media/L3Pp1dmjj8alAquswA/giphy.gif">
</div>

# <b>Classify [Straits Times](https://www.straitstimes.com/) Articles related to housing or not.</b>
Each point can be found in certain cells, important cells (rubrics fulfillment) will have this emoji in it -> 👩🏾‍💻

## Importing the libraries
Only 1 library needs to pip installed for Colab: `sentencepiece`. More details on it when we get into subword tokenization


In [None]:
!pip install sentencepiece

# for computing total time taken, and also time taken for each model's training
import time
import pytz
from datetime import timedelta 
from datetime import datetime
time_alpha = time.time()

# for filestuffs, and some pretty printing
import os
import sys
import matplotlib.backends.backend_pdf as mplpdf


# for data scraping
from urllib.request import urlopen
from bs4 import BeautifulSoup

# basic data manipulation & model training libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils import resample 
from typing import List, Tuple # for types
from keras import layers
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from keras_preprocessing.sequence import pad_sequences

# lastly, this is for visualization
from keras.callbacks import TensorBoard
import matplotlib.pyplot as plt

%load_ext tensorboard

# prevent scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Scrape data directly from straitstimes.com/sitemap.xml
Instead of scraping and storing CSV files online for this notebook to use, I was able to use `beautifulsoup` & `urllib.request` to scrape every single Straits Times article **_url_** and **_datetime_** ever since 2013. This may add more total time elapsed for using running this notebook, and there is a very small chance of your Google account getting locked out of Google Colab. **Use at your own risk!**

As of 15 Dec 2022, it could take about ~6hr~ ~1hr~ ~30min~ **6min** to run this notebook. Some things here like the batch size and the number of epochs here can be changed, but that may lead to longer wait times. 

<br>

## 👩🏾‍💻 Details about data collected from https://www.straitstimes.com/sitemap.xml
Since my task involves (as of now only) classification of by using the **headlines** of a Straits Times article, data collection can be done on the whim by running the cell below. The Straits Times sitemap contains almost every known link that belongs the url https://www.straitstimes.com. 

Fortunately, unlike some other news sites like [Shin Min Daily](https://www.shinmin.sg/sitemap.xml), Straits Times uses a _slugified_ version of the headline of each article webpage as the webpage URL (Shin Min uses numbered date IDs)

With this discovery, I was able to utilize Python's `beautifulsoup` & `urllib.request` libraries to extract virtually every single Straits Times article that has been uploaded to their website, spanning all the way from **1 Jan 2013** to **about a week ago**.

Irregularities about the data & data preparation in the next markdown cell

In [None]:
# go british (init the dataframe)
df = pd.DataFrame()

# get how much pages are there in straitstimes.com/sitemap.xml
total_sitemap_pages = len([x for x in BeautifulSoup(urlopen("https://www.straitstimes.com/sitemap.xml"),'lxml').get_text().split('\n') if "page=" in x])

full_raw_name = "/content/straitstimes_sitemap.xml_full-raw.csv"

if not(os.path.isfile(full_raw_name)):
  for i in range(total_sitemap_pages):
    # assert i == 0 # for debugging
    url = "https://www.straitstimes.com/sitemap.xml?page="+str(i+1)
    html = urlopen(url)

    # note that using 'lxml' may not be available if you're running this 
    # notebook on a local runtime, which i would not recommend imo, as
    # there could be a chance that straitstimes would time you out
    soup = BeautifulSoup(html, 'lxml') 
    soup_as_text = soup.get_text()
    soup_url = [x for x in soup_as_text.split("\n") if "https://" in x]
    soup_datetime = [x for x in soup_as_text.split("\n") if "+08:00" in x]
    
    # first row in first page does not have datetime, hence remove url 
    # (which is just straitstimes.com)
    if i == 0:
      soup_url.remove(soup_url[0])
      # in the first page, soup_datetime has the length of 4999, so soup_url will add up with soup_datetime

    # appending to df
    df = pd.concat([df, pd.DataFrame(
        {"url": soup_url, "datetime": soup_datetime}
    )], ignore_index=True)
    
    # pretty printing
    if i == total_sitemap_pages - 1:
      sys.stdout.write("\rSaved all %i pages!" % (i+1))
    else:
      sys.stdout.write("\rSaving page %i" % (i+1))
    sys.stdout.flush()
    
  # backup to runtime if error in notebook occurs
  df.to_csv(full_raw_name)
else:
  print("Already got the data, proceeding..")

# as of 10 Dec 2022, this saved dataframe saves to a CSV that ends 
# up being around 23MB, so let's check it out here
df

## 👩🏾‍💻 Pre-process data
Even though I have a vast size of **at least 170000 rows of data** (as of time of writing), there are quite a lot of irregularities with the data.
* Drop cells containing NoneType & drop duplicate cells
  * Due to how large the dataset is, Python could experience certain glitches when handling this amount of data. There are multiple instances in this code where there could suddenly be a _NoneType_ in an array of _String_. This will lead to a variety of errors along the way.
  * The only way that I could counter this is by enforcing a check and drop rows that contain _NoneType_ right before a line that could experience errors involving sring could happen (like write text to a file or run text through a function, etc)
  * This is not easy as I also have to drop other rows depending on the current task that the cell is undertaking. 
    * For example, if there is a _NoneType_ in texts list in index 2423, I must also drop label list in index 2423.
* The `url` contains both the `category` and the `headline`, split them into **2 different columns** and then add it. The `category` is not used, just the `headline` column will be used as the **feature**
* The `housing` column, which is the **target** column is based off the the `url` column containing certain keywords, such as _housing_, _hdb_, and a few others.
* **IMPORTANT** BALANCING DATA:
  * I realize after getting how much articles related to housing simple list comprehension showed, initially it was around **1000-3000** articles (depending on keyword lists), but i was using that against **174000-172000** non-housing articles. Even if I were generous with the keywords and get _5000_ articles, that is about **2.857% of the whole dataset**. The model will not be proficient at identifying housing articles against the ocean of ST articles.
  * There are 2 methods to tackle this:
    * _sklearn_'s resample function, which oversamples housing articles to raise it's percentage against the non-housing articles, or
    * just **cut off a large portion of non-housing before splitting**. I did this because I don't want to risk overfitting the model based of the miniscule amount of housing articles that exist on Straits Times
* Also do basic data cleaning.
  * Especially for NLP, so I just switched the _dashes_ and _slashes_ with _spaces_ in `headline` column.



In [None]:
# this line is for debugging this cell, though it doesn't hurt to leave it here
df = pd.read_csv('/content/straitstimes_sitemap.xml_full-raw.csv')

full_prepared_name = "/content/straitstimes_sitemap.xml_full-prepared.csv"

  # HOUSING DICTIONARY (more like array)
housing_array = [
  [ # First, mark all rows that has these
    "housing",
    # "house",
    "hdb",
    "real-estate",
    "private-estate",
    "public-estate",
    "home-approval",
    "housing-estate",
    "columbarium",
    # "business",
    "bto"
  ],
  [ # This part of the array could be implemented, however for some reason it's not possible now
    "white-house",
    "united-states",
    "middle-east",
    "israel",
    "aye",
    "interview-movie",
    "syria",
    "world/americas",
    "entertainment",
    "sph",
    "mom",
    "royal-infant"
  ]
  ]

# # This function is meant to be used in apply(lambda x: funct(x)) thing
# # disabled for now
# def label_housing(row):
#   out_label = 0
  
#   for i in housing_array[0]:
#     if i in row["url"]:
#       out_label = 1
#   for i in housing_array[1]:
#     if i in row["url"]:
#       out_label = 0
#   return out_label

if os.path.isfile(full_prepared_name):
  df = pd.read_csv(full_prepared_name)
  print("Prepared data exists, proceeding..")
else:
  # drop duplicates & nonetypes, though there are likely no dupes
  df.drop_duplicates(inplace=True)
  df.dropna(inplace=True)

  # remove the starting part of the url ("https://www.straitstimes.com/")
  # split url into category and headline, 
  df[["category","headline"]] = df["url"].str.slice(29,).str.rsplit('/',1,expand=True)

  # add in the target column (simple filtering based of some keywords for now)
  df['housing'] = df.apply(lambda row: 1 if  any(sbs in row['url'] for sbs in housing_array[0]) else 0, axis=1)

  # replace slash & hyphens with space in category & headline respectively
  # this is the only data cleaning needed as the url is already slugified
  df["headline"] = df["headline"].str.replace("-", " ")

  print("Current number of articles that are about housing: %i\nDataset will be balanced to it." % sum(df["housing"].tolist()))

  # VERY IMPORTANT
  # This part will cut off an amount of non housing articles
  # this is decided by the ratio between housing & non-housing times a fixed no
  # For example, at 3, 75% of the articles is now non-housing (4, 80%)
  df_majority = df[df["housing"]==0]                      # 175000 -> n * 3   
  df_minority = df[df["housing"]==1] # based              #      n 
  throwaway, df_majority = train_test_split(
      df_majority, 
      test_size=len(df_minority.index)/len(df_majority.index)*3, 
      random_state=42, 
      shuffle=True
  )
  # combine both back again & randomize again, resetting index
  df = pd.concat([df_majority, df_minority])
  df = df.sample(frac=1).reset_index(drop=True) 

  ########## ##

  # IMPORTANT: Force headlines to be string, else, cut them off
  # iterations of this will unfortunately appear in later cells 😥
  # IMPORTANT: THIS SHOULD BE FINAL BEFORE SPLITTING
  temp_headline_for_null_removal = list(df["headline"])
  for i in range(len(temp_headline_for_null_removal)):
    try: assert isinstance(temp_headline_for_null_removal[i],str)
    except: df.drop(df.index[i])

  print("Total df rows before real splitting: "+str(len(df.index)))

  # backup prepared data to hosted runtime 
  df.to_csv(full_prepared_name)

# SPLIT
df_train, df_test = train_test_split(
    df, 
    test_size=0.2, 
    random_state=42, 
    stratify=df["housing"], 
    shuffle=True
)

df_train 

## Preparing data for training and validation
Further splitting here for training & validation. 

Random state was chosen as _42_. An in-depth explanation can be found in [this article](https://grsahagian.medium.com/what-is-random-state-42-d803402ee76b).

In [None]:
random_state = 42
val_split = 0.2

df_train_copy = df_train.copy()
texts = list(df_train_copy["headline"]) # x
labels = list(df_train_copy["housing"]) # y

# ensure that NOTHING IS NONETYPE. This could still happen after splitting with
# keras_preprocessing's train_test_split, and i don't know why
offaxis_removal = 0
for i in range(len(texts)):
  if not(isinstance(texts[i-offaxis_removal], str)):
    del texts[i-offaxis_removal]
    del labels[i-offaxis_removal]
    offaxis_removal += 1 

print('sample text: ', texts[0])
print('corresponding label:', labels[0])

# there should only be [1. 0.] & [0. 1.] representing 0 & 1 respectively
# technically since this is a binary problem for now, i don't need to categorize
# the labels, but just in case i need to use this notebook for further categorization
labels = to_categorical(labels)

# this function could have some problems (next comment)
texts_train, texts_val, y_train, y_val = train_test_split(
    texts, labels,
    test_size=val_split,
    random_state=random_state,
    stratify=labels,
    shuffle=True)

## FURTHER EVICTING OF NONETYPES this is ridiculous
# texts_train and y_train should be of same size, vice versa
offaxis_removal = 0
for i in range(len(texts_train)):
  if not(isinstance(texts_train[i-offaxis_removal], str)):
    del texts_train[i-offaxis_removal]
    y_train = np.delete(y_train, i-offaxis_removal)
    offaxis_removal += 1
offaxis_removal = 0
for i in range(len(texts_val)):
  if not(isinstance(texts_val[i-offaxis_removal], str)):
    del texts_val[i-offaxis_removal]
    y_val = np.delete(y_val, i-offaxis_removal)
    offaxis_removal += 1 

print('labels shape:', labels.shape)
print('train size: ', len(texts_train))
print('validation size: ', len(texts_val))
# If train size + validation size = labels shape[0], 
# then it's all set for training 🙂

# Using 1D CNN Model for training

In both subword-level and word-level, we will use the same CNN sequential model to compare which one would be more suitable for application

In [None]:
def text_cnn(max_sequence_len: int, max_features: int, num_classes: int, 
              optimizer: str='adam', metrics: List[str]=['acc']) -> Model:
    
    sequence_input = layers.Input(shape=(max_sequence_len,), dtype='int32', name="Input") # [(None, 16)]
    embedded_sequences = layers.Embedding(
        max_features, 
        256, 
        trainable=True, 
        name="Embedding"
    )(sequence_input)                                                                     # (None, 16, 256)
    # LSTM could not be used due to some input compatibility problem, but its 
    # not so necessary as the size of each sample is very small
    # lstm_embedded = layers.Bidirectional(layers.LSTM(128, return_sequences=True), name="BidirectionalLSTM")(embedded_sequences)
    conv1 = layers.Conv1D(128, 5, activation='relu', name="Conv1D_1")(embedded_sequences) # (None, 12, 128)
    pool1 = layers.MaxPooling1D(1, name="MaxPool1D_1")(conv1)                             # (None, 12, 128)
    conv2 = layers.Conv1D(128, 5, activation='relu', name="Conv1D_2")(pool1)              # (None, 8, 128)
    pool2 = layers.MaxPooling1D(2, name="MaxPool1D_2")(conv2)                             # (None, 4, 128)
    flatten = layers.Flatten(name="Flatten")(pool2)                                       # (None, 512)
    dens1 = layers.Dense(128, activation='relu', name="Dense_1")(flatten)                 # (None, 128)
    dens2 = layers.Dense(128, activation='relu', name="Dense_2")(dens1)                   # (None, 128)
    dens3 = layers.Dense(32, activation='relu', name="Dense_3")(dens2)                    # (None, 32)
    preds = layers.Dense(num_classes, activation='sigmoid', name="Dense_Preds")(dens3)    # (None, 2)

    model = Model(sequence_input, preds)
    # binary crossentropy, cuz its yes/no if article is about housing
    model.compile(loss='binary_crossentropy',
                  optimizer=optimizer,
                  metrics=metrics)
    return model

## Prepare test data for possible use before training (such as callbacks)
Testing dataset is already cleaned before splitting away from training dataset.

Cell below prepares the data for testing

This is just in case I want to make future improvements to the model creation process here.

In [None]:
url_col = 'url' # url is used as IDs
prediction_col = 'housing'
output_dir = "/content/output"

# WHAT WILL END UP IN OUTPUT.CSV
texts_test = df_test["headline"].tolist()   # FEATURES  (X)
urls = df_test[url_col].tolist()            # ID
housing_test = df_test["housing"].tolist()  # TARGET    (Y)

# ensure that NOTHING IS NONETYPE. this still happens here
# and its inconvenient, and its frustrating, but its not impossible to fix 😮‍💨
offaxis_removal = 0
for i in range(len(texts_test)):
  if not(isinstance(texts_test[i-offaxis_removal], str)):
    del texts_test[i-offaxis_removal] # FEATURES
    del urls[i-offaxis_removal] # IDS 
    del housing_test[i-offaxis_removal] # TARGETS
    offaxis_removal += 1 


## Prepare Tensorboard


In [None]:
root_logdir = os.path.join(os.curdir, "ari_aap_logs")
def get_run_logdir(model): # use a new directory for each run
  run_id = datetime.now(pytz.timezone("Asia/Singapore")).strftime(model+'_run_%Y-%m-%d_%H-%M-%S_SGT')
  return os.path.join(root_logdir, run_id)

run_logdir_sentencepiece = get_run_logdir("sentencepiece")
run_logdir_word = get_run_logdir("word")

tensorboard_cb_sentencepiece = TensorBoard(run_logdir_sentencepiece)
tensorboard_cb_word = TensorBoard(run_logdir_word)

# model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
#   filepath="bestcheckpoint",
#   save_weights_only=True,
#   monitor='val_accuracy',
#   mode='max',
#   save_best_only=True)

## <em>Subword-level</em> tokenization
Using `sentencepiece`, a third-party Python library, we will encode the headlines into subwords, and put it into a subword vocab file. The following are what each cell does until the next markdown cell
- Create `train.txt` which is basically where all `headlines` will be temporarily stored for the next cell.
- Utilize `sentencepiece` for tokenization into subwords
- The 4th cell from here is just the steps to convert and encode the text

In [None]:
temp_file = 'train.txt'
with open(temp_file, 'w') as f:
    f.write("\n".join(texts)) 
    # if this says NoneType error btw, may as well defenestrate yourself

In [None]:
from sentencepiece import SentencePieceTrainer, SentencePieceProcessor

max_num_words = 12000
model_type = 'bpe' # 'unigram' is default, i would go with bpe cuz subwords
model_prefix = model_type
pad_id = 0 # padding to make samples same size
unk_id = 1 # unknown words
bos_id = 2 # beginning of sentence
eos_id = 3 # end of sentence

def rpjmnw(max_num_words_infunct):
  return ' '.join(['--input={}'.format(temp_file),'--model_type={}'.format(model_type),'--model_prefix={}'.format(model_type),'--vocab_size={}'.format(max_num_words_infunct),'--pad_id={}'.format(pad_id),'--unk_id={}'.format(unk_id),'--bos_id={}'.format(bos_id),'--eos_id={}'.format(eos_id)])

print(rpjmnw(max_num_words))
SentencePieceTrainer.train(rpjmnw(max_num_words))

    


In [None]:
# create SentencePieceProcessor. Tokenizes a string to subwords
sp = SentencePieceProcessor()
sp.load("{}.model".format(model_prefix))
print('Found %s unique tokens.' % sp.get_piece_size())

In [None]:
# this var is what you know about the data
# since headlines aren't as long as movie reviews, I set it as 16 for now
# also adjust the model when adjusting this too
max_sequence_len = 16

# line below will tokenize (convert words to numbers)
sequences_train1 = [sp.encode_as_ids(str(text)) for text in texts_train]
# pads tokenized sequence to make it compatible with model
x_train1 = pad_sequences(sequences_train1, maxlen=max_sequence_len)

# line below will tokenize (convert words to numbers)
sequences_val1 = [sp.encode_as_ids(str(text)) for text in texts_val]
# pads tokenized sequence to make it compatible with model
x_val1 = pad_sequences(sequences_val1, maxlen=max_sequence_len)


print(sequences_train1[0][:5])
print(x_train1[0])

In [None]:
print('sample text: ', texts_train[0])
print('sample text: ', sp.encode_as_pieces(sp.decode_ids(x_train1[0].tolist())))

In [None]:
num_classes = 2
model1 = text_cnn(max_sequence_len, max_num_words + 1, num_classes)
model1.summary()

### TRAINING

In [None]:
# TRAINING HERE
start = time.time()
history1 = model1.fit(x_train1, y_train,
                      validation_data=(x_val1, y_val),
                      batch_size=16, 
                      epochs=8,
                      callbacks=[tensorboard_cb_sentencepiece])#, model_checkpoint_callback])
print("\nTime taken to train subword-level model: "+str(timedelta(seconds=time.time() - start)))

In [None]:
# check out more details after training

plt.plot(model1.history.history['acc'])
plt.plot(model1.history.history['val_acc'])
plt.title('subword-token model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.plot(model1.history.history['loss'])
plt.plot(model1.history.history['val_loss'])
plt.title('subword-token model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

## <em>Word-level</em> tokenization
Do something like what i did on for subword-level, but instead using word-level tokenization. Combining the use of both subword-level tokenization & word-level tokenization will ensure meaning behind the words are known correctly. More details can be found here -> https://aclanthology.org/2020.acl-srw.10/

However, all i'm doing for now is getting which model is better at classifying ST articles as housing or not.

<br><br>
btw both model trainings uses a huge batch size and 1 epoch because the web-scraping resulted in actual Big Data.

In [None]:
tokenizer = Tokenizer(num_words=max_num_words, oov_token='<unk>')
tokenizer.fit_on_texts(texts_train)
print('Found %s unique tokens.' % len(tokenizer.word_index))

In [None]:
# line below will tokenize (convert words to numbers)
sequences_train2 = tokenizer.texts_to_sequences(texts_train)
# pads tokenized sequence to make it compatible with model
x_train2 = pad_sequences(sequences_train2, maxlen=max_sequence_len)

# line below will tokenize (convert words to numbers)
sequences_val2 = tokenizer.texts_to_sequences(texts_val)
# pads tokenized sequence to make it compatible with model
x_val2 = pad_sequences(sequences_val2, maxlen=max_sequence_len)

In [None]:
num_classes = 2
model2 = text_cnn(max_sequence_len, max_num_words + 1, num_classes)
model2.summary()

### TRAINING

In [None]:
# TRAINING HERE
start = time.time()
history2 = model2.fit(x_train2, y_train,
                      validation_data=(x_val2, y_val),
                      batch_size=16, 
                      epochs=8,
                      callbacks=[tensorboard_cb_word])#, model_checkpoint_callback])
print("\nTime taken to train word-level model: "+str(timedelta(seconds=time.time() - start)))

In [None]:
# check out more details after training

plt.plot(model2.history.history['acc'])
plt.plot(model2.history.history['val_acc'])
plt.title('word-token model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.plot(model2.history.history['loss'])
plt.plot(model2.history.history['val_loss'])
plt.title('word-token model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# Initiate Testing

## Create test sequences

In [None]:
# word-level test sequences
# line below will tokenize (convert words to numbers)
word_sequences_test = tokenizer.texts_to_sequences(texts_test)
# pads tokenized sequence to make it compatible with model
word_x_test = pad_sequences(word_sequences_test, maxlen=max_sequence_len)
print("Word-level x_test: "+str(len(word_x_test)))

# subword-level test sequences
# line below will tokenize (convert words to numbers)
sentencepiece_sequences_test = [sp.encode_as_ids(str(text)) for text in texts_test]
# pads tokenized sequence to make it compatible with model
sentencepiece_x_test = pad_sequences(sentencepiece_sequences_test, maxlen=max_sequence_len)
print("Subword-level x_test: "+str(len(sentencepiece_x_test)))

# both should be of same length, else there's something horribly horribly wrong 

### Output function for ease of use

In [None]:
# ease things up using functions
def create_output(urls, predictions, url_col, prediction_col, output_path) -> pd.DataFrame:
    df_output = pd.DataFrame({
        url_col: urls,
        prediction_col: predictions.round(6) * 100
    }, columns=[url_col, prediction_col])

    ## NOTE: Predictions will come out as percentage, not by 1

    if output_path is not None:
        # create the directory if need be, e.g. if the output_path = output/output.csv
        # we'll create the output directory first if it doesn't exist
        directory = os.path.split(output_path)[0]
        if (directory != '' or directory != '.') and not os.path.isdir(directory):
            os.makedirs(directory, exist_ok=True)

    df_output.to_csv(output_path, index=False, header=True)

    return df_output


## Predictions
This is where the bulk of prediciton would go to, won't take more than 10min. 

Note the first line, the `predictions_dict` variable declaration.
  * note the `[:, 1]`, its because of how the output's formed
  * `[1. 0.]` & `[0. 1.]` representing 0 & 1 respectively
  * `:` means all the rows, `, 1` means looking at the second value ^ only

Next line generates output to CSVs for accuracy comparison later.


In [None]:
predictions_dict = {
    'sentencepiece_cnn': model1.predict(sentencepiece_x_test)[:, 1], 
    'word_cnn': model2.predict(word_x_test)[:, 1] 
}

# right after prediction, write outputs to CSVs based on their models
for model_name, predictions in predictions_dict.items():
    print('generating output for: ', model_name)
    output_path = '/content/{}_output.csv'.format(model_name)
    df_output = create_output(urls, predictions, url_col, prediction_col, output_path)

# sanity check to make sure the size and the output of the output makes sense
print(df_output.shape)
df_output.head()

## 👩🏾‍💻 Evaluate & launch TensorBoard

In [None]:
# Evaluate the model on the test data using `evaluate`

print("Evaluate on test data\n") # predicted results, truth 
results1 = model1.evaluate(word_x_test, np.array(housing_test).astype('float32').reshape((-1,1)), batch_size=512)
print("SentencePiece:\n\ttest loss, test acc: "+ str(results1) +"\n")
results2 = model2.evaluate(sentencepiece_x_test, np.array(housing_test).astype('float32').reshape((-1,1)), batch_size=512)
print("Word:\n\ttest loss, test acc: "+ str(results2)+"\n")
print("\nThe tokenization method that performed better is: "+str("Sub-word tokenization" if results1[1] > results2[1] else "Word tokenization"), end="\n\n\n\n")

%tensorboard --logdir /content/ari_aap_logs/

Everything below this line is not relevant to the project, rather its for convenience.

---

# Save it all
Saves the models, the scraped data & the outputs from each model.

Oh and also check how long it took to run this notebook

In [None]:
model1.save("/content/sentencepiece_model")
model2.save("/content/word_model")
notebookname = "AAP_shortlist-articles"
%notebook AAP_shortlist-articles_history_no-markdown.ipynb
currentdatetime = datetime.now(pytz.timezone("Asia/Singapore")).strftime('%Y-%m-%d_%H-%M-%S_SGT')
outputzip = "["+currentdatetime+"]_"+notebookname+".zip"
os.system("zip -r "+outputzip+" /content/* -x /content/sample_data/\* /content/.config/\* /content/.ipynb_checkpoints/\*")

%cd /content

# your details here to make a zipped copy of this notebook project
username = ""
email = ""
ghp_key = ""
colab_output_repo = "colab-outputs"

os.system("git config --global user.name \""+username+"\"")
os.system("git config --global user.email \""+email+"\"")
os.system("git clone --depth=1 https://"+username+":"+ghp_key+"@github.com/"+username+"/"+colab_output_repo+"")
os.system("mv /content/"+outputzip+" /content/"+colab_output_repo+"/"+outputzip+" && cd "+colab_output_repo+" && git add . && git commit -m "+currentdatetime+"_"+notebookname+" && git push")


In [None]:
print("Total time taken to run this notebook: "+str(timedelta(seconds=time.time() - time_alpha)))

In [None]:
# wait for 30 seconds to ensure that everything is executed before google colab commit sepukku
# time.sleep(30)

# from google.colab import runtime
# runtime.unassign()

# Summary
This written after the final training and milestone report.

So there were many things that I have learnt throughout the creation of this notebook and before & after the project milestone report

1. **The importance of getting balanced data**. The most greatest revelation was that I was training on very imbalanced data until about 2 days after the project milestone report. Initially, I had about _3000_ articles that are about housing, which is about **1.71%** of the _175000_ articles that I had in the training, and with 3000 articles I was still being generous in the keywords. It wasn't when I started considering simply cutting off some of the non-housing articles, because it felt like if I gave it a closer balance, the model would be able to identify housing articles better. Little did I know that it was a legit problem that it would cause the model to severely overfit. I figured that simply cutting off non-housing articles would be the best choice, as I had about _175000_ rows to deal with. Cutting off a digit or two from that number helped tremendously in training, and the model is able to finally give good predictions in the end. 
1. **The importance of knowing how to build a model**. Not much to say about this one, because its something that I had a lot trouble with when trying to use different layers and knowing what numbers do what. Using **model.summary()** after model.compile() is very helpful in understanding the existing model. It also absolutely helps to know about layers from a different source too.
1. **The importance of using the best & most efficient methods of data collection**. Initially I used UiPath Studio to create a CSV file where it contained only 5000 rows of data. As versatile & powerful UiPath Studio is, I was interested in making everything available in this notebook itself instead of updating and uploading the CSV file once in a while. Not only was I able to increase my dataset to 175000 unique rows, I will also be able to constantly get updated results according to Straits Times' latest articles (that are uploaded to their main site).
1. **The importance of experimentation**. This is essential in A.I. Science, its what makes A.I. science a science. I cross-referenced my subword-level tokenization and word-level tokenization processes after a blog online, and even though I did somewhat understand why it showed both types of tokenization, only after much painstaking training and modification on my own notebook to fit my purpose did it really hit me that creating A.I. is not easy. It requires experimentation, documentation, and a lot more. This whole experience really made me give my appreciation for trailblazers and masters of A.I., and admittedly, really encourages me to potentially pursue A.I. as a career. 
