# Text classification using TensorFlow/Keras on AI Platform

This notebook illustrates:

1. Creating datasets for AI Platform using BigQuery
2. Creating a text classification model using the Estimator API with a Keras model
3. Training on Cloud AI Platform
4. Rerun with pre-trained embedding

In [None]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [None]:
!pip install --user google-cloud-bigquery=1.25.0

In [None]:
# change these to try this notebook out
BUCKET = "cloud-training-demos-ml"
PROJECT = "cloud-training-demos"
REGION = "us-central1"

In [None]:
import os
os.environ["BUCKET"] = BUCKET
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.1"

if "COLAB_GPU" is os.environ: # this is always set on Colab, the value is 0 or 1 depending on whether a GPU is attached
    from google.colab import auth
    auth.authenticate_user()
    # download "sidecar files" since on Colab, this notebook will be on Drive
    !rm -rf txtclsmodel
    !git clone --depth 1 https://github.com/GoogleCloudPlatform/training-data-analyst
    !mv training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtclsmodel/ .
    !rm -rf training-data-analyst
    # downgrade TensorFlow to the version this notebook has been tested with
    !pip install --upgrade tensorflow=$TFVERSION

In [None]:
import tensorflow as tf
print(tf.__version__)

We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub.

We will use [hacker news](https://news.ycombinator.com/) as our data source. It is an aggregator that displays tech related headlines from various sources.

## Creating dataset from BigQuery

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015.

Here is a sample of the dataset.

In [None]:
%load_ext google.cloud.bigquery

In [None]:
%%bigquery --project $PROJECT
SELECT
    url, title, score
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    LENGTH(title) > 0
    AND score > 10
    AND LENGTH(url) > 0
LIMIT 10

Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the URL is http://mobile.nytimes.com/..., I want to be left with _nytimes_

In [None]:
%%bigquery --project $PROJECT
SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, ".*://(.[^/]+)/", "."))[OFFSET(1)]) AS source,
    COUNT(title) AS num_articles
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, ".*://(.[^/]+)/"), ".com$")
    AND LENGTH(title) > 10
GROUP BY
    source
ORDER BY num_articles DESC
LIMIT 10

Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for Cloud AI Platform.

In [None]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

query = """
SELECT source, LOWER(REGEXP_REPLACE(title, "[^a-zA-Z0-9 $.-]", " ")) AS title FROM
    (SELECT
        ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, ".*://(.[^/]+)/"), "."))[OFFSET(1)] AS source,
        title
    FROM
        `bigquery-public-data.hacker_news.stories`
    WHERE
        REGEXP_CONTAINS(REGEXP_EXTRACT(url, ".*://(.[^/]+)/"), ".com$")
        AND LENGTH(title) > 10
    )
WHERE (source = "github" OR source = "nytimes" OR source = "techcrunch")
"""

df = bq.query(query + "LIMIT 5").to_dataframe()
df.head()

For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).

A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (see [O'Reilly repeatable sampling](https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning)).

In [None]:
traindf = bq.query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) > 0").to_dataframe()
evaldf = bq.query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) = 0").to_dataframe()

Below we can see that roughly 75 % of the data is used for training, and 25 % for evaluation.

We can also see that within each dataset, the classes are roughly balanced.

In [None]:
traindf["source"].value_counts()

In [None]:
evaldf["source"].value_counts()

Finally, we'll save our data, which is currently in-memory, to disk.

In [None]:
import os, shutil
DATADIR = "data/txtcls"
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv(os.path.join(DATADIR, "train.tsv"), header=False, index=False, encoding="utf-8", sep="\t")
evaldf.to_csv(os.path.joint(DATADIR, "eval.tsv"), header=False, index=False, encoding="utf-8", sep="\t")

In [None]:
!head -3 data/txtcls/train.tsv

In [None]:
!wc -l data/txtcls/*.tsv

## TensorFlow/Keras code