# Create Data Splits

In [1]:
%%time
!pip3 freeze | grep -E 'boto3|s3fs|scikit-learn|distributed|dask==|dask-m|black==|jupyter-server|pandas'
!conda list -n spark | grep -E 'ipykernel'

black==22.6.0
boto3==1.24.56
dask==2022.8.0
dask-ml==2022.5.27
distributed==2022.8.0
nb-black==1.0.7
pandas==1.4.3
s3fs==0.4.2
scikit-learn==1.1.2
ipykernel                 6.15.1             pyh210e3f2_0    conda-forge
CPU times: user 24.9 ms, sys: 22.8 ms, total: 47.6 ms
Wall time: 2.22 s


In [2]:
%load_ext lab_black

In [3]:
import os
from glob import glob
from datetime import datetime
import zipfile

import boto3
import dask.dataframe as dd
import pandas as pd
from dask_ml.model_selection import train_test_split
from sklearn.model_selection import train_test_split as sk_train_test_split

## About

### Objective
This notebook will split the processed data into training, validation and test splits that can be used to train a machine learning model for twitter sentiment classification.

### ML Model Development
A random sample of the training split will be further divided into three smaller splits in order to support training a NLP (transformers) model to predict sentiment. This NLP model will be used to label the processed tweets data with sentiment. The NLP model will be used to label the data (i.e. to extract the sentiment) used during ML model development. The ML model will be trained using this labeled data and then deployed.

### ML Model Usage in Production
In production, the deployed ML model will be used to predict sentiment of incoming tweets on-demand. These predictions will be served to customers.

### ML Model Drift Monitoring
After the same fraction of inference predictions have been made as the size of the validation of test splits used during ML model development, the following will be performed
- all tweets predicted during inference are labeled
  - manually
  - using the **previously trained NLP model**
- deployed ML model predictions, made during inference, will be scored against these labels (previous bullet point) in order to determine if ML model performance has
  - drifted (the ML training pipeline will be triggered)
    - scores are not within some threshold of the scores on the test split during ML model development
    - predictions made by the **previously trained NLP model** will be served to the customer
      - the other option here is to serve the same (poorly scoring) predictions made by the ML model to the customers
    - updated training, validation and testing splits will be created using *all available data*
    - a new ML model will be trained using *all available data* and will then be deployed to production
  - not drifted
    - scores are within some threshold of the scores on the test split during ML model development
    - the currently used ML model will continue to serve inference

*All available data* here will include
- the original data used for ML model development
- the new data used to make inference with the originally trained ML model

## User Inputs

In [4]:
path_to_folder = "/datasets/twitter/kinesis-demo/"

# processed data
processed_data_dir = "data/processed"
processed_file_name = "processed_text"

# train-test split
test_split_frac = 0.125

# sampling data
num_sampled_tweets = 10_000
sampled_fname = "sampled_data.csv.zip"

# inference
inference_start_date = "2022-01-10 00:00:00"

In [5]:
s3_bucket_name = os.getenv("AWS_S3_BUCKET_NAME")
session = boto3.Session(profile_name="default")
s3_client = session.client("s3")

dtypes_dict = {
    "id": pd.StringDtype(),
    "geo": pd.StringDtype(),
    "coordinates": pd.StringDtype(),
    "place": pd.StringDtype(),
    "contributors": pd.StringDtype(),  # pd.BooleanDtype(),
    "is_quote_status": pd.StringDtype(),  # pd.BooleanDtype(),
    "quote_count": pd.Int32Dtype(),
    "reply_count": pd.Int32Dtype(),
    "retweet_count": pd.Int32Dtype(),
    "favorite_count": pd.Int32Dtype(),
    "favorited": pd.StringDtype(),  # pd.BooleanDtype(),
    "retweeted": pd.StringDtype(),  # pd.BooleanDtype(),
    "source": pd.StringDtype(),
    "in_reply_to_user_id": pd.StringDtype(),
    "in_reply_to_screen_name": pd.StringDtype(),
    "source_text": pd.StringDtype(),
    "place_id": pd.StringDtype(),
    "place_url": pd.StringDtype(),
    "place_place_type": pd.StringDtype(),
    "place_name": pd.StringDtype(),
    "place_full_name": pd.StringDtype(),
    "place_country_code": pd.StringDtype(),
    "place_country": pd.StringDtype(),
    "place_bounding_box_type": pd.StringDtype(),
    "place_bounding_box_coordinates": pd.StringDtype(),
    "place_attributes": pd.StringDtype(),
    "coords_type": pd.StringDtype(),
    "coords_lon": pd.StringDtype(),
    "coords_lat": pd.StringDtype(),
    "geo_type": pd.StringDtype(),
    "geo_lon": pd.StringDtype(),
    "geo_lat": pd.StringDtype(),
    "user_name": pd.StringDtype(),
    "user_screen_name": pd.StringDtype(),
    "user_followers": pd.Int32Dtype(),
    "user_friends": pd.Int32Dtype(),
    "user_listed": pd.Int32Dtype(),
    "user_favourites": pd.Int32Dtype(),
    "user_statuses": pd.Int32Dtype(),
    "user_protected": pd.StringDtype(),  # pd.BooleanDtype(),
    "user_verified": pd.StringDtype(),  # pd.BooleanDtype(),
    "user_contributors_enabled": pd.StringDtype(),
    "user_location": pd.StringDtype(),
    "retweeted_tweet": pd.StringDtype(),
    "tweet_text_urls": pd.StringDtype(),
    "tweet_text_hashtags": pd.StringDtype(),
    "tweet_text_usernames": pd.StringDtype(),
    "num_urls_in_tweet_text": pd.Int32Dtype(),
    "num_users_in_tweet_text": pd.Int32Dtype(),
    "num_hashtags_in_tweet_text": pd.Int32Dtype(),
    "text": pd.StringDtype(),
    "contains_wanted_text": pd.BooleanDtype(),
    "contains_wanted_text_case_sensitive": pd.BooleanDtype(),
    "contains_multi_word_wanted_text": pd.BooleanDtype(),
    "contains_crypto_terms": pd.BooleanDtype(),
    "contains_religious_terms": pd.BooleanDtype(),
    "contains_inappropriate_terms": pd.BooleanDtype(),
    "contains_video_games_terms": pd.BooleanDtype(),
    "contains_misc_unwanted_terms": pd.BooleanDtype(),
    "contains_non_english_terms": pd.BooleanDtype(),
    "text_trimmed": pd.StringDtype(),
    "text_stripped": pd.StringDtype(),
    "text_processed": pd.StringDtype(),
    "words": pd.StringDtype(),
    "num_words": pd.Int32Dtype(),
}

proc_text_zip_fname = f"{processed_file_name}.zip"

val_split_frac = test_split_frac / (1 - test_split_frac)

In [6]:
def highlight_cols(df_cols, cols_to_use):
    """Highlight a list of columns in a DataFrame."""
    # copy df to new - original data is not changed
    df = df_cols[cols_to_use].copy()
    # select all values to yellow color
    df.loc[:, :] = "background-color: yellow"
    # return color df
    return df


def download_file_from_s3(
    s3_bucket_name: str,
    path_to_folder: str,
    data_dir: str,
    fname: str,
    aws_region: str,
    prefix: str,
) -> None:
    """Download file from ."""
    dest_filepath = os.path.join(data_dir, fname)
    s3_filepath_key = s3_client.list_objects_v2(
        Bucket=s3_bucket_name,
        Delimiter="/",
        Prefix=prefix,
    )["Contents"][0]["Key"]
    start = datetime.now()
    print(
        f"Started downloading processed data zip file from {s3_filepath_key} to "
        f"{dest_filepath} at {start.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}..."
    )
    s3 = boto3.resource("s3", region_name=aws_region)
    s3.meta.client.download_file(
        s3_bucket_name,
        s3_filepath_key,
        dest_filepath,
    )
    duration = (datetime.now() - start).total_seconds()
    print(f"Done downloading in {duration:.3f} seconds.")


def extract_zip_file(dest_filepath: str, data_dir: str) -> None:
    """."""
    start = datetime.now()
    print(
        "Started extracting filtered data parquet files from "
        f"processed data zip file to {data_dir} at "
        f"{start.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}..."
    )
    zip_ref = zipfile.ZipFile(dest_filepath)
    zip_ref.extractall(data_dir)
    zip_ref.close()
    duration = (datetime.now() - start).total_seconds()
    print(f"Done extracting in {duration:.3f} seconds.")

## Get Data

We will start by downloaded the processed and filtered `.zip` file from S3 and extracting all the contained `.parquet` files into a `.parquet.gzip` file

In [7]:
%%time
if not os.path.exists(os.path.join(processed_data_dir, proc_text_zip_fname)):
    download_file_from_s3(
        s3_bucket_name,
        path_to_folder,
        processed_data_dir,
        proc_text_zip_fname,
        session.region_name,
        f"{path_to_folder[1:]}processed/{os.path.splitext(proc_text_zip_fname)[0]}",
    )
    extract_zip_file(
        os.path.join(processed_data_dir, proc_text_zip_fname),
        f"{processed_data_dir}/{os.path.splitext(proc_text_zip_fname)[0]}.parquet.gzip",
    )
proc_files = glob(f"{processed_data_dir}/*.parquet.gzip")

CPU times: user 1.16 ms, sys: 0 ns, total: 1.16 ms
Wall time: 849 µs


Use Dask to load the `.parquet.gzip` file (consisting of multiple `.parquet` files) into a single Dask DataFrame

In [8]:
%%time
ddf = dd.read_parquet(proc_files).astype(dtypes_dict)
with pd.option_context("display.max_colwidth", None):
    display(ddf.head())
display(ddf.dtypes.rename("dtype").to_frame())

Unnamed: 0,id,geo,coordinates,place,contributors,is_quote_status,quote_count,reply_count,retweet_count,favorite_count,...,contains_religious_terms,contains_inappropriate_terms,contains_video_games_terms,contains_misc_unwanted_terms,contains_non_english_terms,text_stripped,text_processed,text_trimmed,words,num_words
0,1479845397946380290,,,,,False,0,0,0,0,...,False,False,False,False,False,LIVE from mission control: experts give real-time updates as the telescope's golden honeycomb-like mirror takes its final shape in space. This marks the end of an unprecedented 14-day deployment process! Use for questions.,live from mission control experts give real time updates as the telescope s golden honeycomb like mirror takes its final shape in space this marks the end of an unprecedented day deployment process use for questions,LIVE from mission control: experts give real-time updates as the telescope's golden honeycomb-like mirror takes its final shape in space. This marks the end of an unprecedented 14-day deployment process! Use for questions.,"[LIVE, from, mission, control:, experts, give, real-time, updates, as, the, telescope's, golden, honeycomb-like, mirror, takes, its, final, shape, in, space., This, marks, the, end, of, an, unprecedented, 14-day, deployment, process!, Use, for, questions.]",33
1,1479845401289179139,,,,,False,0,0,0,0,...,False,False,False,False,False,This was taken when we covered the NASA Night Launch! It was a WoW Experience! CapeCanaveral Florida travel luxurytravel adventuretravel,this was taken when we covered the nasa night launch it was a wow experience capecanaveral florida travel luxurytravel adventuretravel,This was taken when we covered the NASA Night Launch! It was a WoW Experience! CapeCanaveral Florida travel luxurytravel adventuretravel,"[This, was, taken, when, we, covered, the, NASA, Night, Launch!, It, was, a, WoW, Experience!, CapeCanaveral, Florida, travel, luxurytravel, adventuretravel]",20
2,1479845404762152969,,,,,False,0,0,0,0,...,False,False,False,False,False,"Nearly halfway through its flight to L2 (vs time required), will fully deploy its primary mirror hopefully marking the end of the most complex space telescope deployment in history. Watch live coverage from mission control at 14:00 UTC:",nearly halfway through its flight to l vs time required will fully deploy its primary mirror hopefully marking the end of the most complex space telescope deployment in history watch live coverage from mission control at utc,"Nearly halfway through its flight to L2 (vs time required), will fully deploy its primary mirror hopefully marking the end of the most complex space telescope deployment in history. Watch live coverage from mission control at 14:00 UTC:","[Nearly, halfway, through, its, flight, to, L2, (vs, time, required),, will, fully, deploy, its, primary, mirror, hopefully, marking, the, end, of, the, most, complex, space, telescope, deployment, in, history., Watch, live, coverage, from, mission, control, at, 14:00, UTC:]",38
3,1479845407085629445,,,,,False,0,0,0,0,...,False,False,False,False,False,Time and life of Stephen Hawking | News18 Tamil Nadu via,time and life of stephen hawking news tamil nadu via,Time and life of Stephen Hawking | News18 Tamil Nadu via,"[Time, and, life, of, Stephen, Hawking, |, News18, Tamil, Nadu, via]",11
4,1479845409014964230,,,,,False,0,0,0,0,...,False,False,False,False,False,"""The scientists at National Aeronautics and Space Administration (NASA) have acknowledged that Sanskrit is the most scientific language",the scientists at national aeronautics and space administration nasa have acknowledged that sanskrit is the most scientific language,"""The scientists at National Aeronautics and Space Administration (NASA) have acknowledged that Sanskrit is the most scientific language","[""The, scientists, at, National, Aeronautics, and, Space, Administration, (NASA), have, acknowledged, that Sanskrit, is, the, most, scientific, language]",17


Unnamed: 0,dtype
id,string
geo,string
coordinates,string
place,string
contributors,string
...,...
text_stripped,string
text_processed,string
text_trimmed,string
words,string


CPU times: user 710 ms, sys: 142 ms, total: 852 ms
Wall time: 1.21 s


## Split Data and Create Sample for Training NLP Labeling (Transformer) Model

### Create Data Splits for ML Model Development

Create non-randomized training, validation and test splits from data pre-dating first inference `datetime`

In [9]:
%%time
df_all = ddf[ddf["created_at"] < inference_start_date]
df_train_val, df_test = train_test_split(
    df_all, test_size=test_split_frac, random_state=88, shuffle=True
)
df_train, df_val = train_test_split(
    df_train_val, test_size=val_split_frac, random_state=88, shuffle=True
)

CPU times: user 129 ms, sys: 246 µs, total: 129 ms
Wall time: 160 ms


Get split lengths

In [10]:
%%time
df_rand_dates = (
    pd.DataFrame([len(df_train)], columns=['train_len'])
    .assign(val_len=len(df_val))
    .assign(test_len=len(df_test))
    .assign(
        total=lambda df: df["train_len"]
        + df["val_len"]
        + df["test_len"]
    )
    .assign(desired_val_frac=val_split_frac)
    .assign(desired_test_frac=test_split_frac)
    .assign(val_frac=lambda df: df["val_len"] / df["total"])
    .assign(test_frac=lambda df: df["test_len"] / df["total"])
)
df_rand_dates

CPU times: user 14.1 s, sys: 1.86 s, total: 16 s
Wall time: 10.3 s


Unnamed: 0,train_len,val_len,test_len,total,desired_val_frac,desired_test_frac,val_frac,test_frac
0,170037,28361,28249,226647,0.142857,0.125,0.125133,0.124639


**Notes**
1. The columns ending in `_frac` are the fraction of rows in the
   - validation split
   - testing split

   relative to the total number of rows across the combination of
   - training split
   - validation split
   - test split

**Observations**
1. By design the validation and testing splits are nearly equivalent in size. This is a direct consequence of specifying the `test_split_frac` and `val_split_frac` when creating the splits with `train_test_split()`.

### Create Data Splits from Training Split, for NLP Model Development (to assign sentiment labels)

We will now draw a random sample from the training split to use in NLP (transformer) model fine-tuning in order to label the tweets with a sentiment (i.e. in order to extract the sentiment of the text in the tweets).

First, we'll define the fraction of the training split to be used in NLP model fine-tuning

In [11]:
%%time
nlp_sample_size = num_sampled_tweets / len(df_train)

CPU times: user 4.98 s, sys: 648 ms, total: 5.63 s
Wall time: 3.55 s


Next, we'll extract a sample of the training data corresponding to this fraction

In [12]:
%%time
df_train_sample = (
    df_train.sample(frac=nlp_sample_size, random_state=88)
    .compute()
    .sort_values(by=["created_at"])
)

CPU times: user 5.32 s, sys: 568 ms, total: 5.89 s
Wall time: 3.7 s


**Notes**
1. The sample is small enough that it fits in memory and so we don't need to use a big data framework to hold its contents. So, we call `.compute()` to bring this sample into memory and we can use in-memory tools (below) for creating data splits from this sample.
2. Before creating the data splits for ML model development, the data was sorted by `datetime` when the tweet was posted (i.e. sorted by the `created_at` column). In order to create the splits for NLP model development in a way that is consistent with how the data splits were created for ML model development, after drawing the random sample, we sort the sampled data by the same `created_at` column before random splits will be created next.

Tweets might be created at the same timestamp and so duplicated values are possible in this column, meaning that unique values in this column will be less than the total number of tweets in the data (including in the sampled data). Tweet IDs are unique for each tweet so the number of unique values in the `id` column will match the number of tweets in the data. These are shown below

In [13]:
%%time
print(
    f"Number of unique IDs in sampled data = {df_train_sample['id'].nunique():,}\n"
    f"Number of unique creation datetimes in sampled data = {df_train_sample['created_at'].nunique():,}\n"
    f"Number of rows in sampled data = {len(df_train_sample):,}"
)

Number of unique IDs in sampled data = 10,001
Number of unique creation datetimes in sampled data = 9,912
Number of rows in sampled data = 10,001
CPU times: user 6.13 ms, sys: 281 µs, total: 6.41 ms
Wall time: 5.28 ms


We'll now create random training, validation and testing splits from the sampled data, which will be used during NLP model training and evaluation

In [14]:
df_nlp_train_val, df_nlp_test = sk_train_test_split(
    df_train_sample, test_size=test_split_frac, random_state=88
)
df_nlp_train, df_nlp_val = sk_train_test_split(
    df_nlp_train_val, test_size=test_split_frac, random_state=88
)

## Export All Data Splits to S3 Bucket

Get the start date for making inference with the trained ML model

In [15]:
inference_start_date_str = (
    inference_start_date.replace("-", "").replace(":", "").replace(" ", "_")
)
print(inference_start_date_str)

20220110_000000


**Notes**
1. The inference start date is used in file naming as a crude way to version data used in each round of ML model training.

All data spits for ML model development will now be saved to a separate `.parquet` file

In [18]:
%%time
for split_name, df_split_to_export in zip(
    ["train", "val", "test"], [df_train, df_val, df_test]
):
    fname = f"{split_name}__inference_starts_{inference_start_date_str}.parquet.gzip"
    prefix = f"{path_to_folder[1:]}processed/splits/{fname}"
    split_filepath = f"s3://{s3_bucket_name}/{prefix}"
    df_split_to_export.to_parquet(
        split_filepath,
        write_index=False,
        compression='gzip',
        storage_options={
            "key": session.get_credentials().access_key,
            "secret": session.get_credentials().secret_key,
        },
    )
    print(f"Exported {len(df_split_to_export):,} rows to {prefix} on S3")

Exported 170,037 rows to datasets/twitter/kinesis-demo/processed/splits/train__inference_starts_20220110_000000.parquet.gzip on S3
Exported 28,361 rows to datasets/twitter/kinesis-demo/processed/splits/val__inference_starts_20220110_000000.parquet.gzip on S3
Exported 28,249 rows to datasets/twitter/kinesis-demo/processed/splits/test__inference_starts_20220110_000000.parquet.gzip on S3
CPU times: user 43.7 s, sys: 3.47 s, total: 47.2 s
Wall time: 26 s


All sampled data spits for NLP model development will now be saved to a separate `.CSV` file

In [19]:
%%time
for split_name, df_split_to_export in zip(
    ["train_nlp", "val_nlp", "test_nlp"],
    [df_nlp_train, df_nlp_val, df_nlp_test],
):
    fname = f"{split_name}__inference_starts_{inference_start_date_str}.csv.zip"
    prefix = f"{path_to_folder[1:]}processed/nlp_splits/{fname}"
    split_filepath = f"s3://{s3_bucket_name}/{prefix}"
    df_split_to_export.to_csv(
        split_filepath,
        index=False,
        storage_options={
            "key": session.get_credentials().access_key,
            "secret": session.get_credentials().secret_key,
        },
    )
    print(f"Exported {len(df_split_to_export):,} rows to {prefix} on S3")

Exported 7,656 rows to datasets/twitter/kinesis-demo/processed/nlp_splits/train_nlp__inference_starts_20220110_000000.csv.zip on S3
Exported 1,094 rows to datasets/twitter/kinesis-demo/processed/nlp_splits/val_nlp__inference_starts_20220110_000000.csv.zip on S3
Exported 1,251 rows to datasets/twitter/kinesis-demo/processed/nlp_splits/test_nlp__inference_starts_20220110_000000.csv.zip on S3
CPU times: user 885 ms, sys: 4.4 ms, total: 890 ms
Wall time: 1.36 s


---

<span style="float:left;">
    <a href="./4_process_data.ipynb"><< 4 - Data Processing</a>
</span>

<span style="float:right;">
    <a href="./6_nlp_labeling.ipynb">6 - NLP-based Labeling >></a>
</span>