# Create Data Splits

In [1]:
%%time
!pip3 freeze | grep -E 'boto3|s3fs|scikit-learn|distributed|dask==|dask-m|black==|jupyter-server|pandas|openpyxl'
!conda list -n spark | grep -E 'ipykernel'

black==22.6.0
boto3==1.24.61
dask==2022.8.0
dask-ml==2022.5.27
distributed==2022.8.0
nb-black==1.0.7
openpyxl==3.0.10
pandas==1.4.3
s3fs==0.4.2
scikit-learn==1.1.2
ipykernel                 6.15.1             pyh210e3f2_0    conda-forge
CPU times: user 32.2 ms, sys: 19.1 ms, total: 51.3 ms
Wall time: 2.41 s


In [2]:
%load_ext lab_black

In [3]:
import os
from glob import glob
from datetime import datetime
import shutil
import zipfile

import boto3
import dask.dataframe as dd
import pandas as pd
from dask_ml.model_selection import train_test_split
from sklearn.model_selection import train_test_split as sk_train_test_split

## About

### Objective
This notebook will split the processed data into training, validation and test splits that can be used to train a machine learning model for twitter sentiment classification. Since `dask-ml` provides a convenient method (`train_test_split()`, [link](https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html#dask-ml-model-selection-train-test-split)) for creating data splits.

### ML Model Development
A random sample of the training split will be further divided into three smaller splits in order to support training a NLP (transformers) model to predict sentiment. This NLP model will be used to label the processed tweets data with sentiment. The NLP model will be used to label the data (i.e. to extract the sentiment) used during ML model development. The ML model will be trained using this labeled data and then deployed.

### ML Model Usage in Production
In production, the deployed ML model will be used to predict sentiment of incoming tweets on-demand. These predictions will be served to customers.

### ML Model Drift Monitoring
After the same fraction of inference predictions have been made as the size of the validation of test splits used during ML model development, the following will be performed
- all tweets predicted during inference are labeled
  - manually
  - using the **previously trained NLP model**
- deployed ML model predictions, made during inference, will be scored against these labels (previous bullet point) in order to determine if ML model performance has
  - drifted (the ML training pipeline will be triggered)
    - scores are not within some threshold of the scores on the test split during ML model development
    - predictions made by the **previously trained NLP model** will be served to the customer
      - the other option here is to serve the same (poorly scoring) predictions made by the ML model to the customers
    - updated training, validation and testing splits will be created using *all available data*
    - a new ML model will be trained using *all available data* and will then be deployed to production
  - not drifted
    - scores are within some threshold of the scores on the test split during ML model development
    - the currently used ML model will continue to serve inference

*All available data* here will include
- the original data used for ML model development
- the new data used to make inference with the originally trained ML model

## User Inputs

In [4]:
path_to_folder = "/datasets/twitter/kinesis-demo/"

# processed data
processed_data_dir = "data/processed"
processed_file_name = "processed_text"

# train-test split
test_split_frac = 0.125

# sampling data
nlp_sample_size = 0.3333
sampled_fname = "sampled_data.csv.zip"
nlp_cols = ["id", "created_at", "text"]

# inference
inference_start_date = "2022-01-10 00:00:00"

upload_to_s3 = True
create_nlp_splits = True
cleanup_local_files = True

In [5]:
s3_bucket_name = os.getenv("AWS_S3_BUCKET_NAME", "sagemakertestwillz3s")
session = boto3.Session(profile_name="default")
s3_client = session.client("s3")

dtypes_dict = {
    "id": pd.StringDtype(),
    "geo": pd.StringDtype(),
    "coordinates": pd.StringDtype(),
    "place": pd.StringDtype(),
    "contributors": pd.StringDtype(),  # pd.BooleanDtype(),
    "is_quote_status": pd.StringDtype(),  # pd.BooleanDtype(),
    "quote_count": pd.Int32Dtype(),
    "reply_count": pd.Int32Dtype(),
    "retweet_count": pd.Int32Dtype(),
    "favorite_count": pd.Int32Dtype(),
    "favorited": pd.StringDtype(),  # pd.BooleanDtype(),
    "retweeted": pd.StringDtype(),  # pd.BooleanDtype(),
    "source": pd.StringDtype(),
    "in_reply_to_user_id": pd.StringDtype(),
    "in_reply_to_screen_name": pd.StringDtype(),
    "source_text": pd.StringDtype(),
    "place_id": pd.StringDtype(),
    "place_url": pd.StringDtype(),
    "place_place_type": pd.StringDtype(),
    "place_name": pd.StringDtype(),
    "place_full_name": pd.StringDtype(),
    "place_country_code": pd.StringDtype(),
    "place_country": pd.StringDtype(),
    "place_bounding_box_type": pd.StringDtype(),
    "place_bounding_box_coordinates": pd.StringDtype(),
    "place_attributes": pd.StringDtype(),
    "coords_type": pd.StringDtype(),
    "coords_lon": pd.StringDtype(),
    "coords_lat": pd.StringDtype(),
    "geo_type": pd.StringDtype(),
    "geo_lon": pd.StringDtype(),
    "geo_lat": pd.StringDtype(),
    "user_name": pd.StringDtype(),
    "user_screen_name": pd.StringDtype(),
    "user_followers": pd.Int32Dtype(),
    "user_friends": pd.Int32Dtype(),
    "user_listed": pd.Int32Dtype(),
    "user_favourites": pd.Int32Dtype(),
    "user_statuses": pd.Int32Dtype(),
    "user_protected": pd.StringDtype(),  # pd.BooleanDtype(),
    "user_verified": pd.StringDtype(),  # pd.BooleanDtype(),
    "user_contributors_enabled": pd.StringDtype(),
    "user_location": pd.StringDtype(),
    "retweeted_tweet": pd.StringDtype(),
    "tweet_text_urls": pd.StringDtype(),
    "tweet_text_hashtags": pd.StringDtype(),
    "tweet_text_usernames": pd.StringDtype(),
    "num_urls_in_tweet_text": pd.Int32Dtype(),
    "num_users_in_tweet_text": pd.Int32Dtype(),
    "num_hashtags_in_tweet_text": pd.Int32Dtype(),
    "text": pd.StringDtype(),
    "contains_wanted_text": pd.BooleanDtype(),
    "contains_wanted_text_case_sensitive": pd.BooleanDtype(),
    "contains_multi_word_wanted_text": pd.BooleanDtype(),
    "contains_crypto_terms": pd.BooleanDtype(),
    "contains_religious_terms": pd.BooleanDtype(),
    "contains_inappropriate_terms": pd.BooleanDtype(),
    "contains_video_games_terms": pd.BooleanDtype(),
    "contains_misc_unwanted_terms": pd.BooleanDtype(),
    "contains_non_english_terms": pd.BooleanDtype(),
    "text_trimmed": pd.StringDtype(),
    "text_stripped": pd.StringDtype(),
    "text_processed": pd.StringDtype(),
    "words": pd.StringDtype(),
    "num_words": pd.Int32Dtype(),
}

proc_text_zip_fname = f"{processed_file_name}.zip"

val_split_frac = test_split_frac / (1 - test_split_frac)

In [6]:
def highlight_cols(df_cols, cols_to_use):
    """Highlight a list of columns in a DataFrame."""
    # copy df to new - original data is not changed
    df = df_cols[cols_to_use].copy()
    # select all values to yellow color
    df.loc[:, :] = "background-color: yellow"
    # return color df
    return df


def download_file_from_s3(
    s3_bucket_name: str,
    path_to_folder: str,
    data_dir: str,
    fname: str,
    aws_region: str,
    prefix: str,
) -> None:
    """Download file from ."""
    dest_filepath = os.path.join(data_dir, fname)
    s3_filepath_key = s3_client.list_objects_v2(
        Bucket=s3_bucket_name,
        Delimiter="/",
        Prefix=prefix,
    )["Contents"][0]["Key"]
    start = datetime.now()
    print(
        f"Started downloading processed data zip file from {s3_filepath_key} to "
        f"{dest_filepath} at {start.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}..."
    )
    s3 = boto3.resource("s3", region_name=aws_region)
    s3.meta.client.download_file(
        s3_bucket_name,
        s3_filepath_key,
        dest_filepath,
    )
    duration = (datetime.now() - start).total_seconds()
    print(f"Done downloading in {duration:.3f} seconds.")


def extract_zip_file(dest_filepath: str, data_dir: str) -> None:
    """."""
    start = datetime.now()
    print(
        "Started extracting filtered data parquet files from "
        f"processed data zip file to {data_dir} at "
        f"{start.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}..."
    )
    zip_ref = zipfile.ZipFile(dest_filepath)
    zip_ref.extractall(data_dir)
    zip_ref.close()
    duration = (datetime.now() - start).total_seconds()
    print(f"Done extracting in {duration:.3f} seconds.")

## Get Data

We will start by downloaded the processed and filtered `.zip` file from S3 and extracting all the contained `.parquet` files into a `.parquet.gzip` file

In [7]:
%%time
if not os.path.exists(os.path.join(processed_data_dir, proc_text_zip_fname)):
    download_file_from_s3(
        s3_bucket_name,
        path_to_folder,
        processed_data_dir,
        proc_text_zip_fname,
        session.region_name,
        f"{path_to_folder[1:]}processed/{os.path.splitext(proc_text_zip_fname)[0]}",
    )
    extract_zip_file(
        os.path.join(processed_data_dir, proc_text_zip_fname),
        f"{processed_data_dir}/{os.path.splitext(proc_text_zip_fname)[0]}.parquet.gzip",
    )
proc_files = glob(f"{processed_data_dir}/*.parquet.gzip")

CPU times: user 1.03 ms, sys: 254 µs, total: 1.29 ms
Wall time: 797 µs


Find the number of individual `.parquet.gzip` files

In [8]:
proc_files_all = glob(f"{processed_data_dir}/*.parquet.gzip/*.gz.parquet")
print(len(proc_files_all))

8


Use Dask to load the `.parquet.gzip` file (consisting of multiple `.parquet` files) into a single Dask DataFrame

In [9]:
# # TRYING dask.delayed with pd.read_parquet (.sort_values() errored out)
# # %%time
# from collections import OrderedDict
# from dask import delayed
# delayed_dfs = [
#     delayed(pd.read_parquet)(f).astype(dtypes_dict).sort_values(by=["created_at"])
#     for f in proc_files_all
# ]
# ddf = (
#     dd.from_delayed(delayed_dfs)
#     .set_index('created_at')
#     # .sort_values(by=["created_at"])
#     .reset_index(drop=True)
#     .repartition(npartitions=len(proc_files_all))
# )
# print(ddf.npartitions)
# with pd.option_context("display.max_columns", None):
#     display(ddf.head())
# with pd.option_context("display.max_colwidth", None, "display.max_rows", None):
#     display(ddf.dtypes.rename("dtype").to_frame())

In [10]:
# # TRYING dask.concat with pd.read_parquet (.sort_values() errored out)
# # %%time
# ddf = dd.multi.concat(
#     [
#         dd.from_pandas(
#             pd.read_parquet(f).astype(dtypes_dict).sort_values(by=["created_at"]),
#             npartitions=1
#         )
#         for f in proc_files_all
#     ], axis=1, interleave_partitions=False
# ).sort_values(by=["created_at"]).repartition(npartitions=len(proc_files_all))
# print(ddf.npartitions)
# with pd.option_context("display.max_columns", None):
#     display(ddf.head())
# with pd.option_context("display.max_colwidth", None, "display.max_rows", None):
#     display(ddf.dtypes.rename("dtype").to_frame())

In [11]:
%%time
ddf = (
    dd.read_parquet(proc_files)
    .reset_index(drop=True)
    .astype(dtypes_dict)
    .set_index("created_at")  # sorts DataFrame based on this column
    .reset_index()
    .repartition(npartitions=len(proc_files_all))
)
print(
    f"Loaded processed data from *.parquet.gzip files into Dask DataFrame "
    f"with {ddf.npartitions:,} partitions"
)
with pd.option_context("display.max_columns", None):
    display(ddf.head())
with pd.option_context("display.max_colwidth", None, "display.max_rows", None):
    display(ddf.dtypes.rename("dtype").to_frame())

Loaded processed data from *.parquet.gzip files into Dask DataFrame with 8 partitions


Unnamed: 0,created_at,id,geo,coordinates,place,contributors,is_quote_status,quote_count,reply_count,retweet_count,favorite_count,favorited,retweeted,source,in_reply_to_user_id,in_reply_to_screen_name,source_text,place_id,place_url,place_place_type,place_name,place_full_name,place_country_code,place_country,place_bounding_box_type,place_bounding_box_coordinates,place_attributes,coords_type,coords_lon,coords_lat,geo_type,geo_lon,geo_lat,user_name,user_screen_name,user_followers,user_friends,user_listed,user_favourites,user_statuses,user_protected,user_verified,user_contributors_enabled,user_joined,user_location,retweeted_tweet,tweet_text_urls,tweet_text_hashtags,tweet_text_usernames,num_urls_in_tweet_text,num_users_in_tweet_text,num_hashtags_in_tweet_text,text,contains_wanted_text,contains_wanted_text_case_sensitive,contains_multi_word_wanted_text,contains_crypto_terms,contains_religious_terms,contains_inappropriate_terms,contains_video_games_terms,contains_misc_unwanted_terms,contains_non_english_terms,text_stripped,text_processed,text_trimmed,words,num_words
0,2021-12-30 17:35:58,1476608009669201922,,,,,True,0,0,0,0,False,False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,Twitter Web App,,,,,,,,,[[]],{},,,,,,,Radio Justice 📻🎙⚖,justiceputnam,2599,1395,84,93424,193934,False,False,False,2009-07-14 05:10:36,"Rogue River, Oregon #homebase",no,,,,0,0,0,"NASA: It wasn't a strike, it was just a work s...",False,True,False,False,False,False,False,False,False,"NASA: It wasn't a strike, it was just a work s...",nasa it wasn t a strike it was just a work s...,"NASA: It wasn't a strike, it was just a work s...","[NASA:, It, wasn't, a, strike,, it, was, just,...",11
1,2021-12-30 17:36:01,1476608024521453573,,,,,False,0,0,0,0,False,False,"<a href=""http://twitter.com/download/iphone"" r...",,,Twitter for iPhone,,,,,,,,,[[]],{},,,,,,,Fabricio F. Costa,ffalconi,1623,5002,303,23,30126,False,False,False,2009-03-06 05:49:14,San Francisco Area,no,https://t.co/9NeCWrTrIp,,,1,0,0,NASA just dropped an exciting update about the...,False,True,True,False,False,False,False,False,False,NASA just dropped an exciting update about the...,nasa just dropped an exciting update about the...,NASA just dropped an exciting update about the...,"[NASA, just, dropped, an, exciting, update, ab...",11
2,2021-12-30 17:36:03,1476608030330724357,,,,,False,0,0,0,0,False,False,"<a href=""http://twitter.com/download/android"" ...",,,Twitter for Android,,,,,,,,,[[]],{},,,,,,,Dr. James O'Donoghue,physicsJ,148322,1081,1251,22069,21049,False,True,False,2010-12-22 22:01:32,Between Mount Fuji and Tokyo,no,https://t.co/pncsLEDASF#WebbFliesAriane|https:...,Webb|WebbFliesAriane|VA256|JWST,RealtraSpace|Arianespace|ariane5|ESA_Webb|NASA...,2,6,4,Great footage from camera of NASA/ESA/CSA J...,False,True,True,False,False,False,False,False,False,Great footage from camera of NASA/ESA/CSA Jame...,great footage from camera of nasa esa csa jame...,Great footage from camera of NASA/ESA/CSA Jam...,"[Great, footage, from, camera, of, NASA/ESA/CS...",24
3,2021-12-30 17:36:04,1476608036873682952,,,,,False,0,0,0,0,False,False,"<a href=""http://twitter.com/download/iphone"" r...",,,Twitter for iPhone,,,,,,,,,[[]],{},,,,,,,Beyond Blue Aerospace,beyondblueaero,680,1157,65,5313,21178,False,False,False,2013-11-03 02:35:32,Canada,no,https://t.co/OigaIuriN4|https://t.co/0mebmz62i0,SpaceX,,2,0,1,Photo of James Webb before he transformed into...,False,False,True,False,False,False,False,False,False,Photo of James Webb before he transformed into...,photo of james webb before he transformed into...,Photo of James Webb before he transformed into...,"[Photo, of, James, Webb, before, he, transform...",13
4,2021-12-30 17:36:05,1476608040698884106,,,,,False,0,0,0,0,False,False,"<a href=""http://twitter.com/download/iphone"" r...",,,Twitter for iPhone,,,,,,,,,[[]],{},,,,,,,Diego A Pava O,dapavao2706,364,714,1,413925,37967,False,False,False,2014-10-11 22:53:00,,no,https://t.co/LAUwDoGfP3,,,1,0,0,Your internet connection over fiber optic cabl...,False,True,False,False,False,False,False,False,False,Your internet connection over fiber optic cabl...,your internet connection over fiber optic cabl...,Your internet connection over fiber optic cabl...,"[Your, internet, connection, over, fiber, opti...",33


Unnamed: 0,dtype
created_at,datetime64[ns]
id,string
geo,string
coordinates,string
place,string
contributors,string
is_quote_status,string
quote_count,Int32
reply_count,Int32
retweet_count,Int32


CPU times: user 7.61 s, sys: 1.21 s, total: 8.82 s
Wall time: 5.96 s


## Exploratory Data Analysis

Show duplicated tweets (identical text)

In [12]:
%%time
df = ddf[
    [
        'id',
        'created_at',
        # 'contributors',
        'is_quote_status',
        # 'source_text',
        'retweeted',
        'favorited',
        'retweet_count',
        'quote_count',
        'favorite_count',
        # 'in_reply_to_user_id',
        'in_reply_to_screen_name',
        'user_name',
        'user_screen_name',
        'user_joined',
        'user_verified',
        'text',
    ]
].compute()
with pd.option_context('display.max_colwidth', 1000):
    display(
        df[df.duplicated(subset=['text'], keep=False)]
        .sort_values(by=['text', 'created_at'])
        .head(25)
        .set_index('id')
    )

Unnamed: 0_level_0,created_at,is_quote_status,retweeted,favorited,retweet_count,quote_count,favorite_count,in_reply_to_screen_name,user_name,user_screen_name,user_joined,user_verified,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1478901017387446275,2022-01-06 01:27:33,False,False,False,0,0,0,DickMackintosh,Kenneth Richard,Kenneth72712993,2019-01-27 06:11:16,False,"""Why don't you discuss it with NASA?""NASA says climate models must be improved by a factor of 100 to be able to project what CO2 might do to the planetary temperature in the future."
1478901328789262341,2022-01-06 01:28:48,False,False,False,0,0,0,,Assoe,FnAssoe,2020-08-13 11:54:06,False,"""Why don't you discuss it with NASA?""NASA says climate models must be improved by a factor of 100 to be able to project what CO2 might do to the planetary temperature in the future."
1478901795686547456,2022-01-06 01:30:39,False,False,False,0,0,0,,Rob Meekel,RobMeekel,2013-12-03 21:59:37,False,"""Why don't you discuss it with NASA?""NASA says climate models must be improved by a factor of 100 to be able to project what CO2 might do to the planetary temperature in the future."
1478909404690931712,2022-01-06 02:00:53,False,False,False,0,0,0,,King Arthur & Excalibur,OscarsWild1,2019-01-14 20:35:43,False,"""Why don't you discuss it with NASA?""NASA says climate models must be improved by a factor of 100 to be able to project what CO2 might do to the planetary temperature in the future."
1478913452202766339,2022-01-06 02:16:58,False,False,False,0,0,0,DickMackintosh,Kenneth Richard,Kenneth72712993,2019-01-27 06:11:16,False,"""Yes you are very confused aren't you?""The ethnicity of the CO2 molecule was a great mystery, yes.Then NASA cleared it up for us."
1478921500010897412,2022-01-06 02:48:57,False,False,False,0,0,0,,rmack2x,rmack2x,2009-01-06 13:02:23,False,"""Yes you are very confused aren't you?""The ethnicity of the CO2 molecule was a great mystery, yes.Then NASA cleared it up for us."
1476609108627410949,2021-12-30 17:40:20,False,False,False,0,0,0,WarwickisaHunt,Qam Yasharahla,chosenachwath,2021-08-05 15:41:01,False,&amp; this is straight from the NASA website so let’s talk about it. This information has been out there for years. Go look up how old this interview is.
1476610692014743552,2021-12-30 17:46:37,False,False,False,0,0,0,,VVS4,VVS413,2021-08-22 07:19:21,False,&amp; this is straight from the NASA website so let’s talk about it. This information has been out there for years. Go look up how old this interview is.
1479545398192656385,2022-01-07 20:08:06,False,False,False,0,0,0,Resist_dwp,Cockwomblethrombosis,Cockwomblethro1,2021-04-18 22:06:51,False,"&amp;I as I said ,NASA is funded by federal money &amp;the federal gvt is controlled by the money power that wants to replace real money with carbon linked CBDCs for complete &amp;total control &amp; monetization of every person&amp; commodity on earth"
1479555351250931716,2022-01-07 20:47:39,False,False,False,0,0,0,,International Society of Anglo-Celts,OldSport87,2021-07-12 09:21:20,False,"&amp;I as I said ,NASA is funded by federal money &amp;the federal gvt is controlled by the money power that wants to replace real money with carbon linked CBDCs for complete &amp;total control &amp; monetization of every person&amp; commodity on earth"


CPU times: user 5.18 s, sys: 544 ms, total: 5.73 s
Wall time: 4.46 s


**Notes**
1. Since we are only focusing on a subset of all available metadata, the subset is small enough to fit in memory so we can call `.compute()` in order to hold the subset in an in-memory (`pandas`) DataFrame.

**Observations**
1. These do not appear as retweets, even though the text of the tweet is identical.
2. For each duplicate, the first of the duplicated occurrences is in response to a known Twitter user (see the `in_reply_to_screen_name` column). However, subsequent occurrences do not list a user in the `in_reply_to_screen_name` column.
3. For ML or NLP model training that
   - only uses the text of the tweet to engineer features
     - duplicates must be dropped
   - uses the text of the tweet as well as tweet metadata to engineer features
     - duplicates must be be kept

## Split Data and Create Sample for Training NLP Labeling (Transformer) Model

### Create Data Splits for ML Model Development

Create non-randomized training, validation, test and train+validation splits from data pre-dating the first inference `datetime`

In [13]:
%%time
df_all = ddf[ddf["created_at"] < inference_start_date]

CPU times: user 1.94 ms, sys: 67 µs, total: 2 ms
Wall time: 1.96 ms


In [14]:
%%time
df_train_val, df_test = train_test_split(
    df_all, test_size=test_split_frac, random_state=88, shuffle=False
)
df_train, df_val = train_test_split(
    df_train_val, test_size=val_split_frac, random_state=88, shuffle=False
)

CPU times: user 14.8 ms, sys: 3.83 ms, total: 18.6 ms
Wall time: 24 ms


**Notes**
1. When the deployed ML model will be evaluated in production, it will make infererence on data that arrives in chronological order. The inference predictions do not have to be made in chronological order, so the newly arrived data will be randomized before making inference with the the trained (and deployed) model. The order in which the predictions are made does not matter and so we are justified in randomizing the newly arrived data before making inference. The data splits should follow the same ordering during model development, meaning that the splits will be created from data that is sorted in chronological order and then randomized. Since the tweets were being streamed, they have already been accumulated in the cloud storage bucket in chronological order. When these files were loaded here into a single `dask.DataFrame` above, the index is set to `created_at` which results in data being sorted in chronological order (setting the index automatically results in data being sorted based on the new index). So, we just need to create splits before shuffling (randomizing) the data. For this reason, the keyword `shuffle=False` was specified in the `train_test_split()` method used to create the training, validation and test splits.

Randomize the splits

In [15]:
df_train, df_val, df_test, df_train_val = [
    df_train.sample(frac=1.0, random_state=88),
    df_val.sample(frac=1.0, random_state=88),
    df_test.sample(frac=1.0, random_state=88),
    df_train_val.sample(frac=1.0, random_state=88),
]

Get split lengths

In [16]:
%%time
df_rand_dates = (
    pd.DataFrame([len(df_train)], columns=['train_len'])
    .assign(val_len=len(df_val))
    .assign(test_len=len(df_test))
    .assign(
        total=lambda df: df["train_len"]
        + df["val_len"]
        + df["test_len"]
    )
    .assign(desired_val_frac=val_split_frac)
    .assign(desired_test_frac=test_split_frac)
    .assign(val_frac=lambda df: df["val_len"] / df["total"])
    .assign(test_frac=lambda df: df["test_len"] / df["total"])
)
df_rand_dates

CPU times: user 18.2 s, sys: 2.3 s, total: 20.5 s
Wall time: 16.4 s


Unnamed: 0,train_len,val_len,test_len,total,desired_val_frac,desired_test_frac,val_frac,test_frac
0,126772,21044,21114,168930,0.142857,0.125,0.124572,0.124987


**Notes**
1. The columns ending in `_frac` are the fraction of rows in the
   - validation split
   - testing split

   relative to the total number of rows across the combination of
   - training split
   - validation split
   - test split

**Observations**
1. By design the validation and testing splits are nearly equivalent in size. This is a direct consequence of specifying the `test_split_frac` and `val_split_frac` when creating the splits with `train_test_split()`.

### Create Data Splits from Training Split, for NLP Model Development (to assign sentiment labels)

We will now draw a random sample from the training split to use in NLP (transformer) model fine-tuning in order to label the tweets with a sentiment (i.e. in order to extract the sentiment of the text in the tweets).

Next, we'll extract a sample of the training data corresponding to the earlier specified fraction of the training split to be used in NLP model fine-tuning, while dropping duplicated tweets

In [17]:
%%time
df_train_sample = (
    df_train.drop_duplicates(subset=["text"], keep="first").sample(frac=nlp_sample_size, random_state=88)
    .compute()
    .sort_values(by=["created_at"])
)

CPU times: user 6.64 s, sys: 924 ms, total: 7.56 s
Wall time: 6.29 s


**Notes**
1. The sample is small enough that it fits in memory and so we don't need to use a big data framework to hold its contents. So, we call `.compute()` to bring this sample into memory and we can use in-memory tools (below) for creating data splits from this sample.
2. Before creating the data splits for ML model development, the data was sorted by the `datetime` when the tweet was posted (i.e. sorted by the `created_at` column). In order to create the splits for NLP model development in a way that is consistent with how the data splits were created for ML model development, after drawing the random sample, we sort the sampled data by the same `created_at` column before non-random splits will be created and then randomized.

Tweets might be created at the same timestamp and so duplicated values are possible in this column, meaning that unique values in this column will be less than the total number of tweets in the data (including in the sampled data). Tweet IDs are unique for each tweet so the number of unique values in the `id` column will match the number of tweets in the data. These are shown below

In [18]:
%%time
print(
    f"Number of unique IDs in sampled data = {df_train_sample['id'].nunique():,}\n"
    f"Number of unique creation datetimes in sampled data = {df_train_sample['created_at'].nunique():,}\n"
    f"Number of rows in sampled data = {len(df_train_sample):,}"
)

Number of unique IDs in sampled data = 9,969
Number of unique creation datetimes in sampled data = 9,890
Number of rows in sampled data = 9,969
CPU times: user 5.59 ms, sys: 205 µs, total: 5.8 ms
Wall time: 4.46 ms


We'll now create non-randomized training, validation, testing and train+validation splits from the sampled data, which will be used during NLP model training and evaluation

In [19]:
df_nlp_train_val, df_nlp_test = sk_train_test_split(
    df_train_sample, test_size=test_split_frac, shuffle=False, random_state=88
)
df_nlp_train, df_nlp_val = sk_train_test_split(
    df_nlp_train_val, test_size=test_split_frac, shuffle=False, random_state=88
)

Randomize the NLP splits and select the necessary columns

In [20]:
%%time
df_nlp_train, df_nlp_val, df_nlp_test, df_nlp_train_val = [
    df_nlp_train[nlp_cols].sample(frac=1.0, random_state=88),
    df_nlp_val[nlp_cols].sample(frac=1.0, random_state=88),
    df_nlp_test[nlp_cols].sample(frac=1.0, random_state=88),
    df_nlp_train_val[nlp_cols].sample(frac=1.0, random_state=88),
]

CPU times: user 16.9 ms, sys: 0 ns, total: 16.9 ms
Wall time: 17.2 ms


Get NLP split lengths

In [21]:
%%time
df_nlp_rand_dates = (
    pd.DataFrame([len(df_nlp_train)], columns=['train_len'])
    .assign(val_len=len(df_nlp_val))
    .assign(test_len=len(df_nlp_test))
    .assign(
        total=lambda df: df["train_len"]
        + df["val_len"]
        + df["test_len"]
    )
    .assign(val_frac=lambda df: df["val_len"] / df["total"])
    .assign(test_frac=lambda df: df["test_len"] / df["total"])
)
df_nlp_rand_dates

CPU times: user 4.34 ms, sys: 42 µs, total: 4.38 ms
Wall time: 3.67 ms


Unnamed: 0,train_len,val_len,test_len,total,val_frac,test_frac
0,7631,1091,1247,9969,0.109439,0.125088


## Export All Data Splits to S3 Bucket

Get the start date for making inference with the trained ML model

In [22]:
inference_start_date_str = (
    inference_start_date.replace("-", "").replace(":", "").replace(" ", "_")
)
print(inference_start_date_str)

20220110_000000


**Notes**
1. The inference start date is used in file naming as a crude way to version data used in each round of ML model training.

All data spits for ML model development will now be saved to a separate `.parquet` file

In [23]:
%%time
for split_name, df_split_to_export in zip(
    ["train", "val", "test"], [df_train, df_val, df_test]
):
    fname = f"{split_name}__inference_starts_{inference_start_date_str}.parquet.gzip"
    if upload_to_s3:
        storage_options = {
            "key": session.get_credentials().access_key,
            "secret": session.get_credentials().secret_key,
        }
        prefix = f"{path_to_folder[1:]}processed/splits/{fname}"
        split_filepath = f"s3://{s3_bucket_name}/{prefix}"
    else:
        storage_options = None
        prefix = f"{processed_data_dir}/{fname}"
        split_filepath = prefix
    df_split_to_export.to_parquet(
        split_filepath,
        write_index=False,
        compression='gzip',
        storage_options=storage_options,
    )
    print(f"Exported {len(df_split_to_export):,} rows to {prefix}")

Exported 126,772 rows to datasets/twitter/kinesis-demo/processed/splits/train__inference_starts_20220110_000000.parquet.gzip
Exported 21,044 rows to datasets/twitter/kinesis-demo/processed/splits/val__inference_starts_20220110_000000.parquet.gzip
Exported 21,114 rows to datasets/twitter/kinesis-demo/processed/splits/test__inference_starts_20220110_000000.parquet.gzip
CPU times: user 45.4 s, sys: 4.78 s, total: 50.2 s
Wall time: 38 s


All sampled data spits for NLP model development will now be saved to a separate
- `.CSV` file on S3
- local `.XLSX` file (for use in the next notebook)

In [24]:
%%time
if create_nlp_splits:
    for split_name, df_split_to_export in zip(
        ["train_nlp", "val_nlp", "test_nlp"],
        [df_nlp_train, df_nlp_val, df_nlp_test],
    ):
        fname = f"{split_name}__inference_starts_{inference_start_date_str}.csv.zip"
        if upload_to_s3:
            storage_options={
                "key": session.get_credentials().access_key,
                "secret": session.get_credentials().secret_key,
            }
            prefix = f"{path_to_folder[1:]}processed/nlp_splits/{fname}"
            split_filepath = f"s3://{s3_bucket_name}/{prefix}"
        else:
            storage_options = None
            prefix = f"{processed_data_dir}/{fname}"
            split_filepath = prefix
        df_split_to_export.to_csv(
            split_filepath,
            index=False,
            storage_options=storage_options,
        )
        df_split_to_export.to_excel(
            f"{processed_data_dir}/{fname.replace('.csv.zip', '.xlsx')}",
            index=False,
        )
        print(f"Exported {len(df_split_to_export):,} rows to {prefix}")

Exported 7,631 rows to datasets/twitter/kinesis-demo/processed/nlp_splits/train_nlp__inference_starts_20220110_000000.csv.zip
Exported 1,091 rows to datasets/twitter/kinesis-demo/processed/nlp_splits/val_nlp__inference_starts_20220110_000000.csv.zip
Exported 1,247 rows to datasets/twitter/kinesis-demo/processed/nlp_splits/test_nlp__inference_starts_20220110_000000.csv.zip
CPU times: user 1.72 s, sys: 2.39 ms, total: 1.72 s
Wall time: 1.9 s


## Cleanup

We'll now
- delete the local `.zip` file (containing the individual `.parquet` files of prepared data) that was downloaded from S3
- delete the local `.parquet` folder with prepared data that was processed in the previous notebook (`5_*.ipynb`) and extracted above

In [25]:
%%time
if cleanup_local_files:
    # delete locally exported (by PySpark) parquet files
    shutil.rmtree(proc_files[0])
    print("Deleted local .parquet.gzip files with filtered data.")

    # delete local zip file
    os.remove(os.path.join(processed_data_dir, proc_text_zip_fname))
    print("Deleted local .zip file created from all filtered data files.")

Deleted local .parquet.gzip files with filtered data.
Deleted local .zip file created from all filtered data files.
CPU times: user 0 ns, sys: 10.5 ms, total: 10.5 ms
Wall time: 9.18 ms


---

<span style="float:left;">
    <a href="./5_process_data.ipynb"><< 5 - Big Data Processing</a>
</span>

<span style="float:right;">
    <a href="./7_nlp_labeling.ipynb">7 - NLP-based Labeling >></a>
</span>