In [1]:
import numpy as np
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from gensim.downloader import load
import nltk
from nltk.tokenize import word_tokenize
import re
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

## GloVe
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. It is based on the idea that the meaning of a word can be inferred from the context in which it appears. GloVe constructs a word vector space such that the dot product of two word vectors equals the logarithm of the probability of their co-occurrence.

For example, if two words like economic and finance are not appeared together in a document, algorithms like tf-idf will assign them a low similarity score, while GloVe will assign them a high similarity score based on their co-occurrence in the corpus.

For more information, you can refer to the [GloVe Website](https://nlp.stanford.edu/projects/glove/).

In [2]:
glove_model = load("glove-wiki-gigaword-100")

## Natural Language Toolkit (NLTK)

The Natural Language Toolkit (NLTK) is a set of libraries and programs for symbolic and NLP for the Python programming language. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

In this assignment, we will use NLTK to preprocess the news articles. This includes only tokenization, however, NLTK provides many other functionalities that can be useful for NLP tasks like stemming, lemmatization, and part-of-speech tagging.

### Tokenization
Tokenization is the process of splitting a text into individual words or tokens. NLTK provides a simple way to tokenize text using the `word_tokenize` function. This function splits the text into words and punctuation, returning a list of tokens.

In [3]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/mark/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/mark/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
CONFIG = {
    "train_path": "./learn-ai-bbc/BBC_News_Train.csv",
    "test_path": "./learn-ai-bbc/BBC_News_Test.csv",
    "sample_solution_path": "./learn-ai-bbc/BBC_News_Sample_Solution.csv",
    "RANDOM_STATE": 42,
    "DEFAULT_TEST_SIZE": 0.2,
    "DEFAULT_TRAIN_SIZE": 0.8
}

In [5]:
def load_data(train_path: str, test_path: str, sample_solution_path: str) -> tuple:
    """
    Load the train, test, and sample solution datasets.

    Args:
        train_path (str): Path to the training dataset.
        test_path (str): Path to the test dataset.
        sample_solution_path (str): Path to the sample solution dataset.

    Returns:
        tuple: A tuple containing the train, test, and sample solution DataFrames.
    """
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    sample_solution_df = pd.read_csv(sample_solution_path)

    return train_df, test_df, sample_solution_df

In [6]:
train_df, test_df, sample_solution_df = load_data(
    train_path=CONFIG["train_path"],
    test_path=CONFIG["test_path"],
    sample_solution_path=CONFIG["sample_solution_path"]
)

In [7]:
print(train_df.shape)
print(test_df.shape)
print(sample_solution_df.shape)

(1490, 3)
(735, 2)
(735, 2)


`train_df` and `test_df` are the dataframes containing the training and test data, respectively. `sample_solution` is the dataframe containing the sample solution, matching the article IDs in the test set with the news category labels.

In [8]:
train_df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [9]:
test_df.head()

Unnamed: 0,ArticleId,Text
0,1018,qpr keeper day heads for preston queens park r...
1,1319,software watching while you work software that...
2,1138,d arcy injury adds to ireland woe gordon d arc...
3,459,india s reliance family feud heats up the ongo...
4,1020,boro suffer morrison injury blow middlesbrough...


In [10]:
sample_solution_df.head()

Unnamed: 0,ArticleId,Category
0,1018,sport
1,1319,tech
2,1138,business
3,459,entertainment
4,1020,politics


In [11]:
train_df['Category'].unique()

array(['business', 'tech', 'politics', 'sport', 'entertainment'],
      dtype=object)

In [12]:
train_df.describe()

Unnamed: 0,ArticleId
count,1490.0
mean,1119.696644
std,641.826283
min,2.0
25%,565.25
50%,1112.5
75%,1680.75
max,2224.0


Let's take a look to the distribution of the news categories in the training set. This will help us understand the balance of the dataset and whether we need to apply any techniques to handle class imbalance.

In [13]:
category_counts = train_df['Category'].value_counts(normalize=True)

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Count of Categories", "Proportion of Categories")
)

fig.add_trace(
    go.Histogram(
        x=train_df['Category'],
        name="Count"
    ),
    row=1, col=1
)

fig.add_trace(
    go.Bar(
        x=category_counts.index,
        y=category_counts.values,
        name="Proportion"
    ),
    row=1, col=2
)

fig.show()

It appears that the dataset is relatively balanced, with a slight imbalance towards the 'sport' category. 

In [14]:
# mean, min, max, and std of the length of the text in the training set
text_length = train_df['Text'].apply(len)
print(text_length.describe())

# Visualize the distribution of text lengths in the training set
fig = px.histogram(text_length, title='Distribution of Text Lengths in Training Set')
fig.update_xaxes(title='Text Length')
fig.update_yaxes(title='Count')
fig.show()

count     1490.000000
mean      2233.461745
std       1205.153358
min        501.000000
25%       1453.000000
50%       1961.000000
75%       2751.250000
max      18387.000000
Name: Text, dtype: float64


In [15]:
train_df.duplicated().sum()

0

## Data Preprocessing

I'll do some simple preprocessing steps to clean the text data. This includes:
- Lowercasing the text
- Removing punctuation
- Tokenizing the text using NLTK



In [16]:

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text

def tokenize(text):
    return word_tokenize(text)

df = train_df.copy(deep=True)
df['text'] = df['Text'].astype(str) 

df['clean_text'] = df['text'].apply(clean_text)
df['tokens'] = df['clean_text'].apply(tokenize)

In [17]:
df

Unnamed: 0,ArticleId,Text,Category,text,clean_text,tokens
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex-boss launches defence lawyers defe...,worldcom exboss launches defence lawyers defen...,"[worldcom, exboss, launches, defence, lawyers,..."
1,154,german business confidence slides german busin...,business,german business confidence slides german busin...,german business confidence slides german busin...,"[german, business, confidence, slides, german,..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicates economic gloom citizens in ...,bbc poll indicates economic gloom citizens in ...,"[bbc, poll, indicates, economic, gloom, citize..."
3,1976,lifestyle governs mobile choice faster bett...,tech,lifestyle governs mobile choice faster bett...,lifestyle governs mobile choice faster bett...,"[lifestyle, governs, mobile, choice, faster, b..."
4,917,enron bosses in $168m payout eighteen former e...,business,enron bosses in $168m payout eighteen former e...,enron bosses in 168m payout eighteen former en...,"[enron, bosses, in, 168m, payout, eighteen, fo..."
...,...,...,...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment,double eviction from big brother model caprice...,double eviction from big brother model caprice...,"[double, eviction, from, big, brother, model, ..."
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment,dj double act revamp chart show dj duo jk and ...,dj double act revamp chart show dj duo jk and ...,"[dj, double, act, revamp, chart, show, dj, duo..."
1487,1590,weak dollar hits reuters revenues at media gro...,business,weak dollar hits reuters revenues at media gro...,weak dollar hits reuters revenues at media gro...,"[weak, dollar, hits, reuters, revenues, at, me..."
1488,1587,apple ipod family expands market apple has exp...,tech,apple ipod family expands market apple has exp...,apple ipod family expands market apple has exp...,"[apple, ipod, family, expands, market, apple, ..."


## Unique Tokens

Let's take a look at the unique tokens in the training set after preprocessing. This will give us an idea of the vocabulary size and the diversity of the text data.

In [18]:
# number of unique tokens in the training set
unique_tokens = set()
for tokens in df['tokens']:
    unique_tokens.update(tokens)

print(f"Number of unique tokens in the training set: {len(unique_tokens)}")

Number of unique tokens in the training set: 27278


## Document Vectors

Now, I use GloVe to convert the acquired tokens into document vectors. GloVe provides pre-trained word vectors that can be used to represent words in a continuous vector space, However, for our unsupervised learning task, we need to have vector for each document, not for each word. To achieve this, I will average the word vectors of all tokens in a document to create a single vector representation for that document.

In [19]:
def get_document_vector(tokens, embedding_index, dim):
    """
    Get the document vector for a list of tokens using the GloVe embedding index.
    If a token is not found in the embedding index, it is ignored.
    If no tokens are found, a zero vector of the specified dimension is returned.
    Args:
        tokens (list): List of tokens (words) from the document.
        embedding_index (dict): Dictionary mapping tokens to their GloVe vectors.
        dim (int): Dimension of the GloVe vectors (default is 100).
    Returns:
        np.ndarray: A vector representing the document, averaged from the GloVe vectors of the tokens.
        If no tokens are found in the embedding index, returns a zero vector of the specified
    """
    vecs = []
    for token in tokens:
        if token in embedding_index:
            vecs.append(embedding_index[token])
    if len(vecs) > 0:
        return np.mean(vecs, axis=0)
    else:
        return np.zeros(dim)

df['doc_vector'] = df['tokens'].apply(lambda x: get_document_vector(x, glove_model, dim=100))

In [20]:
df

Unnamed: 0,ArticleId,Text,Category,text,clean_text,tokens,doc_vector
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex-boss launches defence lawyers defe...,worldcom exboss launches defence lawyers defen...,"[worldcom, exboss, launches, defence, lawyers,...","[0.0881027, -0.072838455, 0.22837983, -0.15146..."
1,154,german business confidence slides german busin...,business,german business confidence slides german busin...,german business confidence slides german busin...,"[german, business, confidence, slides, german,...","[0.011770573, 0.14082459, 0.2959648, -0.077304..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicates economic gloom citizens in ...,bbc poll indicates economic gloom citizens in ...,"[bbc, poll, indicates, economic, gloom, citize...","[-0.055196114, 0.19540092, 0.33485386, -0.1419..."
3,1976,lifestyle governs mobile choice faster bett...,tech,lifestyle governs mobile choice faster bett...,lifestyle governs mobile choice faster bett...,"[lifestyle, governs, mobile, choice, faster, b...","[-0.105643936, 0.13858151, 0.27540573, -0.1671..."
4,917,enron bosses in $168m payout eighteen former e...,business,enron bosses in $168m payout eighteen former e...,enron bosses in 168m payout eighteen former en...,"[enron, bosses, in, 168m, payout, eighteen, fo...","[0.11647683, 0.012706946, 0.2649862, -0.147849..."
...,...,...,...,...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment,double eviction from big brother model caprice...,double eviction from big brother model caprice...,"[double, eviction, from, big, brother, model, ...","[-0.058606397, 0.020778334, 0.30748582, -0.302..."
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment,dj double act revamp chart show dj duo jk and ...,dj double act revamp chart show dj duo jk and ...,"[dj, double, act, revamp, chart, show, dj, duo...","[-0.1355775, 0.105269626, 0.33229253, -0.25106..."
1487,1590,weak dollar hits reuters revenues at media gro...,business,weak dollar hits reuters revenues at media gro...,weak dollar hits reuters revenues at media gro...,"[weak, dollar, hits, reuters, revenues, at, me...","[0.00079907634, 0.03381798, 0.24186508, -0.139..."
1488,1587,apple ipod family expands market apple has exp...,tech,apple ipod family expands market apple has exp...,apple ipod family expands market apple has exp...,"[apple, ipod, family, expands, market, apple, ...","[-0.050727844, 0.12790911, 0.3013645, -0.20063..."


## Principle Component Analysis (PCA)

I will use PCA to extract the most important features from the document vectors. Following plot shows the first two principal components of the document vectors. This will help us visualize the data and understand the distribution of the documents in the vector space.

In [34]:
X = np.stack(df['doc_vector'].values)
print(X.shape)
pca = PCA(n_components=5, random_state=CONFIG['RANDOM_STATE'])
X_reduced = pca.fit_transform(X)

fig = px.scatter(
    x=X_reduced[:, 0],
    y=X_reduced[:, 1],
    color=df['Category'],
    labels={'x': 'PCA Component 1', 'y': 'PCA Component 2', 'color': 'Category'},
    title='PCA of GloVe-based Document Vectors',
    width=800,
    height=600
)
fig.show()

(1490, 100)


## Truncated SVD

For this assignment, I will use Truncated SVD (Singular Value Decomposition) to reduce the dimensionality of the document vectors. Truncated SVD is a linear dimensionality reduction technique that is particularly useful for sparse matrices, such as the document vectors we have created using GloVe.

### Why not NMF?

NMF (Non-negative Matrix Factorization) requires all elements of your input matrix to be ≥ 0.
It’s designed to find additive parts-based decompositions (like how an image is built from positive pixel intensities or how a document is built from positive word counts). But your GloVe document vectors have negative values, because:

Word embeddings like GloVe or Word2Vec are trained to center around zero (to better capture similarities in cosine space). So they naturally include negative numbers.

In [22]:
svd = TruncatedSVD(n_components=5, random_state=CONFIG['RANDOM_STATE'])
document_topic_matrix = svd.fit_transform(X)
topic_word_matrix = svd.components_

print("Document-topic matrix shape:", document_topic_matrix.shape)
print("Topic-word matrix shape:", topic_word_matrix.shape)

Document-topic matrix shape: (1490, 5)
Topic-word matrix shape: (5, 100)


## SVD Components of the Document Vectors

following plot shows the 5 most important SVD components of the document vectors and their relationship with the news categories. This will help us understand how the document vectors are distributed in the SVD space and how they relate to the different news categories.

In [23]:
fig = px.scatter_matrix(
    document_topic_matrix,
    dimensions=[0, 1, 2, 3, 4],
    color=df['Category'],
    labels=dict.fromkeys(range(5), 'Component'),
    title='SVD Components of Document Vectors'
)
fig.update_traces(diagonal_visible=False)
fig.show()

## Training Unsupervised Learning Models

Now, I will train unsupervised learning models on the document vectors. This includes clustering algorithms like K-Means to identify patterns and group similar documents together.


In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    np.vstack(df['doc_vector'].values), df['Category'], test_size=CONFIG['DEFAULT_TEST_SIZE'], random_state=CONFIG['RANDOM_STATE']
)



Here I employed TruncatedSVD to reduce the dimensionality of our feature space to five components. The SVD model was first fitted on the training data to learn the optimal lower-dimensional representation and then used to consistently transform both the training and test datasets. This approach helps capture the most important patterns in the data while reducing noise and computational cost. By applying dimensionality reduction, we aim to improve model performance and mitigate potential overfitting.

In [25]:
svd = TruncatedSVD(n_components=5, random_state=CONFIG['RANDOM_STATE'])
X_train_svd = svd.fit_transform(X_train)
X_test_svd = svd.transform(X_test)


Next, I applied KMeans clustering on the reduced data to explore inherent groupings. I initialized the KMeans model with the number of clusters equal to the number of unique categories in our dataset, ensuring alignment with known class distributions. The model was fitted on the transformed training data (X_train_svd) to learn cluster centroids, and then used to predict cluster assignments for both the training and test sets. This clustering step allows us to investigate how well the unsupervised groupings correspond to the original categories and potentially enhances feature engineering.

In [26]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=df['Category'].nunique(), random_state=CONFIG['RANDOM_STATE'])
train_clusters = kmeans.fit_predict(X_train_svd)
test_clusters = kmeans.predict(X_test_svd)


Unsupervised learning models grouped the clusters, however, it is not yet known to us what these clusters represent in terms of the original categories. To interpret the clusters, we need to map them back to the original labels.


To map the discovered clusters back to the actual categories, we assigned each cluster the most frequent true label (mode) among the training samples within that cluster. This was done by iterating over all clusters and building a cluster_to_label dictionary linking clusters to their majority labels. Using this mapping, we predicted the final labels for both the training and test data based on their cluster assignments. This simple post-clustering labeling strategy enables us to evaluate how well the unsupervised clustering aligns with the original supervised targets.

In [27]:
cluster_to_label = {}
for cluster in np.unique(train_clusters):
    mask = train_clusters == cluster
    majority_label = pd.Series(y_train[mask]).mode()[0]
    cluster_to_label[cluster] = majority_label
print(cluster_to_label)
# Predict final labels
y_train_pred = np.array([cluster_to_label[c] for c in train_clusters])
y_test_pred = np.array([cluster_to_label[c] for c in test_clusters])


{0: 'politics', 1: 'entertainment', 2: 'sport', 3: 'business', 4: 'tech'}


## Confusion Matrix

Finally, I will evaluate the performance of the unsupervised learning models using a confusion matrix. The confusion matrix will show the true labels vs the predicted labels for the test set. This will help us understand how well the unsupervised learning models performed and how they relate to the original categories.

In [28]:
from sklearn.metrics import accuracy_score, confusion_matrix

print("Train accuracy:", accuracy_score(y_train, y_train_pred))
print("Test accuracy:", accuracy_score(y_test, y_test_pred))

cm = confusion_matrix(y_test, y_test_pred, labels=np.unique(y_test))


# plot the confusion matrix using plotly
fig = px.imshow(
    cm,
    text_auto=True,
    labels=dict(x="Predicted Category", y="True Category", color="Count"),
    x=np.unique(y_test),
    y=np.unique(y_test),
    color_continuous_scale='YlGn',
    # aspect="auto",
    title="Confusion Matrix",
    width=600,
    height=600
)
fig.show()

Train accuracy: 0.886744966442953
Test accuracy: 0.889261744966443


Finally, we evaluated the clustering-based classification using a confusion matrix and calculated overall accuracy. The confusion matrix reveals that most samples are correctly grouped into their true categories, with a few misclassifications primarily between related classes such as business and politics. The approach achieved a training accuracy of approximately $88.7%$ and a test accuracy of approximately $88.9%$, indicating that the cluster-to-label mapping generalizes well to unseen data. These results demonstrate that even an unsupervised technique like KMeans, when combined with majority label assignment, can effectively approximate the original categories in this dataset.

## Supervised Learning Models

Let's train some supervised learning models on the original document vectors and evaluate their performance. There are many supervised learning algorithms that can be used for text classification, for sake of simplicity, I will use a simple logistic regression model. However, you can experiment with other algorithms like SVM, Random Forest, or XGBoost to see if you can achieve better performance.

In [29]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000, random_state=CONFIG['RANDOM_STATE'])
clf.fit(X_train, y_train)

# Predictions
y_train_pred_supervised = clf.predict(X_train)
y_test_pred_supervised= clf.predict(X_test)

In [30]:
from sklearn.metrics import accuracy_score, classification_report

print("Train accuracy:", accuracy_score(y_train, y_train_pred_supervised))
print("Test accuracy:", accuracy_score(y_test, y_test_pred_supervised))
print(classification_report(y_test, y_test_pred_supervised))

Train accuracy: 0.9605704697986577
Test accuracy: 0.9731543624161074
               precision    recall  f1-score   support

     business       0.96      0.99      0.97        75
entertainment       1.00      0.96      0.98        46
     politics       0.93      0.98      0.96        56
        sport       1.00      1.00      1.00        63
         tech       0.98      0.93      0.96        58

     accuracy                           0.97       298
    macro avg       0.98      0.97      0.97       298
 weighted avg       0.97      0.97      0.97       298



I trained a logistic regression classifier on the original document vectors to perform text classification. The model achieved impressive results, with a training accuracy of approximately $96.1%$ and a test accuracy of approximately $97.3%$, indicating excellent generalization. The detailed classification report shows consistently high precision, recall, and F1-scores across all categories, with particularly perfect scores for the sport class and very strong results for business, entertainment, politics, and tech. These findings demonstrate that even a relatively simple supervised model can effectively capture the distinctions among categories in this dataset.

Following plot also shows the confusion matrix for the supervised learning model. The confusion matrix reveals that most samples are correctly classified into their true categories, with only a few misclassifications primarily between related classes such as business and politics. The model's high accuracy and strong performance across all categories indicate its effectiveness in capturing the underlying patterns in the text data.

In [31]:
cm = confusion_matrix(y_test, y_test_pred_supervised, labels=np.unique(y_test))


# plot the confusion matrix using plotly
fig = px.imshow(
    cm,
    text_auto=True,
    labels=dict(x="Predicted Category", y="True Category", color="Count"),
    x=np.unique(y_test),
    y=np.unique(y_test),
    color_continuous_scale='YlGn',
    # aspect="auto",
    title="Confusion Matrix",
    width=600,
    height=600
)
fig.show()

## Learning Curve Analysis

To investigate how the amount of training data influences model performance, we trained the logistic regression classifier on progressively larger fractions of the training set, ranging from $1%$ to nearly $100%$. The results show that the test accuracy increased rapidly with the addition of more data: starting at ~$46%$ with just $1%$ of the data, surpassing $90%$ by around $5%$, and reaching over $97%$ once about $95%$ of the data was used. Beyond this point, the accuracy plateaued, indicating that the model had effectively learned the patterns in the data. This analysis highlights the significant impact of training set size on classifier performance, especially in the early stages, and confirms that our dataset is sufficiently large to achieve strong generalization.



In [32]:
train_sizes = []
accuracies = []
train_fracs = [0.01, 0.03, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.90, 0.95, 0.99, 0.999, 0.9999, 0.99999]

for frac in train_fracs:
    X_partial, _, y_partial, _ = train_test_split(X_train, y_train, train_size=frac, random_state=CONFIG['RANDOM_STATE'])
    clf.fit(X_partial, y_partial)
    y_test_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_test_pred)
    train_sizes.append(frac * 100)  # percent
    accuracies.append(acc)
    print(f"Train size: {frac*100:.3f}% -> Test accuracy: {acc:.6f}")


Train size: 1.000% -> Test accuracy: 0.463087
Train size: 3.000% -> Test accuracy: 0.758389
Train size: 5.000% -> Test accuracy: 0.912752
Train size: 8.000% -> Test accuracy: 0.932886
Train size: 10.000% -> Test accuracy: 0.936242
Train size: 20.000% -> Test accuracy: 0.953020
Train size: 30.000% -> Test accuracy: 0.969799
Train size: 40.000% -> Test accuracy: 0.969799
Train size: 50.000% -> Test accuracy: 0.966443
Train size: 60.000% -> Test accuracy: 0.969799
Train size: 70.000% -> Test accuracy: 0.966443
Train size: 80.000% -> Test accuracy: 0.966443
Train size: 90.000% -> Test accuracy: 0.969799
Train size: 95.000% -> Test accuracy: 0.973154
Train size: 99.000% -> Test accuracy: 0.976510
Train size: 99.900% -> Test accuracy: 0.976510
Train size: 99.990% -> Test accuracy: 0.976510
Train size: 99.999% -> Test accuracy: 0.976510


In [33]:
df_plot = pd.DataFrame({
    'Train Size (%)': train_sizes,
    'Test Accuracy': accuracies
})

fig = px.line(
    df_plot, 
    x='Train Size (%)', 
    y='Test Accuracy', 
    markers=True,
    title='Test Accuracy vs Training Data Size (Supervised Learning)',
    labels={'Train Size (%)': 'Training Data Size (%)', 'Test Accuracy': 'Test Accuracy'},
    width=800,
    height=600
)
fig.update_layout(yaxis=dict(range=[0,1]))  # fix scale to [0,1]
fig.show()

## References
- Pennington et al. "GloVe: Global Vectors for Word Representation." EMNLP 2014.
- [sklearn documentation](https://scikit-learn.org/stable/documentation.html)
- [NLTK documentation](https://www.nltk.org/howto/tokenize.html)
- [DTSA 5747: Fundamentals of Natural Language Processing](https://www.colorado.edu/program/data-science/coursera/curriculum/dtsa5747)