# [Homework 3][link] - Analyzing the Zulip Chat
[link]: https://github.com/onefact/datathinking.org/issues/132

_Benjamin Eckhardt — DataThinking — Spring 2023 — University of Tartu_

## Outline

We analyze the Zulip chat data by 

0. loading and cleaning the data 
1. computing high dimensional embeddings from a pretrained model 
2. applying linear regression to predict message lengths from those
embeddings 
3. applying logistic regression to predict message subjects from those embeddings

We supply our analysis with visualizations based on dimensionality
reduction. The linear regression task fails partly due to non linearities in the data distribution. The logistic regression task succeeds almost perfectly but we assume overfitting due to high input dimensionalities in the order of magnitude of the sample number.

## Setup

We require python packages and setup the notebook environment.

In [177]:
#from itables import init_notebook_mode; init_notebook_mode(all_interactive=True)
from IPython.core.interactiveshell import InteractiveShell; InteractiveShell.ast_node_interactivity = "all"
from myst_nb import glue

import json

import pandas as pd    # NOTE: somehow my kernel crashes with polars, so I use the almost identical pandas interface

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

import spacy

## Cleaning and summary statistics

### Load the data and keep only the columns relevant to our analysis

In [141]:
D = json.load(open('../data/zulip-000001.json', 'r'))
D = pd.json_normalize(D['zerver_message']).drop(columns = 'id realm rendered_content rendered_content_version sending_client last_edit_time edit_history sender recipient date_sent has_image has_link search_tsvector has_attachment'.split(' '))
D.head()

Unnamed: 0,subject,content
0,topic demonstration,This is a message on stream #**general** with ...
1,topic demonstration,Topics are a lightweight tool to keep conversa...
2,swimming turtles,This is a message on stream #**general** with ...
3,intros,Hi! I am Jaan. I am teaching this course and t...
4,intros,"Hello Jaan,\nI am Nesma, a Ph.D. student at th..."


### Compute summary statistics of the subset

In [142]:
stats = dict(
    n_messages = len(D),
    #n_users = len(D['sender'].unique()),
    #n_streams = len(D['recipient'].unique()),
    n_subjects = len(D['subject'].unique()),
    messages_per_subject = D.groupby('subject')['content'].count(),
)

# print a nice message summarizing all the stats
print(f"Loaded {stats['n_messages']} messages from {stats['n_subjects']} subjects.")
print(f"\nAvarage number of messages per subject: {stats['messages_per_subject'].mean():.0f}")
print(f"Number of messages per subject:")
for topic, n_messages in stats['messages_per_subject'].sort_values(ascending=False).iteritems():
    print(f"  {topic:34s} {n_messages:.0f}")


Loaded 240 messages from 14 subjects.

Avarage number of messages per subject: 17
Number of messages per subject:
  main                               66
  plotting                           55
  intros                             44
  linkedin profile                   25
  Anaconda installation              19
  random philosophy materials        8
  (no topic)                         6
  Zoom Link                          6
  Course starting date               3
  GitHub repository                  2
  My Linkedin profile:               2
  topic demonstration                2
  Random Interesting Links           1
  swimming turtles                   1


### Drop the subjects with too few messages

In [160]:
# drop the subjects with too few messages
D = D.groupby('subject').filter(lambda x: len(x) >=  8 )
print(f"Only {len(D['subject'].unique())} are subjects left: {D['subject'].unique()}.")

Only 6 are subjects left: ['intros' 'main' 'Anaconda installation' 'linkedin profile' 'plotting'
 'random philosophy materials'].


### More summaries: message lengths in words per subject

In [147]:
stats |= dict(
    avg_words = D['content'].str.split().str.len().mean(),
    words_per_subject = D.groupby('subject')['content'].apply(lambda x: x.str.split().str.len().mean()),
)

print(f"\nAverage number of words per message across all subjects: {stats['avg_words']:.0f}")
print(f"Average number of words per subject:")
for topic, n_words in stats['words_per_subject'].sort_values(ascending=False).iteritems():
    print(f"  {topic:34s} {n_words:.0f}")


Average number of words per message across all subjects: 59
Average number of words per subject:
  main                               72
  plotting                           65
  intros                             64
  Anaconda installation              55
  random philosophy materials        42
  linkedin profile                   8


We observe that most topics have similar amounts of words per subject. Just the topic "linkedin profile" scores much lower.

## Compute embeddings from message content

### High dimensional embeddings 


We compute the high dimensional message embeddings using spacy and the small-sized English model. This model is based on GloVe vectors trained on Common Crawl (according to Copilot).

In [152]:
# compute embeddings from the messages using spacy
nlp = spacy.load('en_core_web_sm')
D['embed'] = D['content'].apply(lambda x: nlp(x).vector)

D.head(); f"each embedding is {len(D['embed'][3])} dimensional"

Unnamed: 0,subject,content,embed
3,intros,Hi! I am Jaan. I am teaching this course and t...,"[0.029341862, -0.113542706, -0.1256793, -0.058..."
4,intros,"Hello Jaan,\nI am Nesma, a Ph.D. student at th...","[0.018824322, 0.02156415, -0.052159272, -0.119..."
5,intros,welcome @**Nesma Mahmoud** ! thank you so much!!,"[-0.40377116, -0.25813308, -0.35395262, -0.587..."
6,intros,"Hello, I am Indrek, a MSc student at Universit...","[0.035889816, -0.16209325, -0.13394497, -0.072..."
7,intros,"Hi, I'm Uku. I'm a research software engineer@...","[0.20111616, 0.032792266, -0.117814206, 0.0300..."


'each embedding is 217 dimensional'

### Low dimansional embeddings 

We compute a 2D PCA projection of the embeddings.

In [163]:
_embeds = D['embed'].apply(pd.Series)


pca = PCA(n_components=2)
pca.fit(_embeds);
_embeds2d = pca.transform(_embeds)

D['PC-1'] = _embeds2d[:,0]
D['PC-2'] = _embeds2d[:,1]


alt.Chart(D, title="Messages in low dimensional embedding space"
).mark_circle(size=60).encode(
    x='PC-1',
    y='PC-2',
    tooltip='PC-1 PC-2 subject content'.split(' ')
).interactive().properties(width=400, height=400)

print(f"The explained variance rations for each component are {pca.explained_variance_ratio_}, 0 is bad 1 is perfect.")


PCA(n_components=2)

The explained variance rations for each component are [0.20808811 0.18072775], 0 is bad 1 is perfect.


We can observe that especially the subject "linkedin profile" is separated from the rest of the subjects by consistent high values in both embedding dimensions. It would be advisable to throw them out of the dataset because they obviously have very different characteristics and also distort the distribution of the other samples. None the less we keep them as this very fact is very instructive.

## Predict Message Length from Embedding using Linear Regression

The mathematical formula for linear regression is 

\begin{align*}
y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n \\
&= \sum_{i=0}^{n} \beta_i x_i
\end{align*}

In this code, $x_1, x_2, ..., x_n$ represent the independent variables (predictor variables), and $\beta_0, \beta_1, \beta_2, ..., \beta_n$ represent the coefficients associated with each independent variable. The equation sums up the product of each coefficient and its respective independent variable.

I choose linear regression for this task, because we want to predict a (more or less) continous variable (message length) from a set of features (embeddings). The linear regression model is a simple model that is easy to interpret. It is also easy to implement and computationally efficient. It is also a good baseline model for more complex models.

### Get message lengths

In [169]:
D['length'] = D['content'].apply(lambda x: len(x.split(' ')))  # precomputes the lengths of the messages

# plot a histogram of the message lengths, limit the x axis to 0 - 1000 words 
alt.Chart(D, title="Distribution of Message Lengths"
).mark_bar().encode(
    alt.X('length', bin=alt.Bin(maxbins=100)),
    y='count()',
    tooltip='length:Q'.split(' ')
).interactive().properties(width=600, height=400) | \
    \
alt.Chart(D, title="Message Lengths in 2D PCA-Space"
).mark_circle(size=60).encode(
    x='PC-1',
    y='PC-2',
    color=alt.Color('length:Q', scale=alt.Scale(domain=[0, 162])),  # we zoom in on most common lengths
    tooltip='PC-1 PC-2 length content'.split(' ')
).interactive().properties(width=400, height=400)

We observe that the message lengths are preferedly short. Only singular exceptions go above 162 words. The longest message by far is almost 1000 words long. This is an essay in some universyity subjects. We can observe that longer messages appear on a straight base line with low values in both embedding dimensions.

### Apply linear regression

In [176]:
# use linear regression to predict the length of a message from its embedding
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

reg = LinearRegression().fit(_embeds, D['length'])
lengths_linreg = reg.predict(_embeds)

# print the quality of the fit
print('The coefficient of determination of the prediction R^2 =', reg.score(_embeds, D['length']))

# plot the predicted lengths_linreg in comparison to the true lengths using altair
D['lengths_linreg'] = lengths_linreg
alt.Chart(D).mark_circle(size=60).encode(
    x='length',
    y='lengths_linreg',
    tooltip='length lengths_linreg subject content'.split(' '),
).interactive().properties(width=400, height=400) 

the coefficient of determination of the prediction R^2 = 0.5671495397703836


The coefficient of determination is pretty nice. The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse). A constant model that always predicts
the expected value of $y$, disregarding the input features, would get
a $R^2$ score of 0.0 (source: scipy docs).

Though we can clearly see that our model fails to predict the very long outlier messages and also predics negative message lengths for some messages. This is due to the fact that the linear regression model is not able to capture the non-linearities in the data. 

A regression line cannot be plotted in 96 dimensional embedding space.

## Predict message subject from embedding using logistic regression

The mathematical formula for logistic regression is 

\begin{align*}
P(y = 1 | \mathbf{x}) &= \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n)}} \\
&= \frac{1}{1 + e^{-(\mathbf{\beta} \cdot \mathbf{x})}}
\end{align*}


In this code, $x_1, x_2, ..., x_n$ represent the independent variables (predictor variables), and $\beta_0, \beta_1, \beta_2, ..., \beta_n$ represent the coefficients associated with each independent variable. The logistic regression equation applies the logistic function (sigmoid function) to the linear combination of the coefficients and independent variables.

I choose logistic regression for this task, because we want to predict a categorical variable (message subject) from a set of features (embeddings). 
Logistic regression is chosen when the outcome variable is binary or categorical. It models the probability of the outcome belonging to a particular category based on the input variables. It is widely used for classification tasks and provides interpretable results.

### Visualize subjects by embeddings in 2D PCA space

In [182]:
# Visualize topics by embeddings in 2D (using PCA)
alt.Chart(D, title="Topics in 2D PCA-Space"
).mark_circle(size=60).encode(
    x='PC-1',
    y='PC-2',
    color='subject',
    tooltip='PC-1 PC-2 subject content'.split(' ')
).interactive().properties(width=400, height=400)

### Apply logistic regression

In [188]:
def to_numeric(col):
    classes = col.unique()
    class2int = {c:i for i,c in enumerate(classes)}
    return col.apply(lambda x: class2int[x]), classes, len(classes)


_subjects, subjects, _n_subjects = to_numeric(D['subject']) 

print(f"There are {_n_subjects} different topics to predict.")

# do the logistic regression
clf = LogisticRegression(random_state=0).fit(_embeds, _subjects)
_subjects_logreg = clf.predict(_embeds)

# print the quality of the fit
print('The coefficient of determination of the prediction   R^2 =', clf.score(_embeds, _subjects))

There are 6 different topics to predict.
The coefficient of determination of the prediction   R^2 = 0.8341013824884793


We observe a very good coefficient of determination! Now we plot the __normalized__ confusion matrix.

### Confusion matrix (normalized)

In [193]:
# plot the confusion matrix from the logistic regression using altair
from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(_subjects, _subjects_logreg)
confusion = pd.DataFrame(confusion, index=subjects, columns=subjects)
confusion.index.name = 'true'
confusion.columns.name = 'predicted'

# normalize the confusion matrix
confusion = confusion / confusion.sum(axis=1)

alt.Chart(confusion.reset_index().melt(id_vars='true'), title="Confusion Matrix of Subjects"
).mark_rect().encode(
    x='predicted:O',
    y=alt.Y('true:O', sort=alt.EncodingSortField(field='true', order='descending')),  # reverse the y axis
    color='value:Q',
    tooltip='true predicted value'.split(' ')
    
).interactive().properties(width=400, height=400)


We can see pretty solid prediction quality! The diagonal is very bright and the off-diagonal elements are very dark. This means that the model predicts the correct subject in most cases. Especially the "linkedin profile" topic is predicted very well. Most confusion happens with "main", which can be explained intuitively with the fact that "main" doesn't define a very specific topic. It is more of a fallback topic for messages that don't fit into any other topic. Lower accucary is with "Anaconda Installations" and "random philosophy materials". This is due to the fact that these topics have very few samples and are therefore not well represented in the dataset (analysis omitted). Because the dimensionality of the embeddings (96) is in the order of magnitude of the number of samples (~200) we can expect that the model is overfitting. Future work could be to reduce the dimensionality of the embeddings and to apply regularization to the model. 

## Conclusion

We have seen that the message length can be predicted from the message content with a linear regression model. The model is not able to capture the non-linearities in the data. As such in predicts negative message lengths for some messages and fails to predict the very long outlier messages. 

Subjects are predicted very very well in contrast. The model could be overfitting due to the high dimensionality of the embeddings. Even if the input dimensionality is the same in the case of linear length predictions, we can assume that logistic regression is much more succeptible to overfitting because it predicts 6 independent variables instead of one. 

Compared to previous runs, the fits turn out much much better when low-data topics are excluded.