<a href="https://colab.research.google.com/github/ckruckenberg/Test/blob/master/Computer_assisted_open_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **Tutorial** - Computer assisted open coding


This is a tutorial for computer assisted open coding.

We will use the 20Newsgroups dataset of so called news groups from USENET as our test data.

We will roughly follow the work flow from "Computational Grounded Theory Revisted" focusing on the phase of computer assisted discovery.

The phases will be:

1. Discover themes using BERTopic model

2. Inspect words and documents from relevant themes

3. Expand query sets for document retrival using word2vec
  - Using netnography and topic model output build an initial keywords for each relevant topic

  - Search for similar words using word2vec in order to extent keyword/query list.


You are free to use other models than the BERTopic and word2vec the models should however have the purpose of exploring different themes and supporting extensive sampling of relevant document per theme.






## 1. Discover themes using BERTopic model

## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

This part of the tutorial is taken from the BERTopic tutorial.

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

________________________________________________________________________________
Cache loading failed
________________________________________________________________________________
No module named 'numpy._core'


## Training

*This section is taken from BERTopic tutorial*

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [None]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

2025-05-12 07:41:48,431 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2025-05-12 07:42:44,305 - BERTopic - Embedding - Completed ✓
2025-05-12 07:42:44,306 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 07:43:21,512 - BERTopic - Dimensionality - Completed ✓
2025-05-12 07:43:21,514 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 07:44:13,245 - BERTopic - Cluster - Completed ✓
2025-05-12 07:44:13,262 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 07:44:16,614 - BERTopic - Representation - Completed ✓


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Exploring Topics
After fitting our model, we can start by looking at the results. The model has located many topics.

As example I'll be interested in religious topics.

The task is then to look across many different proposed topics to see what topics and subsection of topics are of relevance to the study. For each relevant theme you can build a keyword list that fits with that topic.

Remember that you can build one keyword list of words from multiple different topics. You do not have to respect the bounderies of the topic model.

You can also read documents from different topics that might not be of centeral importance to your study but which might give important knowlegde of the discursive context of the datasite.

In [None]:
topic_model.get_topic_info().head(30)

# Topics of potential relevance



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,7291,-1_to_the_is_and,"[to, the, is, and, of, for, you, it, in, that]",[[This is a co-authored report from two of us ...
1,0,700,0_he_year_game_hit,"[he, year, game, hit, baseball, players, team,...",[I thought I'd post my predicted standings sin...
2,1,557,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[I have an idea as to why the encryption algor...
3,2,527,2_ites_cheek_yall_reversed,"[ites, cheek, yall, reversed, yep, huh, ken, i...","[\nYep.\n, \n\n Y'all got the first two..."
4,3,469,3_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",[From: Center for Policy Research <cpr>\nSubje...
5,4,256,4_bike_riding_ride_lane,"[bike, riding, ride, lane, my, you, passenger,...",[\nI'll tell you my story as an example of wha...
6,5,255,5_health_cancer_disease_tobacco,"[health, cancer, disease, tobacco, email, news...",[------------- cut here -----------------\n \n...
7,6,246,6_ram_drive_sale_price,"[ram, drive, sale, price, card, pc, monitor, o...",[I have an Altos 2000 System V.3 Unix system f...
8,7,199,7_car_cars_mustang_ford,"[car, cars, mustang, ford, engine, v8, convert...",[\nA Yugo that will go 1/4mi in 7.7 seconds wi...
9,8,175,8_jpeg_image_gif_format,"[jpeg, image, gif, format, images, file, files...",[Archive-name: jpeg-faq\nLast-modified: 16 May...


There are many relevant topic to investigate given my interest in the theme of religious talk and discussion. For now i'll illustrate the steps with a focus on "atheism".

In [None]:
# Look at top words for a given topic for search terms
topic_model.get_topic(15)


[('atheists', np.float64(0.019080797378681874)),
 ('atheism', np.float64(0.01563451669984829)),
 ('god', np.float64(0.014224656846575017)),
 ('atheist', np.float64(0.011300653945306884)),
 ('belief', np.float64(0.010277368912649099)),
 ('argument', np.float64(0.009823012110236645)),
 ('theism', np.float64(0.008485751249042106)),
 ('believe', np.float64(0.008387619483953494)),
 ('beliefs', np.float64(0.0077390566069442095)),
 ('fallacy', np.float64(0.0076504391509183926))]

In [None]:
# Below is my search term list after reviewing topic 16 for relevant words
topic_atheism_search_words = ['atheists','atheism','atheist','religion','fallacy']

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. This is an alternative way to extend keywords as opposed to word2vec(Word2vec will be introduced below).

Here, we are going to be searching for topics that closely relate the
search term "atheism". Then, we extract the most similar topic and check the results:


In [None]:
similar_topics, similarity = topic_model.find_topics("atheism", top_n=5); similar_topics

[15, 69, 103, 201, 162]

In [None]:
topic_model.get_topic(162)

[('truth', np.float64(0.022142171124167435)),
 ('arrogance', np.float64(0.019179677132954227)),
 ('bible', np.float64(0.018463341046561055)),
 ('ra', np.float64(0.01446112452696658)),
 ('arrogant', np.float64(0.013881599138833989)),
 ('god', np.float64(0.013163224275234128)),
 ('belief', np.float64(0.012284369773923772)),
 ('absolute', np.float64(0.012070015127345539)),
 ('beliefs', np.float64(0.010164258553855075)),
 ('jesus', np.float64(0.009526913998065288))]

After exploring the different topics top words I was unsure what relation these topics had to "atheism". They might be related because of they share words that indicate that core religious beliefs and positions are being discussed. I would have to do some reading of the documents to finds out if the relations to "atheism" was so directe that it would make sense to use some of the words to sample documents and maybe even update my understanding of the "atheism" theme.

### Extend keywords using word2vec

Now we will try to extend our keyword list using word2vec and searching for similar words to expand our keyword set.

In [None]:
# prompt: install gensim

# Beware that gensim is using an older version of numpy

!pip uninstall numpy -y
!pip install numpy==1.23.5 # Downgrade to a compatible version
!pip install gensim

Found existing installation: numpy 1.23.5
Uninstalling numpy-1.23.5:
  Successfully uninstalled numpy-1.23.5
Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jaxlib 0.5.1 requires numpy>=1.25, but you have numpy 1.23.5 which is incompatible.
treescope 0.1.9 requires numpy>=1.25.2, but you have numpy 1.23.5 which is incompatible.
bigframes 2.1.0 requires numpy>=1.24.0, but you have numpy 1.23.5 which is incompatible.
imbalanced-learn 0.13.0 requires numpy<3,>=1.24.3, but you have numpy 1.23.5 which is incompatible.
jax 0.5.2 requires numpy>=1.25, but you have numpy 1.23.5 which is incompatible.
albume



In [None]:
# prompt: train a word2vec model on the fetch_20newsgroups data

from gensim.models import Word2Vec
import nltk
nltk.download('punkt_tab')


# Preprocess the documents
tokenized_docs = []
for doc in docs:
  sentences = nltk.sent_tokenize(doc)
  for sentence in sentences:
    tokens = nltk.word_tokenize(sentence)
    tokenized_docs.append(tokens)

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
## Now we have a model where each word has a similarity score to other words
## we can exploit this to search for new words relevant for our theme.

model.wv.most_similar("atheists", topn=10)

[('Christians', 0.884204089641571),
 ('theists', 0.8332191109657288),
 ('muslims', 0.7910083532333374),
 ('homosexuals', 0.7871225476264954),
 ('humans', 0.7809191942214966),
 ('religions', 0.7733612060546875),
 ('Catholics', 0.7622609734535217),
 ('Protestants', 0.7546778917312622),
 ('believers', 0.7504454255104065),
 ('conservatives', 0.7489081621170044)]

In [None]:
### We can also see which words are similar to a list of words

model.wv.most_similar(topic_atheism_search_words, topn=10)

[('false', 0.8651381134986877),
 ('objective', 0.8579715490341187),
 ('morality', 0.8552519083023071),
 ('conscience', 0.8470013737678528),
 ('belief', 0.8448202610015869),
 ('Christianity', 0.8444059491157532),
 ('god', 0.837492823600769),
 ('substance', 0.8347828388214111),
 ('argument', 0.8331798315048218),
 ('principle', 0.8322663903236389)]

## Retrive documents for reading

In [None]:
# retrive documents that contain words for a list of search terms.

import pandas as pd

# Create a DataFrame from the documents
df = pd.DataFrame({'document': docs})

# Convert the search terms to lowercase for case-insensitive matching
search_terms_lower = [term.lower() for term in topic_atheism_search_words]

# Function to check if any search term is present in a document
# this is bad function in that it has a variable in it.
def contains_search_term(doc,search_terms):
    doc_lower = doc.lower()
    for term in search_terms:
        if term in doc_lower:
            return True
    return False


# Apply the function to create a boolean column indicating matches
df['contains_search_term'] = df['document'].apply(contains_search_term)

# Filter the DataFrame to include only documents that match at least one search term
matching_docs_df = df[df['contains_search_term']]




In [None]:
# Remember to check number of docs and their relevance before exporting

len(matching_docs_df)


NameError: name 'matching_docs_df' is not defined

In [None]:
## insepcting different docs

matching_docs_df.document.iloc[8]

'\n\nOh, and us with the big degrees don\'t got imagination, huh?\n\nThe alleged dichotomy between imagination and knowledge is one of the most\npernicious fallacys of the New Age.  Michael, thanks for the generous\noffer, but we have quite enough dreams of our own, thank you.\n\nYou, on the other hand, are letting your own dreams go to waste by\nfailing to get the maths/thermodynamics/chemistry/(your choices here)\nwhich would give your imagination wings.\n\nJust to show this isn\'t a flame, I leave you with a quote from _Invasion of \nthe Body Snatchers_:\n\n"Become one of us; it\'s not so bad, you know"'

## Below is some extra code from the tutorial

Find the full tutorial here:

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing



## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following: