### Virtual Environment
The env is available in the `tweet_topic_modeling_environment.yml` and was output in this way via command line:

* !`conda env export --from-history > tweet_topic_modeling_environment.yml`.

See [creating-an-environment-from-an-environment-yml-file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) on Conda's site to load the even. 

Side note: 
As I was adding packages to teh environment, I found it useful to add conda-forge to my channels
* `conda config --add channels conda-forge`
* `conda config --set channel_priority strict`
  
### kernel
After activating the conda environment, make the kernel accessible (see [stackoverflow](https://stackoverflow.com/a/44786736)):
* !`python -m ipykernel install --user --name tweet_topic_modeling --display-name "tweet_topic_modeling"`

### IDE
Jupyter Lab is used and the "tweet_topic_modeling" kernel is selected. 

* !`jupyter lab`

### Data
Post Elon, the site formerly known as Twitter no longer makes tweet data freely available. So I am using data from one of my topic modeling projects from 2015 (see my [topic-modeling-201](https://github.com/DrSkippy/Data-Science-45min-Intros/tree/master/topic-modeling-201) repo). The data is 5000 tweets using the search term "golden retreiver" that was freely snagged from Twitter's public API in 2015. 
### Pip

I had to use `pip` to install a few libs that conda wouldn't give me:

* `python -m pip install langid`
* `python -m pip install langdetect`
  
### Method
The goal here was to see what ChatGPT might recommend for the topic modeling I did in previously in [topic-modeling-201](https://github.com/DrSkippy/Data-Science-45min-Intros/tree/master/topic-modeling-201). So I initially started with this prompt:

**promp1**: "I have 5000 tweet's text in a python list from a 2015 version of Twitter's free public API using "golden retriever" as the search term. Can you split this list into a train and test set, train a topic model on the training set, then label the test set using the topic model?" 

In [1]:
!conda env export --from-history

name: tweet_topic_modeling
channels:
  - conda-forge
  - defaults
dependencies:
  - python==3.11
  - pandas
  - jupyter
  - matplotlib
  - gensim
  - nltk
  - scikit-learn
prefix: /Users/lehman/opt/anaconda3/envs/tweet_topic_modeling


### Load Data

In [2]:
import pickle as pkl
with open('data/tweet_text.pkl', 'rb') as file:
    tweet_text = pkl.load(file)


### Explore data

In [None]:
type(tweet_text)

In [None]:
tweet_text[0:10]

### Intial Thoughts on the data
1. I **think** the RT in the text was an optional and probably now outdated way of establishing that a tweet is actually a retweet; so I plan to remove it from the txt b/c I'm not concerned for these purposes if something is a retweet. 
2. In the first 10 tweets, 8 had links, but only 1 link took me to a non-deleted tweet from a non-suspended user.
    * What should I do with links in general?
        * I think that I will remove them to focus purely on the text. 
    * What should I do with text that links to deleted or suspended content?
        * For the purpose of this topic modeling exercise, I'm not concerned that automated bots or humans made the content.
3. The @ mentions should be removed to clean the text
4. Punctuation could be removed to clean the text.
5. Should we start with stemming? I'm keen to get to the root of the topics and want to make things like "snuggler", "snuggles", "snuggling" into "snug" rather than each being an individual topic. So I may just start here.  


### ChatGPT Prompt Update
**promp2**: "I have text from 5000 tweets in a python list variable named "tweet_text".  The text was pulled from a 2015 version of Twitter's free public API using "golden retriever" as the search term. I'd like to remove urls, remove the word "RT", remove punctuation from the text, and remove common stop words. I'd also like to lowercase all of the text, and use stemming. Can you clean this data accordingly, split this list into a train and test set, train a topic model on the training set, and then label the test set using the topic model?"

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
import random
import re
import gensim
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split

# Sample list of 5000 tweet texts
tweet_texts = tweet_text  # Your list of tweet texts here

# Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove "RT"
    text = re.sub(r'\bRT\b', '', text)
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization and lowercase
    words = word_tokenize(text.lower())
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Stemming
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

processed_texts = [preprocess_text(text) for text in tweet_texts]

# Split data into train and test sets
train_texts, test_texts = train_test_split(processed_texts, test_size=0.2, random_state=42)

# Tokenize again for training LDA
train_tokens = [text.split() for text in train_texts]

# Create dictionary and corpus for topic modeling
dictionary = corpora.Dictionary(train_tokens)
train_corpus = [dictionary.doc2bow(tokens) for tokens in train_tokens]

# Perform topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=train_corpus,
                                            id2word=dictionary,
                                            num_topics=5,
                                            random_state=42,
                                            passes=10,
                                            per_word_topics=True)

# Label test set using the trained topic model
test_corpus = [dictionary.doc2bow(text.split()) for text in test_texts]

# Predict topics for test set
test_topic_labels = []
for doc in test_corpus:
    topic_distribution = lda_model.get_document_topics(doc)
    dominant_topic = max(topic_distribution, key=lambda x: x[1])[0]
    test_topic_labels.append(dominant_topic)

# Print example of test tweet with its topic label
for i in range(5):  # Print labels for the first 5 test tweets
    print("Test tweet:", test_texts[i])
    print("Topic label:", test_topic_labels[i])
    print("")

### Thoughts on topics
1. I'll remove tweets that are not unique and see what sort of dataset I have left.
2. I'm going to want to understand what the topics mean as well.

### ChatGPT Prompt Update

I have text from 5000 tweets in a python list variable named "tweet_text".  The text was pulled from a 2015 version of Twitter's free public API using "golden retriever" as the search term. I'd like to remove urls, remove the words that start with @ (these are the user mentions), remove the word "RT", remove punctuation from the text, and remove common stop words. I'd also like to lowercase all of the text, and use stemming. I would then like to unique the list of tweets so we only train on unique tweets. Can you clean this data accordingly, split this list into a train and test set, train a topic model on the training set, and then label the test set using the topic model?

In [None]:
import random
import re
import gensim
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split

# Sample list of 5000 tweet texts
tweet_texts = tweet_text  # Your list of tweet texts here

# Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove "RT"
    text = re.sub(r'\bRT\b', '', text)
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization and lowercase
    words = word_tokenize(text.lower())
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Stemming
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

# Clean and unique the tweet texts
unique_tweet_texts = list(set([preprocess_text(text) for text in tweet_texts]))

# Split data into train and test sets
train_texts, test_texts = train_test_split(unique_tweet_texts, test_size=0.2, random_state=42)

# Tokenize again for training LDA
train_tokens = [text.split() for text in train_texts]

# Create dictionary and corpus for topic modeling
dictionary = corpora.Dictionary(train_tokens)
train_corpus = [dictionary.doc2bow(tokens) for tokens in train_tokens]

# Perform topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=train_corpus,
                                            id2word=dictionary,
                                            num_topics=5,
                                            random_state=42,
                                            passes=10,
                                            per_word_topics=True)

# Label test set using the trained topic model
test_corpus = [dictionary.doc2bow(text.split()) for text in test_texts]

# Predict topics for test set
test_topic_labels = []
for doc in test_corpus:
    topic_distribution = lda_model.get_document_topics(doc)
    dominant_topic = max(topic_distribution, key=lambda x: x[1])[0]
    test_topic_labels.append(dominant_topic)

# Print example of test tweet with its topic label
for i in range(5):  # Print labels for the first 5 test tweets
    print("Test tweet:", test_texts[i])
    print("Topic label:", test_topic_labels[i])
    print("")

# Print topics and their meanings
print("Topics and their meanings:")
for idx, topic in lda_model.print_topics(-1):
    print("Topic {}: {}".format(idx, topic))

### Thoughts on output 
1. What size is our dataset including train/test?
   * We went from 5000 tweets to 300 unique tweets.
1. Should I try an alternative method?
   * I'm going to explore this idea with ChatGPT in a few prompts, but BERT might be fun to try.
1. Are the topics very distinct? Meaning, are the probabilities for the tweets predicting one clear winner or are the probabilities roughly equal across all topics?
1. The meaning of the topics is opaque. How can we visualize the results?
   * I recall pyLDAvis was cool, but I'm keen to explore new ways so I'll also ask ChatGPT for recs.
1. Language? Do we need to remove tweets that are none english?
   * For my own purposes, choosing English or Spanish would make the output more understandable to me. I'll choose english.

In [None]:
print("""------------------------------------------------\n
initial_tweets:  {initial_tweet_count}
unique tweets:   {unique_tweet_count}
training tweets: {train_set_size}
test tweets:     {test_set_size}\n
total topics:    {total_topics}
------------------------------------------------\n""".format(initial_tweet_count=len(tweet_text)
                                                             , unique_tweet_count=len(unique_tweet_texts)
                                                             , train_set_size=len(train_texts)
                                                             , test_set_size=len(test_texts)
                                                             , total_topics=len(lda_model.print_topics(-1))
                                                            )
     )

In [None]:
# Print 5 tweets with its topic label and probability
for i, doc in enumerate(random.sample(test_corpus,5)):
    print("Test tweet:", test_texts[i])
    topic_distribution = lda_model.get_document_topics(doc)
    for topic, prob in topic_distribution:
        print("Topic label:", topic, "Probability:", prob)
    print("")

### choose lang lib
I want the library that leaves me with the most tweets; I'm not going to stress about accuracy at this point, but I just want something to removing tweets not in english. 
* I picked langid b/c it ran faster and left me with more tweets
* The spot check didn't surface anything in either group that seems suspicous 

In [None]:
from langdetect import detect

# Filter out non-English tweets
english_tweets = []
for tweet in tweet_text:
    try:
        if detect(tweet) == 'en':
            english_tweets.append(tweet)
    except:
        pass  # Skip tweets that raise exceptions (e.g., empty tweets)

In [None]:
len(english_tweets)

In [None]:
import random
random.sample(english_tweets, 20)

In [None]:
import langid

# Filter out non-English tweets
english_tweets_langid = []
for tweet in tweet_text:
    lang, _ = langid.classify(tweet)
    if lang == 'en':
        english_tweets_langid.append(tweet)

print(len(english_tweets_langid))

In [None]:
random.sample(english_tweets_langid, 20)

### Run Model with updates

In [None]:
import random
import re
import gensim
import langid
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split

# Sample list of 5000 tweet texts
tweet_texts = tweet_text  # Your list of tweet texts here

# Keep only tweets classified as english
english_tweet_texts = []
for tweet in tweet_texts:
    lang, _ = langid.classify(tweet)
    if lang == 'en':
        english_tweet_texts.append(tweet)

# Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove "RT"
    text = re.sub(r'\bRT\b', '', text)
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization and lowercase
    words = word_tokenize(text.lower())
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Stemming
    words = [stemmer.stem(word) for word in words]
    
    # Remove stopwords and specific terms
    #words = [word for word in words if word not in stop_words and word not in ['golden', 'retriev']]
    return ' '.join(words)

# Clean and unique the tweet texts
unique_english_tweet_texts = list(set([preprocess_text(text) for text in english_tweet_texts]))

# Split data into train and test sets
train_texts, test_texts = train_test_split(unique_english_tweet_texts, test_size=0.2, random_state=42)

# Tokenize again for training LDA
train_tokens = [text.split() for text in train_texts]

# Create dictionary and corpus for topic modeling
dictionary = corpora.Dictionary(train_tokens)
train_corpus = [dictionary.doc2bow(tokens) for tokens in train_tokens]

# Perform topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=train_corpus,
                                            id2word=dictionary,
                                            num_topics=5,
                                            random_state=42,
                                            passes=10,
                                            per_word_topics=True)

# Label test set using the trained topic model
test_corpus = [dictionary.doc2bow(text.split()) for text in test_texts]

# Predict topics for test set
test_topic_labels = []
for doc in test_corpus:
    topic_distribution = lda_model.get_document_topics(doc)
    dominant_topic = max(topic_distribution, key=lambda x: x[1])[0]
    test_topic_labels.append(dominant_topic)

# Print example of test tweet with its topic label
for i in range(5):  # Print labels for the first 5 test tweets
    print("Test tweet:", test_texts[i])
    print("Topic label:", test_topic_labels[i])
    print("")

# Print topics and their meanings
print("Topics and their meanings:")
for idx, topic in lda_model.print_topics(-1):
    print("Topic {}: {}".format(idx, topic))

### Visualize Results
* My aim here is to explore the topics distinct nature and meaning in a human readable way. 

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Generate word clouds for each topic
for idx, topic in lda_model.show_topics(formatted=False):
    word_freq = {word: freq for word, freq in topic}
    wordcloud = WordCloud(background_color='white').generate_from_frequencies(word_freq)
    plt.figure()
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title('Topic {}'.format(idx))
    plt.axis('off')
    plt.show()

# Create bar plots for the most frequent terms in each topic
for idx, topic in lda_model.show_topics(formatted=False):
    terms, freqs = zip(*topic)
    plt.figure(figsize=(8, 6))
    plt.barh(range(len(terms)), freqs, align='center', color='skyblue')
    plt.yticks(range(len(terms)), terms)
    plt.gca().invert_yaxis()
    plt.xlabel('Frequency')
    plt.title('Topic {}'.format(idx))
    plt.show()

### Thoughts on Visuals
1. 'Golden' and 'Retriv' may not need to be in the visual; I'll add code to remove it but comment it out for now.
2. These topics may fit into a customer journey; Golden Retriever being the product taken from awereness to aquisition to loyalty, etc.
3. These visuals don't show the overlap in similarity.

In [None]:
# import plotly.graph_objects as go

# # Extract topics and associated terms from the LDA model
# topics_terms = lda_model.show_topics(formatted=False)

# # Extract topic-term distributions for each topic
# topic_terms = {idx: [term for term, _ in topic] for idx, topic in topics_terms}

# # Create bar plots for each topic showing the top terms
# fig_terms = go.Figure()
# for idx, terms in topic_terms.items():
#     fig_terms.add_trace(go.Bar(x=terms, y=[1]*len(terms), name=f'Topic {idx}', orientation='h'))

# fig_terms.update_layout(title='Top Terms in Each Topic', barmode='stack', xaxis_title='Term', yaxis_title='Topic')
# fig_terms.show()

# # Extract topic-document distributions for the test set
# topic_distribution_test = [lda_model.get_document_topics(doc) for doc in test_corpus]

# # Create stacked bar plot showing topic distribution in test documents
# fig_distribution = go.Figure()
# for idx, topic_dist in enumerate(topic_distribution_test):
#     probs = [prob for _, prob in topic_dist]
#     fig_distribution.add_trace(go.Bar(x=[f'Topic {i}' for i in range(len(probs))], y=probs, name=f'Document {idx}'))

# fig_distribution.update_layout(title='Topic Distribution in Test Documents', barmode='stack', xaxis_title='Topic', yaxis_title='Probability')
# fig_distribution.show()

In [None]:
topic_terms = {}
for idx, topic in lda_model.print_topics(-1):
    terms = [term.split("*")[1].strip().strip('"') for term in topic.split("+")]
    topic_terms[f"Topic {idx}"] = terms

# Print topic-term distributions
for topic, terms in topic_terms.items():
    print(f"{topic}: {terms}")

topic_terms_with_counts = [{ 'topic': topic, 'count': len(terms) } for topic, terms in topic_terms.items()]


In [None]:
import random
import re
import gensim
import langid
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
#from IPython.display import display, HTML
#display(HTML("<style>.container { width:100% !important; }</style>"))
# Sample list of 5000 tweet texts
tweet_texts = tweet_text  # Your list of tweet texts here

# Keep only tweets classified as english
english_tweet_texts = []
for tweet in tweet_texts:
    lang, _ = langid.classify(tweet)
    if lang == 'en':
        english_tweet_texts.append(tweet)

# Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove "RT"
    text = re.sub(r'\bRT\b', '', text)
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization and lowercase
    words = word_tokenize(text.lower())
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Stemming
    words = [stemmer.stem(word) for word in words]
    
    # Remove stopwords and specific terms
    #words = [word for word in words if word not in stop_words and word not in ['golden', 'retriev']]
    return ' '.join(words)

# Clean and unique the tweet texts
unique_english_tweet_texts = list(set([preprocess_text(text) for text in english_tweet_texts]))

# Split data into train and test sets
train_texts, test_texts = train_test_split(unique_english_tweet_texts, test_size=0.2, random_state=42)

# Tokenize again for training LDA
train_tokens = [text.split() for text in train_texts]

# Create dictionary and corpus for topic modeling
dictionary = corpora.Dictionary(train_tokens)
train_corpus = [dictionary.doc2bow(tokens) for tokens in train_tokens]

# Perform topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=train_corpus,
                                            id2word=dictionary,
                                            num_topics=5,
                                            random_state=42,
                                            passes=10,
                                            per_word_topics=True)

# Prepare the visualization data
lda_display = gensimvis.prepare(lda_model, train_corpus, dictionary, sort_topics=False)

# Display the visualization
pyLDAvis.display(lda_display)

### pyLDAvis Thoughts
1. a few of the clusters seem to overlap on the toipc map; run LDA a few more times to see if randomness changes the positions. 
2. I'm not seeing a rich "theme" in the topic content. I'm going to try to use BERT embeddings to represent the text data and then apply LDA to find topics.

In [None]:
import random
import re
import gensim
import langid
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertModel
import torch

# Sample list of 5000 tweet texts
tweet_texts = tweet_text  # Your list of tweet texts here

# Keep only tweets classified as English
english_tweet_texts = [tweet for tweet in tweet_texts if langid.classify(tweet)[0] == 'en']

# Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove "RT"
    text = re.sub(r'\bRT\b', '', text)
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization and lowercase
    words = word_tokenize(text.lower())
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Stemming
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

# Clean and unique the tweet texts
unique_english_tweet_texts = list(set([preprocess_text(text) for text in english_tweet_texts]))

# Tokenize using BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Obtain BERT embeddings for each tweet
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.eval()

bert_embeddings = []

for text in unique_english_tweet_texts:
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    embeddings = torch.mean(outputs.last_hidden_state, dim=1).squeeze().numpy()
    bert_embeddings.append(embeddings)

# Convert BERT embeddings to a format suitable for LDA
# For example, you can concatenate the embeddings for each tweet into a single vector
lda_inputs = [emb.flatten() for emb in bert_embeddings]

# Split data into train and test sets
train_inputs, test_inputs = train_test_split(lda_inputs, test_size=0.2, random_state=42)

# Perform topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=train_inputs,
                                            id2word=dictionary,
                                            num_topics=5,
                                            random_state=42,
                                            passes=10,
                                            per_word_topics=True)


In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from gensim.models import CoherenceModel

# Visualization 1: Word Clouds for each topic
def visualize_wordclouds(lda_model):
    topics = lda_model.show_topics(num_topics=-1, formatted=False)
    for topic_id, words in topics:
        word_freq = {word: freq for word, freq in words}
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
        plt.figure(figsize=(10, 6))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(f'Topic {topic_id}')
        plt.axis('off')
        plt.show()

# Visualization 2: pyLDAvis
def visualize_pyldavis(lda_model, corpus, dictionary):
    lda_display = gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)
    pyLDAvis.display(lda_display)

# Visualization 3: Topic Coherence
def compute_coherence(lda_model, corpus, dictionary):
    coherence_model_lda = CoherenceModel(model=lda_model, texts=train_tokens, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print(f'Topic Coherence Score: {coherence_lda}')

# Visualize word clouds for each topic
visualize_wordclouds(lda_model)

# Visualize topics using pyLDAvis
visualize_pyldavis(lda_model, train_corpus, dictionary)

# Compute and print topic coherence score
compute_coherence(lda_model, train_corpus, dictionary)

In [3]:
1+1

2

In [None]:
import random
import re
import gensim
import langid
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertModel
import torch
import time

start = time.time()

# Sample list of 5000 tweet texts
tweet_texts = tweet_text  # Your list of tweet texts here

# Keep only tweets classified as English
english_tweet_texts = [tweet for tweet in tweet_texts if langid.classify(tweet)[0] == 'en']

end = time.time()
print("Time taken for data preprocessing:", end - start, "seconds")

start = time.time()

# Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove "RT"
    text = re.sub(r'\bRT\b', '', text)
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization and lowercase
    words = word_tokenize(text.lower())
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Stemming
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

# Clean and unique the tweet texts
unique_english_tweet_texts = list(set([preprocess_text(text) for text in english_tweet_texts]))

end = time.time()
print("Time taken for text preprocessing:", end - start, "seconds")

start = time.time()

# Tokenize using BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Obtain BERT embeddings for each tweet
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.eval()

bert_embeddings = []

for text in unique_english_tweet_texts:
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    embeddings = torch.mean(outputs.last_hidden_state, dim=1).squeeze().numpy()
    bert_embeddings.append(embeddings)

end = time.time()
print("Time taken for BERT embeddings:", end - start, "seconds")

start = time.time()

# Convert BERT embeddings to a format suitable for LDA
# For example, you can concatenate the embeddings for each tweet into a single vector
lda_inputs = [emb.flatten() for emb in bert_embeddings]

end = time.time()
print("Time taken for data conversion for LDA:", end - start, "seconds")

start = time.time()

# Split data into train and test sets
train_inputs, test_inputs = train_test_split(lda_inputs, test_size=0.2, random_state=42)

end = time.time()
print("Time taken for data splitting:", end - start, "seconds")

start = time.time()

# Perform topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=train_inputs,
                                            id2word=dictionary,
                                            num_topics=5,
                                            random_state=42,
                                            passes=10,
                                            per_word_topics=True)

end = time.time()
print("Time taken for LDA model training:", end - start, "seconds")

import matplotlib.pyplot as plt
from wordcloud import WordCloud
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from gensim.models import CoherenceModel

start = time.time()

# Visualization 1: Word Clouds for each topic
def visualize_wordclouds(lda_model):
    topics = lda_model.show_topics(num_topics=-1, formatted=False)
    for topic_id, words in topics:
        word_freq = {word: freq for word, freq in words}
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
        plt.figure(figsize=(10, 6))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(f'Topic {topic_id}')
        plt.axis('off')
        plt.show()

# Visualization 2: pyLDAvis
def visualize_pyldavis(lda_model, corpus, dictionary):
    lda_display = gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)
    pyLDAvis.display(lda_display)

# Visualization 3: Topic Coherence
def compute_coherence(lda_model, corpus, dictionary):
    coherence_model_lda = CoherenceModel(model=lda_model, texts=train_tokens, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print(f'Topic Coherence Score: {coherence_lda}')

# Visualize word clouds for each topic
visualize_wordclouds(lda_model)

end = time.time()
print("Time taken for word cloud visualization:", end - start, "seconds")

start = time.time()

# Visualize topics using pyLDAvis
visualize_pyldavis(lda_model, train_corpus, dictionary)

end = time.time()
print("Time taken for pyLDAvis visualization:", end - start, "seconds")

start = time.time()

# Compute and print topic coherence score
compute_coherence(lda_model, train_corpus, dictionary)

end = time.time()
print("Time taken for coherence score computation:", end - start, "seconds")

Time taken for data preprocessing: 3.6213109493255615 seconds
Time taken for text preprocessing: 0.5432548522949219 seconds
