<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/Language_Models/NLP_with_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic Modeling with Latent Dirichlet Allocation (LDA)
https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

# Introduction

In this notebook I will be using LDA to identify unique topics in a corpus.

# Provision Enviorment

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Install all necessary packages
!pip install numpy==1.24.3
!pip install pandas
!pip install nltk
!pip install gensim==4.3.0
!pip install pyLDAvis
!pip install sumy
!pip install snowflake-connector-python
!pip install requests
!pip install beautifulsoup4

Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.4
    Uninstalling numpy-1.24.4:
      Successfully uninstalled numpy-1.24.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyfume 0.3.4 requires numpy==1.24.4, but you have numpy 1.24.3 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 1.5.3 which is incompatible.
xarray 2025.1.2 requires pandas>=2.1, but you have pandas 1.5.3 which is incompatible.
treescope 0.1.9 requires numpy>=1.25.2, but you have numpy 1.24.3 which is incompatible.
mizani 0.13.1 requires pandas>=2.2.0, but you have pa

Collecting numpy>=1.18.5 (from gensim==4.3.0)
  Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 1.5.3 which is incompatible.
xarray 2025.1.2 requires pandas>=2.1, but you have pandas 1.5.3 which is incompatible.
treescope 0.1.9 requires numpy>=1.25.2, but you have numpy 1.24.4 which is incompatible.
mizani 0.13.1 requires pandas>=2.2.0, but you have pandas 1.5.3 which is incompatible.
pymc 5.21.1 requires numpy>=1.2

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
INFO: pip is looking at multiple versions of pyfume to determine which version is compatible with other requirements. This could take a while.
Collecting pyfume (from FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading pyFUME-0.3.1-py3-none-any.whl.metadata (9.7 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━

In [3]:
# Import all necessary libraries
import os
import numpy as np
import pandas as pd
import nltk
import random
import re
import requests
from bs4 import BeautifulSoup
import snowflake.connector

# Download NLTK resources
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# NLTK imports
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords

# Gensim imports
from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

# PyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Sumy libraries
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer

print("All imports successful!")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


All imports successful!


# Corpus

The corpus used was assembled using Beautiful Soup to scrape a pubic forum specific to the BMW E9 (www.e9coupe.com). The data was compiled and stored in a Snowflake database for multiple NLP projects, including LDA, RAG, GRU and LSTM. Furture ideas include supplementing the forum text with an existing users guide specific to this model.

In [4]:
path_to_credentials = '/content/drive/Othercomputers/My Mac/Git/credentials/snowflake_credentials.txt'
lda_model_path = '/content/drive/Othercomputers/My Mac/Git/Language_Models/lda_model'
dictionary_path = '/content/drive/Othercomputers/My Mac/Git/Language_Models/datasets'



## Load data

In [5]:
# Data is stored in Snowflake

# Main sequence

def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")

def fetch_data_from_snowflake():
    conn = snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT'),
    )

    cur = conn.cursor()

    query = """
    SELECT THREAD_TITLE, THREAD_FIRST_POST FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
    """
    cur.execute(query)
    e9_forum_corpus = cur.fetch_pandas_all()

    cur.close()
    conn.close()

    return e9_forum_corpus




# Load credentials
# load_credentials(path_to_credentials)

# Fetch data from Snowflake
# df_raw = fetch_data_from_snowflake()

# df_tensorboard = df_raw.copy()

# df_tensorboard.to_csv('/content/drive/MyDrive/Colab Notebooks/NLP/NLP_with_Tensorboard/df_tensorboard.csv', index=False)


In [6]:
#Process data for LDA

lda_num_topics = 10

def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")


def fetch_data_from_snowflake():
    conn = snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT'),
    )

    cur = conn.cursor()

    query = """
    SELECT THREAD_TITLE, THREAD_FIRST_POST FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
    """
    cur.execute(query)
    e9_forum_corpus = cur.fetch_pandas_all()

    cur.close()
    conn.close()

    return e9_forum_corpus

# Create a longer form "TITLE" for potential use as a QUESTION
def engineer_data(e9_forum_corpus):
    e9_forum_corpus['THREAD_TITLE_EXP'] = e9_forum_corpus['THREAD_TITLE'] + " " + e9_forum_corpus['THREAD_FIRST_POST']
    return e9_forum_corpus

# Drop NA and Stopwords and Lemmatize
def preprocess_data(df):
    #df = df[['THREAD_TITLE_EXP']].copy()
    df = df[['THREAD_TITLE']].copy()
    df.dropna(inplace=True)
    df['THREAD_TITLE'] = df['THREAD_TITLE'].astype(str)

    #df['THREAD_TITLE_EXP'] = df['THREAD_TITLE_EXP'].astype(str)

    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english')).union({'car', 'csi', 'cs', 'csl', 'e9', 'coupe', 'http', 'https', 'www', 'ebay', 'bmw', 'html'})

    # Remove URLs
    def remove_urls(text):
        url_pattern = re.compile(r'https?://\S+|www\.\S+')
        return url_pattern.sub(r'', text)

    # Function to preprocess text
    def preprocess(text):
        text = remove_urls(text)
        return [lemmatizer.lemmatize(word) for word in text.lower().split() if word not in stop_words]

    #df['PROCESSED'] = df['THREAD_TITLE_EXP'].map(preprocess)
    df['PROCESSED'] = df['THREAD_TITLE'].map(preprocess)

    return df

def vectorize_data(df):
    dictionary = Dictionary(df['PROCESSED'])
    corpus = [dictionary.doc2bow(doc) for doc in df['PROCESSED']]
    return df, dictionary, corpus

def train_lda_model(corpus, dictionary, num_topics=lda_num_topics, random_state=42, passes=10):
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=random_state, passes=passes)
    return lda

def review_topics(lda):
    for idx, topic in lda.print_topics(-1):
        print(f"Topic: {idx} \nWords: {topic}\n")

def assign_topics(lda, corpus, df):
    topics = [lda[doc] for doc in corpus]
    df['TOPICS'] = [[(int(topic[0]), float(topic[1])) for topic in doc] for doc in topics]  # Ensure topics are JSON serializable
    return df

def prepare_visualization_data(lda, corpus, dictionary):
    vis_data = gensimvis.prepare(lda, corpus, dictionary)
    return vis_data



In [7]:
# Generate LDA

# New UDF to save the LDA model and dictionary
def save_lda_model_and_dictionary(lda, dictionary, lda_model_path, dictionary_path):
    # Create directories if they don't exist
    import os
    os.makedirs(os.path.dirname(lda_model_path), exist_ok=True)
    os.makedirs(dictionary_path, exist_ok=True)

    # Save model - append a filename if lda_model_path is a directory
    if os.path.isdir(lda_model_path):
        lda_model_path = os.path.join(lda_model_path, 'lda_model')
    lda.save(lda_model_path)

    # Save dictionary - create a specific filename in the dictionary_path
    dictionary_file_path = os.path.join(dictionary_path, 'dictionary.dict')
    dictionary.save(dictionary_file_path)

    print(f"LDA model saved to: {lda_model_path}")
    print(f"Dictionary saved to: {dictionary_file_path}")

# Load credentials
load_credentials(path_to_credentials)

# Fetch data from Snowflake
e9_forum_corpus = fetch_data_from_snowflake()

# Engineer the data
e9_forum_corpus = engineer_data(e9_forum_corpus)

# Preprocess the data
df = preprocess_data(e9_forum_corpus)

# Vectorize the data
df, dictionary, corpus = vectorize_data(df)

# Train the LDA Model
lda = train_lda_model(corpus, dictionary)

# Save the LDA model and dictionary
save_lda_model_and_dictionary(lda, dictionary, lda_model_path, dictionary_path)

# Review the Topics
review_topics(lda)

# Function to assign topics to documents
def assign_topics(lda, corpus, df):
    topic_assignments = []
    for i, row in enumerate(lda[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        topic_assignments.append(row[0][0])
    df['Topic'] = topic_assignments
    return df

# Assign Documents to Topics
df = assign_topics(lda, corpus, df)

# Prepare the visualization data
vis_data = prepare_visualization_data(lda, corpus, dictionary)

# Visualize
pyLDAvis.display(vis_data)


LDA model saved to: /content/drive/Othercomputers/My Mac/Git/Language_Models/lda_model
Dictionary saved to: /content/drive/Othercomputers/My Mac/Git/Language_Models/datasets/dictionary.dict
Topic: 0 
Words: 0.032*"-" + 0.029*"sale" + 0.020*"3.0" + 0.016*"removal" + 0.014*"uk" + 0.014*"anyone" + 0.013*"2" + 0.010*"2000" + 0.010*"looking" + 0.010*"lock"

Topic: 1 
Words: 0.033*"question" + 0.018*"-" + 0.016*"3.0" + 0.015*"spotting" + 0.015*"light" + 0.014*"mirror" + 0.013*"factory" + 0.012*"head" + 0.011*"year" + 0.011*"fuel"

Topic: 2 
Words: 0.029*"front" + 0.029*"steering" + 0.029*"manual" + 0.019*"wheel" + 0.015*"question" + 0.013*"trunk" + 0.011*"3.0" + 0.011*"air" + 0.011*"&" + 0.010*"box"

Topic: 3 
Words: 0.017*"question" + 0.015*"watkins" + 0.013*"?" + 0.011*"glen" + 0.011*"tool" + 0.010*"sale" + 0.010*"vintage" + 0.009*"wiper" + 0.009*"friday" + 0.009*"w/"

Topic: 4 
Words: 0.024*"3.0" + 0.020*"rear" + 0.016*"1973" + 0.016*"need" + 0.014*"trim" + 0.014*"looking" + 0.012*"area" 

In [8]:
#Create representative sentences for each topic


representative_sentences = []

def score_text_block(text, topic_words):
    if isinstance(text, str):
        return sum(1 for word in topic_words if word in text)
    return 0


def create_representative_sentence(lda, df, topic_id, top_n=20):
    topic_words = [word for word, prob in lda.show_topic(topic_id, topn=top_n)]
    #df['SCORE'] = df['THREAD_TITLE_EXP'].map(lambda x: score_text_block(x, topic_words))
    #representative_sentence = df.loc[df['SCORE'].idxmax(), 'THREAD_TITLE_EXP']
    df['SCORE'] = df['THREAD_TITLE'].map(lambda x: score_text_block(x, topic_words))
    representative_sentence = df.loc[df['SCORE'].idxmax(), 'THREAD_TITLE']

    return representative_sentence


for topic_id in range(lda_num_topics):
    try:
        sentence = create_representative_sentence(lda, e9_forum_corpus, topic_id)
        representative_sentences.append({'Topic': topic_id, 'Representative Sentence': sentence})
        print(f"Topic {topic_id}: {sentence}")
    except IndexError as e:
        print(f"Error with topic {topic_id}: {e}")

# Save to CSV
output_df = pd.DataFrame(representative_sentences)
output_df.to_csv('/content/drive/MyDrive/Colab Notebooks/datasets/e9/representative_sentences.csv', index=False)


Topic 0: '72 CSL for sale on ebay.co.uk
Topic 1: newbie questions - squeak and fuel opinion
Topic 2: BMW E9 3.0cs 3.0csi 3.0csl sport spring set ??
Topic 3: Yet another one on the UK Bay - That old value thing!!!
Topic 4: FS: Set of outer trims front door/rear window ! new chrome !
Topic 5: new member - alpina refinish wheels paint or powder coat?
Topic 6: NOS BMW Roundel front raised letter for sale Merry Christmas
Topic 7: '69 2800cs batmobile wanna-be bucket back for sale
Topic 8: OC Bimmerfest Caravan Early Sat. a.m. 405 NB to 101 NB
Topic 9: FS: Set of outer trims front door/rear window ! new chrome !
