<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/Language_Models/NLP_with_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic Modeling with Latent Dirichlet Allocation (LDA)
https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

# Introduction

In this notebook I will be using LDA to identify unique topics in a corpus.

# Provision Enviorment

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Install all necessary packages
!pip install numpy==1.24.3
!pip install pandas
!pip install nltk
!pip install gensim==4.3.0
!pip install pyLDAvis
!pip install sumy
!pip install snowflake-connector-python
!pip install requests
!pip install beautifulsoup4

Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.4
    Uninstalling numpy-1.24.4:
      Successfully uninstalled numpy-1.24.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.
albumentations 2.0.5 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
jax 0.5.2 requires numpy>=1.25, but you have numpy 1.24.3 which is incompatible.
jax 0.5.2 requires scipy>=1.11.1, but you have scipy 1.10.1 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you 

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 324, in run
    session = self.get_default_session(options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/index_command.py", line 71, in get_default_session
    self._session = self.enter_context(self._build_session(options))
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/index_command.py", line 100, in _build_session
    session = PipSession(
              ^^^^^^^^^

In [1]:
# Import all necessary libraries
import os
import numpy as np
import pandas as pd
import nltk
import random
import re
import requests
from bs4 import BeautifulSoup
import snowflake.connector

# Download NLTK resources
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# NLTK imports
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords

# Gensim imports
from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

# PyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Sumy libraries
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer

print("All imports successful!")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


All imports successful!


# Corpus

The corpus used was assembled using Beautiful Soup to scrape a pubic forum specific to the BMW E9 (www.e9coupe.com). The data was compiled and stored in a Snowflake database for multiple NLP projects, including LDA, RAG, GRU and LSTM. Furture ideas include supplementing the forum text with an existing users guide specific to this model.

## Load data

In [2]:
# Data is stored in Snowflake

# Main sequence
path_to_credentials = '/content/drive/MyDrive/Colab Notebooks/credentials/snowflake_credentials.txt'



def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")

def fetch_data_from_snowflake():
    conn = snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT'),
    )

    cur = conn.cursor()

    query = """
    SELECT THREAD_TITLE, THREAD_FIRST_POST FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
    """
    cur.execute(query)
    e9_forum_corpus = cur.fetch_pandas_all()

    cur.close()
    conn.close()

    return e9_forum_corpus




# Load credentials
# load_credentials(path_to_credentials)

# Fetch data from Snowflake
# df_raw = fetch_data_from_snowflake()

# df_tensorboard = df_raw.copy()

# df_tensorboard.to_csv('/content/drive/MyDrive/Colab Notebooks/NLP/NLP_with_Tensorboard/df_tensorboard.csv', index=False)


In [3]:
#Process data for LDA

def load_credentials(path_to_credentials):
    with open(path_to_credentials, 'r') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line and '=' in line:
                key, value = line.split('=')
                os.environ[key] = value
            else:
                print(f"Issue with line {line_num} in {path_to_credentials}: '{line}'")


def fetch_data_from_snowflake():
    conn = snowflake.connector.connect(
        user=os.environ.get('USER'),
        password=os.environ.get('PASSWORD'),
        account=os.environ.get('ACCOUNT'),
    )

    cur = conn.cursor()

    query = """
    SELECT THREAD_TITLE, THREAD_FIRST_POST FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
    """
    cur.execute(query)
    e9_forum_corpus = cur.fetch_pandas_all()

    cur.close()
    conn.close()

    return e9_forum_corpus

# Create a longer form "TITLE" for potential use as a QUESTION
def engineer_data(e9_forum_corpus):
    e9_forum_corpus['THREAD_TITLE_EXP'] = e9_forum_corpus['THREAD_TITLE'] + " " + e9_forum_corpus['THREAD_FIRST_POST']
    return e9_forum_corpus

# Drop NA and Stopwords and Lemmatize
def preprocess_data(df):
    #df = df[['THREAD_TITLE_EXP']].copy()
    df = df[['THREAD_TITLE']].copy()
    df.dropna(inplace=True)
    df['THREAD_TITLE'] = df['THREAD_TITLE'].astype(str)

    #df['THREAD_TITLE_EXP'] = df['THREAD_TITLE_EXP'].astype(str)

    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english')).union({'car', 'csi', 'cs', 'csl', 'e9', 'coupe', 'http', 'https', 'www', 'ebay', 'bmw', 'html'})

    # Remove URLs
    def remove_urls(text):
        url_pattern = re.compile(r'https?://\S+|www\.\S+')
        return url_pattern.sub(r'', text)

    # Function to preprocess text
    def preprocess(text):
        text = remove_urls(text)
        return [lemmatizer.lemmatize(word) for word in text.lower().split() if word not in stop_words]

    #df['PROCESSED'] = df['THREAD_TITLE_EXP'].map(preprocess)
    df['PROCESSED'] = df['THREAD_TITLE'].map(preprocess)

    return df

def vectorize_data(df):
    dictionary = Dictionary(df['PROCESSED'])
    corpus = [dictionary.doc2bow(doc) for doc in df['PROCESSED']]
    return df, dictionary, corpus

def train_lda_model(corpus, dictionary, num_topics=10, random_state=42, passes=10):
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=random_state, passes=passes)
    return lda

def review_topics(lda):
    for idx, topic in lda.print_topics(-1):
        print(f"Topic: {idx} \nWords: {topic}\n")

def assign_topics(lda, corpus, df):
    topics = [lda[doc] for doc in corpus]
    df['TOPICS'] = [[(int(topic[0]), float(topic[1])) for topic in doc] for doc in topics]  # Ensure topics are JSON serializable
    return df

def prepare_visualization_data(lda, corpus, dictionary):
    vis_data = gensimvis.prepare(lda, corpus, dictionary)
    return vis_data

def score_text_block(text, topic_words):
    if isinstance(text, str):
        return sum(1 for word in topic_words if word in text)
    return 0

def create_representative_sentence(lda, df, topic_id, top_n=20):
    topic_words = [word for word, prob in lda.show_topic(topic_id, topn=top_n)]
    #df['SCORE'] = df['THREAD_TITLE_EXP'].map(lambda x: score_text_block(x, topic_words))
    #representative_sentence = df.loc[df['SCORE'].idxmax(), 'THREAD_TITLE_EXP']
    df['SCORE'] = df['THREAD_TITLE'].map(lambda x: score_text_block(x, topic_words))
    representative_sentence = df.loc[df['SCORE'].idxmax(), 'THREAD_TITLE']

    return representative_sentence

# New UDF to save the LDA model and dictionary
def save_lda_model_and_dictionary(lda, dictionary, lda_model_path, dictionary_path):
    lda.save(lda_model_path)  # Use gensim's save method for the LDA model
    dictionary.save(dictionary_path)  # Use gensim's save method for the dictionary
    print(f"LDA model saved to {lda_model_path}")
    print(f"Dictionary saved to {dictionary_path}")


In [4]:
# Generate LDA

# Main sequence
path_to_credentials = '/content/drive/MyDrive/Colab Notebooks/credentials/snowflake_credentials.txt'

lda_model_path = '/content/drive/Othercomputers/My Mac/Git Portfolio/Language_Models/lda_model'
dictionary_path = '/content/drive/Othercomputers/My Mac/Git Portfolio/Language_Models/lda_dictionary'

# Load credentials
load_credentials(path_to_credentials)

# Fetch data from Snowflake
e9_forum_corpus = fetch_data_from_snowflake()

# Engineer the data
e9_forum_corpus = engineer_data(e9_forum_corpus)

# Preprocess the data
df = preprocess_data(e9_forum_corpus)

# Vectorize the data
df, dictionary, corpus = vectorize_data(df)

# Train the LDA Model
lda = train_lda_model(corpus, dictionary)

# Save the LDA model and dictionary
save_lda_model_and_dictionary(lda, dictionary, lda_model_path, dictionary_path)

# Review the Topics
review_topics(lda)

# Function to assign topics to documents
def assign_topics(lda, corpus, df):
    topic_assignments = []
    for i, row in enumerate(lda[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        topic_assignments.append(row[0][0])
    df['Topic'] = topic_assignments
    return df

# Assign Documents to Topics
df = assign_topics(lda, corpus, df)

# Prepare the visualization data
vis_data = prepare_visualization_data(lda, corpus, dictionary)

# Visualize
pyLDAvis.display(vis_data)


LDA model saved to /content/drive/Othercomputers/My Mac/Git Portfolio/Language_Models/lda_model
Dictionary saved to /content/drive/Othercomputers/My Mac/Git Portfolio/Language_Models/lda_dictionary
Topic: 0 
Words: 0.043*"dob" + 0.029*"pictures:" + 0.029*"question" + 0.029*"link" + 0.015*"seat" + 0.015*"looking" + 0.015*"company" + 0.015*"posted" + 0.015*"making" + 0.015*"italian"

Topic: 1 
Words: 0.021*"question" + 0.021*"front" + 0.021*"2800" + 0.021*"brake" + 0.021*"end" + 0.021*"diego" + 0.021*"contest" + 0.021*"san" + 0.021*"clean" + 0.021*"sale:"

Topic: 2 
Words: 0.032*"old" + 0.032*"dear" + 0.032*"bucket" + 0.017*"need" + 0.017*"nice" + 0.017*"ratio" + 0.017*"box" + 0.017*"close" + 0.017*"gear" + 0.017*"opinion"

Topic: 3 
Words: 0.028*"back" + 0.028*"good" + 0.014*"swap" + 0.014*"welcome" + 0.014*"interior" + 0.014*"bmwcca" + 0.014*"local" + 0.014*"," + 0.014*"2002" + 0.014*"thanks"

Topic: 4 
Words: 0.054*"series" + 0.054*"6" + 0.036*"available" + 0.036*"new" + 0.036*"now." 

In [5]:
#Create representative sentences for each topic

num_topics = lda.num_topics
representative_sentences = []

for topic_id in range(num_topics):
    try:
        sentence = create_representative_sentence(lda, e9_forum_corpus, topic_id)
        representative_sentences.append({'Topic': topic_id, 'Representative Sentence': sentence})
        print(f"Topic {topic_id}: {sentence}")
    except IndexError as e:
        print(f"Error with topic {topic_id}: {e}")

# Save to CSV
output_df = pd.DataFrame(representative_sentences)
output_df.to_csv('/content/drive/MyDrive/Colab Notebooks/datasets/e9/representative_sentences.csv', index=False)


Topic 0: Looking for link I posted about italian company making seats
Topic 1: 74 e9 coupe body
Topic 2: Opinions wanted: Close ratio or overdrive gear box?
Topic 3: Complete tan interior up on German Ebay
Topic 4: E24 6 series seats, is it possible to fit them?
Topic 5: Does anyone know of a 3.64 lsd for sale?
Topic 6: Seattle Craigslist -!! RARE !! 1973 BMW 3.0 CS / One owner
Topic 7: Antitheft or: How I Learned to Stop Worrying and Love Lojack
Topic 8: Coupe Link for speaker grille?
Topic 9: So glad to be back
