<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/nlp/lda/notebooks/NLP_with_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic Modeling with Latent Dirichlet Allocation (LDA)
https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

## Corpus Creation

The corpus used was assembled using Beautiful Soup to scrape a pubic forum specific to the BMW E9 (www.e9coupe.com). This active forum has been exsitence since 2003. The data was compiled and stored in a Snowflake database for multiple NLP projects, including LDA, GRU and LSTM. Furture ideas include supplementing the forum text with an existing users guide specific to this model.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Getting a conflict with some of the libraries. This seems to help.

!pip install joblib==1.0.1

Collecting joblib==1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.1/303.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: joblib
  Attempting uninstall: joblib
    Found existing installation: joblib 1.3.2
    Uninstalling joblib-1.3.2:
      Successfully uninstalled joblib-1.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imbalanced-learn 0.10.1 requires joblib>=1.1.1, but you have joblib 1.0.1 which is incompatible.
scikit-learn 1.2.2 requires joblib>=1.1.1, but you have joblib 1.0.1 which is incompatible.[0m[31m
[0mSuccessfully installed joblib-1.0.1


In [1]:
!pip install snowflake-connector-python
import snowflake.connector

import pandas as pd
import os

from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')

!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




In [2]:
# Step 1:
# Load data

# Set the snowflake account and login information
path_to_credentials = '/content/drive/MyDrive/credentials/snowflake_credentials'

# Load the credentials
with open(path_to_credentials, 'r') as file:
    for line in file:
        key, value = line.strip().split('=')
        os.environ[key] = value

conn = snowflake.connector.connect(
    user=os.environ.get('USER'),
    password=os.environ.get('PASSWORD'),
    account=os.environ.get('ACCOUNT'),
)

# Create a cursor object
cur = conn.cursor()

# Select source data
query = """
SELECT * FROM "E9_CORPUS"."E9_CORPUS_SCHEMA"."E9_FORUM_CORPUS";
"""
cur.execute(query)

# Load data into a df.
e9_forum_corpus = cur.fetch_pandas_all()

# Close the cursor and the connection
cur.close()
conn.close()

# Step 2: Preprocess the Data
df = e9_forum_corpus[['THREAD_ALL_POSTS']].copy()
df.dropna(inplace=True)

# Combine Gensim's STOPWORDS with your additional stopwords
additional_stopwords = {'car', 'csi', 'cs', 'csl','e9','coupe','http','https','www','ebay','bmw','html'} # Very corpus specific
all_stopwords = STOPWORDS.union(additional_stopwords)

def preprocess(text):
    tokenizer = RegexpTokenizer(r'\w+')
    lemmatizer = WordNetLemmatizer()
    tokens = tokenizer.tokenize(text.lower())
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens if token not in all_stopwords and len(token) > 1]
    return lemmatized

df['processed'] = df['THREAD_ALL_POSTS'].map(preprocess)

# Step 3: Vectorization
dictionary = Dictionary(df['processed'])
corpus = [dictionary.doc2bow(doc) for doc in df['processed']]

# Step 4: Train the LDA Model
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, random_state=42, passes=10)

# Step 5: Review the Topics
for idx, topic in lda.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")

# Step 6: Assign Documents to Topics
topics = [lda[doc] for doc in corpus]
df['topics'] = topics

# Step 7: Prepare the visualization data
vis_data = gensimvis.prepare(lda, corpus, dictionary)

# Visualize
pyLDAvis.display(vis_data)

  and should_run_async(code)


Topic: 0 
Words: 0.040*"image" + 0.036*"jpg" + 0.034*"broken" + 0.034*"external" + 0.031*"org" + 0.030*"font" + 0.029*"craigslist" + 0.016*"cto" + 0.013*"html" + 0.010*"url"

Topic: 1 
Words: 0.008*"hole" + 0.008*"com" + 0.007*"bolt" + 0.007*"thanks" + 0.007*"said" + 0.007*"part" + 0.007*"expand" + 0.006*"tool" + 0.006*"need" + 0.006*"like"

Topic: 2 
Words: 0.014*"door" + 0.013*"rear" + 0.012*"seat" + 0.011*"window" + 0.008*"bumper" + 0.008*"new" + 0.008*"good" + 0.008*"paint" + 0.008*"trim" + 0.007*"panel"

Topic: 3 
Words: 0.057*"light" + 0.048*"switch" + 0.017*"wiper" + 0.016*"turn" + 0.014*"lens" + 0.013*"bulb" + 0.012*"signal" + 0.012*"relay" + 0.011*"beam" + 0.011*"column"

Topic: 4 
Words: 0.028*"engine" + 0.008*"weber" + 0.008*"jet" + 0.008*"fuel" + 0.007*"carbs" + 0.007*"head" + 0.007*"manifold" + 0.006*"set" + 0.006*"stock" + 0.006*"carb"

Topic: 5 
Words: 0.027*"wire" + 0.015*"plug" + 0.012*"coil" + 0.012*"ignition" + 0.011*"battery" + 0.011*"relay" + 0.009*"motor" + 0.009*

Ill be trying to use these results in the development of language models.