Topic Model Creator

*Developed by John Crissman

Focus (reason I created this project)

Predicting social determinants of health in patients. Also, using Latent Dirichlet Allocation topic models as input into classification algorithms were another focus.

Data

On china town patients and produced by patient navigators at Northwestern University. Data includes visit information, demographics, comments left by patient navigators and the barrier (social determinant of health)

Algorithms

Latent Dirichlet Allocation (Topic Modeling) is used to convert text to numerical data. Random Forests, Artificial Neural Networks, Logistic Regression, Support Vector Machines, and Gaussian Naive Bayes were used for classification.

Other uses for this program.

Users can turn text into topic models and concatenate this data with other numerica data for predicting (classification) purposes.

Setup

Dependencies

This application was developed in Python 3 and HTML, using Visual Studio Code

Python 3.6.4
Visual Studio Code. February 2020 (version 1.43)
Python packages (including several sub-packages)
- pandas
- numpy
- pickle
- time
- pprint
- glob
- gensim
- sklearn
- spacy
- statistics
- seaborn
- itertools
- matplotlib
- re
- webbrowser
- heapq
- csv
- math

API Reference

Classes

CorpusProcessor() (corpus_processor.py): This class prepares a collection of documents to use a vectorizer in order to make a document to word matrix. The doc to word matrix is input into LDA.
LDAProcessor() (lda_processor.py): This class creates a Latent Dirichlet Allocation (LDA) model and transforms the output of the LDA in order to use as input for supervised learning.
ClassifierProcessor() (classifier_processor.py): This class takes a pandas dataframe such that each row is a data point and the columns are attributes of the data point. The values in the column with the column name "Classification" are the labels associated with that respective row/data point.
DisplayNotes() (display_notes.py): This class displays one document and highlights words different colors that are associated with topics.

Demo files that use the above classes

demo_ALL_patients_all.py: Using data from df_each_visit__one_hot_encoding.csv, this demo file will create LDA models (topic #s are 5, 10, 15, 20, 25, 30) and the appropriate matrices and save them into china_ALL_patients_ALL_5_10_15_20_25_30.pkl. We are considering each visit from a patient as a data point and the barrier given by the patient navigator will be the label for that data point.
demo_classify_ALL_patients_all.py: This class loads the objects from the pickle file china_ALL_patients_ALL_5_10_15_20_25_30.pkl and concatenates the data with demographics data associated with each patient. This demo file also runs machine learning algorithms from ClassifierProcessor() and displays the text with DisplayNotes().
demo_each_patient_is_a_data_point.py: Using data from df_each_visit_one_hot_encoding_sorted_by_id.csv, we transform the data to a each visit is a data point strategy to each patient is a data point strategy. We create LDA models (topic #s are 5, 10, 15, 20, 25, 30) and the appropriate matrices and save them into china_ALL_patients_each_patient_one_data_point_ALL_5_10_15_20_25_30.pkl. We are considering each patient as a data point and considering all the visits up to the first occurence when their barrier is not language/interpreter.
demo_classify_each_patient_is_a_data_point.py: This class loads the objects from the pickle file china_ALL_patients_each_patient_one_data_point_ALL_5_10_15_20_25_30.pkl and concatenates the data with demographics data associated with each patient. This demo file also runs machine learning algorithms from ClassifierProcessor() and displays the text with DisplayNotes().
demo_processor_load: This class loads the objects from the pick file c_finalized_model.pkl. These objects in c_finalized_model.pkl are representative of movie review data. Each movie review is labelled as either a positive review or a negative review. 1000 positive reviews and 1000 negative reviews were used. Only the document to topic matrix and the label (positive or negative) were used for classification. This file uses ClassifierProcessor() and DisplayNotes().

Data used for project

df_each_visit_one_hot_encoding.csv: This data shows each row as a visit from a patient and each column is an attribute. Because of one-hot-encoding, there will be many columns/attributes.
PN_demographics_neiu.xlsx: This data has some demographics information for each patient such as age, occupational status, marital status, education level, whether or not they were born in the United States, in what year they came to the U.S., how well they speak english, where they are from, their current zip code, and household income.
df_each_visit_one_hot_encoding_sorted_by_id.csv: Same as df_each_visit_one_hot_encoding.csv_ above with the exception that this file is sorted by record_id.

Pickle files and their contents

china_ALL_patients_ALL_5_10_15_20_25_30.pkl: These objects are representative of the data when considering each visit from a patient as a data point and the barrier given by the patient navigator will be the label for that data point.
- vectorizer
- all_lda_processors
- all_lda_models
- all_doc_to_topic_matrices
- list_of_documents
- list_of_barriers
- document_to_word_matrix
- other_patient_visit_data
china_ALL_patients_each_patient_one_data_point_ALL_5_10_15_20_25_30.pkl: These objects are representative of the data when considering each patient as a data point. We are focusing on barriers that are not language/interpreter. If a patient only has barriers of language/interpreter from their visits, then they will be labeled as language/interpreter
- vectorizer
- all_lda_processors
- all_lda_models
- all_doc_to_topic_matrices
- list_of_documents
- list_of_barriers
- document_to_word_matrix
- other_patient_visit_data
c_finalized_model.pkl: These objects are representative of movie review data. Each movie review is labelled as either a positive review or a negative review. 1000 positive reviews and 1000 negative reviews were used. Only the document to topic matrix and the label (positive or negative) were used for classification.
- vectorizer
- doc_to_word_matrix
- lda_model
- doc_to_topic_matrix

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
__pycache__		__pycache__
data_china_patients_visits_1-50		data_china_patients_visits_1-50
results		results
PN_DemographicsForNEIU.xlsx		PN_DemographicsForNEIU.xlsx
PN_demographics_neiu.xlsx		PN_demographics_neiu.xlsx
README.md		README.md
Tracking_Log_1-10_for_NEIU_excel.xlsx		Tracking_Log_1-10_for_NEIU_excel.xlsx
Tracking_Log_1-10_for_NEIU_excel_with SUMMATIONS.xlsx		Tracking_Log_1-10_for_NEIU_excel_with SUMMATIONS.xlsx
a_finalized_model.pkl		a_finalized_model.pkl
c_doc_to_topic_matrix.csv		c_doc_to_topic_matrix.csv
c_finalized_model.pkl		c_finalized_model.pkl
china_1st_visit_LDA_doc_to_word_matrix.csv		china_1st_visit_LDA_doc_to_word_matrix.csv
china_ALL_patients_ALL_5_10_15_20_25_30.pkl		china_ALL_patients_ALL_5_10_15_20_25_30.pkl
china_ALL_patients_each_patient_one_data_point_ALL_5_10_15_20_25_30.pkl		china_ALL_patients_each_patient_one_data_point_ALL_5_10_15_20_25_30.pkl
china_ALL_patients_first_5_10_15_20_25_30.pkl		china_ALL_patients_first_5_10_15_20_25_30.pkl
china_LDA_patients_first_5_10_15_20_25_30.pkl		china_LDA_patients_first_5_10_15_20_25_30.pkl
china_doc_to_topic_matrix.csv		china_doc_to_topic_matrix.csv
china_doc_to_word_matrix.csv		china_doc_to_word_matrix.csv
china_finalized_model.pkl		china_finalized_model.pkl
china_topic_to_word_matrix10.csv		china_topic_to_word_matrix10.csv
classifier_processor.py		classifier_processor.py
code_example_1.py		code_example_1.py
code_example_1_data.csv		code_example_1_data.csv
corpus_processor.py		corpus_processor.py
cv003_12683.txt		cv003_12683.txt
demo_ALL_patients_all.py		demo_ALL_patients_all.py
demo_ALL_patients_first.py		demo_ALL_patients_first.py
demo_LDA_patients_first.py		demo_LDA_patients_first.py
demo_china_load.py		demo_china_load.py
demo_china_save.py		demo_china_save.py
demo_china_save2.py		demo_china_save2.py
demo_classifiy_ALL_patients_first.py		demo_classifiy_ALL_patients_first.py
demo_classify_ALL_patients_all.py		demo_classify_ALL_patients_all.py
demo_classify_LDA_patients_first.py		demo_classify_LDA_patients_first.py
demo_classify_each_patient_is_a_data_point.py		demo_classify_each_patient_is_a_data_point.py
demo_each_patient_is_a_data_point.py		demo_each_patient_is_a_data_point.py
demo_processor_load.py		demo_processor_load.py
demo_processor_save.py		demo_processor_save.py
demo_results_from_strategies.py		demo_results_from_strategies.py
df_ALL_patients_all.csv		df_ALL_patients_all.csv
df_ALL_patients_first.csv		df_ALL_patients_first.csv
df_all_CATEGORICAL.csv		df_all_CATEGORICAL.csv
df_all_INFO.csv		df_all_INFO.csv
df_all_NUMERICAL.csv		df_all_NUMERICAL.csv
df_all_TESTIING.csv		df_all_TESTIING.csv
df_all_TESTIING2.csv		df_all_TESTIING2.csv
df_all_TESTIING3.csv		df_all_TESTIING3.csv
df_each_patient_is_a_data_point.csv		df_each_patient_is_a_data_point.csv
df_each_visit_is_one_tuple.csv		df_each_visit_is_one_tuple.csv
df_each_visit_one_hot_encoding.csv		df_each_visit_one_hot_encoding.csv
df_each_visit_one_hot_encoding_sorted_by_id.csv		df_each_visit_one_hot_encoding_sorted_by_id.csv
df_education.csv		df_education.csv
df_no_missing_values.csv		df_no_missing_values.csv
df_visits_11_20_TESTING.csv		df_visits_11_20_TESTING.csv
df_visits_1_10_TESTING.csv		df_visits_1_10_TESTING.csv
df_visits_21_30_TESTING.csv		df_visits_21_30_TESTING.csv
df_visits_31_40_TESTING.csv		df_visits_31_40_TESTING.csv
df_visits_41_50_TESTING.csv		df_visits_41_50_TESTING.csv
df_visits_all_together.csv		df_visits_all_together.csv
display_notes.py		display_notes.py
doc_to_topic_matrix.csv		doc_to_topic_matrix.csv
doc_to_word_matrix.csv		doc_to_word_matrix.csv
finalized_model.pkl		finalized_model.pkl
helloworld.html		helloworld.html
html_testing.py		html_testing.py
j_testing.csv		j_testing.csv
j_testing.xlsx		j_testing.xlsx
lda_processor.py		lda_processor.py
m_testing.csv		m_testing.csv
m_testing.xlsx		m_testing.xlsx
movie_reviews_topic_to_word_matrix100.csv		movie_reviews_topic_to_word_matrix100.csv
movie_reviews_topic_to_word_matrix15.csv		movie_reviews_topic_to_word_matrix15.csv
patients_topic_to_word_matrix100.csv		patients_topic_to_word_matrix100.csv
patients_topic_to_word_matrix1000.csv		patients_topic_to_word_matrix1000.csv
preprocess_raw_data.py		preprocess_raw_data.py
testing123.py		testing123.py
testing_simple.py		testing_simple.py
topic_10_to_word_10_1st_visit.csv		topic_10_to_word_10_1st_visit.csv
topic_5_to_word_10_1st_visit.csv		topic_5_to_word_10_1st_visit.csv
topic_5_to_word_1st_visit.csv		topic_5_to_word_1st_visit.csv
topic_5_to_word_7_1st_visit.csv		topic_5_to_word_7_1st_visit.csv
visits_1-10.csv		visits_1-10.csv
visits_11-20.csv		visits_11-20.csv
visits_21-30.csv		visits_21-30.csv
visits_31-40.csv		visits_31-40.csv
visits_41-50.csv		visits_41-50.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Model Creator

Setup

Dependencies

API Reference

Classes

Demo files that use the above classes

Data used for project

Pickle files and their contents

Code Examples

About

Releases

Packages

Languages

fiacobelli/topic-model-creator

Folders and files

Latest commit

History

Repository files navigation

Topic Model Creator

Setup

Dependencies

API Reference

Classes

Demo files that use the above classes

Data used for project

Pickle files and their contents

Code Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages