Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the embedding on a new categorical data #24

Open
isaac2lord opened this issue Mar 6, 2021 · 4 comments
Open

How to use the embedding on a new categorical data #24

isaac2lord opened this issue Mar 6, 2021 · 4 comments

Comments

@isaac2lord
Copy link

isaac2lord commented Mar 6, 2021

Hello,

I have a general rudimentary question ( sorry in advance).

I have reviewed (not fully) many parts of the codes in here. I'd like to test the proposed embedding on a new data, but am not sure where to begin.

I have a simple 2-column data: first col is patient id (assume 1M unique patients) second col is ICD10 diag code (assume 10K categories). We have repeated measurements in data, meaning that diagnoses can be repeated within a given patient and across many patients.

I tested Multiple Correspondance Analysis with categorical data from this link, but the results are not very useful.

Similar to the German States example in the repo, my goal is to perform (unsupervised) dimensionality reduction ( such as the ones you'd see in denoising AE with minimizing reconstruction error).

  • Where should I start? Do I need to run one-hot beforehand?
  • What funcs should I use after loading my raw data to generate such embedding?

Appreciate any words of wisdom you may be able to share.

@entron
Copy link
Owner

entron commented Mar 7, 2021

Hi isaac2lord,

This is a very interesting topic! I have little knowledge in medicine but I will try my best to ansewr your question so please correct me if I make some wrong assumptions.

To learn the embeddings of patient and ICD10 with the 2-col data you have you need to give the network a meaningful problem to answer. One problem I can think of is "If the patient has diagnosese A, B and C what is the probability that he also has diagnoses D?" (I assume you don't have the date when the diagnoses are made so there is no order of A, B, C, D. If you do have the dates and together with other infomation such as age/gender etc. that will make the learned embeddings more meaningful.)

I would solve the problem with the following steps:

  1. Filter the data so that each patient should have at least two different ICD10 and leave out the rest.
  2. If majority of patients have at most 4 diagnoses to make things simpler I will also leave out patients with more diagnoses.
  3. The network I would use is the same as the CBOW architecture in natural language processing context. It will have 3 inputs and one output. You can read the paper "Efficient Estimation of Word Representations in Vector Space" for more information. Fig. 1 has the CBOW architecture (Feel free to comment on this thread if you have any question.)

Now you can just feed the data you have and let the network try to guess the other diagnoses. If the patient has less than 4 diagnoses you can use a special code to represent it. For example instead to use [A, B, C] as input you use [A, nd, nd] as input.

One point your data is different with NLP context is there is no order as I assumed above. You you can permute your data to generate more training samples. For example for patient with diagnose A, B, C you will have the following traning data points:

A, B, nd -> C
A, nd, B -> C
B, A, nd -> C
nd, B, A -> C
C, B, nd -> A
B, nd, C -> A
C, A, nd -> B
.....

@entron
Copy link
Owner

entron commented Mar 7, 2021

You don't need one-hot encodings. You just map each patient to an N-dimentional vector and map each ICD10 code plus another nd code to a M-dimentional vector. You can try different N and M to get the best prediction results. I would start with something meaningful say if you want to describe a patient/diagnose normally how many factors do you need? Of course it also depends on how much data you have. The more data you have the higher dimention l you can afford to try. I would start with something like N=20 and M=20.

@isaac2lord
Copy link
Author

Thanks much @entron for sharing your thoughts.

I shall say the data contains Diag_Date as well, but not a lot of patient info that we can really leverage (it's because we are not allowed to use any member demographic fields in our modeling efforts).

Here is a view of data

image

There are about 80K patients with a max of 23 diagnosis codes.

image

Thanks for pointing me to the paper, and yes I am familiar with word2vec and it's many variants.

So assume I reformat my data based on above steps ( with ordering ), is the entity embedding would perform similar approaches as generative NLP models, trying to predict the next word in a part of speech?

@entron
Copy link
Owner

entron commented Mar 7, 2021

If the date are not really usable it is fine I think to not use it and still get meanful embeddings.
Entity embedding is just apply the word2vec idea to general entities and tabula data, so what I described above is the ee approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants