## Cluster Analysis with Multinomial EM (Text Clustering)

#### In the K-means clustering algorithm we forced documents to be assigned to one cluster. This is known as hard clustering. This approach in not really accurate when working with language.  When someone writes about a topic they use a certain terminology to describe that topic. We assume that when someone else writes about the same topic they are likely to use the same terminology or words. But words may also be used across multiple topics and a clinical note may describe multiple topics. It may be more appropriate to assign words and documents to multiple topics (or clusters) with a certain probability based on their use.

#### This example illustrates how to group a set of different clinical notes based on their topics and the terminology used to describe the topics. We will model our clinical notes as a collection of unigram and bigram words. Specifically, we will represent each clinical note as a feature set of unigram and bigram frequencies found in the clinical note. We will use a matrix where each row will represent a clinical note and each column a feature (i.e. distinct unigram or bigram).

### Mixture of Multinomial Distributions

#### Text is best represented as a mixture of multinomial distributions where each topic has a particular multinomial distribution associated with it and each document in a mixture of topics. 

#### Formally, let $p(c)\space=\space\pi_c$ be the prior probability of a document containing topic c, and each topic c is represented as a multinomial distribution $p(D_i|c)$ with parameters $\mu_{jc}$, then each document becomes a mixture over topics as

#### $p(D_i)\space=\space \displaystyle\sum_{c=1}^{n_c} p(D_i|c)p(c) = \displaystyle\sum_{c=1}^{n_c} \pi_c \prod_{j=1}^{n_w}\mu_{jc}^{T_{ij}}$

### Expectation Maximization for Mixtures of Multinomials

#### The expectation maximization algorithm will allow us to fit a multinomial mixture model to our data. Our goal is to identify which documents belong to which topics and what words (unigrams and bigrams) are used to describe the topic. 

#### 1. E-Step. Compute the expectation that document $D_i$ belongs to topic (cluster) $c$

####         $\gamma_{ic} \propto \pi_c \displaystyle\prod_{j = 1}^{n_w} \mu_{jc}^{T_{ij}}$

#### $Note: \space normalize \space expectations \space over \space c$

#### where,
#### $\pi_c \equiv prior{\space}probability{\space}of{\space}document{\space}containing{\space}topic{\space}c$
#### $\mu_{jc} \equiv probability{\space}of{\space}w_j{\space}in{\space}topic{\space}c$
#### $T_{ij} \equiv count \space of \space w_j \space in \space topic \space c$


#### 2. M-Step. Update the mixture parameters. 

#### $\mu_{jc} = \frac{\sum_{i = 1}^{n_d} \gamma_{ic}T_{ij}}{\sum_{i = 1}^{n_d}\sum_{l = 1}^{n_w} \gamma_{ic}T_{il}} \equiv probability \space of \space a \space word \space being \space w_j \space in \space topic \space (cluster) \space c$ 

#### $\pi_c = \frac{1}{n_d} \displaystyle \sum_{i = 1}^{n_d} \gamma_{ic} \equiv prior \space probability \space of \space each \space cluster$

#### $Note: \space normalize \space priors \space uniformly,\space initialize \space \mu's \space to \space multinomial \space generated \space from \space uniform \space dirichlet \space distribution \space such \space that \sum_{j = 1}^{n_w} \mu_{jc} = 1$

### Implementation

#### Our implementation will consist of miipulating 4 matricies as described below.

![title](./images/MultinomialEMClustering.png)


### Environment Setup

#### First we will import some Python packages that we will use.

In [1]:
import nltk
import re
import pandas as pd
import numpy as np 
import numpy.matlib as ml
import pickle as pkl
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

### Syntactic NLP Processing

#### We need a function to tokenize our text and remove noise like dates, ages, etc.

In [2]:
def tokenize(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [ token for token in tokens if re.search('(^[a-zA-Z]+$)', token) ]
    return filtered_tokens

### Retrieving our Corpus

#### Let's pull in our corpus that we had serialized out to disk.  

In [3]:
file = open('differential-corpus.pkl','rb')
corpus = pkl.load(file)
file.close()
corpus.head()

Unnamed: 0,text,label
0,[**2996-12-2**] 10:25 AM\n CT CHEST W/O CONTRA...,PNA
1,[**3201-9-21**] 4:50 PM\n CT CHEST W/CONTRAST ...,PNA
2,[**3299-6-23**] 5:06 PM\n CT CHEST W/CONTRAST ...,PNA
3,[**3186-6-14**] 2:54 PM\n CT CHEST W/CONTRAST ...,PNA
4,[**2500-1-17**] 9:41 PM\n CT CHEST W/O CONTRAS...,PNA


### More Syntactic Processing

#### We will want to get ride of stop words that are essentially noise. 

In [4]:
cachedStopWords = stopwords.words("english") 
noisywords = ['year', 'old', 'man', 'woman', 'ap', 'am', 'pm', 'portable', 'pa', 'lat', 'admitting', 'diagnosis', 'lateral']
cachedStopWords.extend(noisywords)
print(cachedStopWords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Generate Document-Term Frequency Counts

#### In this step we tokenize our text and remove stop words in addition to generating our frequency counts.

#### 1) how many documents are we working with and how many features (unigrams & bigrams)?

#### 2) Can you figure out what max_df and min_df is doing to our feature count?

In [5]:
corpusList = corpus['text'].tolist()
labels = []
idx = 1
for label in corpus['label'].tolist():
    labels.append(label + "-" + str(idx))
    idx = idx + 1
rank = list(range(len(labels)))

cv = CountVectorizer(lowercase=True, max_df=0.80, max_features=None, min_df=0.033,
                     ngram_range=(2, 2), preprocessor=None, stop_words=cachedStopWords,
                     strip_accents=None, tokenizer=tokenize, vocabulary=None)
SparseT = cv.fit_transform(corpusList)
print("The dimensions of our document-term matrix")
print(SparseT.shape)
print()

  'stop_words.' % sorted(inconsistent))


The dimensions of our document-term matrix
(1500, 342)



### Feature Set

#### Let's take a look at our feature set.

#### 1) Do we have a lot of noise in our features?

In [6]:
lexicon = cv.get_feature_names()
print (lexicon)
print()

['abdomen pelvis', 'administration cc', 'administration iv', 'adrenal glands', 'air bronchograms', 'airways patent', 'also noted', 'amt final', 'amt underlying', 'approximately cm', 'areas consolidation', 'artery calcifications', 'atelectasis left', 'atelectasis right', 'axial images', 'axillary lymphadenopathy', 'bibasilar atelectasis', 'bilateral lower', 'bilateral pleural', 'bone windows', 'c recons', 'cardiac mediastinal', 'cardiac silhouette', 'cardiomediastinal silhouette', 'cc optiray', 'central venous', 'centrilobular emphysema', 'change final', 'chest clip', 'chest compared', 'chest comparison', 'chest contrast', 'chest ct', 'chest history', 'chest indication', 'chest obtained', 'chest pain', 'chest performed', 'chest radiograph', 'chest radiographs', 'chest single', 'chest tube', 'chest w', 'chest wall', 'chest without', 'chf final', 'clinical history', 'clinical indication', 'cm carina', 'compared location', 'compared prior', 'comparison chest', 'comparison ct', 'comparison 

## Define EM Algorithm (E-Step & M-Step)

#### Now let's define our EM Algorithm.

#### 1) In the ExpD function why are we multiplying by $1x10^2 \space ?$  

In [7]:
  
def ExpD(T, Mu, Pi):    
    C_n = Pi.shape[1]
    D_n = T.shape[0]
    Gamma = ml.zeros((D_n, C_n))
    Gamma.astype('float64')
    for c in range(0, C_n):
        Gamma[:, c] = Pi[0][c] * ((Mu[:,c].A[:,0]*1e2)**T).prod(1)
    Gamma = Gamma / Gamma.sum(axis=1)
    return Gamma
    
def updateMu(T, Gamma):    
    C_n = Gamma.shape[1]
    W_n = T.shape[1]
    Mu = ml.zeros((W_n, C_n))
    for c in range(0, C_n):
        numerator = sum(np.multiply(Gamma[:,c],T)).T
        demoninator = sum(np.multiply(Gamma[:,c],T).sum(1))
        Mu[:,c] = numerator / demoninator
    return Mu
    
def updatePi(Gamma):    
    D_n = Gamma.shape[0]
    Pi = sum(Gamma) / D_n
    return Pi.A
    

## Let's Run it !

#### Let's start with 10 topics (clusters) and we will interate 100 times. EM converges quickly.

#### 1) Can you determine at what iteration we are starting to reach convergence?

In [8]:
T = SparseT.todense()
D_n, W_n = T.shape
C_n = 10
Pi = ml.repmat(1/C_n, 1, C_n)
Mu = ml.mat(np.random.dirichlet(np.ones(W_n), C_n).T)
for i in range(1,101):
    print('Iteration: ' + str(i)) 
    Gamma = ExpD(T, Mu, Pi)
    print(Gamma.sum(0))
    Mu = updateMu(T, Gamma)
    Pi = updatePi(Gamma)


Iteration: 1
[[121.75067503 171.1386253  145.86204974 142.45061097 148.30546522
  169.62830556 255.14354441  95.40172592 146.83477235 103.48422551]]
Iteration: 2
[[251.88707529 135.16964172 117.57342788  54.30583127 126.27407856
  397.81014615 121.74130565  89.33094445  96.89266888 109.01488015]]
Iteration: 3
[[269.64503146 126.72248468 121.49612997  57.37545349 133.24068274
  375.06316511 128.66176899  78.71713164  84.276291   124.80186091]]
Iteration: 4
[[274.99211262 128.759225   133.59194161  60.43856844 153.18485612
  331.62782427 134.42126088  73.51112846  81.35991586 128.11316674]]
Iteration: 5
[[278.09587613 129.27524128 144.16811806  59.04421774 169.74015614
  292.44579443 141.10374311  73.1938938   81.74826857 131.18469076]]
Iteration: 6
[[275.82740439 129.52226755 153.73280825  58.10351501 178.61301774
  263.46482463 153.26436385  70.50239946  84.649572   132.31982712]]
Iteration: 7
[[272.42807155 130.9693023  163.82268483  57.30837204 183.48471542
  243.33352407 161.8076196

Iteration: 56
[[233.32572211 132.31683804 191.21126253  51.85519366 197.33647982
  223.36847256 180.45642396  49.00378793 100.83769422 140.28812517]]
Iteration: 57
[[233.21749917 132.31616719 191.21230129  51.85611139 197.45582249
  223.39539941 180.41867342  49.00353881 100.83796485 140.28652197]]
Iteration: 58
[[233.15210855 132.31567084 191.19593094  51.85650051 197.56466252
  223.47711806 180.31116856  49.00331381 100.83806584 140.28546036]]
Iteration: 59
[[233.11648361 132.31530942 191.17342593  51.85679662 197.67291655
  223.61724168 180.12197435  49.00311545 100.83806769 140.28466869]]
Iteration: 60
[[233.09578245 132.31504853 191.15015643  51.85718897 197.78731166
  223.82015839 179.84936183  49.00294928 100.8379943  140.28404816]]
Iteration: 61
[[233.0701318  132.31484431 191.12725418  51.85772964 197.91435113
  224.06261928 179.52890111  49.00280446 100.83785591 140.28350818]]
Iteration: 62
[[232.98745362 132.31461989 191.10240974  51.85828137 198.06151185
  224.29078938 179.

## Examination of Clusters and Terminology

#### Let's take a look at the top cluster for each clinical note and the top 20 words to distinguish this topic.

#### 1) Is there noise in the terminology? If there is how can we get ride of it?


In [9]:
clusters = Gamma.argmax(1).A
clusters = [i[0] for i in clusters]
clinicalDocuments = { 'labels': labels, 'rank': rank, 'corpus': corpusList, 'cluster': clusters }
frame = pd.DataFrame(clinicalDocuments, index = clusters , columns = ['rank', 'labels', 'corpus', 'cluster'])
grouped = frame['rank'].groupby(frame['cluster'])
topWords = Mu.T.argsort()[:, ::-1]
for i in range(C_n):
    n = len(frame.loc[i]['labels'].values.tolist())
    print('Cluster %d (%d):,' % (i+1, n), end='')
    for label in frame.loc[i]['labels'].values.tolist():
        print(' %s,' % label, end='')
    print()
    print()
    print('           Words:', end='')
    for indice in list(topWords[i, :20].A[0]):
        print(' %s (%.5f)' % (lexicon[indice], Mu[indice,i]), end=',')
    print()
    print()
    print()

Cluster 1 (225):, PNA-133, PNA-134, PNA-148, PNA-205, PNA-223, PNA-241, PNA-250, PNA-269, PNA-291, PNA-343, PNA-426, PNA-434, PNA-439, PNA-465, PNA-472, PNA-489, CHF-546, CHF-554, CHF-598, CHF-646, CHF-662, CHF-669, CHF-732, CHF-763, CHF-789, CHF-790, CHF-791, CHF-795, CHF-796, CHF-811, CHF-813, CHF-816, CHF-832, CHF-864, CHF-872, CHF-899, CHF-909, CHF-910, CHF-937, CHF-951, CHF-989, COPD-1015, COPD-1018, COPD-1019, COPD-1031, COPD-1034, COPD-1035, COPD-1046, COPD-1050, COPD-1051, COPD-1052, COPD-1053, COPD-1054, COPD-1056, COPD-1060, COPD-1061, COPD-1064, COPD-1067, COPD-1071, COPD-1079, COPD-1080, COPD-1085, COPD-1086, COPD-1087, COPD-1088, COPD-1089, COPD-1090, COPD-1091, COPD-1094, COPD-1096, COPD-1097, COPD-1098, COPD-1100, COPD-1104, COPD-1108, COPD-1112, COPD-1118, COPD-1120, COPD-1121, COPD-1122, COPD-1123, COPD-1128, COPD-1130, COPD-1133, COPD-1135, COPD-1136, COPD-1140, COPD-1143, COPD-1144, COPD-1145, COPD-1147, COPD-1158, COPD-1163, COPD-1164, COPD-1166, COPD-1167, COPD-117

### Soft Clustering Document Examination

#### 1) Can you find clinical notes that belong to more than 1 cluster ?


In [10]:
N, C = Gamma.shape
for i in range(N):
    print('%s: ' % labels[i], end='')
    for j in range(C):
        print('C%d (%.7f): ' % (j+1, Gamma[i, j]), end='')
    print()
    print()
print()

PNA-1: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (1.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-2: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (1.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-3: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (1.0000000): C10 (0.0000000): 

PNA-4: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000071): C9 (0.0005326): C10 (0.9994603): 

PNA-5: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (1.0000000): 

PNA-6: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (1.0000000):

PNA-78: C1 (0.0000000): C2 (1.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-79: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (1.0000000): 

PNA-80: C1 (0.0000000): C2 (0.0003439): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9996561): C9 (0.0000000): C10 (0.0000000): 

PNA-81: C1 (0.0000000): C2 (1.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-82: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000656): C9 (0.0000000): C10 (0.9999344): 

PNA-83: C1 (0.0000000): C2 (0.0000103): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.999

PNA-203: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (1.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-204: C1 (0.0000000): C2 (1.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-205: C1 (0.9999997): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000003): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-206: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (1.0000000): 

PNA-207: C1 (0.0000000): C2 (0.0000508): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.9999492): C10 (0.0000000): 

PNA-208: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000005): C6 (0.0000000): C7 (0.9999995): C8 (0.0000000): C9 

PNA-319: C1 (0.0000000): C2 (0.0000094): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000499): C9 (0.9999407): C10 (0.0000000): 

PNA-320: C1 (0.0000000): C2 (1.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-321: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000546): C9 (0.0000002): C10 (0.9999452): 

PNA-322: C1 (0.0025501): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.9974499): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-323: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (1.0000000): 

PNA-324: C1 (0.0003131): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9996869): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 

PNA-411: C1 (0.0000000): C2 (1.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-412: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9997320): C6 (0.0000000): C7 (0.0002680): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-413: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.9999994): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000006): C9 (0.0000000): C10 (0.0000000): 

PNA-414: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0317264): C6 (0.0030732): C7 (0.9652004): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-415: C1 (0.0000000): C2 (0.0000000): C3 (0.9999884): C4 (0.0000000): C5 (0.0000000): C6 (0.0000116): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-416: C1 (0.0415331): C2 (0.0000000): C3 (0.0002266): C4 (0.0000000): C5 (0.0009132): C6 (0.0000000): C7 (0.9573270): C8 (0.0000000): C9 

CHF-536: C1 (0.0002250): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.9997750): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-537: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (1.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-538: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (1.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-539: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9999997): C9 (0.0000000): C10 (0.0000003): 

CHF-540: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000001): C9 (0.9999999): C10 (0.0000000): 

CHF-541: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (1.0000000): C7 (0.0000000): C8 (0.0000000): C9 

CHF-661: C1 (0.0001610): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000931): C6 (0.9997459): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-662: C1 (0.9946851): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0019489): C7 (0.0033661): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-663: C1 (0.0000002): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.9990295): C7 (0.0009703): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-664: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (1.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-665: C1 (0.0000012): C2 (0.0000000): C3 (0.0000008): C4 (0.0000000): C5 (0.9754828): C6 (0.0243121): C7 (0.0002030): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-666: C1 (0.0000000): C2 (0.0000000): C3 (0.0051264): C4 (0.0000000): C5 (0.9913135): C6 (0.0035601): C7 (0.0000000): C8 (0.0000000): C9 

CHF-786: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0577504): C6 (0.9422496): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-787: C1 (0.0177768): C2 (0.0000000): C3 (0.0095495): C4 (0.0000000): C5 (0.9137829): C6 (0.0582493): C7 (0.0006409): C8 (0.0000000): C9 (0.0000001): C10 (0.0000005): 

CHF-788: C1 (0.0000000): C2 (0.0053154): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000003): C9 (0.9946843): C10 (0.0000000): 

CHF-789: C1 (0.9970367): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0027549): C7 (0.0002084): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-790: C1 (1.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-791: C1 (0.7865305): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.2128238): C7 (0.0006457): C8 (0.0000000): C9 

CHF-911: C1 (0.0000002): C2 (0.0000000): C3 (0.0717361): C4 (0.0000000): C5 (0.0002049): C6 (0.8459296): C7 (0.0821293): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-912: C1 (0.0000000): C2 (0.0000000): C3 (0.0000091): C4 (0.0000000): C5 (0.0000000): C6 (0.9981967): C7 (0.0017942): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-913: C1 (0.0000000): C2 (0.0000000): C3 (0.0070481): C4 (0.0000000): C5 (0.0035316): C6 (0.7113204): C7 (0.2781000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-914: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (1.0000000): 

CHF-915: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (1.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-916: C1 (0.0000000): C2 (1.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 

COPD-1036: C1 (0.0000000): C2 (0.9978078): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0021922): C10 (0.0000000): 

COPD-1037: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (1.0000000): C10 (0.0000000): 

COPD-1038: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (1.0000000): C10 (0.0000000): 

COPD-1039: C1 (0.0000000): C2 (0.3053439): C3 (0.0002977): C4 (0.3712860): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0002852): C9 (0.0163914): C10 (0.3063957): 

COPD-1040: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1041: C1 (0.0000001): C2 (0.0000000): C3 (0.0031594): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.9968406): C8 (0.0

COPD-1161: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (1.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1162: C1 (0.0007441): C2 (0.0000000): C3 (0.9991167): C4 (0.0000000): C5 (0.0000637): C6 (0.0000755): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1163: C1 (0.9998790): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0001210): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1164: C1 (0.9999911): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000089): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1165: C1 (0.0157120): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9842880): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1166: C1 (0.9999993): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000007): C7 (0.0000000): C8 (0.0

COPD-1286: C1 (0.0000000): C2 (0.0000000): C3 (0.9999692): C4 (0.0000000): C5 (0.0000000): C6 (0.0000281): C7 (0.0000027): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1287: C1 (0.0000000): C2 (0.0000000): C3 (0.9985935): C4 (0.0000000): C5 (0.0000021): C6 (0.0014044): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1288: C1 (0.0084124): C2 (0.0000000): C3 (0.9915858): C4 (0.0000000): C5 (0.0000001): C6 (0.0000017): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1289: C1 (0.0000111): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0051291): C6 (0.0000000): C7 (0.9948598): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1290: C1 (0.0052035): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.9946488): C8 (0.0000000): C9 (0.0000000): C10 (0.0001477): 

COPD-1291: C1 (0.0000001): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000038): C6 (0.0542295): C7 (0.9457666): C8 (0.0

COPD-1411: C1 (0.0000111): C2 (0.0000000): C3 (0.9999889): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1412: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1413: C1 (0.9004259): C2 (0.0000000): C3 (0.0995741): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1414: C1 (1.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1415: C1 (1.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1416: C1 (0.9589436): C2 (0.0000000): C3 (0.0410562): C4 (0.0000000): C5 (0.0000001): C6 (0.0000000): C7 (0.0000000): C8 (0.0

### Soft Clustering Term Examination

#### 1) Can you find terms that belong to more than 1 cluster ?

In [11]:
W, C = Mu.shape
for i in range(W):
    print('%s: ' % lexicon[i], end='')
    for j in range(C):
        print('C%d (%.5f): ' % (j+1, Mu[i, j]), end='')
    print()
    print()
print()

abdomen pelvis: C1 (0.00000): C2 (0.00143): C3 (0.00000): C4 (0.00904): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00392): C9 (0.00502): C10 (0.00000): 

administration cc: C1 (0.00000): C2 (0.00331): C3 (0.00000): C4 (0.00000): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00140): C9 (0.00370): C10 (0.00000): 

administration iv: C1 (0.00000): C2 (0.00320): C3 (0.00000): C4 (0.00069): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00056): C9 (0.00161): C10 (0.00145): 

adrenal glands: C1 (0.00000): C2 (0.00187): C3 (0.00029): C4 (0.00768): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00532): C9 (0.00592): C10 (0.00342): 

air bronchograms: C1 (0.00000): C2 (0.00100): C3 (0.00029): C4 (0.00102): C5 (0.00044): C6 (0.00074): C7 (0.00184): C8 (0.00168): C9 (0.00207): C10 (0.00328): 

airways patent: C1 (0.00000): C2 (0.00341): C3 (0.00000): C4 (0.00068): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00252): C9 (0.00208): C10 (0.00407): 

also noted: C1 (0.00000): C2 (0.00155)

glass opacities: C1 (0.00000): C2 (0.00209): C3 (0.00000): C4 (0.00270): C5 (0.00000): C6 (0.00000): C7 (0.00023): C8 (0.00168): C9 (0.00383): C10 (0.00223): 

great vessels: C1 (0.00022): C2 (0.00276): C3 (0.00000): C4 (0.00234): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00111): C9 (0.00266): C10 (0.00263): 

greater left: C1 (0.00045): C2 (0.00242): C3 (0.00000): C4 (0.00100): C5 (0.00709): C6 (0.00057): C7 (0.00000): C8 (0.00028): C9 (0.00163): C10 (0.00328): 

greater right: C1 (0.00043): C2 (0.00033): C3 (0.00000): C4 (0.00301): C5 (0.00620): C6 (0.00154): C7 (0.00000): C8 (0.00056): C9 (0.00192): C10 (0.00276): 

ground glass: C1 (0.00000): C2 (0.00407): C3 (0.00000): C4 (0.00336): C5 (0.00000): C6 (0.00000): C7 (0.00045): C8 (0.00336): C9 (0.00711): C10 (0.00591): 

heart enlarged: C1 (0.00091): C2 (0.00077): C3 (0.00195): C4 (0.00000): C5 (0.00180): C6 (0.00057): C7 (0.00113): C8 (0.00056): C9 (0.00118): C10 (0.00158): 

heart failure: C1 (0.00627): C2 (0.00242): C3 (0.00

pulmonary vascularity: C1 (0.00134): C2 (0.00000): C3 (0.00454): C4 (0.00033): C5 (0.00348): C6 (0.00136): C7 (0.00119): C8 (0.00000): C9 (0.00000): C10 (0.00000): 

pulmonary vasculature: C1 (0.00193): C2 (0.00099): C3 (0.00536): C4 (0.00000): C5 (0.00361): C6 (0.00447): C7 (0.00316): C8 (0.00000): C9 (0.00030): C10 (0.00000): 

radiograph chest: C1 (0.00095): C2 (0.00000): C3 (0.00346): C4 (0.00000): C5 (0.00197): C6 (0.00345): C7 (0.00378): C8 (0.00000): C9 (0.00000): C10 (0.00000): 

radiology ct: C1 (0.00000): C2 (0.00276): C3 (0.00000): C4 (0.01136): C5 (0.00000): C6 (0.00000): C7 (0.00000): C8 (0.00588): C9 (0.00917): C10 (0.00000): 

reason assess: C1 (0.00388): C2 (0.00110): C3 (0.00187): C4 (0.00200): C5 (0.00463): C6 (0.00459): C7 (0.00229): C8 (0.00140): C9 (0.00074): C10 (0.00197): 

reason eval: C1 (0.00549): C2 (0.00342): C3 (0.01034): C4 (0.00367): C5 (0.00536): C6 (0.00817): C7 (0.00324): C8 (0.00140): C9 (0.00192): C10 (0.00500): 

reason evaluate: C1 (0.00135): C2 (0