## Cluster Analysis with Multinomial EM (Text Clustering)

#### In the K-means clustering algorithm we forced documents to be assigned to one cluster. This is known as hard clustering. This approach in not really accurate when working with language.  When someone writes about a topic they use a certain terminology to describe that topic. We assume that when someone else writes about the same topic they are likely to use the same terminology or words. But words may also be used across multiple topics and a clinical note may describe multiple topics. It may be more appropriate to assign words and documents to multiple topics (or clusters) with a certain probability based on their use.

#### This example illustrates how to group a set of different clinical notes based on their topics and the terminology used to describe the topics. We will model our clinical notes as a collection of unigram and bigram words. Specifically, we will represent each clinical note as a feature set of unigram and bigram frequencies found in the clinical note. We will use a matrix where each row will represent a clinical note and each column a feature (i.e. distinct unigram or bigram).

### Mixture of Multinomial Distributions

#### Text is best represented as a mixture of multinomial distributions where each topic has a particular multinomial distribution associated with it and each document in a mixture of topics. 

#### Formally, let $p(c)\space=\space\pi_c$ be the prior probability of a document containing topic c, and each topic c is represented as a multinomial distribution $p(D_i|c)$ with parameters $\mu_{jc}$, then each document becomes a mixture over topics as

#### $p(D_i)\space=\space \displaystyle\sum_{c=1}^{n_c} p(D_i|c)p(c) = \displaystyle\sum_{c=1}^{n_c} \pi_c \prod_{j=1}^{n_w}\mu_{jc}^{T_{ij}}$

### Expectation Maximization for Mixtures of Multinomials

#### The expectation maximization algorithm will allow us to fit a multinomial mixture model to our data. Our goal is to identify which documents belong to which topics and what words (unigrams and bigrams) are used to describe the topic. 

#### 1. E-Step. Compute the expectation that document $D_i$ belongs to topic (cluster) $c$

####         $\gamma_{ic} \propto \pi_c \displaystyle\prod_{j = 1}^{n_w} \mu_{jc}^{T_{ij}}$

#### $Note: \space normalize \space expectations \space over \space c$

#### where,
#### $\pi_c \equiv prior{\space}probability{\space}of{\space}document{\space}containing{\space}topic{\space}c$
#### $\mu_{jc} \equiv probability{\space}of{\space}w_j{\space}in{\space}topic{\space}c$
#### $T_{ij} \equiv count \space of \space w_j \space in \space topic \space c$


#### 2. M-Step. Update the mixture parameters. 

#### $\mu_{jc} = \frac{\sum_{i = 1}^{n_d} \gamma_{ic}T_{ij}}{\sum_{i = 1}^{n_d}\sum_{l = 1}^{n_w} \gamma_{ic}T_{il}} \equiv probability \space of \space a \space word \space being \space w_j \space in \space topic \space (cluster) \space c$ 

#### $\pi_c = \frac{1}{n_d} \displaystyle \sum_{i = 1}^{n_d} \gamma_{ic} \equiv prior \space probability \space of \space each \space cluster$

#### $Note: \space normalize \space priors \space uniformly,\space initialize \space \mu's \space to \space multinomial \space generated \space from \space uniform \space dirichlet \space distribution \space such \space that \sum_{j = 1}^{n_w} \mu_{jc} = 1$

### Implementation

#### Our implementation will consist of miipulating 4 matricies as described below.

![title](./images/MultinomialEMClustering.png)


### Environment Setup

#### First we will import some Python packages that we will use.

In [1]:
import ipdb
import nltk
import re
import pandas as pd
import numpy as np 
import numpy.matlib as ml
import pickle as pkl
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

### Syntactic NLP Processing

#### We need a function to tokenize our text and remove noise like dates, ages, etc.

In [2]:
def tokenize(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [ token for token in tokens if re.search('(^[a-zA-Z]+$)', token) ]
    return filtered_tokens

### Retrieving our Corpus

#### Let's pull in our corpus that we had serialized out to disk.  

In [3]:
file = open('differential-corpus.pkl','rb')
corpus = pkl.load(file)
file.close()
corpus.head()

Unnamed: 0,text,label
0,[**2996-12-2**] 10:25 AM\n CT CHEST W/O CONTRA...,PNA
1,[**3201-9-21**] 4:50 PM\n CT CHEST W/CONTRAST ...,PNA
2,[**3299-6-23**] 5:06 PM\n CT CHEST W/CONTRAST ...,PNA
3,[**3186-6-14**] 2:54 PM\n CT CHEST W/CONTRAST ...,PNA
4,[**2500-1-17**] 9:41 PM\n CT CHEST W/O CONTRAS...,PNA


### More Syntactic Processing

#### We will want to get ride of stop words that are essentially noise. 

In [4]:
cachedStopWords = stopwords.words("english") 
noisywords = ['year', 'old', 'man', 'woman', 'ap', 'am', 'pm', 'portable', 'pa', 'lat', 'admitting', 'diagnosis', 'lateral']
cachedStopWords.extend(noisywords)
print(cachedStopWords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Generate Document-Term Frequency Counts

#### In this step we tokenize our text and remove stop words in addition to generating our frequency counts.

#### 1) how many documents are we working with and how many features (unigrams & bigrams)?

#### 2) Can you figure out what max_df and min_df is doing to our feature count?

In [5]:
corpusList = corpus['text'].tolist()
labels = []
idx = 1
for label in corpus['label'].tolist():
    labels.append(label + "-" + str(idx))
    idx = idx + 1
rank = list(range(len(labels)))

cv = CountVectorizer(lowercase=True, max_df=0.80, max_features=None, min_df=0.033,
                     ngram_range=(2, 2), preprocessor=None, stop_words=cachedStopWords,
                     strip_accents=None, tokenizer=tokenize, vocabulary=None)
SparseT = cv.fit_transform(corpusList)
print("The dimensions of our document-term matrix")
print(SparseT.shape)
print()

  'stop_words.' % sorted(inconsistent))


The dimensions of our document-term matrix
(1500, 342)



### Feature Set

#### Let's take a look at our feature set.

#### 1) Do we have a lot of noise in our features?

In [6]:
lexicon = cv.get_feature_names()
print (lexicon)
print()

['abdomen pelvis', 'administration cc', 'administration iv', 'adrenal glands', 'air bronchograms', 'airways patent', 'also noted', 'amt final', 'amt underlying', 'approximately cm', 'areas consolidation', 'artery calcifications', 'atelectasis left', 'atelectasis right', 'axial images', 'axillary lymphadenopathy', 'bibasilar atelectasis', 'bilateral lower', 'bilateral pleural', 'bone windows', 'c recons', 'cardiac mediastinal', 'cardiac silhouette', 'cardiomediastinal silhouette', 'cc optiray', 'central venous', 'centrilobular emphysema', 'change final', 'chest clip', 'chest compared', 'chest comparison', 'chest contrast', 'chest ct', 'chest history', 'chest indication', 'chest obtained', 'chest pain', 'chest performed', 'chest radiograph', 'chest radiographs', 'chest single', 'chest tube', 'chest w', 'chest wall', 'chest without', 'chf final', 'clinical history', 'clinical indication', 'cm carina', 'compared location', 'compared prior', 'comparison chest', 'comparison ct', 'comparison 

## Define EM Algorithm (E-Step & M-Step)

#### Now let's define our EM Algorithm.

#### 1) In the ExpD function why are we multiplying by $1x10^2 \space ?$  

In [8]:
  
def ExpD(T, Mu, Pi):  
#    ipdb.set_trace()
    C_n = Pi.shape[1]
    D_n = T.shape[0]
    Gamma = ml.zeros((D_n, C_n))
    Gamma.astype('float64')
    for c in range(0, C_n):
        Gamma[:, c] = Pi[0][c] * ((Mu[:,c].A[:,0]*1e2)**T).prod(1)
    Gamma = Gamma / Gamma.sum(axis=1)
    return Gamma
    
def updateMu(T, Gamma): 
#    ipdb.set_trace()
    C_n = Gamma.shape[1]
    W_n = T.shape[1]
    Mu = ml.zeros((W_n, C_n))
    for c in range(0, C_n):
        numerator = sum(np.multiply(Gamma[:,c],T)).T
        demoninator = sum(np.multiply(Gamma[:,c],T).sum(1))
        Mu[:,c] = numerator / demoninator
    Mu = Mu / Mu.sum(axis=1)
    return Mu
    
def updatePi(Gamma): 
#    ipdb.set_trace()
    D_n = Gamma.shape[0]
    Pi = sum(Gamma) / D_n
    return Pi.A
    

## Let's Run it !

#### Let's start with 10 topics (clusters) and we will interate 100 times. EM converges quickly.

#### 1) Can you determine at what iteration we are starting to reach convergence?

In [9]:
T = SparseT.todense()
D_n, W_n = T.shape
C_n = 10
Pi = ml.repmat(1/C_n, 1, C_n)
Mu = ml.mat(np.random.dirichlet(np.ones(W_n), C_n).T)
#ipdb.set_trace()
for i in range(1,101):
    print('Iteration: ' + str(i)) 
    Gamma = ExpD(T, Mu, Pi)
    print(Gamma.sum(0))
    Mu = updateMu(T, Gamma)
    Pi = updatePi(Gamma)


Iteration: 1
[[105.18612807 197.80101965 128.63343704  80.30029956 100.91433314
  338.52814496 148.87384491 159.03459982 179.66507511  61.06311775]]
Iteration: 2
[[ 64.23754153 160.56738063  83.75640679 203.03923549 173.08233883
  109.11233881 133.6056593  379.98403004 110.50986886  82.10519973]]
Iteration: 3
[[ 69.13483811 130.44497789  77.08146786 227.39446618 184.59853098
   90.57093711 173.15869942 351.80560405  90.60406219 105.20641621]]
Iteration: 4
[[ 75.69643362 115.68178966  78.19575565 233.94838505 188.07468207
   95.75630705 188.92625311 317.89985516  85.78888508 120.03165355]]
Iteration: 5
[[ 81.52493336 108.35130966  82.99324177 234.76322252 191.33738748
   98.396271   195.10669945 291.39142949  83.8063451  132.32916016]]
Iteration: 6
[[ 86.40931926 101.51399001  88.04656399 232.30351488 193.53537431
  104.42739292 199.07662972 270.00529778  81.69122274 142.99069438]]
Iteration: 7
[[ 88.57337213  98.63324354  89.31461652 232.892029   189.19616191
  105.78117208 201.6190423

Iteration: 56
[[ 95.79683168 123.92660351  97.73545862 216.65516323 186.57646185
  124.06208421 182.52910725 221.73191435  67.91765229 183.06872301]]
Iteration: 57
[[ 95.82113079 124.89808136  98.00963554 216.8675917  186.84572538
  124.05572642 182.3797135  221.07890005  66.91353387 183.1299614 ]]
Iteration: 58
[[ 95.83783114 125.05470967  97.84364011 217.26120743 187.02942574
  124.05992442 182.05005163 220.71933392  66.96513734 183.17873861]]
Iteration: 59
[[ 95.86077592 125.02462816  97.42472605 217.74769718 187.16668079
  124.07712769 181.58942184 220.4961333   67.40506061 183.20774846]]
Iteration: 60
[[ 95.89503134 124.99317259  97.23863787 217.8023979  187.41420879
  124.11215636 181.42713768 220.3534857   67.55405083 183.20972093]]
Iteration: 61
[[ 95.93409846 124.96056431  96.74788652 217.81777823 188.08787057
  124.12506933 181.3530366  220.25981811  67.61663089 183.09724697]]
Iteration: 62
[[ 95.99625965 124.90232465  96.44521039 217.81656903 188.80896185
  124.11767778 181.

## Examination of Clusters and Terminology

#### Let's take a look at the top cluster for each clinical note and the top 20 words to distinguish this topic.

#### 1) Is there noise in the terminology? If there is how can we get ride of it?


In [10]:
clusters = Gamma.argmax(1).A
clusters = [i[0] for i in clusters]
clinicalDocuments = { 'labels': labels, 'rank': rank, 'corpus': corpusList, 'cluster': clusters }
frame = pd.DataFrame(clinicalDocuments, index = clusters , columns = ['rank', 'labels', 'corpus', 'cluster'])
grouped = frame['rank'].groupby(frame['cluster'])
topWords = Mu.T.argsort()[:, ::-1]
for i in range(C_n):
    n = len(frame.loc[i]['labels'].values.tolist())
    print('Cluster %d (%d):,' % (i+1, n), end='')
    for label in frame.loc[i]['labels'].values.tolist():
        print(' %s,' % label, end='')
    print()
    print()
    print('           Words:', end='')
    for indice in list(topWords[i, :20].A[0]):
        print(' %s (%.5f)' % (lexicon[indice], Mu[indice,i]), end=',')
    print()
    print()
    print()

Cluster 1 (97):, PNA-53, PNA-83, PNA-86, PNA-91, PNA-113, PNA-120, PNA-122, PNA-125, PNA-142, PNA-147, PNA-159, PNA-170, PNA-177, PNA-183, PNA-188, PNA-191, PNA-204, PNA-207, PNA-215, PNA-239, PNA-268, PNA-271, PNA-272, PNA-284, PNA-300, PNA-305, PNA-315, PNA-318, PNA-325, PNA-330, PNA-332, PNA-353, PNA-360, PNA-373, PNA-382, PNA-408, PNA-453, PNA-459, PNA-461, PNA-473, PNA-479, PNA-486, PNA-488, CHF-523, CHF-585, CHF-607, CHF-615, CHF-696, CHF-715, CHF-778, CHF-788, CHF-873, CHF-888, CHF-916, CHF-946, CHF-970, CHF-982, CHF-984, CHF-999, COPD-1002, COPD-1003, COPD-1004, COPD-1008, COPD-1009, COPD-1011, COPD-1020, COPD-1033, COPD-1036, COPD-1039, COPD-1065, COPD-1074, COPD-1106, COPD-1107, COPD-1113, COPD-1119, COPD-1125, COPD-1134, COPD-1141, COPD-1196, COPD-1199, COPD-1203, COPD-1210, COPD-1215, COPD-1219, COPD-1228, COPD-1231, COPD-1251, COPD-1257, COPD-1311, COPD-1312, COPD-1352, COPD-1364, COPD-1448, COPD-1486, COPD-1487, COPD-1494, COPD-1497,

           Words: recons ct (0.82929)

### Soft Clustering Document Examination

#### 1) Can you find clinical notes that belong to more than 1 cluster ?


In [12]:
N, C = Gamma.shape
for i in range(N):
    print('%s: ' % labels[i], end='')
    for j in range(C):
        print('C%d (%.7f): ' % (j+1, Gamma[i, j]), end='')
    print()
    print()
print()

PNA-1: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (1.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-2: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (1.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-3: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-4: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-5: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (1.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-6: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000):


PNA-147: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-148: C1 (0.0000000): C2 (0.9999918): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000082): C9 (0.0000000): C10 (0.0000000): 

PNA-149: C1 (0.0000217): C2 (0.0352518): C3 (0.0000000): C4 (0.0000000): C5 (0.9545276): C6 (0.0000000): C7 (0.0000000): C8 (0.0101988): C9 (0.0000000): C10 (0.0000000): 

PNA-150: C1 (0.0000000): C2 (0.0000000): C3 (0.0000111): C4 (0.0000000): C5 (0.0000000): C6 (0.9999889): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-151: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (1.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-152: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9

PNA-234: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-235: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (1.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-236: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000010): C6 (0.0000000): C7 (0.0000000): C8 (0.9999990): C9 (0.0000000): C10 (0.0000000): 

PNA-237: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (1.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-238: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (1.0000000): C10 (0.0000000): 

PNA-239: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 

PNA-355: C1 (0.6645863): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.2193789): C6 (0.0000000): C7 (0.0000000): C8 (0.1160348): C9 (0.0000000): C10 (0.0000000): 

PNA-356: C1 (0.0271714): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9728286): C9 (0.0000000): C10 (0.0000000): 

PNA-357: C1 (1.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-358: C1 (0.9998362): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0001637): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-359: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-360: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 

PNA-480: C1 (0.0000000): C2 (0.0000000): C3 (0.0001476): C4 (0.0000000): C5 (0.0000000): C6 (0.9998524): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

PNA-481: C1 (0.0000014): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9999472): C6 (0.0000000): C7 (0.0000000): C8 (0.0000514): C9 (0.0000000): C10 (0.0000000): 

PNA-482: C1 (0.0000000): C2 (0.0667021): C3 (0.0000001): C4 (0.0000000): C5 (0.0000001): C6 (0.0000004): C7 (0.0000002): C8 (0.9332971): C9 (0.0000000): C10 (0.0000000): 

PNA-483: C1 (0.9993884): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0002022): C6 (0.0000000): C7 (0.0000000): C8 (0.0004095): C9 (0.0000000): C10 (0.0000000): 

PNA-484: C1 (0.1774751): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.7832234): C6 (0.0000000): C7 (0.0000000): C8 (0.0393016): C9 (0.0000000): C10 (0.0000000): 

PNA-485: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 

CHF-605: C1 (0.0000897): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9999103): C9 (0.0000000): C10 (0.0000000): 

CHF-606: C1 (0.0002728): C2 (0.0000044): C3 (0.0000000): C4 (0.0000000): C5 (0.0179588): C6 (0.0000000): C7 (0.0000000): C8 (0.9817640): C9 (0.0000000): C10 (0.0000000): 

CHF-607: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-608: C1 (0.9890005): C2 (0.0109995): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-609: C1 (0.9986006): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000001): C6 (0.0000000): C7 (0.0000000): C8 (0.0013993): C9 (0.0000000): C10 (0.0000000): 

CHF-610: C1 (0.9932073): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0067927): C9 

CHF-730: C1 (0.9990172): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0009828): C9 (0.0000000): C10 (0.0000000): 

CHF-731: C1 (0.9799437): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000001): C6 (0.0000000): C7 (0.0000000): C8 (0.0200562): C9 (0.0000000): C10 (0.0000000): 

CHF-732: C1 (0.0714173): C2 (0.9285827): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-733: C1 (0.9293875): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0706125): C9 (0.0000000): C10 (0.0000000): 

CHF-734: C1 (0.9999986): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000014): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-735: C1 (0.0000623): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9999377): C9 

CHF-855: C1 (0.3778739): C2 (0.0797220): C3 (0.0000000): C4 (0.0000000): C5 (0.0360990): C6 (0.0000000): C7 (0.0000000): C8 (0.5063051): C9 (0.0000000): C10 (0.0000000): 

CHF-856: C1 (0.9998552): C2 (0.0000040): C3 (0.0000000): C4 (0.0000000): C5 (0.0000197): C6 (0.0000000): C7 (0.0000000): C8 (0.0001211): C9 (0.0000000): C10 (0.0000000): 

CHF-857: C1 (0.9160739): C2 (0.0839261): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-858: C1 (0.9999992): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000008): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-859: C1 (1.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-860: C1 (0.0044770): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9955230): C9 

CHF-980: C1 (0.9987451): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0012549): C9 (0.0000000): C10 (0.0000000): 

CHF-981: C1 (0.0000758): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.9999242): C9 (0.0000000): C10 (0.0000000): 

CHF-982: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-983: C1 (0.7172825): C2 (0.2775119): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0052055): C9 (0.0000000): C10 (0.0000000): 

CHF-984: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

CHF-985: C1 (0.9999598): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000352): C6 (0.0000000): C7 (0.0000000): C8 (0.0000050): C9 


COPD-1105: C1 (0.0000000): C2 (0.9961632): C3 (0.0000189): C4 (0.0000000): C5 (0.0000249): C6 (0.0000000): C7 (0.0000204): C8 (0.0037725): C9 (0.0000000): C10 (0.0000000): 

COPD-1106: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1107: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1108: C1 (0.0000566): C2 (0.9999429): C3 (0.0000000): C4 (0.0000000): C5 (0.0000004): C6 (0.0000000): C7 (0.0000000): C8 (0.0000001): C9 (0.0000000): C10 (0.0000000): 

COPD-1109: C1 (0.0009008): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9990992): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1110: C1 (0.1319795): C2 (0.1077529): C3 (0.0000000): C4 (0.0000000): C5 (0.7572050): C6 (0.0000000): C7 (0.0000084): C8 (0.

COPD-1229: C1 (0.0111074): C2 (0.9880120): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0008805): C9 (0.0000000): C10 (0.0000000): 

COPD-1230: C1 (0.0000000): C2 (0.0000000): C3 (0.0030040): C4 (0.0000417): C5 (0.0000000): C6 (0.0025398): C7 (0.1676805): C8 (0.0000000): C9 (0.8267339): C10 (0.0000000): 

COPD-1231: C1 (0.0000000): C2 (0.0000000): C3 (1.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1232: C1 (0.9479812): C2 (0.0062098): C3 (0.0000000): C4 (0.0000000): C5 (0.0065558): C6 (0.0000000): C7 (0.0000000): C8 (0.0392533): C9 (0.0000000): C10 (0.0000000): 

COPD-1233: C1 (1.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1234: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9955889): C6 (0.0000000): C7 (0.0000000): C8 (0.0

COPD-1354: C1 (0.0000117): C2 (0.0000032): C3 (0.0000000): C4 (0.0000000): C5 (0.9999830): C6 (0.0000000): C7 (0.0000000): C8 (0.0000021): C9 (0.0000000): C10 (0.0000000): 

COPD-1355: C1 (0.0002894): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9942706): C6 (0.0000000): C7 (0.0000000): C8 (0.0054400): C9 (0.0000000): C10 (0.0000000): 

COPD-1356: C1 (0.0013109): C2 (0.0000000): C3 (0.0000314): C4 (0.0000000): C5 (0.7882414): C6 (0.0000000): C7 (0.0017628): C8 (0.2086536): C9 (0.0000000): C10 (0.0000000): 

COPD-1357: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.3471250): C6 (0.0000000): C7 (0.0705713): C8 (0.5823036): C9 (0.0000000): C10 (0.0000000): 

COPD-1358: C1 (0.0000055): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.9999945): C6 (0.0000000): C7 (0.0000000): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1359: C1 (0.0001303): C2 (0.9774238): C3 (0.0000000): C4 (0.0000000): C5 (0.0224261): C6 (0.0000000): C7 (0.0000000): C8 (0.0

COPD-1479: C1 (0.0005435): C2 (0.0427607): C3 (0.0000177): C4 (0.0000000): C5 (0.9565735): C6 (0.0000000): C7 (0.0000531): C8 (0.0000515): C9 (0.0000000): C10 (0.0000000): 

COPD-1480: C1 (0.0000000): C2 (0.0000000): C3 (0.0451185): C4 (0.0000000): C5 (0.0000000): C6 (0.0359430): C7 (0.0289383): C8 (0.0000000): C9 (0.8900002): C10 (0.0000000): 

COPD-1481: C1 (0.0000000): C2 (0.0000000): C3 (0.2662516): C4 (0.0030333): C5 (0.0000000): C6 (0.0000000): C7 (0.7307151): C8 (0.0000000): C9 (0.0000000): C10 (0.0000000): 

COPD-1482: C1 (0.0000006): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (0.0045582): C6 (0.0000000): C7 (0.0000000): C8 (0.9954412): C9 (0.0000000): C10 (0.0000000): 

COPD-1483: C1 (0.0000000): C2 (0.3072115): C3 (0.0972152): C4 (0.0001619): C5 (0.5670844): C6 (0.0000000): C7 (0.0253537): C8 (0.0029732): C9 (0.0000000): C10 (0.0000000): 

COPD-1484: C1 (0.0000000): C2 (0.0000000): C3 (0.0000000): C4 (0.0000000): C5 (1.0000000): C6 (0.0000000): C7 (0.0000000): C8 (0.0

### Soft Clustering Term Examination

#### 1) Can you find terms that belong to more than 1 cluster ?

In [13]:
W, C = Mu.shape
for i in range(W):
    print('%s: ' % lexicon[i], end='')
    for j in range(C):
        print('C%d (%.5f): ' % (j+1, Mu[i, j]), end='')
    print()
    print()
print()

abdomen pelvis: C1 (0.00000): C2 (0.00000): C3 (0.00869): C4 (0.47638): C5 (0.00000): C6 (0.12511): C7 (0.01129): C8 (0.00000): C9 (0.37854): C10 (0.00000): 

administration cc: C1 (0.00000): C2 (0.00000): C3 (0.26702): C4 (0.18234): C5 (0.00000): C6 (0.31650): C7 (0.00000): C8 (0.00000): C9 (0.00000): C10 (0.23414): 

administration iv: C1 (0.00000): C2 (0.00000): C3 (0.28424): C4 (0.09320): C5 (0.00000): C6 (0.23863): C7 (0.13843): C8 (0.00000): C9 (0.08594): C10 (0.15956): 

adrenal glands: C1 (0.00000): C2 (0.00000): C3 (0.07635): C4 (0.27673): C5 (0.00000): C6 (0.11777): C7 (0.12756): C8 (0.00000): C9 (0.30358): C10 (0.09802): 

air bronchograms: C1 (0.01080): C2 (0.02080): C3 (0.09059): C4 (0.11890): C5 (0.00000): C6 (0.09172): C7 (0.22030): C8 (0.11246): C9 (0.02909): C10 (0.30534): 

airways patent: C1 (0.00000): C2 (0.00000): C3 (0.26959): C4 (0.16511): C5 (0.00000): C6 (0.21832): C7 (0.27974): C8 (0.00000): C9 (0.06725): C10 (0.00000): 

also noted: C1 (0.05867): C2 (0.08681)

initial pre: C1 (0.29792): C2 (0.00000): C3 (0.01786): C4 (0.00000): C5 (0.62940): C6 (0.00000): C7 (0.01160): C8 (0.00000): C9 (0.04322): C10 (0.00000): 

internal jugular: C1 (0.23322): C2 (0.41043): C3 (0.01544): C4 (0.09007): C5 (0.12058): C6 (0.02776): C7 (0.02020): C8 (0.05743): C9 (0.02487): C10 (0.00000): 

interstitial edema: C1 (0.27594): C2 (0.02743): C3 (0.02193): C4 (0.05116): C5 (0.20632): C6 (0.03946): C7 (0.04749): C8 (0.19889): C9 (0.00000): C10 (0.13138): 

interval change: C1 (0.14163): C2 (0.10800): C3 (0.02281): C4 (0.00998): C5 (0.42047): C6 (0.00000): C7 (0.08698): C8 (0.16829): C9 (0.04183): C10 (0.00000): 

interval improvement: C1 (0.35549): C2 (0.21042): C3 (0.03810): C4 (0.00000): C5 (0.11056): C6 (0.03429): C7 (0.04126): C8 (0.20989): C9 (0.00000): C10 (0.00000): 

interval increase: C1 (0.13767): C2 (0.09877): C3 (0.19028): C4 (0.06342): C5 (0.00000): C6 (0.00000): C7 (0.10598): C8 (0.36001): C9 (0.04386): C10 (0.00000): 

intravenous contrast: C1 (0.00000

pleural effusions: C1 (0.31591): C2 (0.11648): C3 (0.06325): C4 (0.07969): C5 (0.04618): C6 (0.05237): C7 (0.10135): C8 (0.07885): C9 (0.07769): C10 (0.06822): 

pleural thickening: C1 (0.10802): C2 (0.02267): C3 (0.13784): C4 (0.00000): C5 (0.55752): C6 (0.00000): C7 (0.10371): C8 (0.00000): C9 (0.07024): C10 (0.00000): 

pneumonia final: C1 (0.04155): C2 (0.06710): C3 (0.02957): C4 (0.09054): C5 (0.11067): C6 (0.01996): C7 (0.21134): C8 (0.06894): C9 (0.16102): C10 (0.19931): 

pneumonia underlying: C1 (0.05477): C2 (0.11642): C3 (0.01426): C4 (0.03322): C5 (0.15301): C6 (0.01281): C7 (0.13570): C8 (0.15266): C9 (0.08055): C10 (0.24661): 

pneumothorax identified: C1 (0.20528): C2 (0.30217): C3 (0.00641): C4 (0.02242): C5 (0.18066): C6 (0.00000): C7 (0.04995): C8 (0.23312): C9 (0.00000): C10 (0.00000): 

pre number: C1 (0.32127): C2 (0.00000): C3 (0.00000): C4 (0.00000): C5 (0.67873): C6 (0.00000): C7 (0.00000): C8 (0.00000): C9 (0.00000): C10 (0.00000): 

prior chest: C1 (0.07687): 

without iv: C1 (0.00000): C2 (0.00000): C3 (0.05050): C4 (0.09815): C5 (0.00000): C6 (0.03030): C7 (0.16771): C8 (0.00000): C9 (0.50210): C10 (0.15125): 

x cm: C1 (0.00435): C2 (0.00000): C3 (0.10581): C4 (0.15292): C5 (0.00000): C6 (0.32288): C7 (0.08963): C8 (0.00000): C9 (0.20039): C10 (0.12402): 

zone redistribution: C1 (0.50766): C2 (0.00000): C3 (0.00000): C4 (0.00000): C5 (0.19131): C6 (0.00000): C7 (0.00000): C8 (0.30103): C9 (0.00000): C10 (0.00000): 


