## Representing text as numerical data for Clustering Algorithms

### Some useful links

- http://www.learndatasci.com/k-means-clustering-algorithms-python-intro/
- https://andhint.github.io/machine-learning/nlp/Feature-Extraction-From-Text/
- https://github.com/justmarkham/pycon-2016-tutorial/blob/master/tutorial.ipynb

### Define the impots:
- Numpy for numerics
- Pandas to represent data in a table
- CountVectorizer from SKlearn for converting text to a Document Term Matrix (DTM)
- TfidfVectorizer also from Sklearn for converting text to a DTM

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Define the text data:
- We will use a list to define a simple corpus i.e. a collection of documents
- Each document will be item in the list


In [4]:
corpus_data = ['DiscountCurve_USD DefaultCuve BondDefaultCurve', 'FXSpot_EUR DiscountCurve_GBP EQAsset', 
               'DiscountCurve_EUR FXSpot_USD, BondDefaultCurve']

### Use the CountVectorizer to convert the text data to a DTM

- import and instantiate CountVectorizer (with mainly the default parameters apart from: lowercase=True)

In [7]:
count_vec = CountVectorizer(lowercase=False)

# learn the 'vocabulary' of the training data (occurs in-place)
count_vec.fit(corpus_data)

# examine the fitted vocabulary
count_vec.get_feature_names()

[u'BondDefaultCurve',
 u'DefaultCuve',
 u'DiscountCurve_EUR',
 u'DiscountCurve_GBP',
 u'DiscountCurve_USD',
 u'EQAsset',
 u'FXSpot_EUR',
 u'FXSpot_USD']

### Transform training data into a 'document-term matrix'

- convert sparse matrix to a dense matrix

In [11]:
corpus_data_dtm = count_vec.transform(corpus_data)
corpus_data_dtm.toarray()

array([[1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0],
       [1, 0, 1, 0, 0, 0, 0, 1]])

### Examine the features and document-term matrix together inside a Dataframe Table

In [13]:
pd.DataFrame(corpus_data_dtm.toarray(), columns=count_vec.get_feature_names())

Unnamed: 0,BondDefaultCurve,DefaultCuve,DiscountCurve_EUR,DiscountCurve_GBP,DiscountCurve_USD,EQAsset,FXSpot_EUR,FXSpot_USD
0,1,1,0,0,1,0,0,0
1,0,0,0,1,0,1,1,0
2,1,0,1,0,0,0,0,1
