# Term Frequency - Inverse Document Frequency (TF-IDF)

The TF-IDF matrix can be the input data for machine learning algorithms
- The source of the TF-IDF is the Document-Term Matrix (DTM)
- The Term Frequencies are derived from the DTM 
- The Inverse Document Frequency vector (IDF) is derived from the DTM
- The term frequency inverse document frequency (TF-IDF) is created by multiplying the TF with the IDF

The **TF-IDF** matrix is used as an input for machine learning to:
  - Characterize writing styles
  - Find plagiarism
  - Curate legal documents
  - Identify fake news
  - Determine Sentiment

## Document Matrix Term (DTM)

Further information on the DTM can be found here, **[Document-term matrix (DTM)](https://en.wikipedia.org/wiki/Document-term_matrix)**<br/>   
The transpose of the DTM is the Term Document Matrix (TDM)

When we have a corpus with some basic text pre-processing applied, we can create a **term document matrix (DTM)**. The DTM is a representation of **[Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model)** model. The DTM has the following properties:

- Number of occurences for a given term or word in a document are the DTM values
- The DTM is a sparse matrix, as most documents do not include most terms. Sparse matrix coding should be used for efficiency. 
- The DTM is used to create the **Term Frequency** matrix
- The DTM is used to create the **Inverse Document Frequency** vector

Let's look at an example of a DTM. The figure in the lecture slides shows a corpus of text documents on the left. This corpus is transformed into the document term matrix shown on the right. Notice that the matrix is sparse as any given document may not contain a term.  Some documents may contain a term multiple times. 

## Create an example DTM

In [1]:
# Some libraries
import numpy as np
import pandas as pd

### Corpus
First we need a corpus.  A corpus is a collection of **documents**.  A document is any independent piece of text, like a Shakespeare play or a tweet.  The following list is a corpus with 5 documents.  

In [2]:
# A corpus with 5 documents
corpus = [
    'i think machine learning is much fun',
    'i think learning is fun',
    'i think machines can learn to learn',
    'i think coding is fun fun fun',
    'i think i can i can'
]

### Vocabulary
Next we need to establish the vocabulary because each unique term will be a row in the TDM.  The vocabulary is a set of unique terms.  A term is like a word in a text.  Terms and words are often referred to as tokens.  Splitting a text into its words or terms is called tokenizing.

In [3]:
# Create the vocabulary
vocabulary = set()
for text in corpus:
    terms = text.split(' ')
    vocabulary.update(set(terms))
    
# cast set as list, which establishes order and allows indexing
vocabulary = list(vocabulary)
print('Vocabulary Size: {} distinct terms:'.format(len(vocabulary)))
print(vocabulary)

Vocabulary Size: 12 distinct terms:
['can', 'think', 'machines', 'is', 'to', 'much', 'learning', 'i', 'fun', 'learn', 'machine', 'coding']


### Filling in the DTM
1. The matrix is initialized with zeros.  The matrix has one row for each document and one column for each term.
2. Each document of the corpus is tokenized and each token is compared to the vocabulary.  A match of the token with the vocabulary term increases the value for the cell specified by term and document

#### Bag of Words
The DTM is considered a kind of "Bag of Words" (BOW) model of the corpus.  The more traditional BOW model would be the Boolean representation of the DTM. 

In [4]:
# initialize empty term-document matrix (TDM):
DTM = np.zeros((len(corpus), len(vocabulary)), dtype=np.intc)

# Document term matrix example
# fill the Document term matrix where
# each row is for a different document and
# each column is for a different term
for doc_index, text in enumerate(corpus):
    tokens = text.split(' ')
    term_index = [vocabulary.index(token) for token in tokens if token in vocabulary]
    for term_ix_col in term_index:
        DTM[doc_index, term_ix_col] = DTM[doc_index, term_ix_col] + 1

print("Example Document Term Matrix (DTM)")
DTM_df = pd.DataFrame(data=DTM, index=corpus, columns=vocabulary)
display(DTM_df)

Example Document Term Matrix (DTM)


Unnamed: 0,can,think,machines,is,to,much,learning,i,fun,learn,machine,coding
i think machine learning is much fun,0,1,0,1,0,1,1,1,1,0,1,0
i think learning is fun,0,1,0,1,0,0,1,1,1,0,0,0
i think machines can learn to learn,1,1,1,0,1,0,0,1,0,2,0,0
i think coding is fun fun fun,0,1,0,1,0,0,0,1,3,0,0,1
i think i can i can,2,1,0,0,0,0,0,3,0,0,0,0


### Term Frequencies
Term Frequencies (TF) is DTM normalized by document length and complexity.  The reasoning for the normalization is:  If a document is very long or complex, then it is more likely to have some unusual terms.  In this case unusual terms would not be a marker for a machine learning outcome but rather a marker for document length.  To correct for this issue, we divide the counts in the raw by the length of the document.  The term frequencies for every document will sum up to 1.

In [7]:
# Get the number of unique terms within each document
NumberOfTerms = DTM_df.sum(axis = 1)
print(" Number of unique terms:")
display(NumberOfTerms)

 Number of unique terms:


i think machine learning is much fun    7
i think learning is fun                 5
i think machines can learn to learn     7
i think coding is fun fun fun           7
i think i can i can                     6
dtype: int64

In [8]:
# Term Frequencies (TF) is the normalized DTM
# Divide every value by the number of terms in that document
TF = DTM_df.mul(1/NumberOfTerms, axis=0)
display(TF.round(2))

Unnamed: 0,can,think,machines,is,to,much,learning,i,fun,learn,machine,coding
i think machine learning is much fun,0.0,0.14,0.0,0.14,0.0,0.14,0.14,0.14,0.14,0.0,0.14,0.0
i think learning is fun,0.0,0.2,0.0,0.2,0.0,0.0,0.2,0.2,0.2,0.0,0.0,0.0
i think machines can learn to learn,0.14,0.14,0.14,0.0,0.14,0.0,0.0,0.14,0.0,0.29,0.0,0.0
i think coding is fun fun fun,0.0,0.14,0.0,0.14,0.0,0.0,0.0,0.14,0.43,0.0,0.0,0.14
i think i can i can,0.33,0.17,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0


### Discussion

#### TDM vs DTM
DTM is the transpose of TDM.  Both have the same information.  DTM is organized in the sklean way for predicting a document attribute, like sentiment or topic.   

#### We should have cleaned the texts first.  Some text cleaning methods:  
- Reduce number of terms (dimensions) by stemming or lemmatization:
    - We could stem "learning" and "learn" to make a single term called "learn"
    - We could stem "machine" and "machines" to make a single term called "machine" 
- Remove stop words.  Some words (terms) are very common.  Such words are called stop words and have little meaning.  Examples are: "i", "to", and "is".

## Create IDF

If a term occurs in too many documents, then that term will not help in machine learning.  Somewhat rare terms are more interesting.
To that end, we determine the inverse document frequency (IDF) for every term.  For each term:
- Determine number of documents with the term
- Divide total number of documents by the number of documents with the term
- Take log of the ratio

In [9]:
# Number of documents in which the term appears
NumberOfDocsWithTerm = (DTM_df > 0).sum(axis = 0)

# Inverse Document Frequency (IDF) for each term
TotalNumberOfDocs = DTM_df.shape[0]
IDF = np.log(TotalNumberOfDocs/NumberOfDocsWithTerm)

# Present results
pd.DataFrame(data=(NumberOfDocsWithTerm,IDF), index=['#Docs', 'IDF']).T.round(2)

Unnamed: 0,#Docs,IDF
can,2.0,0.92
think,5.0,0.0
machines,1.0,1.61
is,3.0,0.51
to,1.0,1.61
much,1.0,1.61
learning,2.0,0.92
i,5.0,0.0
fun,3.0,0.51
learn,1.0,1.61


The above formula works well in our dataset.  It is:
$$IDF = Log\Biggl(\dfrac{T}{N}\Biggr)$$
where:
- T is Total Number Of Docs in corpus
- N is vector of Number Of Docs that contain a Term  
<br/><br/>

For a robust deployment we need to prepare for cases where the NumberOfDocsWithTerm (N) is zero:
$$IDF = Log\Biggl(\dfrac{T + 1}{N + 1}\Biggr)$$

## Create a TF-IDF matrix
We multiply the IDF vector into the TDM to create a TF-IDF matrix.  The IDF vector is multiplied element-by-element with each document of the TDM.

$$\text{TF-IDF} = TF ⊙ IDF$$
where:
- TF is the term-frequency matrix
- IDF is the inverse document frequency vector
- TF-IDF is the Term frequency - Inverse Document Frequency matrix

In [10]:
TF_IDF = TF.mul(IDF, axis=1)
display(TF_IDF.round(decimals=2))

Unnamed: 0,can,think,machines,is,to,much,learning,i,fun,learn,machine,coding
i think machine learning is much fun,0.0,0.0,0.0,0.07,0.0,0.23,0.13,0.0,0.07,0.0,0.23,0.0
i think learning is fun,0.0,0.0,0.0,0.1,0.0,0.0,0.18,0.0,0.1,0.0,0.0,0.0
i think machines can learn to learn,0.13,0.0,0.23,0.0,0.23,0.0,0.0,0.0,0.0,0.46,0.0,0.0
i think coding is fun fun fun,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.22,0.0,0.0,0.23
i think i can i can,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Discussion
- The TF-IDF matrix contains no nulls (nan).  How can we see that the TF-IDF is a sparse matrix?
- Assume that each document in the training and test data sets are additionally labeled with a sentiment.  How would we combine the sentiment labels with the TF-IDF matrix?
- How could the TF-IDF matrix be used in sentiment analysis?  