## Statistical Natural Language Processing 

#### <I>Translation: Machine Learning with Unique Features and Feature Representations</I>

### Guiding Principles
#### Somehow we need to turn the meaning of the text into numerical values representative of their semantic importance
#### Language is very noisy and contains a great dela of ambiguity therefore we need to find ways to reduce the noise and ambiguity


## n-Gram Model

   <img src="./images/n-gram.png" width="400px"> 


### Transforming Text to n-gram Frequency Counts

##### 1) What does playing with ngram_range do?

##### 2) What does max_df and min_df do?

##### 3) What does max_featues do?

##### 4) What does tokenizer do?

##### 5) What does stop_words do?

##### 6) What is the issue with larger n-grams?

##### 7) Why are frequency counts considered a biased representation ?


In [31]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["There was an old woman who swallowed a cow",
        "I do not know how she swallowed a cow",
        "She swallowed the cow to catch the goat",
        "She swallowed the goat to catch the dog",
        "She swallowed the dog to catch the cat",
        "She swallowed the cat to catch the bird",
        "She swallowed the bird to catch the spider"]
                
# max_df - ignore terms with a document frequency greater that threshold
# min_df - ignore terms with a document frequency lower than threshold

# What does token_pattern do by default?

cv = CountVectorizer(lowercase=True, stop_words=None, max_features=None, ngram_range=(1, 1), max_df=1.0, min_df=0.5)

cv_fit=cv.fit_transform(text)

print("Vocabulary Size: ", len(cv.get_feature_names()), "\n")
print("Vocabulary: ", cv.get_feature_names(), "\n")
print("Vectorized Count Matrix")
print(cv_fit.toarray(), "\n")
print("Totals: ", cv_fit.toarray().sum(axis=0), "\n")

Vocabulary Size:  5 

Vocabulary:  ['catch', 'she', 'swallowed', 'the', 'to'] 

Vectorized Count Matrix
[[0 0 1 0 0]
 [0 1 1 0 0]
 [1 1 1 2 1]
 [1 1 1 2 1]
 [1 1 1 2 1]
 [1 1 1 2 1]
 [1 1 1 2 1]] 

Totals:  [ 5  6  7 10  5] 



### Transforming Text to a Co-occurance Matrix

   <img src="./images/co-occurance-matrix.png" width="300px"> 

In [2]:
Xc = (cv_fit.T * cv_fit) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # typically you want to fill same word cooccurence to 0
print("Vocabulary: ", cv.get_feature_names(), "\n")
print(Xc.todense()) # print out matrix in dense format

Vocabulary:  ['an', 'bird', 'cat', 'catch', 'cow', 'do', 'dog', 'goat', 'how', 'know', 'not', 'old', 'she', 'spider', 'swallowed', 'the', 'there', 'to', 'was', 'who', 'woman'] 

[[ 0  0  0  0  1  0  0  0  0  0  0  1  0  0  1  0  1  0  1  1  1]
 [ 0  0  1  2  0  0  0  0  0  0  0  0  2  1  2  4  0  2  0  0  0]
 [ 0  1  0  2  0  0  1  0  0  0  0  0  2  0  2  4  0  2  0  0  0]
 [ 0  2  2  0  1  0  2  2  0  0  0  0  5  1  5 10  0  5  0  0  0]
 [ 1  0  0  1  0  1  0  1  1  1  1  1  2  0  3  2  1  1  1  1  1]
 [ 0  0  0  0  1  0  0  0  1  1  1  0  1  0  1  0  0  0  0  0  0]
 [ 0  0  1  2  0  0  0  1  0  0  0  0  2  0  2  4  0  2  0  0  0]
 [ 0  0  0  2  1  0  1  0  0  0  0  0  2  0  2  4  0  2  0  0  0]
 [ 0  0  0  0  1  1  0  0  0  1  1  0  1  0  1  0  0  0  0  0  0]
 [ 0  0  0  0  1  1  0  0  1  0  1  0  1  0  1  0  0  0  0  0  0]
 [ 0  0  0  0  1  1  0  0  1  1  0  0  1  0  1  0  0  0  0  0  0]
 [ 1  0  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  1  1  1]
 [ 0  2  2  5  2  1  2  2  1  

### Term-Frequency-Inverse Document Frequency

#### Inverse Document Frequency  $(idf_t)$
* <I>Measure of <B>informativeness</B> of a term: it's rarity across the whole corpus.</I>

     ###   $idf_t\space=\space log_{10}(N\space/\space df)$<BR>
     
* <I>Assign a <B>tf.idf</B> weight to each term <B>t</B> in each document <B>d</B>

    ### $w_{t,d}\space=\space tf_{t,d}\space x \space log_{10}(N\space/\space df)$<BR>

#### <I><U><B>Intuitively</B></U></I>
* <I>Weight increases with the number of occurrences within a document</I>
* <I>Weight increases with the rarity of the term across the whole corpus</I>


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Note that max-df is set such that we keep words that can appear across 100% of the corpus and min-df is set
# such that we throw away words that are in less than 10% of our document corpus. 
# We can also generate unigrams and bigrams which is about right for clinical notes. Beyond bigrams we get very sparse....

tfidfVectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
                                  min_df=0.1, stop_words=None,
                                  use_idf=True, tokenizer=None, ngram_range=(1,2))   

tfidfMatrix = tfidfVectorizer.fit_transform(text)

print("**** STORED AS A SPARSE MATRIX ****\n")
print("Tf-idf Matrix Size: ", tfidfMatrix.shape, "\n")
print("Tf-Idf Matrix")
print(tfidfMatrix.toarray())
print()

**** STORED AS A SPARSE MATRIX ****

Tf-idf Matrix Size:  (7, 47) 

Tf-Idf Matrix
[[0.27350507 0.27350507 0.         0.         0.         0.
  0.         0.         0.19406002 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.27350507 0.27350507
  0.         0.         0.         0.11461498 0.2270327  0.
  0.         0.         0.         0.         0.         0.
  0.         0.27350507 0.27350507 0.         0.         0.27350507
  0.27350507 0.27350507 0.27350507 0.27350507 0.27350507]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.22642736 0.         0.31912307 0.31912307
  0.         0.         0.         0.         0.31912307 0.31912307
  0.31912307 0.31912307 0.31912307 0.31912307 0.         0.
  0.15158902 0.15158902 0.         0.13373164 0.26489955 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.

### Other Feature Representations . . .

#### <U>Sentence Level Annotation</U>
##### Example: 
* (+) There is a good deal of opacification at the left base.
* (-) Right lung is clear
* (?) CHEST ONE VIEW PORTABLE
<BR>
   
* <B>Advantages:</B> Finer grained accuracy
* <B>Disadvatages:</B> Requires annotation and a voting method

#### <U>Concept-Assertion Pairs</U>
##### Example: reported fever  present, wheezing absent, . . .
* <B>Advantages:</B> Very accurate and accounts for negation
* <B>Disadvatages:</B> Requires very fine-grained annotation and a voting method

#### <U>Noun, Verb, and Prepositional Phrases</U>
##### Example: no pleural effusion, bilateral infiltrates, lungs are clear . . . 
* <B>Advantages:</B> Accurate and can account for negation
* <B>Disadvatages:</B> Requires a good dependency parser, determination of phrase length, and a voting method

#### <U>IOB (Inside-Outside-Beginning) Format</U>
##### Example:
* I_O complained_O to_O Microsoft_B-ORG about_O Bill_B-PER Gates_I-PER
* They_O told_O me_O to_O see_O the_O mayor_O of_O New_B-LOC York_I-LOC
<BR>

* <B>Advantages:</B> Very fine-grained annotation
* <B>Disadvatages:</B> Requires annotation !!!

