# Getting information out of text

This is a general script to extract information out of text. This is very relevant for analyzing the language model output from your experiments. See [this lecture](https://www.youtube.com/watch?v=7YacOe4XwhY&ab_channel=MachineLearningTV) for information about which types are used and the concepts in natural-language processing.

See these tutorials for other tutorials:
- [Classifying texts](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py)
- [Clustering texts into different topics](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py)
- [Topic extraction from a text collection (corpus)](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py)

## Count frequency of terms

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample sentences.
sentences = [
    "This is a sample sentence",
    "I am interested in good politics",
    "You are a very good software engineer, engineer.",
]

# Create CountVectorizer, which create bag-of-words model.
# stop_words : Specify language to remove stopwords. 
vectorizer = CountVectorizer(stop_words='english')

# Learn vocabulary in sentences. 
vectorizer.fit(sentences)

# Get dictionary. 
vectorizer.get_feature_names()



['engineer',
 'good',
 'interested',
 'politics',
 'sample',
 'sentence',
 'software']

## Evaluate term frequency minus inverse document frequency (TF-IDF)

Allows you to see which words are meaningfully frequent. Ignores unimportant words by design.

Read more here:
- [Understanding TF-IDF](https://monkeylearn.com/blog/what-is-tf-idf/)
- [TF-IDF from scratch](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    "This is a sample sentence",
    "I am interested in politics",
    "You are a very good software engineer, engineer.",
]

# Create TfidfVectorizer.
# stop_words : Get rid of english stop words. 
vectorizer = TfidfVectorizer(stop_words='english')

# Learn vocabulary from sentences. 
vectorizer.fit(sentences)

# Get vocabularies.
vectorizer.vocabulary_

{'sample': 4,
 'sentence': 5,
 'interested': 2,
 'politics': 3,
 'good': 1,
 'software': 6,
 'engineer': 0}

In [None]:
# Transform to document-term matrix
vector_spaces = vectorizer.transform(sentences)
vector_spaces.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.70710678,
        0.70710678, 0.        ],
       [0.        , 0.        , 0.70710678, 0.70710678, 0.        ,
        0.        , 0.        ],
       [0.81649658, 0.40824829, 0.        , 0.        , 0.        ,
        0.        , 0.40824829]])

In [None]:
# Show sentences and vector space representation.
# 
# (A, B) C
# A : Document Index
# B : Specific word-vector index
# C : TF-IDF score
for i, v in zip(sentences, vector_spaces):
    print(i)
    print(v)

This is a sample sentence
  (0, 5)	0.7071067811865476
  (0, 4)	0.7071067811865476
I am interested in politics
  (0, 3)	0.7071067811865476
  (0, 2)	0.7071067811865476
You are a very good software engineer, engineer.
  (0, 6)	0.40824829046386296
  (0, 1)	0.40824829046386296
  (0, 0)	0.8164965809277259


## Calculate embedding of words in the text

You can use word embeddings to compare how close words are in meaning (semantics). Imagine all the words as points in a 2D coordinate system, then `Royalty` and `Queen` are closer than `Royalty` and `Dog`.

Read more here:
- [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
- [Introduction to Word Embedding](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
- [Word Embedding](https://en.wikipedia.org/wiki/Word_embedding)
- [12.1: What is word2ve? — Programming with Text](https://www.youtube.com/watch?v=LSS_bos_TPI&t=293s)

In [None]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

# Get document data.
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [None]:
# Word2Vec modeling. 
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)

# Get specified vocabulary's vector. 
model.wv["computer"]



array([-2.6313041e-03,  1.6538294e-03,  4.6289808e-04, -7.5920334e-04,
       -3.6283568e-03, -2.6506723e-03,  4.3202299e-03,  3.9012234e-03,
        5.4888282e-04,  3.4330352e-03,  1.7413682e-03, -1.4366965e-03,
       -2.1586919e-03, -2.5640775e-03, -4.6938271e-03, -3.0624846e-04,
        1.7438885e-03, -4.2526140e-03, -3.7460853e-03,  3.2590211e-03,
        3.0387915e-03, -3.2021464e-03,  1.8485422e-03, -4.9046022e-03,
       -4.1450812e-03, -1.5228360e-03, -3.4953225e-03, -1.6744386e-03,
        1.7121847e-04,  4.1132323e-03,  4.3965699e-03, -3.7769366e-03,
        4.3228897e-03, -1.7645077e-03,  2.8583778e-03, -3.2301315e-03,
       -2.6300168e-03, -1.3739102e-03, -2.4289147e-03,  2.9276155e-03,
        2.3178912e-03, -3.5281274e-03,  6.2511890e-04, -4.9543008e-03,
        1.6058442e-03,  1.6891175e-04, -4.9381829e-03,  2.9473209e-03,
        4.7533633e-03,  3.0923262e-03, -2.5990247e-03,  1.5995167e-03,
        1.3188375e-03, -3.1465988e-03, -1.3249181e-03,  4.7159302e-03,
      

In [None]:
# Get most similar words of "computer"
model.wv.similarity("computer")

[('time', 0.13495305180549622),
 ('survey', 0.05568737909197807),
 ('minors', 0.042579300701618195),
 ('human', 0.03648962825536728),
 ('trees', 0.029978659003973007),
 ('graph', -0.026356622576713562),
 ('interface', -0.061476483941078186),
 ('user', -0.09933125227689743),
 ('system', -0.10445401072502136),
 ('response', -0.10940004885196686)]