### Text Feature Extraction

Machine Learning algorithms except numerical feature vectors with fixed length 

Raw data is text and variable length. This can't be fed directly to algorithms. 

Text feature extraction required:
- Tokenizing each word or token given an integer id
- Counting the occurrences of tokens in each document
- Normalizing and weighting with diminishing importance tokens that occur in the majority

**Documents represented as matrix**: one row per document and one column per token 

**Vectorization**: general process of turning a collection of text documents into numerical feature vectors 

**Bag of Words representation**:
- Tokenization, counting and normalization together 
- Ignores relative position information of the words in the documents
- Resulting matrix will have many feature values that are zeros
- Uses sparse representation

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

`CountVectorizer` implements both tokenization and occurrence counting in a single class

This model has many parameters, however the default values are quite reasonable.

In [3]:
vectorizer = CountVectorizer()
vectorizer

Let's use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

In [4]:
corpus = [
  "This is the first document",
  "This is the second document",
  "And the third one",
  "Is this the first document?"
]
X = vectorizer.fit_transform( corpus )
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

In [10]:
analyze = vectorizer.build_analyzer()

result = analyze( "Is this the first document?" ) 
result

['is', 'this', 'the', 'first', 'document']

In [6]:
type(result) == type([ 'this', 'is', 'text', 'document', 'to', 'analyze' ])

True

In [7]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

In [8]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

The converse mapping from feature name to column index is stored in the `vocabulary_` attribute of the vectorizer:

In [9]:
vectorizer.vocabulary_.get( 'document' )

1

In [5]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('spanish')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Zkorpion\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Error with downloaded zip file
[nltk_data] Error loading spanish: Package 'spanish' not found in
[nltk_data]     index


False

In [7]:
text = "Este es un ejemplo de texto para tokenización. Posee algunas cosas, como tildes y comas-puntos"
tokens = word_tokenize(text)

print (tokens)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\Zkorpion/nltk_data'
    - 'c:\\Users\\Zkorpion\\AppData\\Local\\Programs\\Python\\Python312\\nltk_data'
    - 'c:\\Users\\Zkorpion\\AppData\\Local\\Programs\\Python\\Python312\\share\\nltk_data'
    - 'c:\\Users\\Zkorpion\\AppData\\Local\\Programs\\Python\\Python312\\lib\\nltk_data'
    - 'C:\\Users\\Zkorpion\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************
