# Basics of CountVectorizer and TfidfVectorizer

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## 1) CountVectorizer

For reading: https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments

<b>Theory</b>

- CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
- Convert a collection of text documents to a matrix of token counts

- This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

- If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.


Parameters
input : string {'filename', 'file', 'content'}

    If 'filename', the sequence passed as an argument to fit is  
    expected to be a list of filenames that need reading to fetch  
    the raw content to analyze.  

    If 'file', the sequence items must have a 'read' method (file-like  
    object) that is called to fetch the bytes in memory.  

    Otherwise the input is expected to be a sequence of items that  
    can be of type string or byte.  
encoding : string, 'utf-8' by default.

    If bytes or files are given to analyze, this encoding is used to  
    decode.  
decode_error : {'strict', 'ignore', 'replace'}

    Instruction on what to do if a byte sequence is given to analyze that  
    contains characters not of the given `encoding`. By default, it is  
    'strict', meaning that a UnicodeDecodeError will be raised. Other  
    values are 'ignore' and 'replace'.  
strip_accents : {'ascii', 'unicode', None}

    Remove accents and perform other character normalization  
    during the preprocessing step.  
    'ascii' is a fast method that only works on characters that have  
    an direct ASCII mapping.  
    'unicode' is a slightly slower method that works on any characters.  
    None (default) does nothing.  

    Both 'ascii' and 'unicode' use NFKD normalization from  
    :func:`unicodedata.normalize`.  
lowercase : boolean, True by default

    Convert all characters to lowercase before tokenizing.  
preprocessor : callable or None (default)

    Override the preprocessing (string transformation) stage while  
    preserving the tokenizing and n-grams generation steps.  
    Only applies if `analyzer is not callable`.  
tokenizer : callable or None (default)

    Override the string tokenization step while preserving the  
    preprocessing and n-grams generation steps.  
    Only applies if `analyzer == 'word'`.  
stop_words : string {'english'}, list, or None (default)

    If 'english', a built-in stop word list for English is used.  
    There are several known issues with 'english' and you should  
    consider an alternative (see :ref:`stop_words`).  

    If a list, that list is assumed to contain stop words, all of which  
    will be removed from the resulting tokens.  
    Only applies if `analyzer == 'word'`.  

    If None, no stop words will be used. max_df can be set to a value  
    in the range [0.7, 1.0) to automatically detect and filter stop  
    words based on intra corpus document frequency of terms.  
token_pattern : string

    Regular expression denoting what constitutes a "token", only used  
    if `analyzer == 'word'`. The default regexp select tokens of 2  
    or more alphanumeric characters (punctuation is completely ignored  
    and always treated as a token separator).  
ngram_range : tuple (min_n, max_n), default=(1, 1)

    The lower and upper boundary of the range of n-values for different  
    word n-grams or char n-grams to be extracted. All values of n such  
    such that min_n  = n  = max_n will be used. For example an  
    `ngram_range` of `(1, 1)` means only unigrams, `(1, 2)` means  
    unigrams and bigrams, and `(2, 2)` means only bigrams.  
    Only applies if `analyzer is not callable`.  
analyzer : string, {'word', 'char', 'char_wb'} or callable

    Whether the feature should be made of word n-gram or character  
    n-grams.  
    Option 'char_wb' creates character n-grams only from text inside  
    word boundaries; n-grams at the edges of words are padded with space.  

    If a callable is passed it is used to extract the sequence of features  
    out of the raw, unprocessed input.  


    Since v0.21, if `input` is `filename` or `file`, the data is  
    first read from the file and then passed to the given callable  
    analyzer.  
max_df : float in range [0.0, 1.0] or int, default=1.0

    When building the vocabulary ignore terms that have a document  
    frequency strictly higher than the given threshold (corpus-specific  
    stop words).  
    If float, the parameter represents a proportion of documents, integer  
    absolute counts.  
    This parameter is ignored if vocabulary is not None.  
min_df : float in range [0.0, 1.0] or int, default=1

    When building the vocabulary ignore terms that have a document  
    frequency strictly lower than the given threshold. This value is also  
    called cut-off in the literature.  
    If float, the parameter represents a proportion of documents, integer  
    absolute counts.  
    This parameter is ignored if vocabulary is not None.  
max_features : int or None, default=None

    If not None, build a vocabulary that only consider the top  
    max_features ordered by term frequency across the corpus.  

    This parameter is ignored if vocabulary is not None.  
vocabulary : Mapping or iterable, optional

    Either a Mapping (e.g., a dict) where keys are terms and values are  
    indices in the feature matrix, or an iterable over terms. If not  
    given, a vocabulary is determined from the input documents. Indices  
    in the mapping should not be repeated and should not have any gap  
    between 0 and the largest index.  
binary : boolean, default=False

    If True, all non zero counts are set to 1. This is useful for discrete  
    probabilistic models that model binary events rather than integer  
    counts.  
dtype : type, optional

    Type of the matrix returned by fit_transform() or transform().  
Attributes
vocabulary_ : dict

    A mapping of terms to feature indices.  
fixed_vocabulary_: boolean

    True if a fixed vocabulary of term to indices mapping  
    is provided by the user  
stop_words_ : set

    Terms that were ignored because they either:  

      - occurred in too many documents (`max_df`)  
      - occurred in too few documents (`min_df`)  
      - were cut off by feature selection (`max_features`).  

    This is only available if no vocabulary was given.  

* Examples
      from sklearn.feature_extraction.text import CountVectorizer 
      corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ]

       vectorizer = CountVectorizer() 
       X = vectorizer.fit_transform(corpus) 

       print(vectorizer.get_feature_names()) 
       ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] 

       print(X.toarray()) 

      [[0 1 1 1 0 0 1 0 1]
      [0 2 0 1 0 1 1 0 1]  
      [1 0 0 1 1 0 1 1 1]  
      [0 1 1 1 0 0 1 0 1]]  

      vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
       X2 = vectorizer2.fit_transform(corpus) 

       print(vectorizer2.get_feature_names()) 
       ['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one','this document', 'this is', 'this the'] 

        print(X2.toarray())  

        [[0 0 1 1 0 0 1 0 0 0 0 1 0]  
        [0 1 0 1 0 1 0 1 0 0 1 0 0]  
        [1 0 0 1 0 0 0 0 1 1 0 1 0]  
        [0 0 1 0 1 0 1 0 0 0 0 0 1]]  


Notes
The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.


### 1.1) Examples : 

In [0]:
sam = ["Hello there, my name is A","Your name is B","you have my notebook","I have your pen"]

In [0]:
cv1 = CountVectorizer()
s = cv1.fit_transform(sam)

In [32]:
cv1.get_feature_names()

['have',
 'hello',
 'is',
 'my',
 'name',
 'notebook',
 'pen',
 'there',
 'you',
 'your']

In [33]:
cv2 = CountVectorizer(min_df=0.6)
s = cv2.fit_transform(sam)

ValueError: ignored

In [0]:
cv2 = CountVectorizer(min_df=0.5)
s = cv2.fit_transform(sam)

In [35]:
cv2.get_feature_names()

['have', 'is', 'my', 'name', 'your']

In [0]:
cv2 = CountVectorizer(stop_words=['is']) # stop words contains a list of words to be ignored.
s = cv2.fit_transform(sam)

In [37]:
cv2.get_feature_names()

['have', 'hello', 'my', 'name', 'notebook', 'pen', 'there', 'you', 'your']

In [38]:
sam

['Hello there, my name is A',
 'Your name is B',
 'you have my notebook',
 'I have your pen']

In [39]:
s.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 1]])

## 2) TfidfVectorizer

<b>Theory</b>


- TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
- Convert a collection of raw documents to a matrix of TF-IDF features.

- Equivalent to :class:CountVectorizer followed by :class:TfidfTransformer.

Parameters
input : str {'filename', 'file', 'content'}

    If 'filename', the sequence passed as an argument to fit is  
    expected to be a list of filenames that need reading to fetch  
    the raw content to analyze.  

    If 'file', the sequence items must have a 'read' method (file-like  
    object) that is called to fetch the bytes in memory.  

    Otherwise the input is expected to be a sequence of items that  
    can be of type string or byte.  
encoding : str, default='utf-8'

    If bytes or files are given to analyze, this encoding is used to  
    decode.  
decode_error : {'strict', 'ignore', 'replace'} (default='strict')

    Instruction on what to do if a byte sequence is given to analyze that  
    contains characters not of the given `encoding`. By default, it is  
    'strict', meaning that a UnicodeDecodeError will be raised. Other  
    values are 'ignore' and 'replace'.  
strip_accents : {'ascii', 'unicode', None} (default=None)

    Remove accents and perform other character normalization  
    during the preprocessing step.  
    'ascii' is a fast method that only works on characters that have  
    an direct ASCII mapping.  
    'unicode' is a slightly slower method that works on any characters.  
    None (default) does nothing.  

    Both 'ascii' and 'unicode' use NFKD normalization from  
    :func:`unicodedata.normalize`.  
lowercase : bool (default=True)

    Convert all characters to lowercase before tokenizing.  
preprocessor : callable or None (default=None)

    Override the preprocessing (string transformation) stage while  
    preserving the tokenizing and n-grams generation steps.  
    Only applies if `analyzer is not callable`.  
tokenizer : callable or None (default=None)

    Override the string tokenization step while preserving the  
    preprocessing and n-grams generation steps.  
    Only applies if `analyzer == 'word'`.  
analyzer : str, {'word', 'char', 'char_wb'} or callable

    Whether the feature should be made of word or character n-grams.  
    Option 'char_wb' creates character n-grams only from text inside  
    word boundaries; n-grams at the edges of words are padded with space.  

    If a callable is passed it is used to extract the sequence of features  
    out of the raw, unprocessed input.  


    Since v0.21, if `input` is `filename` or `file`, the data is  
    first read from the file and then passed to the given callable  
    analyzer.  
stop_words : str {'english'}, list, or None (default=None)

    If a string, it is passed to _check_stop_list and the appropriate stop  
    list is returned. 'english' is currently the only supported string  
    value.  
    There are several known issues with 'english' and you should  
    consider an alternative (see :ref:`stop_words`).  

    If a list, that list is assumed to contain stop words, all of which  
    will be removed from the resulting tokens.  
    Only applies if `analyzer == 'word'`.  

    If None, no stop words will be used. max_df can be set to a value  
    in the range [0.7, 1.0) to automatically detect and filter stop  
    words based on intra corpus document frequency of terms.  
token_pattern : str

    Regular expression denoting what constitutes a "token", only used  
    if `analyzer == 'word'`. The default regexp selects tokens of 2  
    or more alphanumeric characters (punctuation is completely ignored  
    and always treated as a token separator).  
ngram_range : tuple (min_n, max_n), default=(1, 1)

    The lower and upper boundary of the range of n-values for different  
    n-grams to be extracted. All values of n such that min_n  = n  = max_n  
    will be used. For example an `ngram_range` of `(1, 1)` means only  
    unigrams, `(1, 2)` means unigrams and bigrams, and `(2, 2)` means  
    only bigrams.  
    Only applies if `analyzer is not callable`.  
max_df : float in range [0.0, 1.0] or int (default=1.0)

    When building the vocabulary ignore terms that have a document  
    frequency strictly higher than the given threshold (corpus-specific  
    stop words).  
    If float, the parameter represents a proportion of documents, integer  
    absolute counts.  
    This parameter is ignored if vocabulary is not None.  
min_df : float in range [0.0, 1.0] or int (default=1)

    When building the vocabulary ignore terms that have a document  
    frequency strictly lower than the given threshold. This value is also  
    called cut-off in the literature.  
    If float, the parameter represents a proportion of documents, integer  
    absolute counts.  
    This parameter is ignored if vocabulary is not None.  
max_features : int or None (default=None)

    If not None, build a vocabulary that only consider the top  
    max_features ordered by term frequency across the corpus.  

    This parameter is ignored if vocabulary is not None.  
vocabulary : Mapping or iterable, optional (default=None)

    Either a Mapping (e.g., a dict) where keys are terms and values are  
    indices in the feature matrix, or an iterable over terms. If not  
    given, a vocabulary is determined from the input documents.  
binary : bool (default=False)

    If True, all non-zero term counts are set to 1. This does not mean  
    outputs will have only 0/1 values, only that the tf term in tf-idf  
    is binary. (Set idf and normalization to False to get 0/1 outputs).  
dtype : type, optional (default=float64)

    Type of the matrix returned by fit_transform() or transform().  
norm : 'l1', 'l2' or None, optional (default='l2')

    Each output row will have unit norm, either:  
    * 'l2': Sum of squares of vector elements is 1. The cosine  
    similarity between two vectors is their dot product when l2 norm has  
    been applied.  
    * 'l1': Sum of absolute values of vector elements is 1.  
    See :func:`preprocessing.normalize`.  
use_idf : bool (default=True)

    Enable inverse-document-frequency reweighting.  
smooth_idf : bool (default=True)

    Smooth idf weights by adding one to document frequencies, as if an  
    extra document was seen containing every term in the collection  
    exactly once. Prevents zero divisions.  
sublinear_tf : bool (default=False)

    Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).  
Attributes
vocabulary_ : dict

    A mapping of terms to feature indices.  
fixed_vocabulary_: bool

    True if a fixed vocabulary of term to indices mapping  
    is provided by the user  
idf_ : array, shape (n_features)

    The inverse document frequency (IDF) vector; only defined  
    if `use_idf` is True.  
stop_words_ : set

    Terms that were ignored because they either:  

      - occurred in too many documents (`max_df`)  
      - occurred in too few documents (`min_df`)  
      - were cut off by feature selection (`max_features`).  

    This is only available if no vocabulary was given.  


- TfidfTransformer : Performs the TF-IDF transformation from a provided matrix of counts.  

- Notes:

The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.

Examples

        from sklearn.feature_extraction.text import TfidfVectorizer 

        corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ]

        vectorizer = TfidfVectorizer() 
        X = vectorizer.fit_transform(corpus)

        print(vectorizer.get_feature_names()) 

        ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] 
        
        print(X.shape) (4, 9)

### 2.1) Examples:

formula used: tf-idf(d, t) = tf(t) * idf(d, t)

            * tf(t)= the term frequency is the number of times the term appears in the document
            * idf(d, t) = the document frequency is the number of documents 'd' that contain term 't'

In [0]:
sam = ["Hello there, my name is A","Your name is B","you have my notebook","I have your pen"]

In [0]:
tf1 = TfidfVectorizer()
c = tf1.fit_transform(sam)

In [42]:
tf1.get_feature_names()

['have',
 'hello',
 'is',
 'my',
 'name',
 'notebook',
 'pen',
 'there',
 'you',
 'your']

In [43]:
c.toarray()

array([[0.        , 0.50867187, 0.40104275, 0.40104275, 0.40104275,
        0.        , 0.        , 0.50867187, 0.        , 0.        ],
       [0.        , 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        , 0.        , 0.57735027],
       [0.43779123, 0.        , 0.        , 0.43779123, 0.        ,
        0.55528266, 0.        , 0.        , 0.55528266, 0.        ],
       [0.52640543, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.66767854, 0.        , 0.        , 0.52640543]])

How it gives values to words:

Basically, 

          The more times a token appears in a document, the more weight it will have. 
          However, the more documents the token appears in, it is 'penalized' and the weight is diminished. 

Removing all stop words 

In [0]:
sam = ["Hello there, my name is A","Your name is B","you have my notebook","I have your pen"]

In [0]:
tf2 = TfidfVectorizer(stop_words='english')  # this will remove all stop words from sam
c = tf2.fit_transform(sam)

In [46]:
tf2.get_feature_names()

['hello', 'notebook', 'pen']