#**NATURAL LANGUAGE PREPROCESSING (NLP)**

**What will you learn?**
1. **NLP**: Intoduction to NLP
2. **NLTK**: NLTK library
3. **Tokenization**: Converting paragraphs into sentences and words.
4. **StopWords**: Most common words 
5. **Stemming**: Converting the words to their base form.
6. **POS tag**: Part of speech of each word.
7. **Lemmatization**: gives the meaningful word in proper form.
8. **Implementation on Movie Review Dataset**
9. **Count-vectorizer**: counts the frequency of words
10. **N-grams** & **TF-IDF**

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data .

Let's use an example to show just how powerful NLP is when used in a practical situation. When you're typing on your phone, like many of us do every day, you'll see word suggestions based on what you type and what you're currently typing. That's natural language processing in action.


NLP includes many different techniques for interpreting human language, ranging from statistical and machine learning methods to rules-based and algorithmic approaches. We need a broad array of approaches because the text- and voice-based data varies widely, as do the practical applications. 

Basic NLP tasks include tokenization and parsing, lemmatization/stemming, part-of-speech tagging, language detection and identification of semantic relationships. 

In general terms, NLP tasks break down language into shorter, elemental pieces, try to understand relationships between the pieces and explore how the pieces work together to create meaning.

We will study about all of them in detail.

##**Natural Language toolkit (NLTK)**

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP).

It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. It also includes graphical demonstrations and sample data sets as well as accompanied by a cook book and a book which explains the principles behind the underlying language processing tasks that NLTK supports.


###**Installing NLTK**

Type  $!pip \;install\; nltk$ in the Jupyter Notebook or if it doesn’t work in cmd type conda install$ -c \;conda-forge\; nltk.$ This should work in most cases.

To check if NLTK has installed correctly, you can open your Python terminal and type the following: Import nltk. If everything goes fine, that means you've successfully installed NLTK library.

##**Tokenisation**

**Tokenization** is the process breaking complex data like paragraphs into simple units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
1. **Sentence tokenization** : split a paragraph into list of sentences using sent_tokenize() method
2. **Word tokenization** : split a sentence into list of words using word_tokenize() method

Import all the libraries required to perform tokenization on input data.

In [1]:
import nltk

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
sample_text = "In the ninja world, those who break the rules are trash. That's true, but those who abandon their friends are worse than trash."

In [4]:
sent_tokenize(sample_text) #converted the sample text into sentences


['In the ninja world, those who break the rules are trash.',
 "That's true, but those who abandon their friends are worse than trash."]

In [5]:
words = word_tokenize(sample_text)
print(words,"\ncount = ",len(words))

['In', 'the', 'ninja', 'world', ',', 'those', 'who', 'break', 'the', 'rules', 'are', 'trash', '.', 'That', "'s", 'true', ',', 'but', 'those', 'who', 'abandon', 'their', 'friends', 'are', 'worse', 'than', 'trash', '.'] 
count =  28


Data Cleaning plays important role in NLP to remove noise from data.

##**Stop Words**

**Stop Words** refers to the most common words in a language (such as "the", "a", "an", "in") which helps in formation of sentence to make sense, but these words does not provide any significance in language processing so remove it .

In computing, stop words are words which are filtered out before or after processing of natural language data (text). 

You can check list of stopwords by running below code snippet

In [6]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [7]:
clean_words = [w for w in words if not w.lower() in stop]
print(clean_words,"\ncount = ",len(clean_words))

['ninja', 'world', ',', 'break', 'rules', 'trash', '.', "'s", 'true', ',', 'abandon', 'friends', 'worse', 'trash', '.'] 
count =  15


Words like "in", "the", "who",etc. are removed from the list of words.

**Remove Punctuations**

To remove punctuations from the list of words, import all punctuations and add them in the stop word list.

In [8]:
import string
punctuations = list(string.punctuation)
stop = stop + punctuations

In [9]:
clean_words = [w for w in words if not w.lower() in stop]
print(clean_words,"\ncount = ",len(clean_words))

['ninja', 'world', 'break', 'rules', 'trash', "'s", 'true', 'abandon', 'friends', 'worse', 'trash'] 
count =  11


##**Stemming**

**Stemming** is a normalization technique where list of tokenized words are converted into shorten root words to remove redundancy. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

A computer program that stems word may be called a stemmer.

A stemmer reduce the words like fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu .

It removes suffices, like "ing", "ly", "s", etc. by a simple rule-based approach. It reduces the corpus of words but often the actual words get neglected.

**Various Stemming algorithms**
1. **Porter stemming algorithm**: This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem.
2. **Lancaster stemming algorithm**: It was developed at Lancaster University and it is another very common stemming algorithms.
NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem.
3. **Regular Expression stemming algorithm** : With the help of this stemming algorithm, we can construct our own stemmer.
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression.
4. **Snowball stemming algorithm**: NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method.

In [10]:
stem_words = ["play", "played", "playing", "player", "happier", "happiness", "universe", "universal"]
from nltk.stem import PorterStemmer #Here we have used the porter stemming algorithm
ps = PorterStemmer()
for w in stem_words:
    print (ps.stem(w))

play
play
play
player
happier
happi
univers
univers


##**POS Tag**

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.=
POS tag tell us about grammatical information of words of the sentence by assigning specific token (Determiner, noun, adjective , adverb ,verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.

Word can have more than one POS depending upon context where it is used. we can use POS tags as statistical NLP tasks it distinguishes sense of word which is very helpful in text realization and infer semantic information from gives text for sentiment analysis.

Steps Involved:

1. Tokenize text (word_tokenize)
2. apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

POS tag list:

**Abbreviation** 	       **Meaning**

CC  coordinating conjunction

CD	cardinal digit

DT	determiner

EX	existential there

FW	foreign word

IN	preposition/subordinating conjunction

JJ	adjective (large)

JJR	adjective, comparative (larger)

JJS	adjective, superlative (largest)

LS	list market

MD	modal (could, will)

NN	noun, singular (cat, tree)

NNS	noun plural (desks)

NNP	proper noun, singular (sarah)

NNPS	proper noun, plural (indians or americans)

PDT	predeterminer (all, both, half)

POS	possessive ending (parent\ 's)

PRP	personal pronoun (hers, herself, him,himself)

PRP$	possessive pronoun (her, his, mine, my, our )

RB	adverb (occasionally, swiftly)

RBR	adverb, comparative (greater)

RBS	adverb, superlative (biggest)

RP	particle (about)

TO	infinite marker (to)

UH	interjection (goodbye)

VB	verb (ask)

VBG	verb gerund (judging)

VBD	verb past tense (pleaded)

VBN	verb past participle (reunified)

VBP	verb, present tense not 3rd person singular(wrap)

VBZ	verb, present tense with 3rd person singular (bases)

WDT	wh-determiner (that, what)

WP	wh- pronoun (who)

WRB	wh- adverb (how)

In [11]:
nltk.download('state_union')

from nltk.corpus import state_union
text = state_union.raw("2006-GWBush.txt")

[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!


In [12]:
from nltk import pos_tag
import numpy as np
pos = pos_tag(word_tokenize(text.lower()))
pos2=np.array(pos)
pos2

array([['president', 'NN'],
       ['george', 'NN'],
       ['w.', 'VBD'],
       ...,
       ['applause', 'IN'],
       ['.', '.'],
       [')', ')']], dtype='<U18')

In [13]:
print(pos_tag(["One"]),
      pos_tag(["legendary"]),
      pos_tag(["flying"]),
      pos_tag(["person"]))

[('One', 'CD')] [('legendary', 'JJ')] [('flying', 'VBG')] [('person', 'NN')]


##**Lemmatization**

Major drawback of stemming is it produces Intermediate representation of word. Stemmer may or may not return meaningful word.

To overcome this problem , Lemmatization comes into picture.
Stemming algorithm works by cutting suffix or prefix from the word.On the contrary Lemmatization consider morphological analysis of the words and returns meaningful word in proper form.

 The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem,

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus.

In [14]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

In [15]:

  
lem.lemmatize("good", pos = 'a')
    

'good'

In [16]:
lem.lemmatize("better", pos = 'a')

'good'

In [17]:
lem.lemmatize("painting", pos = 'n') 
#Here painting is a noun which means painting can't be converted into paint. For eg
#"This painting is beautiful". Here painting cannot be changed.

'painting'

In [18]:
lem.lemmatize("painting", pos = 'v')
#Here painting is a verb which means it can be converted into paint.
#"I love painting"

'paint'

##**Working on Movie Reviews Dataset**

**STEP 1**: Importing the data

In [19]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [20]:
len(movie_reviews.fileids())  # dataset contains 2000 movie reviews out of which 1000 are positive and rest are negative.

2000

In [21]:
movie_reviews.fileids('pos')

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt',
 'pos/cv005_29443.txt',
 'pos/cv006_15448.txt',
 'pos/cv007_4968.txt',
 'pos/cv008_29435.txt',
 'pos/cv009_29592.txt',
 'pos/cv010_29198.txt',
 'pos/cv011_12166.txt',
 'pos/cv012_29576.txt',
 'pos/cv013_10159.txt',
 'pos/cv014_13924.txt',
 'pos/cv015_29439.txt',
 'pos/cv016_4659.txt',
 'pos/cv017_22464.txt',
 'pos/cv018_20137.txt',
 'pos/cv019_14482.txt',
 'pos/cv020_8825.txt',
 'pos/cv021_15838.txt',
 'pos/cv022_12864.txt',
 'pos/cv023_12672.txt',
 'pos/cv024_6778.txt',
 'pos/cv025_3108.txt',
 'pos/cv026_29325.txt',
 'pos/cv027_25219.txt',
 'pos/cv028_26746.txt',
 'pos/cv029_18643.txt',
 'pos/cv030_21593.txt',
 'pos/cv031_18452.txt',
 'pos/cv032_22550.txt',
 'pos/cv033_24444.txt',
 'pos/cv034_29647.txt',
 'pos/cv035_3954.txt',
 'pos/cv036_16831.txt',
 'pos/cv037_18510.txt',
 'pos/cv038_9749.txt',
 'pos/cv039_6170.txt',
 'pos/cv040_8276.txt',
 'pos/cv041_21113.txt',
 

In [22]:
movie_reviews.words(movie_reviews.fileids()[5])

['capsule', ':', 'in', '2176', 'on', 'the', 'planet', ...]

In [23]:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid), category])
documents[0:5]  ##preparing the dataset, here every review is tokenized and result is appended to get a training example.

[[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg'],
 [['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...], 'neg'],
 [['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...], 'neg'],
 [['"', 'quest', 'for', 'camelot', '"', 'is', 'warner', ...], 'neg'],
 [['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...], 'neg']]

In [24]:
import random
random.seed(2)
random.shuffle(documents)  ## shuffling the training exapmles .

In [25]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [26]:
from nltk.corpus import wordnet
def get_simple_pos(tag): #creating simple tags to pass into the lemmatizer
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [27]:
from nltk.corpus import stopwords
import string
stops = stopwords.words('english') + list(string.punctuation)

In [28]:
from nltk import pos_tag

In [29]:
def clean_review(words):
    output_words = []
    for w in words:
        if w.lower() not in stops:
            pos = pos_tag([w])                                 
            clean_word = lemmatizer.lemmatize(w, get_simple_pos(pos[0][1]))
            output_words.append(clean_word.lower())
    return output_words

In [30]:
 import time
 start = time.time()
documents = [(clean_review(document), category) for document, category in documents]
end = time.time()
print("Cleaning time: ", end - start)

Cleaning time:  309.48702025413513


In [52]:
documents

[(['cold',
   'molecule',
   'move',
   'everything',
   'clean',
   'essential',
   'word',
   'mikey',
   'carver',
   'elijah',
   'wood',
   'young',
   'teenage',
   'boy',
   'living',
   '1973',
   'new',
   'canaan',
   'connecticut',
   'ice',
   'storm',
   'mikey',
   'delivers',
   'word',
   'bore',
   'science',
   'class',
   'unlikely',
   'anyone',
   'realizes',
   'much',
   'parallel',
   'mikey',
   'life',
   'life',
   'surround',
   'father',
   'jim',
   'jamey',
   'sheridan',
   'rarely',
   'see',
   'mother',
   'janey',
   'sigourney',
   'weaver',
   'affair',
   'married',
   'neighbor',
   'ben',
   'hood',
   'kevin',
   'kline',
   'ben',
   'wife',
   'elena',
   'joan',
   'allen',
   'suspect',
   'affair',
   'say',
   'anything',
   'meanwhile',
   'ben',
   '14',
   'year',
   'old',
   'daughter',
   'wendy',
   'christina',
   'ricci',
   'continuously',
   'lure',
   'mikey',
   'young',
   'brother',
   'sandy',
   'adam',
   'hann',
   'byr

In [32]:
training_documents = documents[0:1500]
testing_documents = documents[1500:]

In [33]:
all_words = []
for doc in documents:
    all_words += doc[0]

In [34]:
freq = nltk.FreqDist(all_words)                 #will retrurn a freq distribution object
common = freq.most_common(3000)
features = [i[0] for i in common] #choosing the top 3000 frequency words

In [35]:
def get_feature_dict(words): #will return true/false if the word in present in the document or not
    current_features = {}
    words_set = set(words)
    for w in features:
        current_features[w] = w in words_set
    return current_features

In [36]:
training_data = [(get_feature_dict(doc), category) for doc, category in training_documents] 
testing_data = [(get_feature_dict(doc), category) for doc, category in testing_documents]

In [37]:
#Classification using NLTK Naive Bayes

In [38]:
from nltk import NaiveBayesClassifier 

In [39]:
classifier = NaiveBayesClassifier.train(training_data)

In [40]:
nltk.classify.accuracy(classifier, testing_data)

0.776

In [41]:
classifier.show_most_informative_features(15)

Most Informative Features
               ludicrous = True              neg : pos    =     20.2 : 1.0
             outstanding = True              pos : neg    =     13.5 : 1.0
                    anna = True              pos : neg    =     10.0 : 1.0
                   damon = True              pos : neg    =      9.3 : 1.0
              schumacher = True              neg : pos    =      9.3 : 1.0
                 idiotic = True              neg : pos    =      8.4 : 1.0
             wonderfully = True              pos : neg    =      8.2 : 1.0
            breathtaking = True              pos : neg    =      8.2 : 1.0
               stupidity = True              neg : pos    =      7.6 : 1.0
                   anger = True              pos : neg    =      7.6 : 1.0
                lifeless = True              neg : pos    =      7.3 : 1.0
                 balance = True              pos : neg    =      7.1 : 1.0
                   inept = True              neg : pos    =      7.0 : 1.0

##**Sklearn Classifiers within NLTK**

There is a Sklearn classifier that gives uses of NLTK a way to call the underlying scikit-learn classifier through their code in Phyton.

To construct a scikit-learn estimator object, then use that to construct a SklearnClassifier. E.g., to wrap a linear SVM with default settings:

$$from \;sklearn.svm \;import LinearSVC$$

$$from\; nltk.classify.scikitlearn\; import\; SklearnClassifier$$

$$classifier = SklearnClassifier(LinearSVC())$$

In [42]:
#Using Sklearn Classifier within Nltk
from sklearn.svm import SVC
from nltk.classify.scikitlearn import SklearnClassifier

In [43]:
svc = SVC()
classifier_sklearn = SklearnClassifier(svc)

In [44]:
classifier_sklearn.train(training_data)

<SklearnClassifier(SVC())>

In [45]:
nltk.classify.accuracy(classifier_sklearn, testing_data)

0.812

In [46]:
from sklearn.ensemble import RandomForestClassifier

In [47]:
rfc = RandomForestClassifier()
classifier_sklearn1 = SklearnClassifier(rfc)

In [48]:
classifier_sklearn1.train(training_data)

<SklearnClassifier(RandomForestClassifier())>

In [49]:
nltk.classify.accuracy(classifier_sklearn1, testing_data)

0.81

##**Count Vectorizer**

Count Vectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

It is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation.

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. 

In [50]:
documents = [(clean_review(document), category) for document, category in documents]

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "E:\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\User\AppData\Local\Temp\ipykernel_10900\2706063079.py", line 1, in <module>
    documents = [(clean_review(document), category) for document, category in documents]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Temp\ipykernel_10900\2706063079.py", line 1, in <listcomp>
    documents = [(clean_review(document), category) for document, category in documents]
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\User\AppData\Local\Temp\ipykernel_10900\393818660.py", line 5, in clean_review
    pos = pos_tag([w])
    ^^^^^^^^^^^^
  File "E:\anaconda3\Lib\site-packages\nltk\tag\__init__.py", line 165, in pos_tag
    tagger = _get_tagger(lang)
             ^^^^^^^^^^^^^^^^^
  File "E:\anaconda3\Lib\site-package

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
train_set = {"the sky sky is blue", "the sun is bright"}
count_vec = CountVectorizer(max_features = 3)
a = count_vec.fit_transform(train_set)
a.todense() #give us the sparse matrix which gives us the info about the most freq words

In [None]:
count_vec.get_feature_names() #features with the highest frequency

In [None]:
categories = [category for document, category in documents]

In [None]:
text_documents = [" ".join(document) for document, category in documents]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(text_documents, categories)

In [None]:
count_vec = CountVectorizer(max_features = 2000)
x_train_features = count_vec.fit_transform(x_train)
x_train_features.todense()

In [None]:
count_vec.get_feature_names()

In [None]:
x_test_features = count_vec.transform(x_test)

In [None]:
x_test_features


In [None]:
#Sklearn classifier on countvectorized data
from sklearn.svm import SVC


In [None]:
scv = SVC()

In [None]:
svc.fit(x_train_features,y_train)

In [None]:
svc.score(x_test_features , y_test)

##**N-Grams**

An N-gram is an N-token sequence of words: a 2-gram (called a bigram) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (trigram) is a three-word sequence of words like “not at all”, or “turn off light”.

Set the parameter ngram_range=(a,b) where a is the minimum and b is the maximum size of ngrams you want to include in your features. The default ngram_range is (1,1).

Instead of using a single word as feature, we can use a pair of words or three words as one features for our model.

In [None]:
count_vec = CountVectorizer(max_features = 2000, ngram_range=(2,3))
x_train_features = count_vec.fit_transform(x_train)
x_train_features.todense()

In [None]:
count_vec.get_feature_names() #Here the features are a pair of words or set of three words

##**TF-IDF**

TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF for a word in a document is calculated by multiplying two different metrics:
1. The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
2. The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

**min_df, max_df**: These are the minimum and maximum document frequencies words/n-grams must have to be used as features. If either of these parameters are set to integers, they will be used as bounds on the number of documents each feature must be in to be considered as a feature. 

In [None]:
count_vec = CountVectorizer(max_features = 2000, ngram_range=(2,3),min_df=0.1,max_df=0.7)
x_train_features = count_vec.fit_transform(x_train)
