# Introduction to NLP in Python
## Vectorization

### Bag-of-Words Model (BOW)


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

Create an instance of the CountVectorizer class:

In [3]:
count_vectorizer = CountVectorizer()

Load your text data into a list or an array.


In [4]:
sample_text = [
    "Supervised learning is a type of machine learning",
    "Unsupervised learning is another type of machine learning",
    "Machine learning algorithms can be used for NLP",
    "NLP is a subfield of machine learning",
    "Sentiment analysis is a type of NLP application"
]

Next, we fit and transform the text data. Fitting the vectorizer on the text data creates the vocabulary, while the transform method transforms the text data into a BoW representation

In [5]:
bow = count_vectorizer.fit_transform(sample_text)

The resulting `bow` object is a sparse matrix that represents the text data in BoW form. Go ahead and print the contents of this matrix to see how the words in the sample text have been mapped to their respective indices in the vocabulary.

In [6]:
print(bow.toarray())

[[0 0 0 0 0 0 0 1 2 1 0 1 0 0 1 1 0 0]
 [0 0 1 0 0 0 0 1 2 1 0 1 0 0 0 1 1 0]
 [1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0]
 [0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 0]]


Each row in the matrix corresponds to a sentence in the vocabulary. The values in the matrix represent the frequency of each word in each sentence.  

<br></br>
### Term Frequency-Inverse Document Frequency (TF-IDF)

The scikit-learn library provides a convenient implementation of the TF-IDF algorithm. 
1. First, run the cell below to import the necessary libraries:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

2. Define the text that you want to analyze. This can be a list or array of strings, or even a Pandas DataFrame containing text data for a larger dataset. 

In [8]:
text_data = [
    "Supervised learning is a type of machine learning.",
    "Unsupervised learning is another type of machine learning.",
    "Machine learning algorithms can be used for a wide range of applications, such as image recognition and natural language processing.",
    "One of the most popular machine learning algorithms is the decision tree.",
    "Random forest is an ensemble learning method that combines multiple decision trees.",
    "Support vector machines are a powerful class of machine learning algorithms used for classification and regression tasks.",
    "Neural networks are a type of machine learning algorithm inspired by the structure and function of the human brain.",
    "Deep learning is a subfield of machine learning that involves neural networks with many layers.",
    "Transfer learning is a technique in machine learning where a model trained on one task is used to improve performance on another related task.",
    "Natural Language Processing (NLP) is a subfield of machine learning that deals with the interaction between computers and humans in natural language.",
    "Text classification is an important task in NLP that involves assigning a document to one or more predefined categories.",
    "Named Entity Recognition (NER) is another important task in NLP that involves identifying and extracting entities such as people, organizations, and locations from text data."
]


3. Create an instance of the TfidfVectorizer class. You can specify any parameters you want to use. Some common parameters include:
- `max_features`: the maximum number of features (unique words) to include in the TF-IDF matrix
- `stop_words`: a list of words to exclude from the TF-IDF calculation, such as common stop words like "the" and "and"


In [9]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

4. Fit and transform the text into a TF-IDF matrix:

In [10]:
tfidf = tfidf_vectorizer.fit_transform(text_data)

5. Print the resulting matrix:

In [11]:
print(tfidf.toarray())

[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.52146104 0.         0.28202368 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.64158678
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.48673138 0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.   

The resulting matrix is printed using the `toarray()` method, which converts the sparse matrix to a dense matrix for easier viewing.


<br></br>
### Word Embeddings

Here, we will use the Gensim Python library to create word embeddings.

1. If it has not been installed, run the following line to install Gensim:

In [14]:
pip install gensim

Collecting FuzzyTM>=0.4.0 (from gensim)
  Obtaining dependency information for FuzzyTM>=0.4.0 from https://files.pythonhosted.org/packages/2d/30/074bac7a25866a2807c1005c7852c0139ac22ba837871fc01f16df29b9dc/FuzzyTM-2.0.9-py3-none-any.whl.metadata
  Using cached FuzzyTM-2.0.9-py3-none-any.whl.metadata (7.9 kB)
Collecting pyfume (from FuzzyTM>=0.4.0->gensim)
  Obtaining dependency information for pyfume from https://files.pythonhosted.org/packages/ed/ea/a3b120e251145dcdb10777f2bc5f18b1496fd999d705a178c1b0ad947ce1/pyFUME-0.3.4-py3-none-any.whl.metadata
  Using cached pyFUME-0.3.4-py3-none-any.whl.metadata (9.7 kB)
Collecting scipy>=1.7.0 (from gensim)
  Obtaining dependency information for scipy>=1.7.0 from https://files.pythonhosted.org/packages/65/76/903324159e4a3566e518c558aeb21571d642f781d842d8dd0fd9c6b0645a/scipy-1.10.1-cp311-cp311-win_amd64.whl.metadata
  Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl.metadata (58 kB)
Collecting simpful==2.12.0 (from pyfume->FuzzyTM>=0.4.0->gens

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\Hp\\anaconda3\\Lib\\site-packages\\~cipy\\fft\\_pocketfft\\pypocketfft.cp311-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



2. Import the necessary libraries:

In [15]:
import gensim
from gensim.models import Word2Vec
import nltk
from nltk import word_tokenize

3. Train a Word2Vec model using a tokenized text corpus. Word2Vec takes in tokens as its input, so the `word_tokenize` method from the previous quest is used here. We will make use of the `text_data` list from the previous step:

In [16]:
text_data = [
    "Supervised learning is a type of machine learning.",
    "Unsupervised learning is another type of machine learning.",
    "Machine learning algorithms can be used for a wide range of applications, such as image recognition and natural language processing.",
    "One of the most popular machine learning algorithms is the decision tree.",
    "Random forest is an ensemble learning method that combines multiple decision trees.",
    "Support vector machines are a powerful class of machine learning algorithms used for classification and regression tasks.",
    "Neural networks are a type of machine learning algorithm inspired by the structure and function of the human brain.",
    "Deep learning is a subfield of machine learning that involves neural networks with many layers.",
    "Transfer learning is a technique in machine learning where a model trained on one task is used to improve performance on another related task.",
    "Natural Language Processing (NLP) is a subfield of machine learning that deals with the interaction between computers and humans in natural language.",
    "Text classification is an important task in NLP that involves assigning a document to one or more predefined categories.",
    "Named Entity Recognition (NER) is another important task in NLP that involves identifying and extracting entities such as people, organizations, and locations from text data."
]

tokens_in_text = [nltk.word_tokenize(sentence.lower()) for sentence in text_data] 

model = Word2Vec(tokens_in_text, min_count=1)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\Hp/nltk_data'
    - 'C:\\Users\\Hp\\anaconda3\\nltk_data'
    - 'C:\\Users\\Hp\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\Hp\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Hp\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


The cell above returns the list of sentences into individual tokens after converting the words to lowercase, and is passed into the Word2Vec model. 

The `min_count` parameter ignores all words with a total frequency lower than this. In this case, all words that only appear once is ignored.

4. Once your model is trained, you can access the word embeddings for any word like this:

In [None]:
vector = model.wv['class']
print(vector)

This will return a vector representing the word 'class' in the embedding space. The attribute `wv` of a Word2Vec model stands for "word vector".