# Converting Corpus to Features


In order to convert corpus to feature matrix, following steps are used:
1. Loading your own corpus
2. Pre-processing corpus: Normalization, Tokenization, Stop-word Removal, Stemming
3. Converting pre-processed corpus to feature matrix (TDM/DTM or TTM)

# Loading your own corpus
User defined corpus can be imported in Python using two methods:
1. Using nltk.corpus PlainTextCorpusReader or CategorizedCorpusReader
2. Using file method of Python

# CorpusReader

![20.png](attachment:20.png)



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from nltk.corpus import PlaintextCorpusReader
#path='C:/Users/Desktop/dataset/'
path='/content/drive/MyDrive/CONVAI/DATA/'


dataset=PlaintextCorpusReader(path,'.*')


In [None]:
dataset.fileids()

['1.txt', '2.txt', '3.txt', '4.txt']

In [None]:
dataset.raw(fileids='1.txt')

'Data Science is an important field of science .'

In [None]:
#Converting the dataset to a list (where each string represents a seperate document)
corpus=[]
for i  in dataset.fileids():
    corpus.append(dataset.raw(fileids=i))
corpus

['Data Science is an important field of science .',
 'This is an important data science course .',
 'The cars are driven on the roads .',
 'The trucks are driven on the highways .']

# Loading Corpus Using File Method of Python

Using files:  
File_object=open(r"File_Name","Access_Mode")

Access Modes :
1. Read Only (‘r’)
2. Read and Write (‘r+’)
3. Write Only (‘w’)
4. Write and Read (‘w+’)
5. Append Only (‘a’)
6. Append and Read (‘a+’)

In [None]:
import os
filenames=os.listdir(path)
filenames

['1.txt', '2.txt', '3.txt', '4.txt']

In [None]:
path

'/content/drive/MyDrive/CONVAI/DATA/'

In [None]:

filenames=os.listdir(path)
corpus=[]
for i  in range(len(filenames)):
    f=open(path+filenames[i],'r')
    corpus.append(f.read())
    f.close()
corpus

['Data Science is an important field of science .',
 'This is an important data science course .',
 'The cars are driven on the roads .',
 'The trucks are driven on the highways .']

# Pre-processing: Step 1: Normalization

Normalization in text includes following steps:
1. Converting the text into same case (lower, upper, or proper case)
2. Removing numbers, special symbols, urls from text.


text = "Hello World"

lower_text = text.lower()

text = "Hello World"

split_text = text.split()

words = ["This", "is", "a", "sentence"]

sentence = ' '.join(words)


In [None]:
corpus

['Data Science is an important field of science .',
 'This is an important data science course .',
 'The cars are driven on the roads .',
 'The trucks are driven on the highways .']

In [None]:
for i in corpus:
  print(i)

Data Science is an important field of science .
This is an important data science course .
The cars are driven on the roads .
The trucks are driven on the highways .


In [None]:
lower=[]
for i in corpus:
  s=""
  s=' '.join(x.lower() for x in i.split())
  #print(s)
  lower.append(s)
lower


['data science is an important field of science .',
 'this is an important data science course .',
 'the cars are driven on the roads .',
 'the trucks are driven on the highways .']

In [None]:
#Converting text to lower case using .lower() method of NLTK
lower=[]
for i in corpus:
    lower.append(' '.join([word.lower() for word in i.split()]))
lower

['data science is an important field of science .',
 'this is an important data science course .',
 'the cars are driven on the roads .',
 'the trucks are driven on the highways .']

Remove special symbols, punctuation, and URLs

cleaned_text = re.sub(r'[^\w\s]', '', text)

text = re.sub(r'http\S+|www\S+', '', text) text=re.sub(r'\d+','',text)

^ - negate

\w+ matches sequences of one or more word characters (letters, digits, and underscores).

+ symbol one or more occurrences

\d+ matches sequences of one or more digits. It finds all sequences of digits in the text.

\s: Matches any whitespace character (space, tab, newline, etc.).

\S: Matches any non-whitespace character (letters, digits, punctuation, etc.).

\W: This matches any character that is not a word character. A word character is defined as a letter (a-z, A-Z), a digit (0-9), or an underscore (_).

In [None]:
import re
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
        # Remove numbers
    text = re.sub(r'\d+', '', text)
        # Remove special characters (except for alphabets and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
        # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Clean each document in the corpus
cleaned_corpus = [clean_text(doc) for doc in corpus]

print(cleaned_corpus)


In [None]:
# Removing numbers, special symbols, urls using .isalpha() method of NLTK
alpha=[]
for i in lower:
    alpha.append(' '.join([word for word in i.split() if word.isalpha()]))
alpha

['data science is an important field of science',
 'this is an important data science course',
 'the cars are driven on the roads',
 'the trucks are driven on the highways']

`# Pre-processing Step 2: Tokenization

Tokenization involves converting each document as list of words. It can be done in two ways:
1. .split() method of list
2. word_tokenize method of nltk.tokenize

In [None]:
tokenize=[]
for i in alpha:
  s=i.split()
  tokenize.append(s)
tokenize

[['data', 'science', 'is', 'an', 'important', 'field', 'of', 'science'],
 ['this', 'is', 'an', 'important', 'data', 'science', 'course'],
 ['the', 'cars', 'are', 'driven', 'on', 'the', 'roads'],
 ['the', 'trucks', 'are', 'driven', 'on', 'the', 'highways']]

In [None]:
#Tokenization using .split()
tokenize=[]
for i in alpha:
    tokenize.append([word for word in i.split()])
tokenize

[['data', 'science', 'is', 'an', 'important', 'field', 'of', 'science'],
 ['this', 'is', 'an', 'important', 'data', 'science', 'course'],
 ['the', 'cars', 'are', 'driven', 'on', 'the', 'roads'],
 ['the', 'trucks', 'are', 'driven', 'on', 'the', 'highways']]

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#Tokenization using word_tokenize
tokenize=[]
from nltk.tokenize import word_tokenize
for i in alpha:
    tokenize.append(word_tokenize(i))
tokenize

[['data', 'science', 'is', 'an', 'important', 'field', 'of', 'science'],
 ['this', 'is', 'an', 'important', 'data', 'science', 'course'],
 ['the', 'cars', 'are', 'driven', 'on', 'the', 'roads'],
 ['the', 'trucks', 'are', 'driven', 'on', 'the', 'highways']]

# Pre-processing Step 3: Stop-word Removal
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that does not have any linguistic importance in NLP applications

NLTK(Natural Language Toolkit) in python has a list of stopwords stored in stopwords corpus in 16 different languages.

The name of fields is the name of language.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import *
stopword=stopwords.words('english') #stopword will contain list of all stopwords of english language
no_stop=[]
for i in tokenize:
    no_stop.append([word for word in i if word not in stopword])
no_stop

[['data', 'science', 'important', 'field', 'science'],
 ['important', 'data', 'science', 'course'],
 ['cars', 'driven', 'roads'],
 ['trucks', 'driven', 'highways']]

In [None]:
for i in tokenize:
  print(i)
  for word in i:
    if word not in stopword:
      print(word)

#     no_stop.append([word for word in i if word not in stopword])
# no_stop

['data', 'science', 'is', 'an', 'important', 'field', 'of', 'science']
data
science
important
field
science
['this', 'is', 'an', 'important', 'data', 'science', 'course']
important
data
science
course
['the', 'cars', 'are', 'driven', 'on', 'the', 'roads']
cars
driven
roads
['the', 'trucks', 'are', 'driven', 'on', 'the', 'highways']
trucks
driven
highways


In [None]:
for i in tokenize:
  l=[]
  for word in i:
    if word not in stopword:
      l.append(word)
no_stop.append(l)
no_stop


[['data', 'science', 'important', 'field', 'science'],
 ['important', 'data', 'science', 'course'],
 ['cars', 'driven', 'roads'],
 ['trucks', 'driven', 'highways'],
 ['trucks', 'driven', 'highways'],
 ['trucks', 'driven', 'highways']]

In [None]:
no_stop=[]
for i in tokenize:
  no_stop.append([word for word in i if word not in stopword])
no_stop

[['data', 'science', 'important', 'field', 'science'],
 ['important', 'data', 'science', 'course'],
 ['cars', 'driven', 'roads'],
 ['trucks', 'driven', 'highways']]

# Pre-processing Step 4: Stemming

Stemming is a process that maps variant word forms to their base forms (play, plays, playing, played )

nltk.stem has number of stemming algorithms named as "PorterStemmer", "LancasterStemmer", etc. These algorithms accepts the list of tokenized word and stems it into root word.


In [None]:
#Stemming Example
from nltk.stem import PorterStemmer #Importing porter stemmer class
ps=PorterStemmer() #Creating an object of PorterStemmer Class
ps.stem('unhappy') #stemming a word using .stem method

'unhappi'

In [None]:
#Stemming the corpus
final=[] #will contain final pre-processed documents
from nltk.stem import PorterStemmer
ps=PorterStemmer()
for i in no_stop:
    final.append(' '.join([ps.stem(word) for word in i]))
final

['data scienc import field scienc',
 'import data scienc cours',
 'car driven road',
 'truck driven highway']

In [None]:
from nltk.stem import LancasterStemmer

# Initialize the Lancaster Stemmer
ls = LancasterStemmer()
final = []
# Stemming the corpus
for i in no_stop:
   final.append(' '.join([ls.stem(word) for word in i]))

print(final)

['dat sci import field sci', 'import dat sci cours', 'car driv road', 'truck driv highway']


In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['running', 'flies', 'better', 'cats']

# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

['running', 'fly', 'better', 'cat']


In [None]:
words_with_pos = [('running', 'v'), ('flies', 'n'), ('better', 'a'), ('cats', 'n')]
# Lemmatize each word with its POS tag
lemmatized_words_with_pos = [lemmatizer.lemmatize(word, pos) for word, pos in words_with_pos]
print(lemmatized_words_with_pos)

['run', 'fly', 'good', 'cat']


In [None]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to convert POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Define words to be lemmatized
words = ['running', 'flies', 'better', 'cats']

# POS tagging
pos_tagged_words = nltk.pos_tag(words)

# Lemmatize each word with its POS tag
lemmatized_words_with_pos = [
    lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag))
    for word, pos_tag in pos_tagged_words
]

print(lemmatized_words_with_pos)
