# Text classification/categorization

    What is text classification?

Text classification is the process of assigning text documents into one or more classes or categories, assuming that we have a predefined set of classes.

Documents here are textual documents, and each document can contain a sentence or even a paragraph of words. 

## Two types of text classification

    What types of text classifications are available?

- content-based classification
- request-based classification

__Content-based classification__ is the type of text classification where priorities or weights are given to a specific subjects or topics in the text content that would help determine the class of the document.

E.g., a book with more than 30 percent of its content about food preparations can be classified under cooking/recipes. 

__Request-based classification__ is influenced by user requests and targeted towards specific user groups and audiences. This type of classification is governed by specific policies and ideals.

## Text classification blueprint

1. prepare test, train and validation (optional) datasets
2. text normalization
3. feature extraction
4. model training
5. model prediction and evaluation
6. model deployment

## Text normalization

- expanding contractions
- text standardization through lemmatization
- removing special characters and aymbols
- removing stopwords

Others:
- correcting spelling

In [11]:
# In order to use modules, create a directory module and a __init__.py file there.
# Note that a .py file cannot be in the same folder as the .ipynb, else it will throw an exception.
from module.contractions import expand_contractions 
from module.tokenize import tokenize_text
from module.lemmatize import lemmatize_text, pos_tag_text
from module.feature_extractor import bow_extractor

In [2]:
expand_contractions("this isn't good")

'this is not good'

In [3]:
# Define function to tokenize text into tokens that will be used by our other normalization functions.
tokenize_text('hello world')

['hello', 'world']

In [4]:
import re

# Match any hello.
pattern = re.compile('hello')

# Define a substitution function that allows us access to the matched word.
def subfn(m):
    match = m.group(0)
    return f'[{match}]'
    
pattern.sub(subfn, 'hello world')

'[hello] world'

In [8]:
lemmatize_text('where are you playing football')

'where be you play football'

In [31]:
import string
import re
from nltk.corpus import stopwords

stopword_list = stopwords.words('english')

def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) 
                                    for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

def normalize_corpus(corpus, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        text = expand_contractions(text)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
    return normalized_corpus

In [32]:
CORPUS = [
    'the sky is blue',
    'sky is blue and sky is beautiful',
    'the beautiful sky is blue',
    'i love blue cheese'
]
new_doc = ['loving this blue sky today']

In [33]:
normalize_corpus(CORPUS, True)

['sky blue',
 ['sky', 'blue'],
 'sky blue sky beautiful',
 ['sky', 'blue', 'sky', 'beautiful'],
 'beautiful sky blue',
 ['beautiful', 'sky', 'blue'],
 'love blue cheese',
 ['love', 'blue', 'cheese']]

In [None]:
## Feature Extraction