# Text Classification Methods in Python

In [12]:
import sklearn
import nltk # Natural Language Toolkit

## 1st Step: Load in a Dataset

To train our initial text classification algorithm, I am using a sample dataset already located within scikit-learn


This dataset took 20,000 samples of news text and classified them according to their general topic/subject

In [3]:
from sklearn.datasets import fetch_20newsgroups

# we are using the training subset of the data here
newsgroups_data = fetch_20newsgroups(subset="train")

# newsgroups_data.data: the text data, where every element in this attribute contains a string of text from the relevant news post
# newsgroups_data.target: contains the category each element of text data was assigned
X, y = newsgroups_data.data, newsgroups_data.target

in case you didn't know what each attribute contained and you wanted to find out

In [4]:
attributes = dir(newsgroups_data)
print(attributes)

['DESCR', 'data', 'filenames', 'target', 'target_names']


## Step 2: Preprocess the Data

the text which we input must be cleaned and standardized to be better understood as input

### Tokenization

The process of taking larger strings of sentences and spaces and breaking them down into individual tokens/characters

EX:

'Hello World' --> 'H','e','l','l','o','','W','o','r','l','d'

but we are using word level tokenization so...

'Hello','World'

### StopWord Removal

The removal of words that have little to know semantic function in the text

EX: and, is, that

### Stemming

removal of prefixes and suffixes to identify the root form/meaning of a word

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import nltk # need to import right before for it to work
nltk.download("punkt") # Punkt is a pre-trained tokenizer that can split text into sentences and words.
nltk.download("stopwords") # downloads a list of stopwords


# Now we are creating a function to preprocess the text data 
# function name: preprocess_text
# input: text

def preprocess_text(text):
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english')) # uses the english ones from the list "stopwords"
    # filtered tokens should be stored as a list[]
    # token.lower converts each token to a lowercase one
    # for an item 'token' in the list 'tokens'
    # given that the item/token is only made of alphabetic characters and is not a stopword
    filtered_tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in stop_words]
    stemmer = PorterStemmer()
    # stemmed tokens are stored in a new list
    # iterates the stem function over each item/token in the 'tokens' list
    # the stem function is what separates the stem/root word from each word
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    # returns a single string where each of the stemmed tokens/words is separated by a space
    return " ".join(stemmed_tokens)

X_preprocessed = [preprocess_text(text) for text in X]

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/KeerthiStanley/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/KeerthiStanley/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Step 3: Numerical Conversion of the Stemmed Tokens

machine learning models need numerical features only in order to work, so even our alphabetical tokens/words must be covnerted in numerical data inputs

one common approach is to use the Term Frequency-Inverse Document Frequency (TF-IDF) method

this can be implemented using sklearn

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_transformed = vectorizer.fit_transform(X_preprocessed)

## Step 4: Training the Text Classification

This particular method is using Naive Bayes as the ML method to train the model

**Further discussion: which ML method works best for the text classification of Scientific Papers? obtain accuracy scores**

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC  # Import SVC from scikit-learn

classifier = SVC(kernel='linear') 
classifier.fit(X_transformed, y)

## Step 5: Evaluate the Text Classification Model

this will project an accuracy score for us in the end too

In [18]:
test_data = fetch_20newsgroups(subset="test")
X_test, y_test = test_data.data, test_data.target

X_test_preprocessed = [preprocess_text(text) for text in X_test]
X_test_transformed = vectorizer.transform(X_test_preprocessed)

y_pred = classifier.predict(X_test_transformed)

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for  SVM:", accuracy)

Accuracy for  SVM: 0.8254115772703133
