# Pre-Processing Text Documents

HashingVectorizer is the way to go if we're falling short of memory and resources, or we need to perform incremental learning; CountVectorizer is best choice if we need access to the actual tokens

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.linear_model import SGDClassifier

In [1]:
text = ['One way to get a sense of the daring of this personal statement, written by a student who aims to study film at Columbia University, is simply to consider the allusions he makes throughout his statement. With neither apology nor obvious humility, this writer makes references to Steven Spielberg, Woody Allen, Jean-Luc Godard, Jean Vigo, Terrence Malick, and David Gordon Green. Further, this writer takes the unusual step of using section headings in his personal statement, including, on his first page “Poetry,” “Plastics,” and “Children.” But no matter how creative this writer is, of course, we must ultimately judge him on his evidenced ability as a filmmaker. In that regard, he showcases his ease with talking about films and directors, posits an analogy about student filmmaking (“directing your own material is like parenting”), and discusses the success of his nineteen-minute senior project, “Burying Dvorak”—a film he promoted by taking a year off after graduation, successfully landing it in more than 20 film festivals. As he closes his essay, he makes a specific pitch for Columbia University, where he hopes to continue “to discover my own voice, my own poetry.”', 'For the lengthy sample essay from the student in biological science, the extensive length and scientific depth are necessary because the student is applying for the highly competitive STAR Fellowship. The STAR (Science to Achieve Results) program offers graduate fellowships through the US Environmental Protection Agency (EPA), funding several years of study. Given the competitiveness of the process and the EPA’s mission of environmental protection, it is vital that this student presents a viable, environmentally important project in a persuasive, professional manner. To achieve this, the writer successfully approaches the essay as she would a thesis proposal, using science-related section heads, providing original figures and data, focusing heavily on future research goals, and essentially performing a literature review, citing 19 sources ranging from basic textbooks to refereed journals. The result is a powerful essay with scientific depth.',         'In the first sample essay from mechanical engineering, what stands out immediately are the length and the photographs. In this case, the student was applying for an engineering scholarship, so he was given room to flesh out technical material as well as address issues such as personal motivations one would expect to read in a personal statement. Much of the essay is given to a discussion of his thesis work, which involves the examination of “the propagation of a flame in a small glass tube.” The figures depict the experimental work and represent the success of preliminary thesis results, visually indicating the likely point at which the flame reached detonation.']

### Using Count Vectorizer

In [3]:
c_vectorizer = CountVectorizer()
X_c = c_vectorizer.fit_transform(text)
X_c.shape

(3, 247)

In [8]:
c_vectorizer.vocabulary_ # Each of these represent the features in the dataset
print(X_c) # Returns a sparse matrix

  (0, 139)	1
  (0, 231)	1
  (0, 218)	6
  (0, 78)	1
  (0, 185)	1
  (0, 135)	5
  (0, 213)	4
  (0, 37)	1
  (0, 215)	4
  (0, 146)	2
  (0, 197)	3
  (0, 243)	1
  (0, 25)	2
  (0, 200)	2
  (0, 237)	1
  (0, 8)	1
  (0, 201)	1
  (0, 65)	3
  (0, 19)	1
  (0, 30)	2
  (0, 221)	2
  (0, 103)	3
  (0, 189)	1
  (0, 33)	1
  (0, 10)	1
  :	:
  (2, 204)	1
  (2, 125)	1
  (2, 58)	1
  (2, 168)	1
  (2, 126)	1
  (2, 47)	1
  (2, 240)	2
  (2, 236)	2
  (2, 102)	1
  (2, 57)	1
  (2, 162)	1
  (2, 70)	2
  (2, 190)	1
  (2, 80)	1
  (2, 219)	1
  (2, 40)	1
  (2, 59)	1
  (2, 173)	1
  (2, 155)	1
  (2, 227)	1
  (2, 101)	1
  (2, 113)	1
  (2, 152)	1
  (2, 167)	1
  (2, 42)	1


### Using Hash Vectorizer

In [9]:
h_vectorizer = HashingVectorizer(n_features=50)

# n_features need to be keep optimum, so that the words are well represented, but without too much legroom
X_h = h_vectorizer.fit_transform(text)
X_h.shape

(3, 50)

In [10]:
print(X_h[0]) # For the first element in the list

  (0, 1)	-0.04789131426105757
  (0, 2)	-0.09578262852211514
  (0, 3)	0.09578262852211514
  (0, 4)	0.23945657130528783
  (0, 5)	-0.04789131426105757
  (0, 6)	0.04789131426105757
  (0, 7)	0.04789131426105757
  (0, 8)	-0.04789131426105757
  (0, 9)	0.04789131426105757
  (0, 10)	0.04789131426105757
  (0, 11)	-0.23945657130528783
  (0, 12)	0.04789131426105757
  (0, 13)	0.0
  (0, 14)	-0.09578262852211514
  (0, 15)	0.04789131426105757
  (0, 16)	-0.04789131426105757
  (0, 17)	-0.23945657130528783
  (0, 18)	0.0
  (0, 19)	0.04789131426105757
  (0, 20)	-0.09578262852211514
  (0, 21)	0.09578262852211514
  (0, 22)	-0.19156525704423027
  (0, 23)	-0.04789131426105757
  (0, 24)	0.23945657130528783
  (0, 25)	0.0
  (0, 26)	0.04789131426105757
  (0, 27)	0.04789131426105757
  (0, 28)	0.04789131426105757
  (0, 29)	0.0
  (0, 30)	0.0
  (0, 31)	-0.04789131426105757
  (0, 32)	0.1436739427831727
  (0, 33)	0.0
  (0, 34)	0.09578262852211514
  (0, 35)	0.04789131426105757
  (0, 36)	0.0
  (0, 37)	0.04789131426105757


## Classifying text documents

### HashingVectorizer with SGDClassifier

In [11]:
import numpy as np
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request

resp = urllib.request.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip')
zipfile = ZipFile(BytesIO(resp.read()))

data = TextIOWrapper(zipfile.open('sentiment labelled sentences/amazon_cells_labelled.txt'), encoding='utf-8')

df = pd.read_csv(data, sep='\t')
df.columns = ['review', 'sentiment']

In [12]:
df.head()

Unnamed: 0,review,sentiment
0,"Good case, Excellent value.",1
1,Great for the jawbone.,1
2,Tied to charger for conversations lasting more...,0
3,The mic is great.,1
4,I have to jiggle the plug to get it to line up...,0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     999 non-null    object
 1   sentiment  999 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [14]:
df.loc[:, 'sentiment'].unique()

array([1, 0], dtype=int64)

In [15]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.shape, y_train.shape

((799,), (799,))

In [18]:
h_vectorizer = HashingVectorizer()
classifier = SGDClassifier(penalty='l2', loss='hinge')

Now, let's do the learning in parts using `partial_fit()`. Split the dataset equally.

On Iteration-1:

In [21]:
X_train_part1_hashed = h_vectorizer.fit_transform(X_train[0:400])
y_train_part1 = y_train[0:400]

all_classes = np.unique(df['sentiment'])
print(all_classes)

classifier.partial_fit(X_train_part1_hashed, y_train_part1, classes=all_classes)

# Use the trained classifier on test data
X_test_hashed = h_vectorizer.transform(X_test) # Because h_vectorizer was already fit with X_train[0:400] above
test_score = classifier.score(X_test_hashed, y_test)
print(test_score)

[0 1]
0.775


In [22]:
X_train_part2_hashed = h_vectorizer.fit_transform(X_train[400:])
y_train_part2 = y_train[400:]

classifier.partial_fit(X_train_part2_hashed, y_train_part2)

# Use the trained classifier on test data

X_test_hashed = h_vectorizer.transform(X_test) # Because h_vectorizer was already fit with X_train[0:400] above
test_score = classifier.score(X_test_hashed, y_test)
print(test_score)

0.805


**Note that in two partial fit iterations, the test score has increased.**