In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfTransformer
import numpy

In [2]:
train = pd.read_csv("train.csv")

In [3]:
train.head(5)

Unnamed: 0,label,text
0,1,"Henry Thomas showed a restraint, even when the..."
1,1,"This movie starts out brisk, has some slow mom..."
2,1,Castle of Blood is a good example of the quali...
3,1,I viewed the movie together with a homophobic ...
4,1,"The ""Men in White"" movie is definitely one of ..."


In [4]:
test = pd.read_csv("test.csv")

In [5]:
test.head(5)

Unnamed: 0,Id,text
0,0,I cannot believe I actually sat through the wh...
1,1,I saw this one remastered on DVD. It had a big...
2,2,"Irrespective of the accuracy of facts, Bandit ..."
3,3,"Significant Spoilers! This is a sick, disturbi..."
4,4,If there are people that don't like this movie...


There are several ways to start with. First, you should think of a way to transform the text of a review to a feature vector, such that each dimension represents a word and the value represents the weight of that word in the review. You can also try different TF-IDF tricks to adjust the weightings. You may also consider adding bi-gram features as well. the `sklearn` package offers some ways to extract features from text, so let's play with one of them.

In [7]:
pipe = Pipeline([('count', TfidfVectorizer(binary=True)),
                 ('tfid', TfidfTransformer())]).fit(train.text)

In [8]:
pipe['count'].transform(train.text)

<10000x51704 sparse matrix of type '<class 'numpy.float64'>'
	with 1381935 stored elements in Compressed Sparse Row format>

In [9]:
pipe['tfid'].idf_

array([6.59952245, 5.62547289, 9.11182808, ..., 9.51729319, 9.51729319,
       9.51729319])

In [10]:
X_train = pipe.fit_transform(train.text)

In [11]:
X_train

<10000x51704 sparse matrix of type '<class 'numpy.float64'>'
	with 1381935 stored elements in Compressed Sparse Row format>

As the name suggest, we are transforming the reviews in the training set as a 10000 x 51704 matrix. THe number 51704 indicates that there are 51704 unique word in the training reviews. We can also limit the number of features in the matrix by setting the `max_features` when initiating the CountVectorizer.

In [12]:
pipe['count'].transform(test.text)

<5000x51704 sparse matrix of type '<class 'numpy.float64'>'
	with 649911 stored elements in Compressed Sparse Row format>

In [13]:
pipe['tfid'].idf_

array([6.59952245, 5.62547289, 9.11182808, ..., 9.51729319, 9.51729319,
       9.51729319])

In [14]:
X_test = pipe.transform(test.text)

In [15]:
y_train = train.label

You may notice that we are no longer constructing X_train, X_test, y_train, y_test using train_test_split. Obviously, the train-test split is now provided by a third-party and the y_test is hidden from you.

## Building a Dummy Classifier

In [16]:
clf = LogisticRegression()

In [17]:
y_pred = clf.fit(X_train.toarray(), y_train).predict(X_test.toarray())

In [18]:
pd.DataFrame({"Id": test.Id, "Category": y_pred}).to_csv("EvanKagglePrediction.csv", index=False)