### Text Classification With Machine Learning and SpaCy
+ Text categorization / text classification is the task of assigning predefined categories to documents.
+ Sentiment Analysis
+ Multilabel classification
+ + DataSet source http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

##### Aim is to classify reviews into positive or negative review


In [1]:
# Load EDA packages
import pandas as pd

In [2]:
# Load our dataset
# df_yelp = pd.read_table('.input/yelp_labelled.txt')
# df_imdb = pd.read_table('.input/imdb_labelled.txt')
# df_amz = pd.read_table('.input/amazon_cells_labelled.txt')

# Set path to the data files
path = './input/'

# Add column names
col_names = ['Text', 'Label']

# Import data files as Pandas Dataframes
df_yelp = pd.read_table(path+'yelp_labelled.txt', names=col_names)
df_amz = pd.read_table(path+'amazon_cells_labelled.txt', names=col_names) # Use as test dataset
df_imdb = pd.read_table(path+'imdb_labelled.txt', names=col_names)


In [3]:
# Concatenate our Datasets
frames = [df_yelp,df_imdb,df_amz]

In [4]:
# Renaming Column Headers
for colname in frames:
    colname.columns = ["Message","Target"]

In [5]:
# Column names
for colname in frames:
    print(colname.columns)

Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')


In [6]:
# Assign a Key to Make it Easier
keys = ['Yelp','IMDB','Amazon']

In [7]:
# Merge or Concat our Datasets
df = pd.concat(frames,keys=keys)

In [8]:
# Length and Shape 
df.shape

(2748, 2)

In [9]:
df.head()

Unnamed: 0,Unnamed: 1,Message,Target
Yelp,0,Wow... Loved this place.,1
Yelp,1,Crust is not good.,0
Yelp,2,Not tasty and the texture was just nasty.,0
Yelp,3,Stopped by during the late May bank holiday of...,1
Yelp,4,The selection on the menu was great and so wer...,1


In [10]:
# df.to_csv("sentimentdataset.csv")

In [11]:
# Data Cleaning
df.columns

Index(['Message', 'Target'], dtype='object')

In [12]:
# Checking for Missing Values
df.isnull().sum()

Message    0
Target     0
dtype: int64

###  Working with SpaCy
+ Removing Stopwords
+ Lemmatizing

In [13]:
from  spacy.lang.en.stop_words import STOP_WORDS

# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)

In [14]:
stopwords

['every',
 'per',
 'part',
 'none',
 "n't",
 'if',
 'where',
 'not',
 'been',
 'move',
 'somehow',
 'with',
 'being',
 'thereupon',
 'whoever',
 'what',
 'only',
 '‘m',
 'somewhere',
 'within',
 'some',
 'towards',
 'keep',
 'mostly',
 '‘s',
 'how',
 'above',
 'toward',
 'up',
 'though',
 'afterwards',
 'few',
 'about',
 'are',
 'used',
 'of',
 'meanwhile',
 'would',
 'else',
 'made',
 'therein',
 'twelve',
 'back',
 'two',
 'most',
 'these',
 'without',
 'everything',
 'off',
 'again',
 'were',
 'wherever',
 'could',
 'three',
 'go',
 'too',
 'alone',
 'often',
 'nowhere',
 'among',
 'seems',
 'must',
 'formerly',
 'although',
 'six',
 'get',
 'serious',
 'my',
 '’re',
 'much',
 'them',
 'our',
 'eleven',
 'thereafter',
 'see',
 'please',
 'below',
 '‘ve',
 '‘d',
 'we',
 '’d',
 'various',
 'during',
 '’ll',
 'into',
 'first',
 'five',
 'show',
 'or',
 'side',
 'neither',
 'using',
 'whereas',
 'next',
 'either',
 'between',
 'whose',
 'therefore',
 'and',
 'former',
 'was',
 'us',
 'h

##### Getting Lemma and Stop words

In [15]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [16]:
docx = nlp("This is how John Walker was walking. He was also running beside the lawn.")
type(docx)

spacy.tokens.doc.Doc

In [17]:
# Lemmatizing of tokens
for word in docx:
    print(word.text,"Lemma =>",word.lemma_)
    

This Lemma => this
is Lemma => be
how Lemma => how
John Lemma => John
Walker Lemma => Walker
was Lemma => be
walking Lemma => walk
. Lemma => .
He Lemma => -PRON-
was Lemma => be
also Lemma => also
running Lemma => run
beside Lemma => beside
the Lemma => the
lawn Lemma => lawn
. Lemma => .


In [18]:
# Lemma that are not pronouns
for word in docx:
    if word.lemma_ != "-PRON-":
        print(word.lemma_.lower().strip())

this
be
how
john
walker
be
walk
.
be
also
run
beside
the
lawn
.


In [19]:
# List Comprehensions of our Lemma
[word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in docx]

['this',
 'be',
 'how',
 'john',
 'walker',
 'be',
 'walk',
 '.',
 'he',
 'be',
 'also',
 'run',
 'beside',
 'the',
 'lawn',
 '.']

In [20]:
# Filtering out Stopwords and Punctuations
for word in docx:
    if word.is_stop == False and not word.is_punct:
#     if word.is_stop != True and not word.is_punct:
        print(word)

John
Walker
walking
running
lawn


In [21]:
# Stop words and Punctuation In List Comprehension
[ word for word in docx if word.is_stop == False and not word.is_punct ]

[John, Walker, walking, running, lawn]

In [22]:
# Use the punctuations of string module
import string
punctuations = string.punctuation

In [23]:
# Creating a Spacy Parser
from spacy.lang.en import English
parser = English()

In [24]:
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    return mytokens

In [25]:
spacy_tokenizer("This is how John Walker was walking. He was also running beside the lawn.")

['john', 'walker', 'walking', 'running', 'lawn']

In [26]:
type(spacy_tokenizer("This is how John Walker was walking. He was also running beside the lawn."))

list

#### Machine Learning With SKlearn

In [27]:
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [28]:
#Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic function to clean the text 
def clean_text(text):     
    return text.strip().lower()

In [29]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 
classifier = LinearSVC()

In [30]:
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [31]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

In [32]:
# Features and Labels
X = df['Message']
ylabels = df['Target']

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)

In [34]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

In [35]:
# Fit our data
pipe.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x000001DDC83E7F40>),
                ('vectorizer',
                 CountVectorizer(tokenizer=<function spacy_tokenizer at 0x000001DDC00D0E50>)),
                ('classifier', LinearSVC())])

In [36]:
# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

In [37]:
# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample,pred) in zip(X_test,sample_prediction):
    print(sample,"Prediction=>",pred)

Great product. Prediction=> 1
This product is very High quality Chinese CRAP!!!!!! Prediction=> 0
Let's start with all the problemsthe acting, especially from the lead professor, was very, very bad.   Prediction=> 0
It's too bad that everyone else involved didn't share Crowe's level of dedication to quality, for if they did, we'd have a far better film on our hands than this sub-par mess.   Prediction=> 0
It always cuts out and makes a beep beep beep sound then says signal failed. Prediction=> 0
Generous portions and great taste. Prediction=> 1
Whatever prompted such a documentary is beyond me!   Prediction=> 0
This phone might well be the worst I've ever had in any brand. Prediction=> 0
Went for lunch - service was slow. Prediction=> 0
Also, the fries are without a doubt the worst fries I've ever had. Prediction=> 0
The refried beans that came with my meal were dried out and crusty and the food was bland. Prediction=> 0
The only possible way this movie could be redeemed would be as M

In [38]:
# Accuracy
print("Train Accuracy: ",pipe.score(X_train,y_train))
print("Test Accuracy: ",pipe.score(X_test,y_test))

Train Accuracy:  0.9872611464968153
Test Accuracy:  0.7509090909090909


In [39]:
# Another random review
pipe.predict(["This was a great movie"])

array([1], dtype=int64)

In [40]:
example = ["I do enjoy my job",
 "What a poor product!,I will have to get a new one",
 "I feel amazing!"]
       

In [41]:
pipe.predict(example)

array([1, 0, 1], dtype=int64)

In [42]:
#### Using Tfid

In [43]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe_tfid = Pipeline([("cleaner", predictors()),
                 ('vectorizer', tfvectorizer),
                 ('classifier', classifier)])

In [44]:
pipe_tfid.fit(X_train,y_train)

Pipeline(steps=[('cleaner', <__main__.predictors object at 0x000001DDC84238B0>),
                ('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x000001DDC00D0E50>)),
                ('classifier', LinearSVC())])

In [45]:
sample_prediction1 = pipe_tfid.predict(X_test)

In [46]:
for (sample,pred) in zip(X_test,sample_prediction1):
    print(sample,"Prediction=>", pred)

Great product. Prediction=> 1
This product is very High quality Chinese CRAP!!!!!! Prediction=> 0
Let's start with all the problemsthe acting, especially from the lead professor, was very, very bad.   Prediction=> 0
It's too bad that everyone else involved didn't share Crowe's level of dedication to quality, for if they did, we'd have a far better film on our hands than this sub-par mess.   Prediction=> 0
It always cuts out and makes a beep beep beep sound then says signal failed. Prediction=> 0
Generous portions and great taste. Prediction=> 1
Whatever prompted such a documentary is beyond me!   Prediction=> 0
This phone might well be the worst I've ever had in any brand. Prediction=> 0
Went for lunch - service was slow. Prediction=> 0
Also, the fries are without a doubt the worst fries I've ever had. Prediction=> 0
The refried beans that came with my meal were dried out and crusty and the food was bland. Prediction=> 0
The only possible way this movie could be redeemed would be as M

In [47]:
print("Accuracy: ",pipe_tfid.score(X_test,y_test))
print("Accuracy: ",pipe_tfid.score(X_test,sample_prediction1))

Accuracy:  0.7763636363636364
Accuracy:  1.0


In [48]:
### Jesse JCharis
### J-Secur1ty
### Jesus Saves @ JCharisTech