<a href="https://colab.research.google.com/github/deepika5419/Text-Classification-in-Python-Using-spaCy/blob/master/Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# word Tokenization
from spacy.lang.en import English

In [0]:
## Load English tokenizer, tagger, parser, NER and word vectors
nlp=English()

In [0]:
text="""When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""
print(text)

When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!


In [0]:
my_doc=nlp(text)

In [0]:
#create a list of word token
token_list=[]
for token in my_doc:
  token_list.append(token.text)
print(token_list)  

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


In [0]:
#sentance tokenization
nlp = English()
#create pipeline 'sentencizer' component 
sbd=nlp.create_pipe('sentencizer')
#add component to the pipe line
nlp.add_pipe(sbd)

In [0]:
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

In [0]:
## create sentance
sents_list=[]
for sent in doc.sents:
  sents_list.append(sent.text)
print(sents_list) 

["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]


In [0]:
#Cleaning Text Data: Removing Stopwords
#Most text data that we work with is going to contain a lot of words that aren’t actually useful to us. These words, called stopwords, are useful in human speech, but they don’t have much to contribute to data analysis
# Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time analysis takes (since there are fewer words to process).

In [0]:
import spacy
spacy_stopwords=spacy.lang.en.stop_words.STOP_WORDS

In [0]:
# printing the total number of stop word
print("Number of stop words:%d"%len(spacy_stopwords))

Number of stop words:326


In [0]:
#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:20])

First ten stop words: ['wherever', 'they', 'sixty', 'hers', 'neither', 'get', 'but', 'formerly', 'none', 'everything', 'seem', 'us', 'yet', 'from', 'will', 'everywhere', 'three', 'into', 'amongst', 'after']


In [0]:
# Removing stopwords from our data
from spacy.lang.en.stop_words import STOP_WORDS
#Implementation of stop words:
filtered_sent=[]
#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)


In [0]:
for word in doc:
  if word.is_stop==False:
    filtered_sent.append(word)
print("Filtered Sentence:",filtered_sent)    

Filtered Sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


In [0]:
#Lexicon Normalization
#Lemmatization
#Lexicon normalization is another step in the text data cleaning process. In the big picture, normalization converts high dimensional features into low dimensional features which are appropriate for any machine learning model. For our purposes here, we’re only going to look at lemmatization, a way of processing words that reduces them to their roots.

In [0]:
#Lemmatization
# Implementing lemmatization
lem=nlp("run runs running runner")
for word in lem:
  print(word.text,word.lemma_)
  

run run
runs run
running run
runner runner


In [0]:
#Part of Speech (POS) Tagging
# A word’s part of speech defines its function within a sentence. A noun, for example, identifies an object. An adjective describes an object. A verb describes action. Identifying and tagging each word’s part of speech in the context of a sentence is called Part-of-Speech Tagging, or POS Tagging.
'''we need to import en_core_web_sm model,because that contains the dictionary and grammatical information required
to do this analysis.Then all we need to do is load this model with .load() and loop through our new docs variable, identifying the part of speech for each word using .pos_. '''
# u in u"All is well that ends well." signifies that the string is a Unicode string.


'we need to import en_core_web_sm model,because that contains the dictionary and grammatical information required\nto do this analysis.Then all we need to do is load this model with .load() and loop through our new docs variable, identifying the part of speech for each word using .pos_. '

In [0]:
# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm
# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()
#  "nlp" Objectis used to create documents with linguistic annotations.
docs = nlp(u"All is well that ends well.")
for word in docs:
    print(word.text,word.pos_)

All DET
is VERB
well ADV
that DET
ends VERB
well ADV
. PUNCT


In [0]:
#Entity Detection
#'''Entity detection, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, 
#since you can quickly pick out important topics or indentify key sections of text.'''


'Entity detection, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, \nsince you can quickly pick out important topics or indentify key sections of text.'

In [0]:
#for visualization of Entity detection importing displacy from spacy:
from spacy import displacy

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")


In [0]:
entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'NORP', 381),
 (6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

In [0]:
#Using displaCy we can also visualize our input text, with each identified entity highlighted by color and labeled. We’ll use style = "ent" to tell displaCy that we want to visualize entities here.'''

In [0]:
displacy.render(nytimes,style = "ent",jupyter = True)

In [0]:
#Dependency Parsing
#Depenency parsing is a language processing technique that allows us to better determine the meaning of a sentence by analyzing how it’s constructed to determine how the individual words relate to each other.

docp=nlp(" In pursuit of a wall, President Trump ran into one.")
for chunk in docp.noun_chunks:
  print(chunk.text,chunk.root.text,chunk.root.dep,chunk.root.head.text)
  

pursuit pursuit 439 In
a wall wall 439 of
President Trump Trump 429 ran


In [0]:
displacy.render(docp,style="dep",jupyter=True)

In [0]:
#Word Vector Representation
#When we’re looking at words alone, it’s difficult for a machine to understand connections that a human would understand immediately. Engine and car, for example, have what might seem like an obvious connection (cars run using engines), but that link is not so obvious to a computer.
#A word vector is a numeric representation of a word that commuicates its relationship to other words.
#Each word is interpreted as a unique and lenghty array of numbers.

import en_core_web_sm
nlp = en_core_web_sm.load()
mango=nlp(u'mango')
print(mango.vector.shape)
print(mango.vector)



(96,)
[ 1.0466383  -1.5323697  -0.72177905 -2.4700649  -0.2715162   1.1589639
  1.7113379  -0.31615403 -2.0978343   1.837553    1.4681302   2.728043
 -2.3457408  -5.17184    -4.6110015  -0.21236466 -0.3029521   4.220028
 -0.6813917   2.4016762  -1.9546705  -0.85086954  1.2456163   1.5107994
  0.4684736   3.1612053   0.15542296  2.0598564   3.780035    4.6110964
  0.6375268  -1.078107   -0.96647096 -1.3939928  -0.56914186  0.51434743
  2.3150034  -0.93199825 -2.7970662  -0.8540115  -3.4250052   4.2857723
  2.5058174  -2.2150877   0.7860181   3.496335   -0.62606215 -2.0213525
 -4.47421     1.6821622  -6.0789204   0.22800982 -0.36950028 -4.5340714
 -1.7978683  -2.080299    4.125556    3.1852438  -3.286446    1.0892276
  1.017115    1.2736416  -0.10613725  3.5102775   1.1902348   0.05483437
 -0.06298041  0.8280688   0.05514218  0.94817173 -0.49377063  1.1512338
 -0.81374085 -1.6104267   1.8233354  -2.278403   -2.1321895   0.3029334
 -1.4510616  -1.0584296  -3.5698352  -0.13046083 -0.266833

In [0]:
#Text Classification
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.base import TransformerMixin


In [0]:
from google.colab import files
uploaded = files.upload()

Saving amazon_alexa.tsv to amazon_alexa.tsv


In [0]:
#Loading TSV file
df_amazon = pd.read_csv ("amazon_alexa.tsv", sep="\t")

In [0]:
# Top 5 records
df_amazon.head()


Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [0]:
# shape of dataframe
df_amazon.shape

(3150, 5)

In [0]:
#view information
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
rating              3150 non-null int64
date                3150 non-null object
variation           3150 non-null object
verified_reviews    3150 non-null object
feedback            3150 non-null int64
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [0]:
#Feedback Value Counts
df_amazon.feedback.value_counts()

1    2893
0     257
Name: feedback, dtype: int64

In [0]:
#Tokening the Data With spaCy
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

In [0]:
# Create our list of punctuation marks
punctuations = string.punctuation

In [0]:
# Create our list of stopwords
nlp=spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

In [0]:
# Load English tokenizer, tagger, parser, NER and word vectors
parser=English()

In [0]:
# Creating our tokenizer function
def spacy_tokenizer(sentence):
  # Creating our token object, which is used to create documents with linguistic annotations.
  mytokens=parser(sentence)
  # Lemmatizing each token and converting each token into lowercase
  mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
  # Removing stop words
  mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
  # return preprocessed list of tokens
  return mytokens

In [0]:
#Defining a Custom Transformer
# Custom transformer using spaCy
class predictors(TransformerMixin):
   def transform(self, X, **transform_params):
     # Cleaning Text
     return [clean_text(text) for text in X]
   def fit(self,X,y=None,**fit_params):
     return self
   def get_params(self,deep=True):
     return {}
 # Basic function to clean the text
def clean_text(text):
   #removing spaces and converting text into lowercase
    return text.strip().lower()




In [0]:
#Vectorization Feature Engineering (TF-IDF)
bow_vector=CountVectorizer(tokenizer = spacy_tokenizer,ngram_range=(1,1))
tfidf_vector=TfidfVectorizer(tokenizer=spacy_tokenizer)
#Splitting The Data into Training and Test Sets
from sklearn.model_selection import train_test_split
X=df_amazon['verified_reviews']#the features we want to analyze
ylabels=df_amazon['feedback']#the labels,or answer,we want to test against
X_train,X_test,y_train,y_test=train_test_split(X,ylabels,test_size=0.3)
#Creating a Pipeline and Generating the Model



In [0]:
#Creating a Pipeline and Generating the Model
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
# Creating pipeline using bag of words
pipe = Pipeline([("cleaner",predictors()),
                 ("vectorizer",bow_vector),
                 ("classifiers",classifier)])
#model generation
pipe.fit(X_train,y_train)





Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x7f1c32b93940>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 t...\b\\w\\w+\\b',
                                 tokenizer=<function spacy_tokenizer at 0x7f1c39bbba60>,
                                 vocabulary=None)),
                ('classifiers',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
            

In [0]:
#Evaluating the Model
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)
#model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test,predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test,predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test,predicted))



Logistic Regression Accuracy: 0.9365079365079365
Logistic Regression Precision: 0.9389978213507625
Logistic Regression Recall: 0.9953810623556582


In [0]:
from google.colab import drive
drive.mount('/content/drive')