spaCy is a free, open-source library for NLP in Python. It's written in Cython and is designed to build information extraction or natural language understanding systems. It's built for production use and provides a concise and user-friendly API.

A core difference between NLTK and spaCy stems from the way in which these libraries were built. NLTK is essentially a string processing library, where each function takes strings as input and returns a processed string. Though this seems like a simple way to use the library, in practice, you’ll often find yourself going back to the documentation to discover new functions.

In contrast, spaCy takes an object-oriented approach. Each function returns objects instead of strings or arrays. This allows for easy exploration of the tool. Developers don’t need to constantly check with documentation to understand context because the object itself provides it.

In [1]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0-py3-none-any.whl (778.8 MB)


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\ANTHONY\anaconda3\lib\site-packages\pip\_vendor\urllib3\response.py", line 438, in _error_catcher
    yield
  File "C:\Users\ANTHONY\anaconda3\lib\site-packages\pip\_vendor\urllib3\response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "C:\Users\ANTHONY\anaconda3\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 62, in read
    data = self.__fp.read(amt)
  File "C:\Users\ANTHONY\anaconda3\lib\http\client.py", line 458, in read
    n = self.readinto(b)
  File "C:\Users\ANTHONY\anaconda3\lib\http\client.py", line 502, in readinto
    n = self.fp.readinto(b)
  File "C:\Users\ANTHONY\anaconda3\lib\socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\ANTHONY\anaconda3\lib\ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Users\ANTHONY\anaconda3\lib\ssl.py", line 1099, in read
    return self._sslobj.read(

In [1]:
import spacy


In [2]:
MyVectorizer = spacy.load('en_core_web_lg') #creates an object of spacey


In [3]:
sentence1 = "I love the book"
sentence2 = "This is a great book ?"
sentence3 = "The fit is great"
sentence4 = "I love the shoes"
sentence5 = "Do you hate driving"
sentence6 = "I love boxing"
sentence7 = "Beggars can't be chooser"
sentence8 = "Early to bed and early to rise, makes a man healthy, wealthy and wise"
sentence9 = "Playing sports makes you stronger"
sentence10 = "Any competitive physical activity providing a sense of enjoyment is Sport"
sentence11 = "play a major part in improving our physical and mental fitness" 
sentence12 = " It helps in developing Self Confidence, Team Spirit, and Mental & Physical toughness.There are two types of sports, Indoor and Outdoor"
s13="Collisions between trains and vehicles standing foul of the line ."
s14="People used it to go to one place to another."
s15="Its is a means of transport.It is a four-wheeled vehicle.It has five comfortable seats.It also has windows and AC.Its used to carry things to drop from one place to another."
s16="She should be looking for a replacement vehicle, but having another car in the garage would only be a reminder that there was no one left to drive it."

train_x = [sentence1,sentence2,sentence3,sentence4,sentence5,sentence6,sentence7,sentence8,sentence9,sentence10,sentence11,sentence12,s13,s14,s15,s16]
train_y = ["books", "book", "clothes", "clothes","vehicle","games","proverb","proverb","games","games","games","games","vehicle","vehicle","vehicle","vehicle"]


# Hey Spacy please read and understand the training data for me! can you? SPACEY understands the sentences to be convereted into numeric format
MyTextVectors = [MyVectorizer(text) for text in train_x]

# Hey Spacy please converrt your knowledge into numerical values[only then it can predict] so that I can analyze them with mathematical methods I know already
MyNumericalVectors = [x.vector for x in MyTextVectors]


# SVM is a classifier that learns and later decides if a text is about the clothes or books  or anything else
from sklearn import svm
MySVM = svm.SVC(kernel = 'linear')
MySVM.fit(MyNumericalVectors, train_y)  #we will fit traning set into the model


SVC(kernel='linear')

In [4]:
test_x = ["I love the books and the stories"]
MyTextVectors = [MyVectorizer(text) for text in test_x]
MyNumericalVectors = [x.vector for x in MyTextVectors]
MySVM.predict(MyNumericalVectors)

array(['books'], dtype='<U7')

In [5]:
test_x = ["I dont have fuel"]
MyTextVectors = [MyVectorizer(text) for text in test_x]
MyNumericalVectors = [x.vector for x in MyTextVectors]
MySVM.predict(MyNumericalVectors)


array(['vehicle'], dtype='<U7')

In [None]:
#Tokenization in spaCy

In [8]:
nlp = spacy.load('en_core_web_lg') #creates an object of spacey

In [9]:
introduction_text = ('This tutorial is about Natural'
 ' Language Processing in Spacy.')
introduction_doc = nlp(introduction_text)
 # Extract tokens for the given doc
print ([token.text for token in introduction_doc])

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']
