In [5]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)


Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']
Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun ORG
Recode PRODUCT
earlier this week DATE


In [17]:
spacy.explain("VBG")

'verb, gerund or present participle'

In [4]:
print(doc.ents)

(Sebastian Thrun, Google, 2007, American, Thrun, Recode, earlier this week)


###### Install word vectots

python -m spacy download en_core_web_lg

python -m spacy download en_core_web_md

python -m spacy download en_core_web_sm

In [26]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [3]:
import spacy

nlp = spacy.load("en_core_web_md")
#nlp = spacy.load("en-core-web-md")  # make sure to use larger model!
tokens1 = nlp("dog cat banana")
tokens2 = nlp("monkey tiger plantain")

for token1 in tokens1:
    for token2 in tokens2:
        print(token1.text, token2.text, token1.similarity(token2))

dog monkey 0.47752646
dog tiger 0.43654656
dog plantain 0.13001473
cat monkey 0.5351813
cat tiger 0.5413389
cat plantain 0.15484892
banana monkey 0.45207787
banana tiger 0.2851668
banana plantain 0.6150555


In [15]:
import spacy
nlp = spacy.load("en_core_web_md")
text1 = ("""
Machine learning (ML) is the scientific study of algorithms and statistical models 
that computer systems use to perform a specific task without using explicit instructions, 
relying on patterns and inference instead. It is seen as a subset of artificial intelligence. 
Machine learning algorithms build a mathematical model based on sample data, known as "training data", 
in order to make predictions or decisions without being explicitly programmed to perform the task.
Machine learning algorithms are used in a wide variety of applications, such as email 
filtering and computer vision, where it is difficult or infeasible to develop a conventional 
algorithm for effectively performing the task.
        """)
doc1 = nlp(text1)

text2 = ("""
Machine learning is closely related to computational statistics, 
which focuses on making predictions using computers. 
The study of mathematical optimization delivers methods, 
theory and application domains to the field of machine learning. 
Data mining is a field of study within machine learning, and 
focuses on exploratory data analysis through unsupervised learning.
In its application across business problems, machine learning is also referred to as predictive analytics.
""")
doc2 = nlp(text2)

text3 = ("""
The name machine learning was coined in 1959 by Arthur Samuel. 
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms 
studied in the machine learning field: "A computer program is said to learn from experience E 
with respect to some class of tasks T and performance measure P if its performance at tasks in T, 
as measured by P, improves with experience E." This definition of the tasks in which machine 
learning is concerned offers a fundamentally operational definition rather than defining the 
field in cognitive terms. This follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", 
in which the question "Can machines think?" is replaced with the question 
"Can machines do what we (as thinking entities) can do?". In Turing's proposal the various characteristics 
that could be possessed by a thinking machine and the various implications in constructing one are exposed.
""")
doc3 = nlp(text3)

text4 = """Titanic is a 1997 American epic romance and disaster film directed, written, co-produced, and co-edited by James Cameron. Incorporating both historical and fictionalized aspects, the film is based on accounts of the sinking of the RMS Titanic, and stars Leonardo DiCaprio and Kate Winslet as members of different social classes who fall in love aboard the ship during its ill-fated maiden voyage."""
doc4 = nlp(text4)

text5 = """Cameron's inspiration for the film came from his fascination with shipwrecks; he felt a love story interspersed with the human loss would be essential to convey the emotional impact of the disaster. Production began in 1995, when Cameron shot footage of the actual Titanic wreck. The modern scenes on the research vessel were shot on board the Akademik Mstislav Keldysh, which Cameron had used as a base when filming the wreck. Scale models, computer-generated imagery, and a reconstruction of the Titanic built at Baja Studios were used to re-create the sinking. The film was co-financed by Paramount Pictures and 20th Century Fox; the former handled distribution in North America while the latter released the film internationally. It was the most expensive film ever made at the time, with a production budget of $200 million."""
doc5 = nlp(text5)

#print(doc1.text, "\n", doc3.text, "\n", doc1.similarity(doc3))
print("Similarity:", doc2.similarity(doc5))

Similarity: 0.8676397831908912
