In [10]:
import spacy
nlp = spacy.load("en_core_web_lg")
with open ("/Users/alp/Desktop/Spacey/Gervain.txt", "r") as f:
    text = f.read()
doc = nlp(text)

In [3]:
print(doc)

The emergence of language has intrigued scientists and the general public alike, but it was only in the second half of the twentieth century that a systematic empirical investigation of language acquisition began. This work was greatly inspired by the suggestion that the environment is mainly a trigger rather than a tutor for language acquisition, at least during the first years of life (Chomsky 1959). Consequently, to explain the uniquely human capacity of language, scholars proposed innate acquisition mechanisms, specific to language (Chomsky 1959). A few years later, research into the biological foundations of language was expanded, giving a better grasp of the innate dispositions
for language acquisition (Lenneberg 1967). 
By contrast, other researchers suggested that classical learning mechanisms, ones that humans share with other animals, may be sufficient to acquire language (Elman 1996, Tomasello 2000). Under this view, the human specificity of language arises from quantitative

A list of all the sentences in the corpus

In [4]:
#list all of the sentences in the document with their list index numbers 
for i, sent in enumerate(doc.sents):
    print(i, sent)


0 The emergence of language has intrigued scientists and the general public alike, but it was only in the second half of the twentieth century that a systematic empirical investigation of language acquisition began.
1 This work was greatly inspired by the suggestion that the environment is mainly a trigger rather than a tutor for language acquisition, at least during the first years of life (Chomsky 1959).
2 Consequently, to explain the uniquely human capacity of language, scholars proposed innate acquisition mechanisms, specific to language (Chomsky 1959).
3 A few years later, research into the biological foundations of language was expanded, giving a better grasp of the innate dispositions
for language acquisition (Lenneberg 1967). 

4 By contrast, other researchers suggested that classical learning mechanisms, ones that humans share with other animals, may be sufficient to acquire language (Elman 1996, Tomasello 2000).
5 Under this view, the human specificity of language arises from

The code down below measures the similarity of each sentence with each other

In [5]:
for i, sent1 in enumerate(doc.sents):
    for j, sent2 in enumerate(doc.sents):
        if i == j:
            continue
        print(i, j, sent1.similarity(sent2))
#The code down below gives the average similarity of the sentences in the document
total = 0
comparisons = 0
for i, sent1 in enumerate(doc.sents):
    for j, sent2 in enumerate(doc.sents):
        if i == j:
            continue
        total += sent1.similarity(sent2)
        comparisons += 1
print(total/comparisons)

0 1 0.9297234416007996
0 2 0.8441457748413086
0 3 0.9266768097877502
0 4 0.7899712324142456
0 5 0.925125002861023
0 6 0.8560572862625122
0 7 0.9392214417457581
0 8 0.8979479670524597
0 9 0.8055269718170166
0 10 0.7147533297538757
0 11 0.9197076559066772
0 12 0.8509125709533691
0 13 0.904289960861206
0 14 0.9377974271774292
0 15 0.9032106399536133
0 16 0.9278295636177063
0 17 0.9142584204673767
1 0 0.9297234416007996
1 2 0.8482668399810791
1 3 0.934273898601532
1 4 0.8224436640739441
1 5 0.8933923840522766
1 6 0.8096685409545898
1 7 0.8789430856704712
1 8 0.8916823863983154
1 9 0.7799533009529114
1 10 0.6866056323051453
1 11 0.8543907999992371
1 12 0.7966938018798828
1 13 0.846263587474823
1 14 0.8611082434654236
1 15 0.8446451425552368
1 16 0.8605772852897644
1 17 0.8662689924240112
2 0 0.8441457748413086
2 1 0.8482668399810791
2 3 0.8904719352722168
2 4 0.9118756055831909
2 5 0.8857656717300415
2 6 0.8349946737289429
2 7 0.8378738164901733
2 8 0.8772842884063721
2 9 0.7767167687416077

How to render the results

In [6]:
from spacy import displacy
#displacy.serve(doc, style='dep', port=5002)
# While using jupyter notebooks using .render instead of .serve is much better
sentence_lists = list(doc.sents)
displacy.render(sentence_lists[9], style='dep')


The code down below creates a list for all sentences

In [7]:
sentence_lists = list(doc.sents)

A way to render long docs

In [None]:

#displacy.render(sentence_lists, style='dep')
displacy.serve(sentence_lists, style='dep', port=5002)

Exporting results as visuals

In [8]:
from pathlib import Path
sentence1 = sentence_lists[9]
svg = displacy.render(sentence1, style='dep', jupyter=False)
filename = 'sentence1.svg'
output_path = Path('/Users/alp/Desktop/Spacey/images' + filename)
output_path.open('w', encoding='utf-8').write(svg)

10567

List of all tokens in a sentence

In [9]:
sentence1 = sentence_lists[9]
for token in sentence1:
    print(token)

Our
review
discusses
landmarks
in
language
acquisition
as
well
as
their
biological
underpinnings
.


“Tokenization simply means splitting the sentence into its tokens. A token is a unit of semantics. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence.”

Part of the speach matched with tokens
[POS link](https://universaldependencies.org/u/pos/)

In [13]:
sentence1 = sentence_lists[9]
print(sentence1)
#Create a list of tokens in the sentence1
tokens = [token.text for token in sentence1]
print(tokens)
#Create a list of part-of-speech tags in the sentence1
pos_tags = [token.pos_ for token in sentence1]
ziped = zip(tokens, pos_tags)
print(list(ziped))

Our review discusses landmarks in language acquisition as well as their biological underpinnings.
['Our', 'review', 'discusses', 'landmarks', 'in', 'language', 'acquisition', 'as', 'well', 'as', 'their', 'biological', 'underpinnings', '.']
[('Our', 'PRON'), ('review', 'NOUN'), ('discusses', 'VERB'), ('landmarks', 'NOUN'), ('in', 'ADP'), ('language', 'NOUN'), ('acquisition', 'NOUN'), ('as', 'ADV'), ('well', 'ADV'), ('as', 'ADP'), ('their', 'PRON'), ('biological', 'ADJ'), ('underpinnings', 'NOUN'), ('.', 'PUNCT')]


Named Entity Recognition

In [14]:
for ent in doc.ents:
    print(ent.text, ent.label_)

the second half of the twentieth century DATE
the first years DATE
Chomsky PERSON
1959 DATE
Chomsky PERSON
1959 DATE
A few years later DATE
Lenneberg 1967 ORG
Elman PERSON
Tomasello 2000 PERSON
the first year DATE
the next decades DATE
first ORDINAL
the past decades DATE


Or we can visualize the results

In [15]:
displacy.render(doc, style='ent')

In [16]:
for obj in doc.noun_chunks:
    print(obj.text, obj.label_, obj.root.text)
    

The emergence NP emergence
language NP language
scientists NP scientists
the general public NP public
it NP it
the second half NP half
the twentieth century NP century
a systematic empirical investigation NP investigation
language acquisition NP acquisition
This work NP work
the suggestion NP suggestion
the environment NP environment
a trigger NP trigger
a tutor NP tutor
language acquisition NP acquisition
the first years NP years
life NP life
Chomsky NP Chomsky
the uniquely human capacity NP capacity
language NP language
scholars NP scholars
innate acquisition mechanisms NP mechanisms
language NP language
Chomsky NP Chomsky
research NP research
the biological foundations NP foundations
language NP language
a better grasp NP grasp
the innate dispositions NP dispositions
language acquisition NP acquisition
Lenneberg NP Lenneberg
contrast NP contrast
other researchers NP researchers
classical learning mechanisms NP mechanisms
ones NP ones
that NP that
humans NP humans
other animals NP an

A smiliraty matrix for each token

In [21]:
sentence1 = sentence_lists[9]
for token1 in sentence1:
    for token2 in sentence1:
        if token1 == token2:
            continue
        print(token1.text, str("->"), token2.text, str("->"), token1.similarity(token2))    

Our -> review -> 0.1353009194135666
Our -> discusses -> 0.09265071153640747
Our -> landmarks -> -0.0053463284857571125
Our -> in -> 0.030499719083309174
Our -> language -> 0.02306048572063446
Our -> acquisition -> 0.12261588126420975
Our -> as -> 0.06830637156963348
Our -> well -> 0.08253715187311172
Our -> as -> 0.06830637156963348
Our -> their -> 0.31158027052879333
Our -> biological -> 0.11669295281171799
Our -> underpinnings -> 0.16561120748519897
Our -> . -> 0.14620058238506317
review -> Our -> 0.1353009194135666
review -> discusses -> 0.4760911166667938
review -> landmarks -> 0.16541312634944916
review -> in -> 0.03355959430336952
review -> language -> 0.26037272810935974
review -> acquisition -> 0.43741652369499207
review -> as -> 0.05098135396838188
review -> well -> 0.1828746646642685
review -> as -> 0.05098135396838188
review -> their -> 0.10308544337749481
review -> biological -> 0.26503872871398926
review -> underpinnings -> 0.3863060176372528
review -> . -> 0.0980100035667