<h1 style="text-align: center;" markdown="1"> Text Preprocessing: Python and Bash </h1>

<img src="images/skyclear.jpg" alt="Drawing" style="width: 700px;"/>

Natural Language Processing (NLP) is all about leveraging tools, techniques and algorithms to process and understand natural language-based data, which is usually unstructured like text, speech and so on.
Computers are great at working with structured data like spreadsheets and database tables. But us humans usually communicate in words, not in tables. That’s unfortunate for computers.
A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand unstructured text and extract data from it?

First step is to preprocess and clean our raw text data. This also depends on our final task.

After the preprocessing phase we will transform our text from human language to machine-readable format.


<img src="images/nlp_pipeline.jpg" alt="Drawing" style="width: 700px;"/>


In [1]:
import spacy
from spacy import displacy
#import textacy
import string
# If not already installed
#!python3 -m spacy download en_core_web_sm
#!python -m spacy download it_core_news_sm
#!pip3 install -U textacy
spacy.about.__version__

'2.0.18'

In [None]:
!python3 -m spacy download en_core_web_sm
#!python -m spacy download it_core_news_sm

<h2 style="text-align: center;" markdown="1"> Playing with strings in Python</h1>

### Basic Operations with Strings

Stings are like list and thus they have a length.

In [2]:
ll = [1,2,3,4,5,'c']
ll[3]

4

In [4]:
ex1 = 'They say that Banksy is Robin Gunningham'
ex2 = 'where\'s the revolution'

print(f'{ex1[2]}    {ex1[6:20]}')
print('How long is our string?', len(ex2))

e    ay that Banksy
How long is our string? 22


In [11]:
print(ex1 + ' ' + 'ciao')

They say that Banksy is Robin Gunningham ciao


In [8]:
print('xxxxxx: ', ex1)

xxxxxx:  They say that Banksy is Robin Gunningham


In [12]:
print('xxxxxx: %s' %(ex1))

xxxxxx: They say that Banksy is Robin Gunningham


Do you know that you can add strings?

In [13]:
name, surname ='Richard', 'Feynman'
print(name + ' ' + surname)
print(name*3)

Richard Feynman
RichardRichardRichard


### Title

In [14]:
title = 'lA viTa è beLla'
title.title()

'La Vita È Bella'

### Find Something

If you are lost...

In [15]:
findit = 'where is my mind'
print(findit.find('mind'))
print(findit.index('mind'))
print(findit.find('kiwi'))

12
12
-1


In [6]:
findit.endswith('mind'), findit.startswith('mind')

(True, False)

### Replace

In [16]:
dedica = 'Cara Marina, sei l\'unica al mondo'
print(dedica + '\n')
dedica = dedica.replace('Marina', 'Giulia')
print(dedica + '\n')

Cara Marina, sei l'unica al mondo

Cara Giulia, sei l'unica al mondo



### Partition

In [21]:
part = 'Take away it'
part.partition('it')

('Take away ', 'it', '')

### Format String

In [22]:
pi = 3.14159
method1 = "The value of pi is " + str(pi)
method2 = "The value of pi is {}".format(pi)
method3 = "The value of pi is {0:.4f}".format(pi)
print(method1 + '\n' + method2 + '\n' + method3)
method4 = """Flight Number: {0}. Flight Number: {1}.""".format('AZA86E', 'AZA151')
print(method4)


The value of pi is 3.14159
The value of pi is 3.14159
The value of pi is 3.1416
Flight Number: AZA86E. Flight Number: AZA151.


### Some Common Methods

In [23]:
'put it upper'.upper()

'PUT IT UPPER'

In [24]:
'    strip it  '.strip()

'strip it'

In [25]:
'tannutuva    strip it  tannutuva'.strip('tannutuva')

'    strip it  '

In [26]:
'more is better than less'.split()

['more', 'is', 'better', 'than', 'less']

In [27]:
'one,two,three,four,five'.split(',')

['one', 'two', 'three', 'four', 'five']

In [72]:
' '.join(['Less', 'is', 'better', 'than', 'boh!?!'])

'Less is better than boh!?!'

In [73]:
' , '.join(['Less', 'is', 'better', 'than', 'boh!?!'])

'Less , is , better , than , boh!?!'

### Regular Expressions 

A regular expression, o regex, is a sequence of characters that define a search pattern. Some useful special characters:

* **"\d"**	Match any digit		
* **"\D"**	Match any non-digit
* **"\s"**	Match any whitespace		
* **"\S"**	Match any non-whitespace
* **"\w"**	Match any alphanumeric char		
* **"\W"**	Match any non-alphanumeric char


* **?**	Match zero or one repetitions of preceding:	"ab?" matches "a" or "ab"
* *****	Match zero or more repetitions of preceding:	"ab*" matches "a", "ab", "abb", "abbb"...
* **+**	Match one or more repetitions of preceding:	"ab+" matches "ab", "abb", "abbb"... but not "a"
* **{n}**	Match n repetitions of preeeding	"ab{2}" matches "abb"
* **{m,n}**	Match between m and n repetitions of preceding	"ab{2,3}" matches "abb" or "abbb"

In [29]:
import re

line = 'more is better than      less'
regex1 = re.compile('\s')
regex2 = re.compile('\s+')
# "\s" is a special character that matches any whitespace (space, tab, newline, etc.), 
# and the "+" is a character that indicates one or more of the entity preceding it
print(regex1.split(line))
print(regex2.split(line))

['more', 'is', 'better', 'than', '', '', '', '', '', 'less']
['more', 'is', 'better', 'than', 'less']


In [49]:
email = re.compile('\w+@\w+\.[a-z]{3}')
mails = 'I have several mails. My personal mails are denobili16@gmail.com and cristiano.denobili@gmail.com'

email.findall(mails)

['denobili16@gmail.com', 'denobili@gmail.com']

In [53]:
email = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
email.findall(mails)

['denobili16@gmail.com', 'cristiano.denobili@gmail.com']

In [58]:
regex = re.compile('\w\s\w')
# \w is a special marker matching any alphanumeric character.
regex.findall('deep learning is cool but flying is cooler y or n')

['p l', 'g i', 's c', 'l b', 't f', 'g i', 's c', 'r y', 'r n']

if you'd like to match any of these characters 

. ^ $ * + ? { } [ ] \ | ( )

you can escape them with a back-slash!

In [41]:
regex = re.compile('\$')
regex.findall("the cost is $20, but you you jump twice is $15.")

['$', '$']


<h2 style="text-align: center;" markdown="1"> Text Normalization</h1>

### Convert text to lower or upper case

In [44]:
input_str = 'The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.'
input_str_lower = input_str.lower()

print(input_str_lower)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


Using tr in bash:

```console
$: cat text_to_norm.txt  | tr '[:upper:]' '[:lower:]'
$: cat text_to_norm.txt  | tr '[:lower:]' '[:upper:]'

or

$: cat data/text_to_norm.txt | awk '{print tolower($0)}' 
$: cat data/text_to_norm.txt | awk '{print toupper($0)}'
```

If you want to time them:

```console
$: time for i in `seq 100000`; do cat text_to_norm.txt  | tr '[:upper:]' '[:lower:]' > /dev/null 2>&1; done

or

$: time for i in `seq 100000`; do cat data/text_to_norm.txt | awk '{print toupper($0)} > /dev/null 2>&1; done
```

In [59]:
! cat ./data/text_to_norm.txt  | tr '[:upper:]' '[:lower:]'

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.
why do   mosquitoes exist?
one two three 4 five!
la cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
is the universe @ quantum computer?
la vita è meravigli!osa! senza & saresti morto...



In [39]:
! cat ./data/text_to_norm.txt  | tr '[:lower:]' '[:upper:]'

THE 5 BIGGEST COUNTRIES BY POPULATION IN 2017 ARE CHINA, INDIA, UNITED STATES, INDONESIA, AND BRAZIL.
WHY DO   MOSQUITOES EXIST?
ONE TWO THREE 4 FIVE!
LA COSA BELLA 56700 DI UNA BATTUTA A DOPPIO SENSO È CHE PUÒ SIGNIFICARE SOLO UNA COSA
IS THE UNIVERSE @ QUANTUM COMPUTER?
LA VITA È MERAVIGLI!OSA! SENZA & SARESTI MORTO...



In [60]:
! cat data/text_to_norm.txt | awk '{print tolower($0)}'

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.
why do   mosquitoes exist?
one two three 4 five!
la cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
is the universe @ quantum computer?
la vita è meravigli!osa! senza & saresti morto...



In [61]:
! cat data/text_to_norm.txt | awk '{print toupper($0)}'

THE 5 BIGGEST COUNTRIES BY POPULATION IN 2017 ARE CHINA, INDIA, UNITED STATES, INDONESIA, AND BRAZIL.
WHY DO   MOSQUITOES EXIST?
ONE TWO THREE 4 FIVE!
LA COSA BELLA 56700 DI UNA BATTUTA A DOPPIO SENSO è CHE PUò SIGNIFICARE SOLO UNA COSA
IS THE UNIVERSE @ QUANTUM COMPUTER?
LA VITA è MERAVIGLI!OSA! SENZA & SARESTI MORTO...



In [42]:
! time for i in `seq 1000`; do cat data/text_to_norm.txt  | tr '[:upper:]' '[:lower:]' > /dev/null 2>&1; done


real	0m3.195s
user	0m2.322s
sys	0m2.998s


In [148]:
! time for i in `seq 1000`; do cat data/text_to_norm.txt | awk '{print toupper($0)}' > /dev/null 2>&1; done


real	0m2.753s
user	0m2.104s
sys	0m2.780s


### Remove numbers

[Regular Expression](https://docs.python.org/3.6/howto/regex.html)

[GNU/Linux Command-Line Tools Summary](http://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm)

In [62]:
import re
input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'
result = re.sub(r'\d+', '', input_str)
#  \d Matches any decimal digit; this is equivalent to the class [0-9].
#  + which matches one or more times.
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


In [63]:
! cat data/text_to_norm.txt  | tr -d '0123456789'

The  biggest countries by population in  are China, India, United States, Indonesia, and Brazil.
Why do   mosquitoes exist?
one two three  five!
La cosa bella  di una battuta a doppio senso è che può significare solo una cosa
Is the Universe @ quantum computer?
La vita è meravigli!osa! Senza & saresti morto...



In [68]:
! sed -e 's/[0-9\*]//g' data/text_to_norm.txt

The  biggest countries by population in  are China, India, United States, Indonesia, and Brazil.
Why do   mosquitoes exist?
one two three  five!
La cosa bella  di una battuta a doppio senso è che può significare solo una cosa
Is the Universe @ quantum computer?
La vita è meravigli!osa! Senza & saresti morto...



In [69]:
! cat data/text_to_norm.txt

The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.
Why do   mosquitoes exist?
one two three 4 five!
La cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
Is the Universe @ quantum computer?
La vita è meravigli!osa! Senza & saresti morto...



### Select Sentence Length

In [70]:
! awk 'NF<=5' data/text_to_norm.txt

Why do   mosquitoes exist?
one two three 4 five!



In [71]:
! awk 'NF>5' data/text_to_norm.txt

The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.
La cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
Is the Universe @ quantum computer?
La vita è meravigli!osa! Senza & saresti morto...


### Grep (select) Specific Words

In [74]:
! grep '\bcosa bella\b' data/text_to_norm.txt

The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.
Why do   mosquitoes exist?
one two three 4 five!
Is the Universe @ quantum computer?
La vita è meravigli!osa! Senza & saresti morto...



In [51]:
! grep '\bIndia\b' data/text_to_norm.txt

The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.


In [52]:
! grep -v '\bIndia\b' data/text_to_norm.txt

Why do   mosquitoes exist?
one two three 4 five!
La cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
Is the Universe @ quantum computer?
La vita è meravigli!osa! Senza & saresti morto...



In [76]:
! grep -P '(\bvita\b).*(\bmorto\b)' data/text_to_norm.txt

La vita è meravigli!osa! Senza & saresti morto...


### Remove punctuation

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]

In [77]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [84]:
input_str = 'This &is [an] example? {of} string. with.? punctuation!!!!'

# 1st method
table = str.maketrans({key: None for key in string.punctuation})
input_str_nopunct = input_str.translate(table)
print(input_str_nopunct)

# 2nd method: substitute every NON alpha-numeric char with space
input_str_nopunct = re.sub(r'[^\w\s]','',input_str)
print(input_str_nopunct)




This is an example of string with punctuation
This is an example of string with punctuation


In [83]:
! cat data/text_to_norm.txt | tr -d '[:punct:]'

The 5 biggest countries by population in 2017 are China India United States Indonesia and Brazil
Why do   mosquitoes exist
one two three 4 five
La cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
Is the Universe  quantum computer
La vita è meravigliosa Senza  saresti morto



In [151]:
! awk '{ gsub(/[[:punct:]]/, "", $0) }1;' data/text_to_norm.txt

The 5 biggest countries by population in 2017 are China India United States Indonesia and Brazil
Why do   mosquitoes exist
one two three 4 five
La cosa bella di una battuta a doppio senso � che può significare solo una cosa
Is the Universe  quantum computer
La vita � meravigliosa Senza  saresti morto



In [88]:
# [] identifies a range: m[a,u,o]m -> mam, mum, mom
# [^] it performs a logical not
! sed -e "s/[^a-zA-Z0-9àèéìò\!'?\* ]//g" data/text_to_norm.txt

The 5 biggest countries by population in 2017 are China India United States Indonesia and Brazil
Why do   mosquitoes exist?
one two three 4 five!
La cosa bella 56700 di una battuta a doppio senso è che può significare solo una cosa
Is the Universe  quantum computer?
La vita è meravigli!osa! Senza  saresti morto



In [171]:
! sed -e "s/[a-zA-Z0-9àèéìò\!'?\* ]//g" data/text_to_norm.txt

,,,,.



@
&...



### Remove white spaces

In [94]:
line = 'Don\'t take to much       space'
#regex1 = re.compile('\s')
re.sub(r'\s+', ' ', line)

"Don't take to much       space"

In [95]:
! echo "don't take to much    space" | sed 's/  */ /g'

don't take to much space


In [60]:
#### Tokenization 

In [61]:
#### Stop words

In [62]:
#### Stemming and Lemmatization

In [63]:
#### Part of Speech Tagging & Chunking

<h2 style="text-align: center;" markdown="1"> Spacy: Python Library for Text Preprocessing </h2>


[Spacy Official Page](https://spacy.io)

**spaCy** is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages

<h3 style="text-align: center;" markdown="1"> Language Models </h3>


[Spacy Language Models Official Page](https://spacy.io/models/en)

Using spaCy to extract linguistic features like part-of-speech tags, dependency labels and named entities, customising the tokenizer and working with the rule-based matcher.

### English: en_core_web_sm


English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.

'_sm' stands for small. You can download also medium and large version!

In [15]:
#!python3 -m spacy download en_core_web_sm
#!python -m spacy download it_core_news_sm

In [97]:
ll = ['a', 'c', 'c']

for tokin in ll:
    print(tokin)

a
c
c


In [99]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


* **Text**: The original word text.
* **Lemma**: The base form of the word.
* **POS**: The simple part-of-speech tag.
* **Tag**: The detailed part-of-speech tag.
* **Dep**: Syntactic dependency, i.e. the relation between tokens.
* **Shape**: The word shape – capitalisation, punctuation, digits.
* **is alpha**: Is the token an alpha character?
* **is stop**: Is the token part of a stop list, i.e. the most common words of the language?

In [100]:
doc = nlp(u'Time flies')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Time time NOUN NN nsubj Xxxx True False
flies fly VERB VBZ ROOT xxxx True False


### Part-of-speech Tagging (POS)

*Ci sei alle sei?*

In corpus linguistics, part-of-speech tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. 

part of speech: Noun (N), Verb (V), Adjective(ADJ), Adverb (ADV), Preposition (P), Conjunction (CON), Pronoun(PRO), Interjection (INT)

After tokenization, spacy can parse and tag a given doc. This is where the statistical model comes in, which enables spacy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language. For instance, a word following "the" in English is most likely a noun.

In [123]:
doc = nlp(u'Time flies')

for token in doc:
    print(token.text, token.pos_)

Time time NOUN
flies fly VERB


In [101]:
doc = nlp(u'The journey of a thousand miles begins with one step.')

for token in doc:
    print(token.text, token.pos_)

The DET
journey NOUN
of ADP
a DET
thousand NUM
miles NOUN
begins VERB
with ADP
one NUM
step NOUN
. PUNCT


In [102]:
# On jupyter, othervise use displacy.serve
displacy.render(doc, style='dep', jupyter=True)

### Lemmatization

Let's built a lemmatizer...

In [104]:
text = 'She went home because she was tired'
print(text)

def lemmatizer(corpus):
    doc = nlp(corpus)
    lmtz = [token.lemma_ for token in doc]
    return ' '.join(lmtz)

lemmatizer('She went home because she was tired')

She went home because she was tired


'-PRON- go home because -PRON- be tired'

... and apply map function

In [123]:
def marameo(x):
    x_str = str(x)
    x_str_up = x_str.upper()
    return x_str_up

marameo(banksy)

marameo_lmd = lambda x: str(x).upper()

marameo_lmd(7)

'7'

In [109]:
corpus = ['She went home because she was tired', 'I don\'t want to be uncolored', 'She is the one']
corpus_lemmatized = list(map(lambda x: x.upper(), corpus))
corpus_lemmatized

['SHE WENT HOME BECAUSE SHE WAS TIRED',
 "I DON'T WANT TO BE UNCOLORED",
 'SHE IS THE ONE']

What if we use now a lambda function...

In [124]:
list(map(lambda x: x.lower(), corpus_lemmatized))

['she went home because she was tired',
 "i don't want to be uncolored",
 'she is the one']

### Named Entities

The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products.
A named entity is a "real-world object" that's assigned a name. For example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

In [127]:
# Load the large English NLP model
#nlp = spacy.load('en_core_web_lg')
nlp = spacy.load('en_core_web_sm')

# The text we want to examine
text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
"""

# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    #print(spacy.explain(entity.label_))
    spacy_expl=spacy.explain(entity.label_)
    print(f"{entity.text} ({entity.label_} : {spacy_expl} )")

London (GPE : Countries, cities, states )
England (GPE : Countries, cities, states )

 (GPE : Countries, cities, states )
the United Kingdom (GPE : Countries, cities, states )
  (ORDINAL : "first", "second", etc. )
the River Thames (ORG : Companies, agencies, institutions, etc. )

 (GPE : Countries, cities, states )
Great Britain (GPE : Countries, cities, states )
London (GPE : Countries, cities, states )

 (GPE : Countries, cities, states )
two (CARDINAL : Numerals that do not fall under another type )
Romans (NORP : Nationalities or religious or political groups )
Londinium (PERSON : People, including fictional )

 (GPE : Countries, cities, states )


In [20]:
displacy.render(doc, style='ent', jupyter=True)
#displacy.render(doc, style='dep', jupyter=True)

### Noun Chunks

a phrase or group of words which can be learnt as a unit by somebody who is learning a language. Examples of chunks are ‘Can I have the bill, please?’ and ‘Pleased to meet you’.

In [128]:

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward


### Word vectors and similarity

<img src="images/wordembeding.png" alt="Drawing" style="width: 500px;"/>

What are **word embeddings** exactly? Loosely speaking, they are vector representations of a particular word. 

https://spacy.io/usage/vectors-similarity

In [129]:
nlp = spacy.load('en_core_web_sm')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.53906965
dog banana 0.28761008
cat dog 0.53906965
cat cat 1.0
cat banana 0.48752153
banana dog 0.28761008
banana cat 0.48752153
banana banana 1.0


In [4]:
#TODO
#import sputnik
#sputnik.install('spacy', spacy.about.__version__, 'en_core_web_sm', data_path='~/my_datasets')

### English: en_core_web_sm

English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.

In [28]:
nlp = spacy.load('en_core_web_sm')

# The text we want to examine
text = """London is the capital and most populous city of England and the United Kingdom."""
doc = nlp(text)

print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

London is the capital and most populous city of England and the United Kingdom.
London PROPN nsubj
is VERB ROOT
the DET det
capital NOUN attr
and CCONJ cc
most ADV advmod
populous ADJ amod
city NOUN conj
of ADP prep
England PROPN pobj
and CCONJ cc
the DET det
United PROPN compound
Kingdom PROPN conj
. PUNCT punct


### Italian: it_core_news_sm

Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

In [130]:
nlp = spacy.load('it_core_news_sm')

# The text we want to examine
text = """L'aperitivo è una bevanda alcolica o analcolica che si beve prima dei pasti per stimolare l'appetito. 
Può essere un cocktail o una bevanda non miscelata accompagnata o meno a stuzzichini."""
doc = nlp(text)

print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

L'aperitivo è una bevanda alcolica o analcolica che si beve prima dei pasti per stimolare l'appetito. 
Può essere un cocktail o una bevanda non miscelata accompagnata o meno a stuzzichini.
L'aperitivo NOUN nsubj
è VERB cop
una DET det
bevanda NOUN ROOT
alcolica ADJ amod
o CONJ cc
analcolica ADJ conj
che PRON nsubj
si PRON expl:pass
beve VERB acl:relcl
prima ADV advmod
dei DET det
pasti NOUN nsubj:pass
per ADP mark
stimolare VERB advcl
l'appetito VERB obj
. PUNCT punct

 SPACE 
Può AUX aux
essere VERB cop
un DET det
cocktail NOUN ROOT
o CONJ cc
una DET det
bevanda NOUN conj
non ADV advmod
miscelata VERB amod
accompagnata VERB acl
o CONJ cc
meno ADV conj
a ADP case
stuzzichini NOUN obl
. PUNCT punct


In [98]:
text = 'ci sei alle sei?'

doc = nlp(text)

print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

ci sei alle sei?
ci PRON expl
sei NUM nummod
alle NOUN ROOT
sei NUM nsubj
? PUNCT punct
