# Natural Language Processing project with spaCy

- **Created by Andrés Segura Tinoco**
- **Created on June 04, 2019**

In [1]:
# Load Python libraries
import io

In [2]:
# Load NLP libraries
import spacy
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS

In [3]:
def read_text_file(file_path):
    words = ""
    with io.open(file_path, 'r', encoding = 'ISO-8859-1') as f:
        words = f.read()
    return words;

In [4]:
# Get text sample
file_path = "../data/citeseer/abberley99thisl"
text_sample = read_text_file(file_path)
text_sample

'The THISL Broadcast News Retrieval System This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer, using a recurrent network acoustic model, and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed using a similar North American Broadcast News system, to take advantage of the TREC SDR evaluation methodology. We report results on this evaluation, with particular reference to the effect of query expansion and of automatic segmentation algorithms. 1. INTRODUCTION  THISL is an ESPRIT Long Term Research project in the area of speech retrieval. It is concerned with the construction of a system which performs good recognition of broadcast speech from television and radio news programmes, from which it can produce 

In [5]:
# Create NLP model for english language
nlp = spacy.load('en')
doc = nlp(text_sample)

In [6]:
# Show words
words = [token for token in doc if not token.is_stop and token.is_alpha]
print(words)

[THISL, Broadcast, News, Retrieval, System, paper, described, THISL, spoken, document, retrieval, system, British, North, American, Broadcast, News, system, based, ABBOT, large, vocabulary, speech, recognizer, recurrent, network, acoustic, model, probabilistic, text, retrieval, system, discuss, development, realtime, British, English, Broadcast, News, system, integration, spoken, document, retrieval, system, Detailed, evaluation, performed, similar, North, American, Broadcast, News, system, advantage, TREC, SDR, evaluation, methodology, report, results, evaluation, particular, reference, effect, query, expansion, automatic, segmentation, algorithms, INTRODUCTION, THISL, ESPRIT, Long, Term, Research, project, area, speech, retrieval, concerned, construction, system, performs, good, recognition, broadcast, speech, television, radio, news, programmes, produce, multimedia, indexing, data]


In [7]:
# Show stop-words
stop_words = [token for token in doc if token.is_stop]
print(stop_words)

[The, This, the, for, and, The, is, on, the, using, a, and, a, We, the, of, a, and, its, into, a, is, using, a, to, take, of, the, We, on, this, with, to, the, of, and, of, is, an, in, the, of, It, is, with, the, of, a, which, of, from, and, from, which, it, can]


In [8]:
# Print out named entities
for ent in doc.ents:
    print(ent.text, '[', ent.start_char, ',', ent.end_char, ']:', ent.label_)

The THISL Broadcast News Retrieval System [ 0 , 41 ]: ORG
THISL [ 67 , 72 ]: ORG
British [ 110 , 117 ]: NORP
North American Broadcast News [ 122 , 151 ]: ORG
ABBOT [ 180 , 185 ]: ORG
British English Broadcast News [ 348 , 378 ]: ORG
North American [ 497 , 511 ]: NORP
Broadcast News [ 512 , 526 ]: ORG
1 [ 732 , 733 ]: CARDINAL
ESPRIT Long Term Research [ 761 , 786 ]: ORG


In [9]:
displacy.render(doc, style='ent', jupyter=True)

<hr>
<p><a href="https://github.com/ansegura7/NLP/">« Home</a></p>