# 1. Import text data

This notebook will introduce you to the basics of importing texts. 
You'll learn about the data structures.

## 1.1. Tools for Text Analysis

Because text analysis techniques are primarily applied machine learning, a language that has rich scientific and numeric computing libraries is necessary. When it comes to tools for performing machine learning on text, Python has a powerhouse suite that includes NLTK, Gensim, and spaCy:
    
- **NLTK**, the Natural Language Tool-Kit, is a “batteries included” resource for NLP written in Python by experts in academia. Originally a pedagogical tool for teach‐
ing NLP, it contains corpora, lexical resources, grammars, language processing algorithms, and pretrained models that allow Python programmers to quickly get started processing text data in a variety of languages. 👉 https://www.nltk.org/

- **Gensim** is a robust, efficient, and hassle-free library that focuses on unsupervised semantic modeling of text. Originally designed to find similarity between docu‐
ments (generate similarity), it now exposes topic modeling methods for latent semantic techniques, and includes other unsupervised libraries such as word2vec. 👉 https://radimrehurek.com/gensim/

- **spaCy** provides production-grade language processing by implementing the academic state-of-the-art into a simple and easy-to-use API. In particular, spaCy focuses on preprocessing text for deep learning or to build information extraction or natural language understanding systems on large volumes of text. 👉 https://spacy.io/

📕 Bengfort, B., Bilbro, R., & Ojeda, T. (2018). *Applied text analysis with python: Enabling language-aware data products with machine learning.* O'Reilly Media, Inc.

🌍 https://course.spacy.io/en/chapter1

🌍 https://www.nltk.org/book/ch02.html


### 1.1.2 . Introduction to Spacy

We'll create a variable in English call *nlp*.

In [2]:
# Import Spacy 
import spacy
from spacy.lang.en import English

# Create the nlp object
nlp = English()


When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

In [3]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


🤖📝 **Your turn:**

Try ot some of the 55+ available languages: https://spacy.io/usage/models#languages.

- Import the <tt> language <tt> class from <tt> spacy.lang.en <tt> and create a new <tt> mlp <tt>  object.
- Create a <tt> doc <tt> and print its text.


In [None]:
# Import the language class
from spacy.lang.____ import ____ 

# Create the nlp object
nlp = ____

# Process a text
doc = nlp("Write here your sentence on your language.")

# Print the document text
print(____.text)

### 1.1.3 . Introduction to NLTK

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

In [11]:
# Import NLTK
import nltk
# Download Gutenberg package
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/avaldivia/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [12]:
# Let's pick out the first of these texts — Emma by Jane Austen — and give it a short name, emma, then find out how many words it contains:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

In [10]:
len(emma)

192427