# 1. Import text data

This notebook will introduce you to the basics of importing texts. 
You'll learn about different data structures (corpus and datasets).







Legend of symbols:

- 🤓: Tips

- 🤖📝: Your turn

- ❓: Question

- 💫: Extra exercise 

## 1.1. Very Basic Tutorial for Jupyter Notebook

Let's begin this tutorial by printing the "Hello World!" example. To do so, we will use **<tt> print <tt>** function:

In [None]:
print("Hello World!")

Let's try writing **<tt> Hello World! <tt>** several times:

In [None]:
print("Hello World!")
print()
print("Hello World!")

Now, let's print a list of numbers:

In [None]:
int_list = [1,2,3,4,5,6]
print(int_list)

In Python, we have different variables:

In [None]:
sent= "Hello World!" # This is a string
int_list = [1,2,3,4,5,6] # This is a list of integers

Use the function **<tt> type <tt>** to get the variable's type.

In [None]:
type(sent)

In [None]:
type(int_list)

And finally, we can also have a list of strings:

In [None]:
city_list = ["London", "Granada", "Bagdad", "Lang Tang", "Lucca", "Budapest"] # This is a list of integers

In [None]:
type(city_list)

🤓 We use **<tt> list[x] <tt>** to get the element **<tt> x <tt>** on a list:

In [None]:
city_list[0]

In [None]:
type(city_list[0])

## 1.2. Importing unstructured data (Corpus)

Because text analysis techniques are primarily applied machine learning, a language that has rich scientific and numeric computing libraries is necessary. When it comes to tools for performing machine learning on text, Python has a powerhouse suite that includes NLTK, Gensim, and spaCy:
    
- **NLTK**, the Natural Language Tool-Kit, is a “batteries included” resource for NLP written in Python by experts in academia. Originally a pedagogical tool for teach‐
ing NLP, it contains corpora, lexical resources, grammars, language processing algorithms, and pretrained models that allow Python programmers to quickly get started processing text data in a variety of languages. 👉 https://www.nltk.org/

- **Gensim** is a robust, efficient, and hassle-free library that focuses on unsupervised semantic modeling of text. Originally designed to find similarity between docu‐
ments (generate similarity), it now exposes topic modeling methods for latent semantic techniques, and includes other unsupervised libraries such as word2vec. 👉 https://radimrehurek.com/gensim/

- **spaCy** provides production-grade language processing by implementing the academic state-of-the-art into a simple and easy-to-use API. In particular, spaCy focuses on preprocessing text for deep learning or to build information extraction or natural language understanding systems on large volumes of text. 👉 https://spacy.io/


### 1.2.1. Introduction to Spacy

We'll create a variable in English call **<tt> nlp <tt>**.

In [None]:
# Import Spacy 
import spacy
from spacy.lang.en import English

# Create the nlp object
nlp = English()


When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

In [None]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

### 🤖📝 **Your turn**



Try ot some of the 55+ available languages: https://spacy.io/usage/models#languages.

- Import the <tt> language <tt> class from <tt> spacy.lang.en <tt> and create a new <tt> mlp <tt>  object.
- Create a <tt> doc <tt> and print its text.


In [None]:
# Import the language class
from spacy.lang.____ import ____ 

# Create the nlp object
nlp = ____

# Process a text
doc = nlp("Write here your sentence on your language.")

# Print the document text
print(____.text)

### 1.2.2. Introduction to NLTK

#### Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

In [None]:
# Import NLTK
import nltk
# Download Gutenberg package
from nltk.corpus import gutenberg

Let's pick out the first of these texts (Emma by Jane Austen):

In [None]:
emma_raw = nltk.corpus.gutenberg.raw('austen-emma.txt')

And print it:

In [None]:
print(emma_raw)

Now, let's pick out the first of these texts — Emma by Jane Austen — and give it a short name, **<tt> emma_words <tt>** then find out how many words it contains:

In [None]:
emma_words = nltk.corpus.gutenberg.words('austen-emma.txt')

In [None]:
print(emma_words)

❓ Which is the first element of the list?

In [None]:
emma_words[]

🤓 **<tt> emma_words <tt>** is a nltk corpus of strings:

In [None]:
type(emma_words)

In [None]:
type(emma_words[0])

❓ How many words do this corpus has?

In [None]:
len(emma_words)

🤓 The previous example, **<tt> nltk.corpus.gutenberg.words <tt>** also showed how we can access the raw text split up into tokens.

Now, let's try another function for sentences:

In [None]:
emma_sents = nltk.corpus.gutenberg.sents('austen-emma.txt')

In [None]:
print(emma_sents)

❓ How many sentences do this corpus has?

In [None]:
len(emma_sents)

🤓 In this case, **<tt> nltk.corpus.gutenberg.sents <tt>** showed how we can get the text split up into sentences.

### 🤖📝 **Your turn**

Import **<tt> melville-moby_dick.txt <tt>** and extract (1) the number of words and (2) sentences of this corpus.

In [None]:
mobydick_raw = nltk.corpus.gutenberg.raw(_____)

#...

❓ Do you think that Gutenberg corpora is annotated or unannotated?

#### Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. This table gives an example of each genre (for a complete list, see http://icame.uib.no/brown/bcm-los.html):

<img src="table_brown.png">

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

Next, we need to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions.

❓ Do you think that Gutenberg corpora is annotated or unannotated?

#### 💫 Counting Words by Genre

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics. 
Let's compare genres in their usage of modal verbs.
The first step is to produce the counts for a particular genre. 

In [None]:
news_text = brown.words(categories='news')

In [None]:
print(news_text)

In [None]:
fdist = nltk.FreqDist(w.lower() for w in news_text)

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
for m in modals:
    print(m + ':', fdist[m], end=' ')

Next, we need to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions. These are presented systematically in 2, where we also unpick the following code line by line. For the moment, you can ignore the details and just concentrate on the output.

In [None]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

### 🤖📝 **Your turn**

Download the Reuters Corpus and count words per 6 pre-selected categories.

## 1.3. Importing structured text (datasets)

In this new section, we will analyse structured text. To begin with, we need to import pandas wich is the package used in Python to analyse dataframes or datasets:

In [None]:
import pandas as pd

Next, we will read the news dataset which is inside the data folder. We will named this df(dataframe):

In [None]:
df = pd.read_csv('../data/news.csv')

Take a look to the first fifths rows of df:

In [None]:
df.head(5)

Let's analyse the text column

In [None]:
df['text']

### Resources

📕 Bengfort, B., Bilbro, R., & Ojeda, T. (2018). *Applied text analysis with python: Enabling language-aware data products with machine learning.* O'Reilly Media, Inc.

🌍 https://course.spacy.io/en/chapter1

🌍 https://www.nltk.org/book/ch02.html