# 1. Import text data

This notebook will introduce you to the basics of importing texts. 
You'll learn about different data structures (corpus and datasets).







Legend of symbols:

- ü§ì: Tips

- ü§ñüìù: Your turn

- ‚ùì: Question

- üí´: Extra exercise 

## 1.1. Very Basic Tutorial for Jupyter Notebook

Let's begin this tutorial by printing the "Hello World!" example. To do so, we will use **<tt> print <tt>** function:

In [1]:
print("Hello World!")

Hello World!


Let's try writing **<tt> Hello World! <tt>** several times:

In [2]:
print("Hello World!")
print()
print("Hello World!")

Hello World!

Hello World!


Now, let's print a list of numbers:

In [3]:
int_list = [1,2,3,4,5,6]
print(int_list)

[1, 2, 3, 4, 5, 6]


In Python, we have different variables:

In [4]:
sent= "Hello World!" # This is a string
int_list = [1,2,3,4,5,6] # This is a list of integers

Use the function **<tt> type <tt>** to get the variable's type.

In [5]:
type(sent)

str

In [6]:
type(int_list)

list

And finally, we can also have a list of strings:

In [7]:
city_list = ["London", "Granada", "Bagdad", "Lang Tang", "Lucca", "Budapest"] # This is a list of integers

In [8]:
type(city_list)

list

ü§ì We use **<tt> list[x] <tt>** to get the element **<tt> x <tt>** on a list:

In [9]:
city_list[0]

'London'

In [10]:
type(city_list[0])

str

## 1.2. Importing unstructured data (Corpus)

Because text analysis techniques are primarily applied machine learning, a language that has rich scientific and numeric computing libraries is necessary. When it comes to tools for performing machine learning on text, Python has a powerhouse suite that includes NLTK, Gensim, and spaCy:
    
- **NLTK**, the Natural Language Tool-Kit, is a ‚Äúbatteries included‚Äù resource for NLP written in Python by experts in academia. Originally a pedagogical tool for teach‚Äê
ing NLP, it contains corpora, lexical resources, grammars, language processing algorithms, and pretrained models that allow Python programmers to quickly get started processing text data in a variety of languages. üëâ https://www.nltk.org/

- **Gensim** is a robust, efficient, and hassle-free library that focuses on unsupervised semantic modeling of text. Originally designed to find similarity between docu‚Äê
ments (generate similarity), it now exposes topic modeling methods for latent semantic techniques, and includes other unsupervised libraries such as word2vec. üëâ https://radimrehurek.com/gensim/

- **spaCy** provides production-grade language processing by implementing the academic state-of-the-art into a simple and easy-to-use API. In particular, spaCy focuses on preprocessing text for deep learning or to build information extraction or natural language understanding systems on large volumes of text. üëâ https://spacy.io/


### 1.2.1. Introduction to Spacy

We'll create a variable in English call **<tt> nlp <tt>**.

In [11]:
# Import Spacy 
! pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')


You should consider upgrading via the '/home/avaldivia/env37/bin/python3.7 -m pip install --upgrade pip' command.[0m


When you process a text with the nlp object, spaCy creates a Doc object ‚Äì short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

In [12]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### ü§ñüìù **Your turn**

Try ot some of the 55+ available languages: https://spacy.io/usage/models#languages.

- Import the <tt> language <tt> class from <tt> spacy.lang.en <tt> and create a new <tt> mlp <tt>  object.
- Create a <tt> doc <tt> and print its text.


In [16]:
# Import the language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = spacy.load('es_core_news_sm')

# Process a text
doc = nlp("Escribo mi frase en castellano.")

# Print the document text
print(doc.text)

Escribo mi frase en castellano.


### 1.2.2. Introduction to NLTK

#### Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

In [17]:
# Import NLTK
! pip install nltk
import nltk
# Download Gutenberg package
from nltk.corpus import gutenberg
nltk.download('gutenberg')

You should consider upgrading via the '/home/avaldivia/env37/bin/python3.7 -m pip install --upgrade pip' command.[0m


[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/avaldivia/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

Let's pick out the first of these texts (Emma by Jane Austen):

In [18]:
emma_raw = nltk.corpus.gutenberg.raw('austen-emma.txt')

And print it:

In [19]:
print(emma_raw)

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

Now, let's pick out the first of these texts ‚Äî Emma by Jane Austen ‚Äî and give it a short name, **<tt> emma_words <tt>** then find out how many words it contains:

In [20]:
emma_words = nltk.corpus.gutenberg.words('austen-emma.txt')

In [21]:
print(emma_words)

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]


‚ùì Which is the first element of the list?

In [22]:
emma_words[0]

'['

ü§ì **<tt> emma_words <tt>** is a nltk corpus of strings:

In [23]:
type(emma_words)

nltk.corpus.reader.util.StreamBackedCorpusView

In [24]:
type(emma_words[0])

str

‚ùì How many words does this corpus have?

In [25]:
len(emma_words)

192427

ü§ì The previous example, **<tt> nltk.corpus.gutenberg.words <tt>** also showed how we can access the raw text split up into tokens.

Now, let's try another function for sentences:

In [26]:
emma_sents = nltk.corpus.gutenberg.sents('austen-emma.txt')

In [27]:
print(emma_sents)

[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]


‚ùì How many sentences does this corpus have?

In [28]:
len(emma_sents)

7752

ü§ì In this case, **<tt> nltk.corpus.gutenberg.sents <tt>** showed how we can get the text split up into sentences.

### ü§ñüìù **Your turn**

Import **<tt> melville-moby_dick.txt <tt>** and extract (1) the number of words and (2) sentences of this corpus.

In [41]:
mobydick_raw = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")
print(mobydick_raw)

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true." --HACKLUYT

"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness
or rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER'S
DICTIONARY

"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;
A.S. WALW-IAN, to roll, to wallow." --RICHARDSON'S DICTIONARY


In [36]:
# Counting number of words
mobydick_words = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
len(mobydick_words)

260819

In [39]:
# Counting number of sentences
mobydick_sent = nltk.corpus.gutenberg.sents("melville-moby_dick.txt")
len(mobydick_sent)

10059

‚ùì Do you think that Gutenberg is annotated or unannotated?

Unannotated

#### Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. This table gives an example of each genre (for a complete list, see http://icame.uib.no/brown/bcm-los.html):

<img src="table_brown.png">

In [46]:
from nltk.corpus import brown

In [47]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Next, we need to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions.

‚ùì Do you think that Gutenberg corpora is annotated or unannotated?

#### üí´ Counting Words by Genre

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics. 
Let's compare genres in their usage of modal verbs.
The first step is to produce the counts for a particular genre. 

In [48]:
news_text = brown.words(categories='news')

In [49]:
print(news_text)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


In [50]:
fdist = nltk.FreqDist(w.lower() for w in news_text)

In [51]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [52]:
for m in modals:
    print(m + ':', fdist[m], end=' ')

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 

Next, we need to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions. These are presented systematically in 2, where we also unpick the following code line by line. For the moment, you can ignore the details and just concentrate on the output.

In [53]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))

In [54]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


### ü§ñüìù **Your turn**

Download the Reuters Corpus and count words per 6 pre-selected categories.

In [55]:
# Download reuters corpus
from nltk.corpus import reuters

In [58]:
# How many categories are there?
reuters.categories()

['acq',
 'alum',
 'barley',
 'bop',
 'carcass',
 'castor-oil',
 'cocoa',
 'coconut',
 'coconut-oil',
 'coffee',
 'copper',
 'copra-cake',
 'corn',
 'cotton',
 'cotton-oil',
 'cpi',
 'cpu',
 'crude',
 'dfl',
 'dlr',
 'dmk',
 'earn',
 'fuel',
 'gas',
 'gnp',
 'gold',
 'grain',
 'groundnut',
 'groundnut-oil',
 'heat',
 'hog',
 'housing',
 'income',
 'instal-debt',
 'interest',
 'ipi',
 'iron-steel',
 'jet',
 'jobs',
 'l-cattle',
 'lead',
 'lei',
 'lin-oil',
 'livestock',
 'lumber',
 'meal-feed',
 'money-fx',
 'money-supply',
 'naphtha',
 'nat-gas',
 'nickel',
 'nkr',
 'nzdlr',
 'oat',
 'oilseed',
 'orange',
 'palladium',
 'palm-oil',
 'palmkernel',
 'pet-chem',
 'platinum',
 'potato',
 'propane',
 'rand',
 'rape-oil',
 'rapeseed',
 'reserves',
 'retail',
 'rice',
 'rubber',
 'rye',
 'ship',
 'silver',
 'sorghum',
 'soy-meal',
 'soy-oil',
 'soybean',
 'strategic-metal',
 'sugar',
 'sun-meal',
 'sun-oil',
 'sunseed',
 'tea',
 'tin',
 'trade',
 'veg-oil',
 'wheat',
 'wpi',
 'yen',
 'zinc']

In [63]:
# Count words per 6 pre-selected categories: 

#cocoa
cocoa_text = reuters.words(categories='cocoa')
print(cocoa_text)
len(cocoa_text)

['COCOA', 'EXPORTERS', 'EXPECTED', 'TO', 'LIMIT', ...]
['FINNS', 'AND', 'CANADIANS', 'TO', 'STUDY', 'MTBE', ...]


11306

In [64]:
#gas
gas_text = reuters.words(categories='gas')
print(gas_text)
len(gas_text)

['FINNS', 'AND', 'CANADIANS', 'TO', 'STUDY', 'MTBE', ...]


11306

In [65]:
#jet
jet_text = reuters.words(categories='jet')
print(jet_text)
len(jet_text)

['BANGLADESH', 'TENDERS', 'FOR', 'TWO', 'MLN', ...]


548

In [66]:
#lei
lei_text = reuters.words(categories='lei')
print(lei_text)
len(lei_text)

['CANADA', 'LEADING', 'INDICATOR', 'UP', '0', '.', '4', ...]


2020

In [67]:
#rubber
rubber_text = reuters.words(categories='rubber')
print(rubber_text)
len(rubber_text)

['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]


13452

In [68]:
#sugar
sugar_text = reuters.words(categories='sugar')
print(sugar_text)
len(sugar_text)

['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]


40047

## 1.3. Importing structured text (datasets)

In this new section, we will analyse structured text. To begin with, we need to import pandas wich is the package used in Python to analyse dataframes or datasets:

In [69]:
! pip install pandas
import pandas as pd

Next, we will read the news dataset which is inside the data folder. We will named this df(dataframe):

In [111]:
df = pd.read_csv('../data/news.csv')

Take a look to the first fifths rows of df:

In [71]:
df.head(5)

Unnamed: 0,topic,media,corpus,headline,link
0,climatic,The Guardian,The reindeer is the emblematic Christmas anima...,Weatherwatch: reindeer adapted to snow but not...,https://www.theguardian.com/world/2019/dec/23/...
1,climatic,The Guardian,The European parliament is split over whether ...,European parliament split on declaring climate...,https://www.theguardian.com/world/2019/nov/26/...
2,climatic,The Guardian,Fisayo Soyombo was eating an evening snack in ...,‚ÄòClimate of fear‚Äô: Nigeria intensifies crackdo...,https://www.theguardian.com/world/2019/nov/14/...
3,climatic,The Guardian,The European Union considers itself as a leade...,EU's soaring climate rhetoric not always match...,https://www.theguardian.com/world/2019/dec/11/...
4,climatic,The Guardian,"Good morning, we‚Äôre now exactly two weeks out ...",Thursday briefing: Political climate too hot f...,https://www.theguardian.com/world/2019/nov/28/...


Let's analyse the text column

In [72]:
df['corpus']

0       The reindeer is the emblematic Christmas anima...
1       The European parliament is split over whether ...
2       Fisayo Soyombo was eating an evening snack in ...
3       The European Union considers itself as a leade...
4       Good morning, we‚Äôre now exactly two weeks out ...
5       Global heating is ‚Äúsupercharging‚Äù an increasin...
6       In the arid lands that have seen one of the mo...
7       Hundreds of demonstrators call for internation...
8       Country will become first to make study of glo...
9       The European parliament has declared a global ...
10      Climate breakdown played a key role in at leas...
11      Capacity rose by 42.9GW in 18 months, far outp...
12      The EU‚Äôs trade deal with four South American c...
13      Christian Porter accuses Market Forces of tryi...
14      This article is more than 7 months oldThis art...
15      Bank president indicates she will move bank be...
16      A sample group of 150 French citizens ‚Äî from u...
17  

### 1.3.1. iloc

The **<tt> iloc <tt>** indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.

The iloc indexer syntax is data.iloc[row selection, column selection], which is sure to be a source of confusion for R users. ‚Äúiloc‚Äù in pandas is used to **select rows and columns by number**, in the order that they appear in the data frame. You can imagine that each row has a row number from 0 to the total rows ( **<tt> data.shape[0] <tt>**)  and  **<tt> iloc[] <tt>** allows selections based on these numbers. The same applies for columns (ranging from 0 to  **<tt> data.shape[1]<tt>**)

There are two ‚Äúarguments‚Äù to iloc ‚Äì a row selector, and a column selector.  For example:

In [73]:
# Single selections using iloc and DataFrame

# Rows:
df.iloc[0] # first row of data frame.
df.iloc[1] # second row of data frame.
df.iloc[-1] # last row of data frame.

# Columns:
df.iloc[:,0] # first column of data frame.
df.iloc[:,1] # second column of data frame.
df.iloc[:,-1] # last column of data frame.

0       https://www.theguardian.com/world/2019/dec/23/...
1       https://www.theguardian.com/world/2019/nov/26/...
2       https://www.theguardian.com/world/2019/nov/14/...
3       https://www.theguardian.com/world/2019/dec/11/...
4       https://www.theguardian.com/world/2019/nov/28/...
5       https://www.theguardian.com/global-development...
6       https://www.theguardian.com/world/2019/dec/18/...
7       https://www.theguardian.com/world/2019/dec/14/...
8       https://www.theguardian.com/global-development...
9       https://www.theguardian.com/world/2019/nov/28/...
10      https://www.theguardian.com/world/2019/dec/27/...
11      https://www.theguardian.com/world/2019/nov/20/...
12      https://www.theguardian.com/world/2019/dec/09/...
13      https://www.theguardian.com/world/2019/nov/05/...
14      https://www.theguardian.com/world/2019/nov/04/...
15      https://www.theguardian.com/world/2019/dec/02/...
16      https://www.theguardian.com/world/2019/oct/02/...
17      https:

### ü§ñüìù **Your turn**

Extract:
- First five rows of df
- First two columns of df with all rows
- 1st, 4th, 7th, 25th row + 1st 2nd 4th columns
- First 5 rows and 3rd and 4th columns of data frame

In [105]:
# First five rows of df
df.iloc[0:5]

Unnamed: 0,topic,media,corpus,headline,link
0,climatic,The Guardian,The reindeer is the emblematic Christmas anima...,Weatherwatch: reindeer adapted to snow but not...,https://www.theguardian.com/world/2019/dec/23/...
1,climatic,The Guardian,The European parliament is split over whether ...,European parliament split on declaring climate...,https://www.theguardian.com/world/2019/nov/26/...
2,climatic,The Guardian,Fisayo Soyombo was eating an evening snack in ...,‚ÄòClimate of fear‚Äô: Nigeria intensifies crackdo...,https://www.theguardian.com/world/2019/nov/14/...
3,climatic,The Guardian,The European Union considers itself as a leade...,EU's soaring climate rhetoric not always match...,https://www.theguardian.com/world/2019/dec/11/...
4,climatic,The Guardian,"Good morning, we‚Äôre now exactly two weeks out ...",Thursday briefing: Political climate too hot f...,https://www.theguardian.com/world/2019/nov/28/...


In [106]:
# First two columns of df with all rows
df.iloc[:,0:2]

Unnamed: 0,topic,media
0,climatic,The Guardian
1,climatic,The Guardian
2,climatic,The Guardian
3,climatic,The Guardian
4,climatic,The Guardian
5,climatic,The Guardian
6,climatic,The Guardian
7,climatic,The Guardian
8,climatic,The Guardian
9,climatic,The Guardian


In [107]:
# 1st, 4th, 7th, 25th row + 1st 2nd 4th columns
df.iloc[[0,3,6,24],[0,1,3]]

Unnamed: 0,topic,media,headline
0,climatic,The Guardian,Weatherwatch: reindeer adapted to snow but not...
3,climatic,The Guardian,EU's soaring climate rhetoric not always match...
6,climatic,The Guardian,How water is helping to end 'the first climate...
24,climatic,The Guardian,School strikers try to unite divided Belgium o...


In [108]:
# First 5 rows and 3rd and 4th columns of data frame
df.iloc[0:4,[3,4]]

Unnamed: 0,headline,link
0,Weatherwatch: reindeer adapted to snow but not...,https://www.theguardian.com/world/2019/dec/23/...
1,European parliament split on declaring climate...,https://www.theguardian.com/world/2019/nov/26/...
2,‚ÄòClimate of fear‚Äô: Nigeria intensifies crackdo...,https://www.theguardian.com/world/2019/nov/14/...
3,EU's soaring climate rhetoric not always match...,https://www.theguardian.com/world/2019/dec/11/...


### 1.3.2. loc

The Pandas **<tt>loc<tt>** indexer can be used with DataFrames for two different use cases:

    - i) Selecting rows by label/index
    - ii) Selecting rows with a boolean / conditional lookup
    
The **<tt>loc<tt>** indexer is used with the same syntax as iloc: **<tt>data.loc[row selection, column selection]<tt>**.

In [112]:
df.loc[0, 'corpus']

'The reindeer is the emblematic Christmas animal and, while not exactly magical, it is among the best adapted to snowy conditions.For a start, a reindeer‚Äôs feet have four toes with dewclaws that spread out to distribute its weight like snowshoes, and are equipped with sharp hooves for digging in snow.A reindeer‚Äôs nose warms the air on its way to the lungs, cooling it again before it is exhaled. As well as retaining heat, this helps prevent water from being lost as vapour. This is why reindeer breath does not steam like human and horse breath.A reindeer‚Äôs thick double-layered coat is so efficient that it is more likely to overheat than get too cold, especially when running. When this happens, reindeer pant like dogs to cool down, bypassing the nasal heat exchanger.Snowfields may be featureless to human eyes, but reindeer are sensitive to ultraviolet light, an evolutionary development that only occurred after the animals moved to Arctic regions. Snow reflects ultraviolet, so this u

And we can also use conditional expressions like "extract all news of modern slavery":

In [113]:
df.loc[df['topic'] == 'modern slavery']

Unnamed: 0,topic,media,corpus,headline,link
184,modern slavery,The Guardian,CCLA says UK firms should develop anti-slavery...,Charity fund manager moves to tackle modern sl...,https://www.theguardian.com/world/2019/nov/17/...
185,modern slavery,The Guardian,Mayflower 400 is commemorating the Mayflower v...,Mayflower 400 is ignoring slavery | Letters,https://www.theguardian.com/world/2019/nov/08/...
186,modern slavery,The Guardian,Arrests and prosecutions remain thin on the gr...,"UK modern slavery helpline receives over 7,000...",https://www.theguardian.com/global-development...
187,modern slavery,The Guardian,This article is more than 7 months oldThis art...,Uniqlo accused of mocking wartime sexual slave...,https://www.theguardian.com/world/2019/oct/21/...
188,modern slavery,The Guardian,Thank you for the article by Afua Hirsch (Brit...,Britain‚Äôs despicable history of slavery needs ...,https://www.theguardian.com/world/2019/oct/24/...
189,modern slavery,The Guardian,The government is facing legal action to try a...,Lawyers challenge UK imports of 'slavery-taint...,https://www.theguardian.com/global-development...
190,modern slavery,The Guardian,A charity attempting to have a memorial built ...,UK government refuses to fund slavery memorial...,https://www.theguardian.com/world/2019/dec/10/...
191,modern slavery,The Guardian,While forced labour and slavery in the fishing...,We can't allow Myanmar‚Äôs slavery-tainted shrim...,https://www.theguardian.com/global-development...
192,modern slavery,The Guardian,A survivor‚Äôs graphic memoir and a feature film...,'Such brutality': tricked into slavery in the ...,https://www.theguardian.com/world/2019/sep/21/...
193,modern slavery,The Guardian,Sadiq Khan has endorsed proposals for a Britis...,Sadiq Khan backs London slavery museum to chal...,https://www.theguardian.com/world/2019/aug/11/...


### ü§ñüìù **Your turn**

- Extract the first row of modern slavery topic.
- Extract the media of the first row of modern slavery topic.

In [118]:
# Extract the first row of modern slavery topic.

# create a new df with only modern slavery news
df_modern_slavery = df.loc[df['topic'] == 'modern slavery']


df_modern_slavery.loc[184]

topic                                          modern slavery
media                                            The Guardian
corpus      CCLA says UK firms should develop anti-slavery...
headline    Charity fund manager moves to tackle modern sl...
link        https://www.theguardian.com/world/2019/nov/17/...
Name: 184, dtype: object

In [122]:
# Extract the media of the first row of modern slavery topic.
df_modern_slavery.loc[184, "media"]

'The Guardian'

### Resources

üìï Bengfort, B., Bilbro, R., & Ojeda, T. (2018). *Applied text analysis with python: Enabling language-aware data products with machine learning.* O'Reilly Media, Inc.

üåç https://course.spacy.io/en/chapter1

üåç https://www.nltk.org/book/ch02.html

üåç https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

