## Vamos ver algumas utilidades do regex com uma historia Sherlock Holmes

In [24]:
#A biblioteca re foi construída para trabalhar com pandas e numpy por esta razão ela é muito utilizada para tratamento de strings
import re

In [25]:
# for linux and mac, uncomment and run the following line:
# !wget https://sherlock-holm.es/stories/plain-text/cnus.txt
# for windows, use your browser to download txt file and place in working directory

#Carregamos o texto no algoritmo

In [26]:
text = ''
with open('cnus.txt','r') as f:
    text = " ".join([l.strip() for l in f.readlines()])

In [6]:
text[2611:3000]

"On landing at Bombay, I learned that my corps had advanced through the passes, and was already deep in the enemy's country. I followed, however, with many other officers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.  The campaign brought honours and promotion to many, but for me "

## Questão 1

Uma das famosas frases de efeito de Sherlock Holmes é o uso da palavra "sem dúvida"

* Quantas vezes a palavra 'sem dúvida' é usada?

In [7]:
# a palavra sem dúvida aparece apenas 43 vezes
p = re.compile('undoubtedly') #Existe diferença entre maiusculas e minusculas
len(p.findall(text))

43

## Questão 2

Os personagens são anunciados deliberadamente na linguagem do cenário na Inglaterra vitoriana. Podemos usar isso mais tarde para encontrar personagens no livro. Mas, por enquanto, vamos praticar em um personagem que conhecemos

Quantas vezes Sherlock Holmes é referido por 'Mr. Sherlock Holmes 'vs' Sherlock Holmes 'vs' Mr. Holmes 'vs' Sherlock '

In [8]:
#vamos começar com a coisa mais simples. 
#Vamos apenas encontrar as ocorrências de Sherlock Holmes
p = re.compile('Sherlock Holmes')
len(p.findall(text))

361

In [9]:
# uma maneira fácil de resolver isso
# é usar apenas o operador 'ou' | com todos os padrões que queremos combinar
p = re.compile('Mr\. Sherlock Holmes|Sherlock Holmes|Mr\. Holmes|Sherlock|Holmes')
results = p.findall(text)
counts = {}
for r in results:
    if r in counts.keys():
        counts[r] += 1
    else:
        counts[r] = 1
        
counts

{'Holmes': 1646,
 'Mr. Holmes': 496,
 'Mr. Sherlock Holmes': 93,
 'Sherlock': 22,
 'Sherlock Holmes': 268}

Quão apropriado

uma coisa a lembrar com REGEX é que raramente há uma única maneira correta de fazer as coisas.

Outra estratégia que poderíamos ter tentado seria criar grupos de correspondências opcionais

In [15]:
p = re.compile('((Mr\.\s)?(Sherlock\s)?(Holmes)?)')
results = p.findall(text)
counts = {}
for r in results:
    if r[0]:
        if r[0] in counts.keys():
            counts[r[0].strip()] += 1 # Strip : Remove spaces at the beginning and at the end of the string
        else:
            counts[r[0].strip()] = 1
        
counts

{'Holmes': 1646,
 'Mr.': 1,
 'Mr. Holmes': 496,
 'Mr. Sherlock': 1,
 'Mr. Sherlock Holmes': 93,
 'Sherlock': 1,
 'Sherlock Holmes': 268}

## Questão 3
Encontre todos os médicos da coleção

faça uma lista de todos os personagens que aparecem na coleção (dica: Sra. Srta. Dr. etc)

In [16]:
p = re.compile('[MD][irs][s\.]?[s\.]? [A-Z]\w*')
set(p.findall(text))
    

{'Dr. Ainstree',
 'Dr. Armstrong',
 'Dr. Barnicot',
 'Dr. Becher',
 'Dr. Ferrier',
 'Dr. Fordham',
 'Dr. Grimesby',
 'Dr. Horsom',
 'Dr. Huxtable',
 'Dr. James',
 'Dr. Leon',
 'Dr. Leslie',
 'Dr. Moore',
 'Dr. Mortimer',
 'Dr. Percy',
 'Dr. Richards',
 'Dr. Roylott',
 'Dr. Shlessinger',
 'Dr. Somerton',
 'Dr. Sterndale',
 'Dr. Thorneycroft',
 'Dr. Trevelyan',
 'Dr. Watson',
 'Dr. Willows',
 'Dr. Wood',
 'Miss Adler',
 'Miss Alice',
 'Miss Brenda',
 'Miss Burnet',
 'Miss Cushing',
 'Miss Dobney',
 'Miss Doran',
 'Miss Edith',
 'Miss Ettie',
 'Miss Flora',
 'Miss Fraser',
 'Miss Harrison',
 'Miss Hatty',
 'Miss Helen',
 'Miss Holder',
 'Miss Honoria',
 'Miss Hunter',
 'Miss Irene',
 'Miss M',
 'Miss Marie',
 'Miss Mary',
 'Miss Miles',
 'Miss Morrison',
 'Miss Morstan',
 'Miss Nancy',
 'Miss Rachel',
 'Miss Roylott',
 'Miss Rucastle',
 'Miss S',
 'Miss Sarah',
 'Miss Smith',
 'Miss Stapleton',
 'Miss Stoner',
 'Miss Stoper',
 'Miss Susan',
 'Miss Sutherland',
 'Miss Turner',
 'Miss Viole

## Questão 4

* Pesquise todos os anos e datas que aparecem na história

In [17]:
# podemos usar \ d para corresponder a qualquer dígito
p = re.compile('1[89]\d\d')
p.findall(text)

['1878',
 '1860',
 '1857',
 '1871',
 '1878',
 '1878',
 '1882',
 '1882',
 '1882',
 '1888',
 '1858',
 '1890',
 '1890',
 '1869',
 '1870',
 '1878',
 '1883',
 '1883',
 '1869',
 '1869',
 '1884',
 '1887',
 '1846',
 '1855',
 '1875',
 '1891',
 '1890',
 '1891',
 '1894',
 '1894',
 '1840',
 '1881',
 '1884',
 '1887',
 '1894',
 '1901',
 '1895',
 '1900',
 '1888',
 '1872',
 '1883',
 '1884',
 '1883',
 '1883',
 '1883',
 '1883',
 '1894',
 '1884',
 '1882',
 '1882',
 '1884',
 '1882',
 '1883',
 '1876',
 '1800',
 '1865',
 '1875',
 '1872',
 '1874',
 '1875',
 '1892',
 '1895',
 '1897',
 '1914',
 '1911',
 '1915']

In [23]:
# podemos usar \ d para corresponder a qualquer dígito
p = re.compile('1[6789]\d\d') # \d procura qualquer dígito decimal
p.findall(text)

['1878',
 '1642',
 '1860',
 '1857',
 '1871',
 '1878',
 '1878',
 '1882',
 '1882',
 '1882',
 '1888',
 '1858',
 '1890',
 '1890',
 '1869',
 '1870',
 '1878',
 '1883',
 '1883',
 '1869',
 '1869',
 '1884',
 '1887',
 '1846',
 '1855',
 '1607',
 '1875',
 '1891',
 '1890',
 '1891',
 '1894',
 '1894',
 '1840',
 '1881',
 '1884',
 '1887',
 '1894',
 '1901',
 '1895',
 '1900',
 '1888',
 '1872',
 '1883',
 '1884',
 '1883',
 '1883',
 '1883',
 '1883',
 '1894',
 '1884',
 '1882',
 '1882',
 '1884',
 '1882',
 '1883',
 '1730',
 '1742',
 '1742',
 '1876',
 '1647',
 '1750',
 '1800',
 '1865',
 '1750',
 '1644',
 '1875',
 '1872',
 '1874',
 '1875',
 '1892',
 '1895',
 '1897',
 '1914',
 '1911',
 '1915']

## Questão 5

Sherlock Holmes está frequentemente fumando seu cachimbo. Mas, como muitos verbos em inglês, existem muitas maneiras de a palavra fumar ser conjugada, dependendo do contexto.

* capturar todas as frases sobre fumo (fumo, fumo, fumo, fumado)
* capture as duas palavras que aparecem após a palavra para fumar
* capture as duas palavras que aparecem antes da palavra para fumar

In [28]:
p = re.compile('\.[ A-Za-z]+smok[ A-Za-z]+\.')
p.findall(text)

['. I am going to smoke and to think over this queer business to which my fair client has introduced us.',
 '. I would have thought no more of knifing him than of smoking this cigar.',
 '. The smoke and shouting were enough to shake nerves of steel.',
 '. He had even smoked there.',
 '. Then I went into the back yard and smoked a pipe and wondered what it would be best to do.',
 '. As we rolled into Eyford Station we saw a gigantic column of smoke which streamed up from behind a small clump of trees in the neighbourhood and hung like an immense ostrich feather over the landscape.',
 '. Then he lit his pipe and sat for some time smoking and turning them over.',
 '. I had smoked two cigarettes before he moved.',
 '.  We had breakfasted and were smoking our morning pipe on the day after the remarkable experience which I have recorded when Mr.',
 '. I observed that he was smoking with extraordinary rapidity.',
 '. He does smoke something terrible.',
 '. From over a distant rise there float

In [29]:
p = re.compile('smok[ A-Za-z]+\.')
p.findall(text)

['smoked a Trichinopoly cigar.',
 'smoking his pipe.',
 'smoke and to think over this queer business to which my fair client has introduced us.',
 'smoking this cigar.',
 'smoking in silence.',
 'smoke curled through the room and out at the open window.',
 'smoke and shouting were enough to shake nerves of steel.',
 'smoked there.',
 'smoked a cigar and waited behind a tree until he should be alone.',
 'smoke like so many pistol shots.',
 'smoked a pipe and wondered what it would be best to do.',
 'smoke.',
 'smoke which streamed up from behind a small clump of trees in the neighbourhood and hung like an immense ostrich feather over the landscape.',
 'smoking and turning them over.',
 'smoking pistol in his hand at his elbow.',
 'smoke that we could not see across the table.',
 'smoke thinned away there was no sign left of the Gloria Scott.',
 'smoking when the alarm was given.',
 'smoke a pipe with you with pleasure.',
 'smoked for some time in silence.',
 'smoked.',
 'smoke of his ci

In [30]:
p = re.compile('\.[ A-Za-z]+smok')
p.findall(text)

['. I am going to smok',
 '. I would have thought no more of knifing him than of smok',
 '. There was a group of shabbily dressed men smok',
 '. The smok',
 '. I have a caseful of cigarettes here which need smok',
 '. The drawn blinds and the smok',
 '. He had even smok',
 '. He drank a great deal of brandy and smok',
 '. Then I went into the back yard and smok',
 '. As we rolled into Eyford Station we saw a gigantic column of smok',
 '. If you care to smok',
 '. Then he lit his pipe and sat for some time smok',
 '. Suddenly as we looked at her we saw a dense black cloud of smok',
 '. Alec was smok',
 '. Seems to have smok',
 '.  His brother Mycroft was sitting smok',
 '. I had smok',
 '.  We had breakfasted and were smok',
 '. I observed that he was smok',
 '. He does smok',
 '. I therefore smok',
 '. I should not sit here smok',
 '. From over a distant rise there floated a gray plume of smok',
 '. She had come from the direction in which the plume of smok',
 '. Both of them were smok

    
## Questão 6

Freqüentemente, recebemos um bloco de texto não estruturado e queremos usar REGEX para fornecer alguma estrutura. Nesse caso, podemos dividir o livro por capítulo.

* use a função re.split () para dividir os livros por capítulo.

In [31]:
p = re.compile('CHAPTER\s[IVX]+')
p.findall(text)

['CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER XIII',
 'CHAPTER XIV',
 'CHAPTER XV',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER I',
 'CHAPTER II']

In [41]:
p = re.compile('CHAPTER\s[ A-Za-z]+')
p.findall(text)

['CHAPTER I Mr',
 'CHAPTER II The Science Of Deduction   We met next day as he had arranged',
 'CHAPTER III The Lauriston Garden Mystery   I confess that I was considerably startled by this fresh proof of the practical nature of my companion',
 'CHAPTER IV What John Rance Had To Tell   It was one o',
 'CHAPTER V Our Advertisement Brings A Visitor   Our morning',
 'CHAPTER VI Tobias Gregson Shows What He Can Do   The papers next day were full of the ',
 'CHAPTER VII Light In The Darkness   The intelligence with which Lestrade greeted us was so momentous and so unexpected',
 'CHAPTER I On The Great Alkali Plain   In the central portion of the great North American Continent there lies an arid and repulsive desert',
 'CHAPTER II The Flower Of Utah   This is not the place to commemorate the trials and privations endured by the immigrant Mormons before they came to their final haven',
 'CHAPTER III John Ferrier Talks With The Prophet   Three weeks had passed since Jefferson Hope and his comr

In [42]:
p.split(text)[3]

'\'s theories. My respect for his powers of analysis increased wondrously. There still remained some lurking suspicion in my mind, however, that the whole thing was a pre-arranged episode, intended to dazzle me, though what earthly object he could have in taking me in was past my comprehension. When I looked at him he had finished reading the note, and his eyes had assumed the vacant, lack-lustre expression which showed mental abstraction.  "How in the world did you deduce that?" I asked.  "Deduce what?" said he, petulantly.  "Why, that he was a retired sergeant of Marines."  "I have no time for trifles," he answered, brusquely; then with a smile, "Excuse my rudeness. You broke the thread of my thoughts; but perhaps it is as well. So you actually were not able to see that that man was a sergeant of Marines?"  "No, indeed."  "It was easier to know it than to explain why I knew it. If you were asked to prove that two and two made four, you might find some difficulty, and yet you are quit