# Python aplicado a Big Data
## Dia 8 - 15/03/2021
### Instrutor: Leonardo Galler

# Introducao a NLP

##  Leitura de arquivo

In [1]:
# Se não estiver instalado utilize esta célula para instalar
## Se estiver utilizando Conda
#import sys
#!conda install --yes --prefix {sys.prefix} matplotlib

In [2]:
import os
import nltk

#### 2 formas de leitura
1. Arquivo no mesmo diretório do notebook
2. Arquivo em um diretório diferente

In [3]:
# Fazendo a leitura de um arquivo de texto no mesmo diretório
with open("frases.txt", "r") as f:
    text = f.read()
    print(text)

Ainda pior que a convicção do não e a incerteza do talvez é a desilusão de um quase. É o quase que me incomoda, que me entristece, que me mata trazendo tudo que poderia ter sido e não foi. Quem quase ganhou ainda joga, quem quase passou ainda estuda, quem quase morreu está vivo, quem quase amou não amou. Basta pensar nas oportunidades que escaparam pelos dedos, nas chances que se perdem por medo, nas ideias que nunca sairão do papel por essa maldita mania de viver no outono. 


In [None]:
# Fazendo a leitura de um arquivo de texto em um diretório diferente
with open(os.path.join("data","frases.txt"), "r") as f:
    text = f.read()
    print(text)

## Tabulando um arquivo

In [5]:
# Podemos usar o o pandas para tabular um arquivo
import pandas as pd

In [9]:
# Extrai coluna de texto do dataframe
df = pd.read_csv("news.csv")
print(df.columns)
print()
# Converte o texto em lowercase
df['title'] = df['title'].str.lower()
resultado = df.head()[['publisher', 'title']]
resultado

Index(['id', 'title', 'url', 'publisher', 'category', 'story', 'hostname',
       'timestamp'],
      dtype='object')



Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


##  Criando um webscrapping

In [10]:
# Importando bibliotecas para webscraping
import requests
import json

In [17]:
# Obtendo dados de uma API REST
r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent=4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Every day you have a choice to be honest or deceptive. If you commit to telling the truth, you will win. You'll win more trust, you'll win more business, and you'll win more peace of mind. You'll break the system and be even more successful.",
                "length": "269",
                "author": "Dale Patridge",
                "tags": [
                    "honest",
                    "inspire",
                    "success",
                    "truth",
                    "win"
                ],
                "category": "inspire",
                "language": "en",
                "date": "2021-03-15",
                "permalink": "https://theysaidso.com/quote/dale-patridge-every-day-you-have-a-choice-to-be-honest-or-deceptive-if-you-commi",
                "id": "ALTQKodcrj3X6ypM8lGjnAeF",
                "background": "https://theysaidso.com/img/qod/

In [18]:
# Extraindo objeto e campo relevante 
q = res["contents"]["quotes"][0]
print(q["quote"], "\n--", q["author"])

Every day you have a choice to be honest or deceptive. If you commit to telling the truth, you will win. You'll win more trust, you'll win more business, and you'll win more peace of mind. You'll break the system and be even more successful. 
-- Dale Patridge


## Acessando o Corpora

In [19]:
# Import
from nltk.corpus import reuters
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     /home/leogaller/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [22]:
# Visualizando os arquivos no Corpora
files = reuters.fileids()
print(len(files))

10788


In [21]:
# Acessando um arquivo
words16097 = reuters.words(['test/16097'])
print(words16097)

['UGANDA', 'PULLS', 'OUT', 'OF', 'COFFEE', 'MARKET', ...]


In [54]:
# Acessando um número específico de palavras em um arquivo
words20 = reuters.words(['test/16097'])[:20]
print(words20)

['UGANDA', 'PULLS', 'OUT', 'OF', 'COFFEE', 'MARKET', '-', 'TRADE', 'SOURCES', 'Uganda', "'", 's', 'Coffee', 'Marketing', 'Board', '(', 'CMB', ')', 'has', 'stopped']


In [55]:
# O Corpora não é apenas uma lista de arquivos, mas uma hierarquia categorizada de 90 tópicos.
# Cada tópico tem vários arquivos associados. 
# Quando acessamos um tópicos estamos na verdade acessando os arquivos associados ao tópico.
reutersGenres = reuters.categories()
print(reutersGenres)

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


In [23]:
# Acessando 2 tópicos e imprimindo as palavras
for w in reuters.words(categories=['bop','cocoa']):
    print(w+' ',end='')
    if(w is '.'):
        print()

SOUTH KOREA MOVES TO SLOW GROWTH OF TRADE SURPLUS South Korea ' s trade surplus is growing too fast and the government has started taking steps to slow it down , Deputy Prime Minister Kim Mahn - je said . 
He told a press conference the government planned to increase investment , speed up the opening of the local market to foreign imports and gradually adjust its currency to hold the surplus " at a proper level ." But he said the government would not allow the won to appreciate too much in a short period of time . 
South Korea has been under pressure from Washington to revalue the won . 
The U . 
S . 
Wants South Korea to cut its trade surplus with the U . 
S ., Which rose to 7 . 
4 billion dlrs in 1986 from 4 . 
3 billion dlrs in 1985 . 
Kim , who is also economic planning minister , said prospects were bright for the South Korean economy , but the government would try to hold the current account surplus to around five billion dlrs a year for the next five years . 
" Our government pr

" The cost of this approach is that the much - needed revival of business investment will be further postponed ," it said . 
The economy was now on a modest growth upswing boosted by export and import - replacement industries which had created a false suggestion that the worst adjustments to the balance of payments crisis were past . 
" Unfortunately , successful adjustment to Australia ' s deep - seated economic problems remains a long - term process ," it said . 
In its economic forecasts , ANZ said it expected moderate overall economic growth with gross domestic product ( GDP ) rising 2 . 
7 pct this year and 2 . 
4 pct in 1988 . 
The current account deficit would narrow to five pct of GDP this year and 4 . 
3 pct in 1988 and net foreign debt would grow strongly from 81 billion at the end of 1986 to 97 . 
2 billion at end - 1987 and 110 . 
3 billion a year later . 
Inflation would fall to 8 . 
5 pct in 1987 and 7 . 
5 pct in 1988 from 8 . 
9 pct in 1986 and further falls in real wag

Cavaco Silva has accused the left - wing opposition parties of blocking key economic reforms . 
The left - wingers said Portugal ' s positive economic results were more the product of favourable international conditions such as cheaper oil and raw material imports , than of PSD policies . 
SOUTH KOREA TO CHANGE POLICIES TO AVERT TRADE WAR South Korea has decided on major changes in its trade , investment and finance policies aimed at reducing the growth of its balance of payments surplus and avoiding a trade war with the United States , Deputy Prime Minister Kim Mahn - je said . 
Kim told reporters the excessively fast rise in exports could make South Korea too reliant on exports , increase nflation and produce trade friction . 
The policy shift , which means abandoning Seoul ' s goal of rapidly reducing its foreign debt , was worked out at a series of ministerial meetings . 
Kim , who is also Economic Planning Minister , said the current account surplus , previously expected to exceed

London traders say terminal market prices would have to gain around 300 stg a tonne to take the ICCO 10 - day average indicator to its 1 , 935 sdr per tonne midway point ( or reference price ). However , little progress has been made in that direction , and the 10 - day average is still well below the 1 , 600 sdr lower intervention level at 1 , 562 . 
87 from 1 , 569 . 
46 previously . 
The buffer stock manager may announce today he will be making purchases tomorrow , although under the rules of the agreement such action is not automatic , traders said . 
Complaints about the inaction of the buffer stock manager are not confined to West African producers , they observed . 
A Reuter report from Rotterdam quoted industry sources there saying Dutch cocoa processors also are unhappy with the intermittent buffer stock buying activities . 
In London , traders expressed surprise that no more than 21 , 000 tonnes cocoa has been bought so far against total potential purchases under the new agre

ITALIAN BALANCE OF PAYMENTS IN DEFICIT IN MAY Italy ' s overall balance of payments showed a deficit of 3 , 211 billion lire in May compared with a surplus of 2 , 040 billion in April , provisional Bank of Italy figures show . 
The May deficit compares with a surplus of 1 , 555 billion lire in the corresponding month of 1986 . 
For the first five months of 1987 , the overall balance of payments showed a surplus of 299 billion lire against a deficit of 2 , 854 billion in the corresponding 1986 period . 
OECD SEES GERMAN GROWTH HIT BY LOW DOMESTIC DEMAND West German economic growth will slow to 1 . 
5 pct this year from 2 . 
4 pct in 1986 due to weak domestic demand and tougher competition from abroad , the Organisation for Economic Cooperation and Development ( OECD ) said in its semi - annual review of the world economy . 
This view is less favourable than the West German government ' s forecast of a growth rate of under two pct this year , but is in line with forecasts by independent 

It forecast real growth of three pct for the world economy and four pct for Japan by 2 , 000 if the adjustments were made . 
S . 
AFRICAN RESERVE BANK SAYS GROWTH RATE ON TARGET South Africa recorded annualised real growth in GDP of 3 . 
25 pct in the first quarter of this year and the economy should achieve the government ' s target of three pct growth for 1987 , the Reserve Bank said . 
The South African central bank said in its quarterly bulletin that confidence in the economy improved from January to May 31 because of the higher gold price , a rise in the nation ' s gold and foreign currency reserves and an improvement in the rand ' s exchange rate to just under 50 U . 
S . 
Cents . 
It noted the growth rate had slowed from 4 . 
5 pct in the third and fourth quarters of last year . 
It also cited a three year debt recheduling agreement reached with international creditors in March as evidence of improved foreign perceptions of the South African economy . 
The accord effectively ext

ITALY SHOWS SEPTEMBER OVERALL PAYMENTS SURPLUS Italy ' s overall balance of payments showed a 919 billion lire surplus in September against a deficit of 1 , 026 billion in August , provisional Bank of Italy figures showed . 
The September surplus compared with a shortfall of 1 , 697 billion lire in September 1986 . 
For the first nine months of 1987 , the overall balance of payments showed a deficit of 1 , 921 billion lire against a 1 , 725 billion deficit in the same 1986 period . 
The central bank said Italy ' s one billion dlr Eurobond launched last month contributed to September ' s surplus . 
TURKEY CURRENT ACCOUNT DEFICIT WIDENS IN JULY Turkey ' s current account deficit widened in July to 674 mln dlrs from 454 mln in June but fell from 1 . 
22 billion in July last year , the State Statistics Institute said . 
The cumulative trade position in July showed a 1 . 
85 billion dlr deficit after 1 . 
33 billion in June and 1 . 
89 billion a year earlier , with exports at 4 . 
91 billio

Liquor sales were limited with March / April selling at 2 , 325 and 2 , 380 dlrs , June / July at 2 , 375 dlrs and at 1 . 
25 times New York July , Aug / Sept at 2 , 400 dlrs and at 1 . 
25 times New York Sept and Oct / Dec at 1 . 
25 times New York Dec , Comissaria Smith said . 
Total Bahia sales are currently estimated at 6 . 
13 mln bags against the 1986 / 87 crop and 1 . 
06 mln bags against the 1987 / 88 crop . 
Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which ends midday on February 27 . 
COFFEE , SUGAR AND COCOA EXCHANGE NAMES CHAIRMAN The New York Coffee , Sugar and Cocoa Exchange ( CSCE ) elected former first vice chairman Gerald Clancy to a two - year term as chairman of the board of managers , replacing previous chairman Howard Katz . 
Katz , chairman since 1985 , will remain a board member . 
Clancy currently serves on the Exchange board of managers as chairman of its appeals , executive , 

The invisible trade deficit fell to 617 mln dlrs in February from 693 mln a year earlier , but was up from a 527 mln deficit in January . 
Figures do not tally exactly because of rounding . 
Transfer payments narrowed to a 140 mln dlr deficit last month from a 185 mln deficit a year earlier and a 225 mln deficit in January . 
The basic balance of payments deficit in February fell to 4 . 
02 billion dlrs from 4 . 
17 billion in February 1986 and 7 . 
37 billion in January . 
Short - term capital account payments swung to a 1 . 
28 billion dlr deficit in February from a 1 . 
60 billion surplus a year earlier and a 1 . 
44 billion dlr surplus in January . 
Errors and omissions were 2 . 
65 billion dlrs in surplus , compared with a 1 . 
27 billion surplus a year earlier and a 1 . 
10 billion deficit in January . 
The overall balance of payments deficit rose to 2 . 
65 billion dlrs from 1 . 
30 billion a year earlier but was down from 7 . 
04 billion in January . 
COCOA BUFFER STOCK COMPROM

He said in a statement welcoming the agreement on buffer stock rules reached last week in London that it resulted in large part from initiatives taken by the EC Commission after consumers and producers had reached deadlock in initial negotiations . 
COCOA DEAL SEEN POSITIVE , BUT NO PRICE GUARANTEE The buffer stock rules agreement reached on Friday by the International Cocoa Organization ( ICCO ) is an improvement on previous arrangements but the price - support mechanism is unlikely to do more than stem the decline in cocoa prices , many ICCO delegates and trade sources said . 
The accord was reached between producers and consumers of the 35 - member ICCO council after two weeks of talks . 
European chocolate manufacturers and delegates said the accord may boost cocoa prices immediately , but world surpluses overhanging the market will pull prices down again before long . 
" If the buffer stock operation is successful , I doubt it will do anything more than stop the price from falling

The cocoa pact establishes precise differentials the Buffer Stock Manager must use when purchasing varying grades . 
A new International Natural Rubber Agreement ( INRA ) was adopted earlier this month in Geneva . 
Importing and exporting countries agreed several changes to make the reference price more responsive to market trends and they eliminated provisions under which the buffer stock could borrow from banks to finance operations . 
Direct cash contributions from members will fund buffer stock purchases . 
Bank financing was a particular feature of the failed ITC buffer stock which suffered losses running into hundreds of millions of sterling . 
Legal wrangles continue . 
Recent International Coffee Organization ( ICO ) negotiations in London exemplified the degree to which consumers insist that agreements reflect market reality , commodity analysts said . 
Consumers and a small group of producers argued that " objective criteria " should be used to define export quota shares , wh

Despite the low loads , trees are said to be in excellent condition and recent flowering and pod setting - which will lead to late temporao / early main crop beans - has been good . 
CSCE ALTERS RULES ON TRADING LIMITS The Coffee , Sugar and Cocoa Exchange amended regulations governing expanded trading limits on coffee , cocoa and sugar contracts to provide uniformity . 
Effective today , the exchange will permit normal daily price limits in those commodities to expand whenever the first two limited contract months move the limit in the same direction for two consecutive sessions . 
The normal daily limits will be reinstated once the first two limited deliveries close by less than the normal limit for two successive trading days . 
Previously exchange rules required the first three limited months to move the limit in coffee and cocoa . 
It had required the first two limited sugar deliveries to make such moves for three consecutive sessions . 
GHANA COCOA PURCHASES SLOW The Ghana Cocoa 

PORTUGAL ' S GDP FORECAST TO GROW FOUR PCT THIS YEAR Portugal ' s Gross Domestic Product ( GDP ) will grow around four pct this year , the same rate as in 1986 , according to a Bank of Portugal forecast . 
Total investment this year , the country ' s second as a member of the European Community ( EC ), will rise nearly 10 pct , again the same rate as last year , the central bank study said . 
It added that Portugal ' s current account was forecast to show a surplus of 400 mln dlrs this year compared with 1 . 
13 billion in 1986 and 369 mln the previous year . 
Last year ' s high surplus was attributed to cheaper oil and raw materials , lower world interest rates and a weaker dollar . 
Imports by volume were forecast to grow 10 pct this year and exports four pct compared with increases of 16 . 
5 pct and 6 . 
6 pct respectively in 1986 , the bank said . 
The forecasts were calculated on the assumption that the non - expansionary monetary policy carried out by the current government woul

79 billion in January , but was down on the 7 . 
26 billion figure for February 1986 . 
Seasonally adjusted , the February current account surplus narrowed against January . 
While exports in February fell a half pct against the same month last year , imports fell 10 - 1 / 2 pct largely due to the drop in prices . 
Exports grew three pct in volume and imports two pct . 
In the balance of services , a fall in net investment income led to a 300 mln mark deficit in February after a 300 mln mark surplus in January . 
The deficit in transfer payments widened to 3 . 
70 billion marks from 2 . 
69 billion , largely due to a sharp jump to 2 . 
3 billion marks from 200 mln in payments to the European Community budget . 
TURKEY SEES 1 . 
5 BILLION DLR DEFICIT IN 1986 Turkey expects a 1986 balance of payments deficit of 1 . 
5 billion dlrs , well over target , but is taking steps to improve its performance in this and other fields , Ali Tigrel , director of economic planning at the State Planning

The report forecasts that in calendar 1987 , Indonesia ' s CTC ( crushed , torn and curled ) tea exports will increase significantly with the coming on stream of at least eight new CTC processing plants . 
Indonesia plans to diversify its tea products by producing more CTC tea , the main component of tea bags . 
Production of black and green teas is forecast in the embassy report to rise to 125 , 000 tonnes in calendar 1987 from 123 , 000 tonnes in 1986 . 
Exports of these teas are likely to rise to 95 , 000 tonnes in 1987 from 85 , 000 in 1986 and around 90 , 000 in 1985 . 
The embassy noted the ministry of trade tightened quality controls on tea in October 1986 in an effort to become more competititve in the world market . 
OECD TRADE , GROWTH SEEN SLOWING IN 1987 The 24 nations of the Organisation for Economic Cooperation and Development ( OECD ), hampered by sluggish industrial output and trade , face slower economic growth , and their joint balance of payments will swing into defi

The EIU said in its World Trade Forecast it revised OECD economic growth downwards to 2 . 
5 pct this year , compared with a 2 . 
8 pct growth forecast in December . 
It said the new areas of weakness are West Germany and the smaller European countries it influences , and Japan , hardest hit by currency appreciation this year . 
The independent research organisation cut its 1987 growth rate forecasts for West Germany to 2 . 
2 pct from 3 . 
2 pct in December and to 2 . 
3 pct from three pct for Japan . 
It said it expected the OECD to post a current account deficit of some 13 billion dlrs in both 1987 and 1988 , due in large part to a 1 . 
50 dlrs a barrel rise in 1987 oil prices . 
It said the U . 
S . 
Current account deficit looked likely to fall even more slowly than forecast , to 125 billion dlrs in 1987 and 115 billion in 1988 from 130 billion in 1986 . 
It said it expected West Germany to post a 31 billion dlr payments surplus and Japan a 76 billion dlr surplus this year . 
The 

67 billion dlrs . 
For the full year 1986 , the merchandise trade deficit was a record 147 . 
7 billion dlrs , up from 124 . 
4 billion dlrs in 1985 , the department said . 
During the final quarter last year imports rose 2 . 
78 billion dlrs or three pct to 95 . 
7 billion dlrs , while exports rose 1 . 
56 billion dlrs or three pct to 57 . 
33 billion dlrs . 
The trade report on a balance of payments basis excludes such factors as military sales and the costs of shipping and insurance . 
The Commerce Department said non - petroleum imports in the quarter were up 2 . 
7 billion dlrs or three pct to 87 . 
7 billion dlrs , with the largest increases in consumer goods , which rose 1 . 
2 billion dlrs , and in non - monetary gold and passenger cars from Canada , up 900 mln dlrs each . 
Lumber imports from Canada fell 300 mln dlrs or 33 pct because of a 15 pct duty on imports from Canada , the department said . 
Passenger car imports fell 600 mln dlrs because of an 18 pct decrease in the nu

1 mln bags , and over seven weeks still to go to the end of the year , the total outturn should be at least a record 6 . 
5 mln bags if all production is declared , the sources said . 
This would compare with the previous record set last year of 6 . 
03 mln . 
However , there is no way of telling how many current crop beans will be declared after the May 1 start of the temporao and thus the true size of the 1986 / 87 harvest may never be officially registered . 
AUSTRALIAN TERMS OF TRADE WORSEN IN LAST QUARTER Australia ' s terms of trade fell by a further 3 . 
5 pct in the fourth quarter of 1986 after declining 0 . 
8 pct in the third quarter and 2 . 
7 pct a year earlier , the Statistics Bureau said . 
It said the seasonally adjusted current account deficit of 3 . 
22 billion dlrs in the quarter would have dropped to 912 mln if not for the terms of trade decline . 
The fourth quarter decline followed a 1 . 
1 pct fall in export prices and a 2 . 
4 pct rise in import prices , it said 

72 billion ) in January and 2 . 
54 billion a year earlier while FOB imports fell to 2 . 
77 billion from 2 . 
99 billion ( revised from 3 . 
01 billion ) against 2 . 
70 billion a year earlier , the Bureau said . 
It said a four pct decline in rural exports , despite an 11 pct rise in wheat exports , was more than offset by a seven pct rise in non - rural exports , notably minerals and fuels . 
On the import side , the main decreases were falls of 17 pct in machinery and transport equipment and 21 pct in fuels , the Bureau said . 
The net services deficit narrowed to 146 mln dlrs from 253 mln ( revised from 268 mln ) in January and 192 mln a year earlier , the Bureau said . 
This made a sharply lower deficit of 104 mln dlrs on the balance of goods and services against deficits of 499 mln in January and 354 mln a year earlier . 
Deficit on net income and unrequited transfers was 646 mln dlrs against 736 mln in January and 543 mln a year earlier . 
Official capital transactions in Febru

The current account showed an adjusted surplus of 6 . 
1 billion francs in January last year , and an unadjusted deficit of one billion . 
The full year 1986 current account surplus was reported last month at 25 . 
8 billion francs . 
COCOA WORKING GROUP MEETING DELAYED The International Cocoa Organization ( ICCO ) buffer stock working group meeting set for 1130 GMT today was rescheduled for 1430 , ICCO delegates said . 
The meeting was delayed so a draft compromise proposal on buffer stock rules could be completed , they said . 
ICCO Executive Director Kobena Erbynn was preparing the plan in consultation with other delegates for presentation to the full working group , they added . 
U . 
S . 
CURRENT ACCOUNT DEFICIT RECORD 36 . 
84 BILLION DLRS IN 4TH QTR 1986 U . 
S . 
CURRENT ACCOUNT DEFICIT RECORD 36 . 
84 BILLION DLRS IN 4TH QTR 1986 U . 
S . 
CURRENT ACCOUNT DEFICIT 36 . 
84 BILLION DLRS The U . 
S . 
current account deficit widened to a record 36 . 
84 billion dlrs on a balance 

The main worry from today ' s speech is the outlook for inflation , given the signs of relaxed monetary policy contained in it , Scrimgeour Vickers economist Richard Holt said . 
Holt noted the " rather loose " inflation forecast of 4 . 
0 pct at end - 1987 , and said the lower interest rates likely to result from the tough fiscal stance could cause longer term concern . 
" A higher PSBR target could be preferable in the long term ," he said , although lower mortgage interest rates on the back of falling base rates would have an offsetting impact on inflation . 
The Budget will inspire a lot of short - term confidence but it was " not a good budget for inflation ," he said Jeffrey said he would have liked Lawson to say more about the dangers of excessive liquidity build - up but overall was not too concerned about a revival of inflation . 
Fellner noted that the exchange rate was to remain the " leading edge " of monetary policy , but said the authorities were likely to be extremely ca

2 billion baht in February from 4 . 
6 billion the previous month but was higher than 1 . 
5 billion a year ago . 
The bank said the balance of payments surplus for the first two months of 1987 widened to 7 . 
8 billion baht from 4 . 
6 billion from the same period in 1986 , while the net capital inflow rose to five billion baht from 3 . 
1 billion . 
N . 
Y . 
COCOA TRADERS STILL CAUTIOUS ON ICCO New York cocoa traders reacted with caution to today ' s developments at the International Cocoa Organization talks in London , saying there is still time for negotiations to break down . 
" I would be extremely cautious to go either long or short at this point ," said Jack Ward , president of the cocoa trading firm Barretto Peat . 
" If and when a final position comes out ( of the ICCO talks ) one will still have time to put on positions . 
The risk at the moment is not commensurate with the possible gain ." ICCO producer and consumer delegates this morning accepted the outlines of a comprom

Soviet and East German delegates did not attend the council session because of a conflicting International Sugar Organization meeting today , but could arrive this afternoon , delegates said . 
ICCO TO EXAMINE BUFFER STOCK DETAILS TOMORROW The International Cocoa Council , ICCO , adjourned for the day after a detailed proposal on buffer stock rules was distributed and executive committee officials were elected , delegates said . 
Producers , EC consumers and all consumers are scheduled to hold separate meetings tomorrow to review the proposal , written by ICCO executive director Kobena Erbynn , they said . 
The buffer stock working group is to meet again on rules Monday morning , and the full council is to reconvene Tuesday , delegates said . 
Heinz Hofer of Switzerland was elected executive committee chairman and Mette Mogstad of Norway vice chairman , they added . 
VOLCKER SAYS U . 
S . 
TRADE DEFICIT IS MAJOR CHALLENGE Federal Reserve Board Chairman Paul Volcker said the U . 
S . 
T

20 billion against 11 . 
36 billion . 
Government borrowing stood at 9 . 
26 billion dlrs for calendar 1986 against 3 . 
15 billion for 1985 . 
Borrowing in the December quarter rose to 3 . 
92 billion from 1 . 
79 in the September quarter and 611 mln a year earlier . 
Repayments stood at 5 . 
5 billion for the year , up from 3 . 
1 billion in 1985 . 
Repayments in the December quarter accounted for 1 . 
4 billion dlrs against 260 mln in the September quarter and 334 mln a year earlier . 
Official reserves totalled 7 . 
205 billion dlrs at end December compared with 4 . 
723 billion at end September and 3 . 
255 billion one year earlier . 
GHANA COCOA PURCHASES STILL AHEAD OF LAST YEAR The Ghana Cocoa Board said it purchased 456 tonnes of cocoa in the 23rd week , ended March 12 , of the 1986 / 87 main crop season , compared with 684 tonnes the previous week and 784 tonnes in the 23rd week ended March 20 of the 1985 / 86 season . 
Cumulative purchases so far this season stand at 217 , 2

The provisional January trade surplus narrowed to 7 . 
2 billion marks from a record 11 . 
6 billion marks the month before . 
In February 1986 the current account had shown a 6 . 
85 billion mark surplus and the trade account a 6 . 
84 billion surplus . 
German Feb current account surplus 6 . 
6 billion marks ( Jan surplus 4 . 
8 billion ) - official German Feb current account surplus 6 . 
6 billion marks ( Jan surplus 4 . 
8 billion ) - official GERMAN CURRENT ACCOUNT SURPLUS WIDENS IN FEBRUARY West Germany ' s current account surplus widened to a provisional 6 . 
6 billion marks in February from a slightly downwards revised 4 . 
8 billion in January , a spokeswoman for the Federal Statistics Office said . 
The trade surplus in February widened to a provisional 10 . 
4 billion marks from 7 . 
2 billion in January , she added . 
The Statistics Office had originally put the January current account surplus at 4 . 
9 billion marks . 
The February trade surplus was well up on the 6 . 
84 

BAKER SEES 15 TO 20 BILLION DLR DROP IN TRADE GAP Treasury Secretary James Baker said he expected the U . 
S . 
Trade deficit to fall by 15 billion to 20 billion dlrs in 1987 . 
Commenting on the deficit during an interview on Cable News Network , Baker said " I think you ' re going to see a 15 to 20 billion dlr reduction this year ." The deficit was 170 billion dlrs in 1986 . 
Baker noted that the benefits of a weaker currency take 12 to 18 months to affect the trade balance , and said it is now 18 months since the Plaza agreement to lower the dollar ' s value . 
U . 
K . 
VISIBLE TRADE DEFICIT NARROWS IN FEBRUARY Britain ' s visible trade deficit narrowed to a seasonally adjusted provisional 224 mln stg in February from 527 mln in January , The Trade and Industry Department said . 
The current account balance of payments in February showed a seasonally adjusted provisional surplus of 376 mln stg compared with a surplus of 73 mln in January . 
Invisibles in February were put provision

COCOA COUNCIL HEAD TO PRESENT BUFFER COMPROMISE International Cocoa Organization , ICCO , council chairman Denis Bra Kanon will present a compromise proposal on buffer stock rules to producer and consumer delegates either later today or tomorrow morning , delegates said . 
Bra Kanon held private bilateral consultations with major producers and consumers this morning to resolve outstanding differences , mostly on the issues of how much non - member cocoa the buffer stock can purchase and price differentials for different varieties . 
Delegates were fairly confident the differences could be worked out in time to reach agreement tomorrow . 
Some consuming member nations , including Britain and Belgium , favour the buffer stock buying more than 10 pct non - member cocoa , delegates have said . 
The consumers argue that buying cheaper , lower quality non - member cocoas , particularly Malaysian , will most effectively support prices because that low quality cocoa is currently pressuring the

# Acessando um Corpora a Partir da Web
### Download do dataset: http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_cleaned.zip



In [25]:
# Import
# Como este Corpus já está categorizado, carregar e ler o Corpus diretamente com a função CategorizedPlaintextCorpusReader
from nltk.corpus import CategorizedPlaintextCorpusReader
from random import randint

In [27]:
# Carregando o arquivo e imprimindo catefgorias e ids dos arquivos
# Usamos expressões regulares para buscar padrões nos nomes dos arquivos
reader = CategorizedPlaintextCorpusReader(r'movie_reviews_tokens/tokens', r'.*\.txt', cat_pattern=r'(\w+)/*')
print(reader.categories())
print(reader.fileids())

['neg', 'pos']
['neg/cv000_tok-9611.txt', 'neg/cv001_tok-19324.txt', 'neg/cv002_tok-3321.txt', 'neg/cv003_tok-13044.txt', 'neg/cv004_tok-25944.txt', 'neg/cv005_tok-24602.txt', 'neg/cv006_tok-29539.txt', 'neg/cv007_tok-11669.txt', 'neg/cv008_tok-11555.txt', 'neg/cv009_tok-19587.txt', 'neg/cv010_tok-2188.txt', 'neg/cv011_tok-7845.txt', 'neg/cv012_tok-26965.txt', 'neg/cv013_tok-14854.txt', 'neg/cv014_tok-12391.txt', 'neg/cv015_tok-23730.txt', 'neg/cv016_tok-16970.txt', 'neg/cv017_tok-27221.txt', 'neg/cv018_tok-11502.txt', 'neg/cv019_tok-2003.txt', 'neg/cv020_tok-13096.txt', 'neg/cv021_tok-29141.txt', 'neg/cv022_tok-25633.txt', 'neg/cv023_tok-25625.txt', 'neg/cv024_tok-22867.txt', 'neg/cv025_tok-12991.txt', 'neg/cv026_tok-23590.txt', 'neg/cv027_tok-20123.txt', 'neg/cv028_tok-25883.txt', 'neg/cv029_tok-27815.txt', 'neg/cv030_tok-23788.txt', 'neg/cv031_tok-25886.txt', 'neg/cv032_tok-9567.txt', 'neg/cv033_tok-13710.txt', 'neg/cv034_tok-25395.txt', 'neg/cv035_tok-22978.txt', 'neg/cv036_tok-970

In [28]:
# Separa os arquivos das duas categorias
posFiles = reader.fileids(categories='pos')
negFiles = reader.fileids(categories='neg')

In [35]:
# Selecionamos randomicamente arquivos de cada categoria
fileP = posFiles[randint(0,len(posFiles)-1)]
fileN = negFiles[randint(0, len(posFiles) - 1)]
print(fileP)
print(fileN)

pos/cv112_tok-19726.txt
neg/cv474_tok-20168.txt


In [37]:
# Imprimimos as palavras do arquivo escolhido
for w in reader.words(fileP):
    print(w + ' ', end='')
    if (w is '.'):
        print()

the always over - the - top underrated don knotts kicked off his first of a string of family films with this delightful comedy with a dash of horror . 
luther heggs ( knotts ) is a bug - eyed type - setter for the local newspaper , he ' s always dreamed of becoming a first - class reporter but newspaper manager ollie ( skip homeier ) makes fun of him and won ' t let him write any stories . 
luther is driving along when a man is hit over the head in front of the town ' s spooky old simmons mansion that everyone says is haunted . 
convinced that they can scare him away from the paper for good , ollie and george beckett ( dick sargent ) decide to let him write a story , what ' s the catch you ask ? he has to write it about his overnite stay in the simmons house . 
much to his horror he finds that things go bump in the night , the organ plays by itself , and a pair of garden shears are stabbed in the neck of a bleeding portrait . 
he writes of his experiences and the town declares him a he

In [38]:
for w in reader.words(fileN):
    print(w + ' ', end='')
    if (w is '.'):
        print()

for more reviews and movie screensavers , visit http : // www . 
joblo . 
com / this movie is written by the man who is deemed to be " one of the hottest writers in hollywood " . 
he wrote the groundbreaking screenplay for scream ( 8 / 10 ) , then added the successful i know what you did last summer ( 7 . 
5 / 10 ) script to his mix , and also created the popular tv series " dawson ' s creek " . 
so when he asked to direct his first movie , based on his first ever script written , everyone and their grandma said " sure , go for it ! " . 
uhhm , my question is . 
. 
. 
did anyone bother reading this stupid script ? ? ? plot : ace student leigh ann watson is mistakenly caught with some cheating papers by the bitchiest teacher in the west , mrs . 
tingle , and set to lose her scholarship to college . 
when she and her friends visit the teacher at home in order to explain their side of the story , they end up tying her up , and slowly trying to talk some sense into the hardheaded woman . 


# Explorando Distribuição de Frequência em Dados de Chat

In [39]:
# Imports
import nltk
from nltk.corpus import webtext
nltk.download('webtext')

[nltk_data] Downloading package webtext to
[nltk_data]     /home/leogaller/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


True

In [40]:
# Fileids
print("\n")
print(webtext.fileids())



['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt', 'singles.txt', 'wine.txt']


In [51]:
# Distribuição de frequência de um único arquivo
fileid = 'singles.txt'
wbt_words = webtext.words(fileid)
fdist = nltk.FreqDist(wbt_words)
fdist

FreqDist({',': 539, '.': 353, '/': 110, 'for': 99, 'and': 74, 'to': 74, 'lady': 68, '-': 66, 'seeks': 60, 'a': 52, ...})

In [52]:
# Report, vc entendera melhor sobre tokens logo abaixo
print('\nContagem do número máximo de ocorrências do token "',fdist.max(),'" : ', fdist[fdist.max()])
print('\nNúmero total de tokens distintos : ', fdist.N())
print('\nA seguir estão os 10 tokens mais comuns')
print(fdist.most_common(10))
print("\n")


Contagem do número máximo de ocorrências do token " , " :  539

Número total de tokens distintos :  4867

A seguir estão os 10 tokens mais comuns
[(',', 539), ('.', 353), ('/', 110), ('for', 99), ('and', 74), ('to', 74), ('lady', 68), ('-', 66), ('seeks', 60), ('a', 52)]




# Tokenization
## Processo de dividir uma string em listas de pedaços ou "tokens". 
## Um token é uma parte inteira. Por exemplo: uma palavra é um token em uma sentença. Uma sentença é um token em um parágrafo.



In [53]:
# Imports
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/leogaller/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [54]:
# Texto
frase = "Hora de começar com o processamento de linguagem natural. Python vai facilitar nossa vida!"

In [55]:
# Tokenization em sentenças
sent_tokens = sent_tokenize(frase)
print(sent_tokens)

['Hora de começar com o processamento de linguagem natural.', 'Python vai facilitar nossa vida!']


In [56]:
# Tokenization em palavras
word_tokens = word_tokenize(frase)
print(word_tokens)
print(word_tokenize("can't"))

['Hora', 'de', 'começar', 'com', 'o', 'processamento', 'de', 'linguagem', 'natural', '.', 'Python', 'vai', 'facilitar', 'nossa', 'vida', '!']
['ca', "n't"]


In [58]:
# Usando tokenizers customizados
from nltk.tokenize import TreebankWordTokenizer

In [59]:
# TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize('Hello World.'))

['Hello', 'World', '.']


In [60]:
# Tokenization por Pontuação
from nltk.tokenize import WordPunctTokenizer

In [61]:
tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize("Can't is a contraction."))

['Can', "'", 't', 'is', 'a', 'contraction', '.']


In [62]:
# Tokenization por expressões regulares
from nltk.tokenize import RegexpTokenizer

In [63]:
# RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("Can't is a contraction."))

["Can't", 'is', 'a', 'contraction']


In [64]:
# Tokenization por expressões regulares
from nltk.tokenize import regexp_tokenize

In [65]:
# regexp_tokenize
print(regexp_tokenize("Can't is a contraction.", "[\w']+"))

["Can't", 'is', 'a', 'contraction']


# Remoção de Stopwords

### Stopwords são palavras comuns que normalmente não contribuem para o significado de uma frase, pelo menos com relação ao propósito da 
### informação e do processamento da linguagem natural. São palavras como "The" e "a" ((em inglês) ou "O/A" e "Um/Uma" ((em português). 
### Muitos mecanismos de busca filtram estas palavras (stopwords), como forma de economizar espaço em seus índices de pesquisa.



In [67]:
# Imports
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/leogaller/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [68]:
# Stop words em inglês
english_stops = set(stopwords.words('english'))

In [69]:
# Lista de palavras
words = ["Can't", 'is', 'a', 'contraction']

In [70]:
# List comprehension para aplicar as english_stop words a lista de palavras
[word for word in words if word not in english_stops]

["Can't", 'contraction']

In [73]:
#Stop words em português
portuguese_stops = set(stopwords.words('portuguese'))

In [74]:
# Lista de palavras
palavras = ["Estou", 'estudando', 'um', 'tema', 'interesante', 'em', 'PLN']

In [75]:
# List comprehension para aplicar as portuguese_stop words a lista de palavras
[ palavra for palavra in palavras if palavra not in portuguese_stops ]

['Estou', 'estudando', 'tema', 'interesante', 'PLN']

In [76]:
# IDs das Stop Words
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [77]:
# Stop words
print(stopwords.words('portuguese'))
len(stopwords.words('portuguese'))

['de', 'a', 'o', 'que', 'e', 'é', 'do', 'da', 'em', 'um', 'para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das', 'à', 'seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois', 'sem', 'mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles', 'você', 'essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa', 'pelos', 'elas', 'qual', 'nós', 'lhe', 'deles', 'essas', 'esses', 'pelas', 'este', 'dele', 'tu', 'te', 'vocês', 'vos', 'lhes', 'meus', 'minhas', 'teu', 'tua', 'teus', 'tuas', 'nosso', 'nossa', 'nossos', 'nossas', 'dela', 'delas', 'esta', 'estes', 'estas', 'aquele', 'aquela', 'aqueles', 'aquelas', 'isto', 'aquilo', 'estou', 'está', 'estamos', 'estão', 'estive', 'esteve', 'estivemos', 'estiveram', 'estava', 'estávamos', 'estavam', 'estivera', 'estivéramos', 'esteja', 'estejamos', 'estejam', 'estivesse', 'estivéssemos', 'estivessem', 'estiver', 'estiv

204

# Lemmatization

#### Lemmatização na linguística, é o processo de agrupar as diferentes formas flexionadas de uma palavra para que possam ser analisadas como um único item.
#### Na linguística computacional, a Lemmatização é o processo algorítmico de determinação do lema para uma determinada palavra. 
#### Uma vez que o processo pode envolver tarefas complexas, como entender o contexto e determinar a parte da fala de uma palavra em uma frase 
#### (requerendo, por exemplo, conhecimento da gramática de uma linguagem), pode ser uma tarefa difícil implementar um lematizador para uma nova língua.

#### Em muitas línguas, as palavras aparecem em várias formas inflexíveis. 
#### Por exemplo, em inglês, o verbo 'to walk' pode aparecer como 'walk', 'walk', 'walkks', 'walking'. 
#### A forma base, 'walk', que se poderia procurar em um dicionário, é chamado de lema para a palavra. 
#### A combinação da forma base com a parte da fala geralmente é chamada de lexema da palavra.

#### A Lemmatização está intimamente relacionada com o Stemming. 
#### A diferença é que um stemmer opera em uma única palavra sem conhecimento do contexto e, portanto, não pode discriminar entre palavras 
#### que têm diferentes significados, dependendo da parte da fala. No entanto, os stemmers são geralmente mais fáceis de implementar e executar mais 
#### rapidamente, e a precisão reduzida pode não ser importante para algumas aplicações.

#### Stemming e Lemmatization são operações parecidas. A principal diferença entre eles é que o Stemmning pode gerar palavras geralmente inexistentes, 
#### enquanto as lemas são palavras reais.

#### Assim, sua root stem pode não ser algo que você pode procurar em um dicionário, mas você pode procurar um lema. 
#### Algumas vezes você terminará com uma palavra muito semelhante, mas as vezes, você terminará com uma palavra completamente diferente. Vamos ver alguns exemplos.

In [78]:
# Imports
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/leogaller/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [79]:
# Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [80]:
# Com argumentos default
print(wordnet_lemmatizer.lemmatize('cooking'))
print(wordnet_lemmatizer.lemmatize('dogs'))
print(wordnet_lemmatizer.lemmatize('churches'))
print(wordnet_lemmatizer.lemmatize('are'))
print(wordnet_lemmatizer.lemmatize('is'))

cooking
dog
church
are
is


In [93]:
# pos = v
print(wordnet_lemmatizer.lemmatize('is', pos='v'))
print(wordnet_lemmatizer.lemmatize('are', pos='v'))
print(wordnet_lemmatizer.lemmatize('cooking', pos='v'))

be
be
cook
