If this is your first time working with a Jupyter Notebook, here are some basic instructions. <br>
1) The notebook is made of explanatory segments of text followed by snipets of code. It's kind of like an essay, that's alive. <br>
2) You're meant to read down the document and run the code cells in progression. To run one of the cells, click on it (it will be highlighted blue), then click on the 'run cell' button in the toolbar or Cell > Run Cells in the dropdown menu. <br>
3) If you run into errors, start back at the beginning (or start debugging and share the corrections with the author).
That's all there is to it.<br>

This is a short exploration of the [Natural Language Tool Kit (NLTK)](https://www.nltk.org/), which "is a leading platform for building Python programs to work with human language data."
<br>



To begin with, let's get some text.  I have chosen Indira Ghandi's [speech at the Stockholm Conference on the Human Enviornment in 1971](http://lasulawsenvironmental.blogspot.com/2012/07/indira-gandhis-speech-at-stockholm.html).

In [36]:
text = '''Man And Environment 
 
Smt. Indira Gandhi 
(late Prime Minister of India) 
Plenary Session of United Nations Conference on Human Environment 
Stockholm 14th June, 1972 

It is indeed an honour to address this Conference-in itself a fresh expression of the spirit which created the United Nations-concern for the present and future welfare of humanity. It does not aim merely at securing limited agreements but at establishing peace and harmony in life-among all races and with Nature. This gathering represents man's earnest endeavour to understand his own condition and to prolong his tenancy of this planet. A vast amount of detailed preparatory work has gone into the convening of this Conference guided by the dynamic personality of Mr. Maurice Strong the Secretary General.

I have had the good fortune of growing up with a sense of kinship with nature in all its manifestations. Birds, plants, stones were companions and, sleeping under the star-strewn sky, I became familiar with the names and movements of the constellations. But my deep interest in this our `only earth' was not for itself but as a fit home for man.

One cannot be truly human and civilized unless one looks upon not only all fellow-men but all creation with the eyes of a friend. Throughout India, edicts carved on rocks and iron pillars are reminders that 22 centuries ago the Emperor Ashoka defined a King's duty as not merely to protect citizens and punish wrongdoers but also to preserve animal life and forest trees. Ashoka was the first and perhaps the only monarch until very recently, to forbid the killing of a large number of species of animals for sport or food, foreshadowing some of the concerns of this Conference. He went further, regretting the carnage of his military conquests and enjoining upon his successors to find "their only pleasure in the peace that comes through righteousness".

Along with the rest of mankind, we in India--in spite of Ashoka have been guilty of wanton disregard for the sources of our sustenance. We share you concern at the rapid deterioration of flora and fauna. Some of our own wildlife has been wiped out, miles of forests with beautiful old trees, mute witnesses of history, have been destroyed. Even though our industrial development is in its infancy, and at its most difficult stage, we are taking various steps to deal with incipient environmental imbalances. The more so because of our concern for the human being--a species which is also imperiled. In poverty he is threatened by malnutrition and disease, in weakness by war, in richness by the pollution brought about by his own prosperity.

It is said that in country after country, progress should become synonymous with an assault on nature. We who are a part of nature and dependent on her for very need, speak constantly about "exploiting" nature. When the highest mountain in the world was climber in 1953, Jawaharlal Nehru objected to the phrase "conquest of Everest" which he thought was arrogant. It is surprising that this lack of consideration and the constant need to prove one's superiority should be projected onto our treatment of our fellowmen? I remember Edward Thompson, a British writer and a good friend of India, once telling Mr. Gandhi that wildlife was fast disappearing. Remarked the Mahatma--"It is decreasing in the jungles but it is increasing in the town".

We are gathered here under the aegis of the United Nations. We are supposed to belong to the same family sharing common traits and impelled by the same basic desires, yet we inhabit a divided world.

How can it be otherwise? There is still no recognition of the equality of man or respect for him as an individual. In matters of colour and race, religion and custom, society is governed by prejudice. Tensions arise because of man's aggressiveness and notions of superiority. The power of the big stick prevails and it is used not in favour of fair play or beauty, but to chase imaginary windmills--to assume the right to interfere in the affairs of others, and to arrogate authority for action which would not normally be allowed. Many of the advanced countries of today have reached their present affluence by their domination over other races and countries, the exploitation of their own natural resources. They got a head start through sheer ruthlessness, undisturbed by feelings of compassion or by abstract theories of freedom, equality or justice. The stirrings of demands for the political rights of citizens, and the economic rights of the toiler came after considerable advance had been made. The riches and the labour of the colonized countries played no small part in the industrialization and prosperity of the West. Now, as we struggle to create a better life for our people, it is in vastly different circumstances, for obviously in today's eagle-eyed watchfulness we cannot indulge in such practices even for a worthwhile purpose. We are bound by our own ideals. We owe allegiance to the principles of the rights of workers and the norms enshrined in the charters of international organizations. Above all we are answerable to the millions of politically awakened citizens in our countries. All these make progress costlier and more complicated.

On the one hand the rich look askance at our continuing poverty--on the other, they warn us against their own methods. We do not wish to impoverish the environment any further and yet we cannot for a moment forget the grim poverty of large numbers of people. Are not poverty and need the greatest polluters? For instance, unless we are in a position to provide employment and purchasing power for the daily necessities of the tribal people and those who live in or around our jungles, we cannot prevent them from combing the forest for food and livelihood; from poaching and from despoiling the vegetation. When they themselves feel deprived, how can we urge the preservation of animals? How can we speak to those who live in villages and in slums about keeping the oceans, the rivers and the air clean when their own lives are contaminated at the source? The environment cannot be improved in conditions of poverty. Nor can poverty be eradicated without the use of science and technology.

Must there be conflict between technology and a truly better world or between enlightenment of the spirit and a higher standard of living? Foreigners sometimes ask what to us seems a very strange question, whether progress in India would not mean diminishing of her spirituality or her values. Is spiritual quality so superficial as to be dependent upon the lack of material comfort? As a country we are not more or less spiritual than any other but traditionally our people have respected the spirit of detachment and renunciation. Historically, our great spiritual discoveries were made during periods of comparative affluence. The doctrines of detachment from possessions were developed not as rationalization of deprivation but to prevent comfort and ease from dulling the senses. Spirituality means the enrichment of the spirit, the strengthening of ones inner resources and the stretching of one's range of experience. It is the ability to be still in the midst of activity and vibrantly alive in moments of calm; to separate the essence from circumstances; to accept joy and sorrow with some equanimity. Perception and compassion are the marks of true spirituality.

I am reminded of an incident in one of our tribal areas. The vociferous demand of elder tribal chiefs that their customs should be left undisturbed found support from noted anthropologists. In its anxiety that the majority should not submerge the many ethnic, racial and cultural groups in our country, the Government of India largely accepted this advice. I was amongst those who entirely approved. However, a visit to remote part of our north-east frontier brought me in touch with a different point of view-the protest of the younger elements that while the rest of India was on the way to modernization they were being preserved as museum pieces. Could we not say the same to the affluent nations?

For the last quarter of a century, we have been engaged in an enterprise unparalled in human history--the provision of basic needs to one-sixth of mankind within the span of one or two generations. When we launched on that effort our early planners had more than the usual gaps to fill. There was not enough data and no helpful books. No guidance could be sought from the experience of other countries whose conditions--political, economic, social and technological--were altogether different. Planning in the sense we were innovating, had never been used in the context of a mixed economy. But we could not wait. The need to improve the conditions of our people was pressing. Planning and action, the improvement of data leading to better planning and better action, all this was a continuous and overlapping process. Our industrialization tended to follow the paths which the more advanced countries had traversed earlier. With the advance of the 60's and particularly during the last five years, we have encountered a bewildering collection of problems, some due to our shortcomings but many inherent in the process and in existing attitudes. The feeling is growing that we should re-order our priorities and move away from the single-dimensional model which has viewed growth from certain limited angles, which seems to have given a higher place to things rather than to persons and which has increased our wants rather than our enjoyment. We should have a more comprehensive approach to life, centred on man not as a statistic but an individual with many sides to his personality. The solution of these problems cannot be isolated phenomena of marginal importance but must be an integral part of the unfolding of the very process of development.

The extreme forms in which questions of population or environmental pollution are posed, obscure the total view of political, economic and social situations. The Government of India is one of the few which has an officially sponsored programme of family planning and this is making some progress. We believe that planned families will make for a healthier and more conscious population. But we know also that no programme of population control can be effective without education and without a visible rise in the standard of living. Our own programmes have succeeded in the urban or semi-urban areas. To the very poor, every child is an earner and a helper. We ar experimenting with new approaches and the family planning programme is being combined with those of maternity and child welfare, nutrition and development in general.

It is an over--simplification to blame all the world's problems on increasing population. Countries with but a small fraction of the world population consume the bulk of the world's production of minerals, fossil fuels and so on. Thus we see that when it comes to the depletion of natural resources and environmental pollution, the increase of one inhabitant in an affluent country., at his level of living, is equivalent to an increase of many Asian, Africans or Latin Americans at their current material levels of living.

The inherent conflict is not between conservation and development, but between environment and reckless exploitation of man and earth in the name of efficiency. Historians tell us that the modern age began with the will to freedom of the individual. And the individual came to believe that the had rights with no corresponding obligations. The man who got ahead was the one who commanded admiration. No questions were asked as to the methods employed or the price which others had to pay. The industrial civilization has promoted the concept of the efficient man, he whose entire energies are concentrated on producing more in a given unit of time and from a given unit of manpower. Groups or individuals who ar less competitive and according to this test, less efficient are regarded as lesser breeds--for example the older civilizations, the black and brown peoples, women and certain professions. Obsolescence is built into production, and efficiency is based on the creation of goods which are not really needed and which cannot be disposed of when discarded. What price such efficiency now, and is not recklessness a more appropriate term for such a behaviour?

All the `isms' of the modern age--even those which in theory disown the private profit principle--assume that man's cardinal interest is acquisition. The profit motive, individual or collectives, seems to overshadow all else. This overriding concern with self and Today is the basic cause of the ecological crisis.

Pollution is not a technical problem. The fault lies not in science and technology as such but in the sense of values of the contemporary world which ignores the rights of others and is oblivious of the longer perspective.

There are grave misgivings that the discussion on ecology may be designed to distract attention from the problems of war and poverty. We have to prove to the disinherited majority of the world that ecology and conservation will not work against their interest but will bring an improvement in their lives. To withhold technology from them would deprive them of vast resources of energy and knowledge. This is no longer feasible not will it be acceptable.

The environmental problems of developing countries are not the side effects of excessive industrialization but reflect the inadequacy of development. The rich countries may look upon development as the cause of environmental destruction, but to us it is one of the primary means of improving the environment for living, or providing food, water, sanitation and shelter; of making the deserts green and the mountains habitable. The research and perseverance of dedicated people have given us an insight which is likely to play an important part in the shaping of our future plans. We see that however much man hankers after material goods, they can never give him full satisfaction. Thus the higher standard of living must be achieved without alienating people from their heritage and without despoiling nature of its beauty, freshness and purity so essential to our lives.

The most urgent and basic question is that of peace. Nothing is so pointless as modern warfare. Nothing destroys so instantly, so completely as the diabolic weapons which not only kill but maim and deform the living and the yet to be born; which poison the land, leaving long trails of ugliness, barrenness and hopeless desolation. What ecological projects can survive a war? The Prime Minister of Sweden, Mr. Olof Palme, has already drawn the attention of the Conference to this in powerful words.

It is clear that the environmental crisis which is confronting the world, will profoundly alter the future destiny or our planet. No one among us, whatever our status, strength or circumstance can remain unaffected. The process of change challenges present international policies. Will the growing awareness of "one earth" and "one environment' guide us to the concept of "one humanity"? Will there be a more equitable sharing of environmental costs and greater international interest in the accelerated progress of the less developed world? Or, will it remain confined to a narrow concern, based on exclusive self-sufficiency?

The first essays in narrowing economic and technological disparities have not succeeded because the policies of aid were made to subserve the equations of power. We hope that the renewed emphasis on self-reliance, brought a about by the change in the climate for aid, will also promote search for new criteria of human satisfaction. In the meantime, the ecological crises should not add to the burdens of the weaker nations by introducing new considerations in the political and trade policies of rich nations. It would be ironic if the fight against pollution were to be converted into another business, out of which a few companies, corporations, or nations would make profits at the cost of the many. Here is a branch of experimentation and discovery in which scientist of all nations should take interest. They should ensure that their findings are available to all nations, unrestricted by patents. I am glad that the Conference has given thought on this aspect of the problem.

Life is one and the world is one, and all these questions are inter-linked. The population explosion; poverty; ignorance and disease, the pollution of our surroundings, the stockpiling of nuclear weapons and biological and chemical agents of destruction are all parts of a vicious circle. Each is important and urgent but dealing with them one by one would be wasted effort.
It serves little purpose to dwell on the past or to apportion blame, no one of us is blameless. If some are able to dominate over others, it is at least partially due to the weakness, the lack of unity and the temptation of gaining some advantage on the part of those who submit. If the prosperous have been exploiting the needy, can we honestly claim that in our own societies people do not take advantage of the weaker sections? We must re-evaluate the fundamentals on which our respective civic societies are based and the ideals by which they are sustained. If there is to be a change of heart, a change of direction and methods of functioning, it is not an organization or a country-no matter how well intentioned--which can achieve it. While each country must deal with that aspect of the problem which is most relevant to it, it is obvious that all countries must unite in an overall endeavour. There is no alternative to a cooperative approach on a global scale to the entire spectrum of our problems.

I have referred to some problems which seem to me to be the underlying causes of the present crises in our civilization. This is not in the expectation that this Conference can achieve miracles or solve all the world's difficulties, but in the hope that the opinions of each national will be kept in focus, that these problems will be viewed in perspective and each project devised as part of the whole.

On a previous occasion I have spoken of the unfinished revolution in our countries I am now convinced that this can be taken to its culmination when it is accompanied by a revolution in social thinking. In 1968 at the 14th General Conference of UNESCO the Indian delegation, along with others, proposed a new and major programme entitled "a design for living". This is essential to grasp the full implications of technical advance and its impact on different sections and groups. We do not want to put the clock back or resign ourselves to a simplistic natural state. We want new directions in the wiser use of the knowledge and tools with which science has equipped us. And this cannot be just one upsurge but a continuous search into cause and effect and an unending effort to match technology with higher levels of thinking. We must concern ourselves not only with the kind of world we want but also with what kind of man should inhabit it. Surely we do not desire a society divided into those who condition and those who are conditioned. We want thinking people capable of spontaneous self-directed activity, people who are interested and interesting, and who are imbued with compassion and concern for others.

It will not be easy for large societies to change their style of living. They cannot be coerced to do so, nor can governmental action suffice. People can be motivated and urged to participate in better alternatives.

It has been my experience that people who are at cross purposes with nature are cynical about mankind and ill-at-ease with themselves. Modern man must re-establish an unbroken link with nature and with life. He must again learn to invoke the energy of growing things and to recognize, as did the ancients in India centuries ago, that one can take from the Earth and the atmosphere only so much as one puts back into them. In their hymn to Earth, the sages of the Atharva Veda chanted-I quote,
"What of thee I dig out, let that quickly grow over, Let me
not hit thy vitals, or thy heart".
So can man himself be vital and of good heart and
conscious of his responsibility'''

Following on last week's discussion, I wanted to start with the creation of tokens.  Below you'll find four lines of code that will remove all the punctuation from the text and give us a list of all the words in the text. A "list" is a common way of working with data in Python. Each word is a string of charachters (hence 'Indira' 'Ghandi').  The list begins and ends with square brackets [ and ].  Each value in the list is separated by a comma.     

In [38]:
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
print(tokens)
#nltk.pos_tag(tokens)

['Man', 'And', 'Environment', 'Smt', 'Indira', 'Gandhi', 'late', 'Prime', 'Minister', 'of', 'India', 'Plenary', 'Session', 'of', 'United', 'Nations', 'Conference', 'on', 'Human', 'Environment', 'Stockholm', '14th', 'June', '1972', 'It', 'is', 'indeed', 'an', 'honour', 'to', 'address', 'this', 'Conference', 'in', 'itself', 'a', 'fresh', 'expression', 'of', 'the', 'spirit', 'which', 'created', 'the', 'United', 'Nations', 'concern', 'for', 'the', 'present', 'and', 'future', 'welfare', 'of', 'humanity', 'It', 'does', 'not', 'aim', 'merely', 'at', 'securing', 'limited', 'agreements', 'but', 'at', 'establishing', 'peace', 'and', 'harmony', 'in', 'life', 'among', 'all', 'races', 'and', 'with', 'Nature', 'This', 'gathering', 'represents', 'man', 's', 'earnest', 'endeavour', 'to', 'understand', 'his', 'own', 'condition', 'and', 'to', 'prolong', 'his', 'tenancy', 'of', 'this', 'planet', 'A', 'vast', 'amount', 'of', 'detailed', 'preparatory', 'work', 'has', 'gone', 'into', 'the', 'convening', 'of

If we want to remove stop words, we can simply remove them from the list using the .remove() function in Python. 

In [40]:
stop_words = ['the','of','and','to', 'a', 'is', 'not', 'that','be','our','with','are','which','for','The','or','we','We']
for token in tokens:
    if token in stop_words:
        tokens.remove(token)
print(tokens)

['Man', 'And', 'Environment', 'Smt', 'Indira', 'Gandhi', 'late', 'Prime', 'Minister', 'India', 'Plenary', 'Session', 'United', 'Nations', 'Conference', 'on', 'Human', 'Environment', 'Stockholm', '14th', 'June', '1972', 'It', 'indeed', 'an', 'honour', 'address', 'this', 'Conference', 'in', 'itself', 'fresh', 'expression', 'spirit', 'created', 'United', 'Nations', 'concern', 'present', 'future', 'welfare', 'humanity', 'It', 'does', 'aim', 'merely', 'at', 'securing', 'limited', 'agreements', 'but', 'at', 'establishing', 'peace', 'harmony', 'in', 'life', 'among', 'all', 'races', 'Nature', 'This', 'gathering', 'represents', 'man', 's', 'earnest', 'endeavour', 'understand', 'his', 'own', 'condition', 'prolong', 'his', 'tenancy', 'this', 'planet', 'A', 'vast', 'amount', 'detailed', 'preparatory', 'work', 'has', 'gone', 'into', 'convening', 'this', 'Conference', 'guided', 'by', 'dynamic', 'personality', 'Mr', 'Maurice', 'Strong', 'Secretary', 'General', 'I', 'have', 'had', 'good', 'fortune', '

We can also cut out the smaller and less frequent words.  Here I am removing all words shorter than three charachters that appear less than five times in the text. Try changing the 3 and 5 in the script to see how the results vary. 

In [45]:
sorted(w for w in set(fdist1) if len(w) > 3 and fdist1[w] > 5)

['Conference',
 'India',
 'been',
 'cannot',
 'concern',
 'countries',
 'country',
 'development',
 'environmental',
 'from',
 'have',
 'into',
 'living',
 'more',
 'must',
 'nations',
 'nature',
 'only',
 'others',
 'part',
 'people',
 'population',
 'poverty',
 'problems',
 'should',
 'some',
 'their',
 'this',
 'those',
 'were',
 'will',
 'world',
 'would']

In [46]:
import nltk
from nltk.probability import FreqDist

fdist1 = FreqDist(tokens)
fdist1.most_common(50)

[('in', 70),
 ('one', 23),
 ('but', 22),
 ('on', 21),
 ('an', 19),
 ('by', 17),
 ('as', 17),
 ('can', 16),
 ('have', 16),
 ('from', 15),
 ('this', 15),
 ('it', 15),
 ('all', 14),
 ('their', 14),
 ('man', 13),
 ('who', 13),
 ('at', 13),
 ('world', 13),
 ('It', 12),
 ('will', 11),
 ('s', 11),
 ('I', 11),
 ('people', 11),
 ('was', 11),
 ('more', 10),
 ('should', 10),
 ('countries', 10),
 ('has', 10),
 ('so', 9),
 ('India', 9),
 ('living', 9),
 ('were', 9),
 ('cannot', 9),
 ('own', 9),
 ('us', 9),
 ('no', 9),
 ('those', 8),
 ('poverty', 8),
 ('Conference', 8),
 ('his', 8),
 ('been', 8),
 ('must', 8),
 ('problems', 8),
 ('its', 7),
 ('environmental', 7),
 ('part', 7),
 ('only', 7),
 ('concern', 7),
 ('nature', 7),
 ('country', 7)]

In [57]:
#Big and frequent words
sorted(w for w in set(fdist1) if len(w) > 7 and fdist1[w] > 2)

['Conference',
 'citizens',
 'compassion',
 'conditions',
 'countries',
 'development',
 'different',
 'ecological',
 'economic',
 'efficiency',
 'environment',
 'environmental',
 'experience',
 'individual',
 'industrialization',
 'interest',
 'international',
 'material',
 'planning',
 'policies',
 'political',
 'pollution',
 'population',
 'problems',
 'programme',
 'progress',
 'questions',
 'resources',
 'societies',
 'spiritual',
 'standard',
 'technology',
 'thinking']

In [32]:
import nltk
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ajanco/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [33]:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text) 
#print(sentences)
model = Word2Vec(sentences)
words = list(model.wv.vocab)
model.save('word2vec.model')
model = Word2Vec.load('word2vec.model')


In [34]:
model.most_similar('a', topn=5)

  if __name__ == '__main__':


[('I', 0.9996715188026428),
 ('y', 0.999638020992279),
 ('i', 0.9996201992034912),
 ('-', 0.9996087551116943),
 ('"', 0.9995754361152649)]

In [35]:
for token in tokens:
    if len(token) < 3 and token != 'VBD':
        pass
    else:
        print(token)

Man
And
Environment
Smt
Indira
Gandhi
late
Prime
Minister
India
Plenary
Session
United
Nations
Conference
Human
Environment
Stockholm
14th
June
1972
indeed
honour
address
this
Conference
itself
fresh
expression
spirit
created
United
Nations
concern
present
future
welfare
humanity
does
aim
merely
securing
limited
agreements
but
establishing
peace
harmony
life
among
all
races
Nature
This
gathering
represents
man
earnest
endeavour
understand
his
own
condition
prolong
his
tenancy
this
planet
vast
amount
detailed
preparatory
work
has
gone
into
convening
this
Conference
guided
dynamic
personality
Maurice
Strong
Secretary
General
have
had
good
fortune
growing
sense
kinship
nature
all
its
manifestations
Birds
plants
stones
were
companions
sleeping
under
star
strewn
sky
became
familiar
names
movements
constellations
But
deep
interest
this
only
earth
was
itself
but
fit
home
man
One
cannot
truly
human
civilized
unless
one
looks
upon
only
all
fellow
men
but
all
creation
eyes
friend
Throughout
Indi