## Interactive python (IPython) with Jupyter Notebooks


Let's play around with our environment:
- Setting up and get ready with [Anaconda - https://www.continuum.io/downloads](https://www.continuum.io/downloads). It's free. 


## Import NLTK library and experimental data
```python
import nltk
from nltk.book import *  
texts()
```

### Exploring vocabulary:  Useful functions
```len(text1)``` gives the number of symbols or 'tokens' in your text. This is the total number of words and items of punctuation.

```set(text2)``` gives you a list of all the tokens in the text, without the duplicates.

```sorted(text4)``` places items in the list into alphabetical order, with punctuation symbols and capitalised words first.


### Exploring text - useful methods to search inside text

```text1.concordance("monstrous")```  shows you a word in context and is useful if you want to be able to discuss the ways in which a word is used in a text. 

```text2.similar(...)```  will find words used in similar contexts; it is not looking for synonyms, although the results may include synonyms

```text2.common_contexts([..., ...])```  allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma
 
```text2.collocations()```  A collocation is a sequence of words that occur together unusually often

### Exploring text: Plotting dispersion of words

```text1.dispersion_plot(words=['sea','whale'])```
 A **dispersion plot** that shows where given words occur in a text.

## Data structures: Texts as Lists of Words
Python treats a text as a long list of words.

```sent1 = ['Call', 'me', 'Ishmael', '.']```
 Note we use Square brackets here to define our list
 
```sent1.count('me')```counts the times a particular word is in the list

```sent2.append('tomorrow')``` if we want to add a single item to a list
sent2

#### Indexing Lists
We can navigate this list with the help of indexes. Just as we can find out the number of times a word occurs in a text, we can also find where a word first occurs. We can navigate to different points in a text without restriction, so long as we can describe where we want to be.

```text4.index('awaken')``` location of the word

```text4[158]``` Get the item in that location 

```text5[16715:16735]``` Slice the list

## Strings: A useful object to store texts

A string is a sequence of characters, you can think of it as a list. For example, we can assign a string to a variable, index a string, and slice a string

```' '.join(['I','love', 'NLTK'])``` Join the items of a list with a particular string

```'I love python'.split()``` Split the string with a pattern

## Let computers do the repetitive work: Python Loops

Teach the computer how to repeat things. 

Loop to produce the same result as len('Python')

```python
length = 0
for char in 'Python':
    length = length + 1
print('There are', length, 'letters in this word'`
```

## Frecuency Distributions: Counting for analysis
We can use Python's ability to perform statistical analysis of data to do further exploration of vocabulary.
```python
from nltk.probability import FreqDist 
fdist1 = FreqDist(text1) # creates the frequency distribution
fdist1.most_common(10) # get the most common
fdist1['like'] # look for the count of a particular item
fdist1.max() # get the max count
fdist1.freq('a') # frequency for an item
fdist1.plot(50,cumulative=False) # plot the distribution
fdist2= FreqDist(len(word) for word in text1) # explore other features of thext
```

## Explore your own text: Accessing a corpus   

**Corpus** - Structured set of texts. 
*Corpora* is the plural of this. Example: A collection of medical journals.

Get a text from your hard drive
```python
import os
text_path = 'books/pg1080.txt'
path=os.path.join(text_path)
file = open(os.path.join(text_path), "r", encoding='UTF-8')
text = file.read()
```

Get a text from the web
```python
from urllib import request
url = "http://www.gutenberg.org/cache/epub/1080/pg1080.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
```

tokenize
```python
from nltk import word_tokenize,sent_tokenize,wordpunct_tokenize
tokens=wordpunct_tokenize(raw) #use the tokenizer you need
my_text=Text(tokens)
```