## 4.1 Back to basics

### Assignment

`bar = foo` assigns the value of `foo` to `bar`, and importantly `bar` is a **copy** of `foo`

In [1]:
foo = 'Monty'
bar = foo
foo = 'Python'
bar

'Monty'

Assignments always copies the value of an expression, but a value is not always what you might expect it to be. The value of a list is actually a **reference** to the object. Now `bar = foo` assigns the reference of `foo` to `bar`.

In [4]:
foo = ['Monty', 'Python']
bar = foo
foo[1] = 'Bodkin'
bar

['Monty', 'Bodkin']

In [5]:
empty = []
nested = [empty, empty, empty]
nested

[[], [], []]

In [6]:
nested[1].append('Python')
nested

[['Python'], ['Python'], ['Python']]

In [12]:
nested = [[]] * 3
nested

[[], [], []]

In [13]:
nested[1].append('Python')
nested[1] = ['Monty']
nested

[['Python'], ['Monty'], ['Python']]

Important: To copy the items from a list foo to a new list bar, you can write `bar = foo[:]`. This copies the object references inside the list. To copy a structure without copying any object references, use `copy.deepcopy()`.

### Equality

The `is` operator tests for object identity.

In [14]:
size = 5
python = ['Python']
snake_nest = [python] * size
snake_nest

[['Python'], ['Python'], ['Python'], ['Python'], ['Python']]

In [15]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [16]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

True

In [17]:
import random
position = random.choice(range(size))
snake_nest[position] = ['Python']
snake_nest

[['Python'], ['Python'], ['Python'], ['Python'], ['Python']]

In [18]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [19]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

False

In [20]:
[id(snake) for snake in snake_nest]

[2329783298824, 2329783298824, 2329783298824, 2329793632392, 2329783298824]

### Conditionals

In [21]:
mixed = ['cat', '', ['dog'], []]
for element in mixed:
    if element:
        print(element)

cat
['dog']


In [22]:
sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']
all(len(w) > 4 for w in sent)

False

In [23]:
any(len(w) > 4 for w in sent)

True

## 4.2 Sequences

In [24]:
t = 'walk', 'fem', 3
t

('walk', 'fem', 3)

In [25]:
t[0]

'walk'

In [26]:
t[1:]

('fem', 3)

In [27]:
len(t)

3

In [28]:
raw = 'I turned off the spectroroute'
text = ['I', 'turned', 'off', 'the', 'spectroroute']
pair = (6, 'turned')
raw[2], text[3], pair[1]

('t', 'the', 'turned')

In [29]:
raw[-3:], text[-3:], pair[-3:]

('ute', ['off', 'the', 'spectroroute'], (6, 'turned'))

In [30]:
len(raw), len(text), len(pair)

(29, 5, 2)

### Operating on sequence types

In [2]:
import nltk
from nltk import word_tokenize
raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
text = word_tokenize(raw)
fdist = nltk.FreqDist(text)
sorted(fdist)

[',', '.', 'Red', 'lorry', 'red', 'yellow']

In [36]:
for key in fdist:
    print(key + ':', fdist[key], end='; ')

Red: 1; lorry: 4; ,: 3; yellow: 2; red: 1; .: 1; 

In [3]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
words[2], words[3], words[4] = words[3], words[4], words[2]
words

['I', 'turned', 'the', 'spectroroute', 'off']

In [6]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']
list(zip(words, tags))

[('I', 'noun'),
 ('turned', 'verb'),
 ('off', 'prep'),
 ('the', 'det'),
 ('spectroroute', 'noun')]

In [7]:
list(enumerate(words))

[(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

In [2]:
text = nltk.corpus.nps_chat.words()
cut = int(0.9 * len(text))
training_data, test_data = text[:cut], text[cut:]
text == training_data + test_data

True

In [3]:
len(training_data) / len(test_data)

9.0

### Combining different sequence types

In [4]:
words = 'I turned off the spectroroute'.split()
wordlens = [(len(word), word) for word in words]
wordlens

[(1, 'I'), (6, 'turned'), (3, 'off'), (3, 'the'), (12, 'spectroroute')]

In [5]:
wordlens.sort()
' '.join(w for (_, w) in wordlens)

'I off the turned spectroroute'

In [6]:
lexicon = [('the', 'det', ['Di:', 'D@']),
          ('off', 'prep', ['Qf', 'O:f'])]
lexicon

[('the', 'det', ['Di:', 'D@']), ('off', 'prep', ['Qf', 'O:f'])]

In [7]:
lexicon.sort()
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])
del lexicon[0]
lexicon

[('turned', 'VBD', ['t3:nd', 't3`nd'])]

In [8]:
lexicon = tuple(lexicon)
lexicon

(('turned', 'VBD', ['t3:nd', 't3`nd']),)

### Generator expressions

In [10]:
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
        "it means just what I choose it to mean - neither more nor less."'''
[w.lower() for w in word_tokenize(text)]

['``',
 'when',
 'i',
 'use',
 'a',
 'word',
 ',',
 "''",
 'humpty',
 'dumpty',
 'said',
 'in',
 'rather',
 'a',
 'scornful',
 'tone',
 ',',
 '``',
 'it',
 'means',
 'just',
 'what',
 'i',
 'choose',
 'it',
 'to',
 'mean',
 '-',
 'neither',
 'more',
 'nor',
 'less',
 '.',
 "''"]

In [11]:
max([w.lower() for w in word_tokenize(text)])

'word'

In [12]:
max(w.lower() for w in word_tokenize(text))

'word'

## 4.3 Questions of style

### Procedural vs declarative style

In [3]:
tokens = nltk.corpus.brown.words(categories='news')
count = 0
total = 0
for token in tokens:
    count += 1
    total += len(token)
total / count

4.401545438271973

In [5]:
total = sum(len(t) for t in tokens)
print(total / len(tokens))

4.401545438271973


In [None]:
word_list = []
i = 0
while i < len(tokens):
    j = 0
    while j < len(word_list) and word_list[j] <= tokens[i]:
        j += 1
    if j == 0 or tokens[i] != word_list[j-1]:
        word_list.insert(j, tokens[i])
    i += 1

In [4]:
word_list = sorted(set(tokens))

In [9]:
fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
    cumulative += fd.freq(word)
    print("%3d %6.2f%% %s" % (rank + 1, cumulative * 100, word))
    if cumulative > 0.25:
        break


  1   5.40% the
  2  10.42% ,
  3  14.67% .
  4  17.78% of
  5  20.19% and
  6  22.40% to
  7  24.29% a
  8  25.97% in


In [10]:
text = nltk.corpus.gutenberg.words('milton-paradise.txt')
longest = ''
for word in text:
    if len(word) > len(longest):
        longest = word
longest

'unextinguishable'

In [11]:
maxlen = max(len(word) for word in text)
[word for word in text if len(word) == maxlen]

['unextinguishable',
 'transubstantiate',
 'inextinguishable',
 'incomprehensible']

### Some legitimate uses for counters

In [14]:
import pprint

In [12]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
[sent[i:i+n] for i in range(len(sent)-n+1)]

[['The', 'dog', 'gave'],
 ['dog', 'gave', 'John'],
 ['gave', 'John', 'the'],
 ['John', 'the', 'newspaper']]

In [15]:
m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')
pprint.pprint(array)

[[set(), set(), set(), set(), set(), set(), set()],
 [set(), set(), set(), set(), set(), set(), set()],
 [set(), set(), set(), set(), set(), {'Alice'}, set()]]
