# Counting words in a text
 
Suppose we wanted to count how often each word occurs in the English Wikipedia page for Monty Python. Then a natural language description of the algorithm (the method) could look approximately like this:

* Split the text into words.
* Start a sheet of counts.
* Start at the first word, and for each word, do the following:
  * If the word is already on the counts sheet, add a dash next to it.
  * If the word is not on the counts sheet, put it there, and put a single dash next to it.

This algorithm uses:

* Different data types:
  strings (words), integers (counts), a list of words (the text, after splitting), and a mapping from words to counts, a data type that in Python is called a "dictionary"
* Operations on numbers, in this case, adding
* Operations on strings, in this case, splitting a text into words and comparing a word to others on the sheet
* Conditions: "If the word is already on the counts sheet..."
* Repetition: "for each word, do the following"

Although this is a very simple algorithm, it uses almost all the most important data types and types of commands in programming languages like Python. This worksheet introduces three of them:

* Conditions ("if the word is already on the counts sheet...")
* Lists (for example, lists of words), and
* Loops ("for each word, do the following").

## Boolean expressions

Boolean expressions are expressions whose value is either True or False. They play an important role in conditions. You have already encountered some of them when we were discussing strings: ```in```, ```startswith```, and ```endswith```. Here they are again: 

In [1]:
"eros" in "rhinoceros"

True

In [2]:
"nose" in "rhinoceros"

False

In [3]:
"truism".endswith("ism")

True

In [4]:
"inconsequential".startswith("pro")

False

Here are some Boolean expressions that compare numbers:

In [5]:
3.1 >= 2.9

True

In [6]:
3.1 < 2.9

False

In [7]:
3.2 - 0.1 == 3.1

True

In [8]:
3.2 != 3.1 

True

We can compare expressions that yield a number using <, >, <=, and >=. "Equals" is expressed using ==, and "does not equal" is "!=". Note that it's "==" if we want to do a comparison. "=" assigns a value.

The same operators work for strings as well. Try them out. What do you think they do?

In [9]:
"rhinoceros" != "rhino"

True

In [10]:
"rhinoceros" == "rhino"

False

In [11]:
#uses alphabetical order to decide what is greater or larger

"armadillo" < "rhinoceros"

True

In [12]:
#e comes before m so it is considered smaller
#making this false

"elephant" > "mouse"

False

The operators < and >, when used on strings, test for alphabetic, or rather lexicographic, order: "elephant" is before "mouse" in the lexicon. 

Check what lowercase and uppercase letters do: how will they be ordered? Test, for example, the order of "elephant" and "Elephant", of "ant" and "Zebra", of "Ant" and "zebra":

In [13]:
"Zebra" < "ant"

True

The type of such expressions is 'bool'. Remember, the function ```type()``` gives you the type of any Python expression:

In [14]:
type("elephant" > "mouse")

bool

**Try it for yourself:**
Make Boolean expressions for the following, and see if they are true:
* ```6 * 7``` is the same as 42
* ```"ant"``` is before ```"armadillo"``` in the dictionary
* ```"ant"``` is before ```"Ant"``` in the dictionary
* ```"123"``` as a string is after ```"Zebra"``` in the dictionary

In [15]:
# box for code
print(6*7 == 42)
print("ant" < "armadillo")
print("ant" < "Ant")
print("123" > "Zebra")

True
True
False
False


## Conditions

In natural language, you express conditions using "if": "If this word is not yet on our count sheet, put it there". In Python, it is very similar:

In [16]:
if "mad" in "armadillo":
    print("yes, found it")

yes, found it


As you can see, the `print("yes, found it")` code was executed, you can see the result. If the Boolean expression inside the condition is False, then the indented code is not executed, like here:

In [17]:
if "crazy" in "armadillo":
    print("Can anyone hear me?")

As you can see, there is no output of "Can anyone hear me?". 

The formatting here is quite important to look at closely. We have:

* "if"
* then a Boolean expression
* then a colon
* then a linebreak
* then, indented, what to do if the Boolean expression is True.

Or, to put it another way:

```if <Boolean>:
    block```

You will see the same notation again in other contexts.  

The "block" can contain more than a single line, like here:

In [18]:
if "mad" in "armadillo":
    print("yes, found it")
    print("position of the 'm':", "armadillo".index("m"))

yes, found it
position of the 'm': 2


The important point here is: There are *two* lines that should only be executed if the Boolean is True, both of them print-statements. Python groups them together, makes them into a *block*, by having the same level of indentation on the left: one tab-stop. 

Here is another example with a "block" of 2 statements that are only executed if the Boolean is True, followed by a line that is "unindented", and hence outside the block. So that last line is executed, no matter if the Boolean is true or not: It does not belong with the if-block.

In [19]:
if "crazy" in "armadillo":
    print("can anyone hear me?")
    print("and now?")
print("I am outside the indented block.")

I am outside the indented block.


As you can see, the statements in the if-block are not executed, because the Boolean is False, but as soon as we reach the unindented code, that does not depend on the "if" anymore. 

Python is quite particular about its whitespace on the left-hand side. The lines that together make a block all have to be indented exactly the same number of whitespaces. For example, un-comment the following piece of code (that is, remove the hash marks) to get an "IndentationError". This error happens because the second print-statement is indented one more space than the first print-statement. 

In [20]:
# if 3 < 4:
#    print("this line is fine")
#     print("this is going to get us a nice error message")

Sometimes we have two different things that we want to do, depending on whether a Boolean expression is True or False. For example, say we want to guess at a word's part of speech ("POS") based on its suffix. If the words ends in "able", we'll print "adjective", otherwise we will just blindly guess "noun". Here is how to say this in Python. 

In [21]:
word = "lovable" # or try: word = "table"
if word.endswith("able"):
    print("adjective")
else:
    print("noun")

adjective


That is, we can use "if"... "else" to tell Python two different pieces of code to execute, based on whether a Boolean is true or false. The general shape of if...else... commands is

```
if <Boolean>:
    block
else:
    block```
    
So we start un-indented for the "if", then comes an indented block, then we un-indent for the "else", then there is another indented block. Watch the indentation levels! Python is very particular about them.

If you want to use more than two cases, the command is ```if... elif... elif... else...``` 

Here is how to extend the part-of-speech guessing code a bit: We now guess "noun" for words ending in "ion" or "ism", "verb" for words ending in "ed", "adjective" if a word ends in "able", and if we have no better idea, we will just guess "noun".

In [22]:
word = "schism" # also try: "wasted", "red", "undeniable", "table"
# Can you come up with other words that would make this code 
# produce the right result,
# or a wrong result?

if word.endswith("ion"):
    pos = "N"
elif word.endswith("ism"):
    pos = "N"
elif word.endswith("ed"):
    pos = "V"
elif word.endswith("able"):
    pos = "A"
else: 
    pos = "N"
    
pos

'N'

**Try it for yourself**:

* Write code that compares two numbers, stored in variables ```numval1``` and ```numval2```, and stores the greater of the two in the variable ```greaterval```. Use "if" to do this.
* Write code that tries to identify nouns: It should inspect a variable ```word``` that contains a string. If that string starts with an uppercase letter, or if it ends in "ity" or "hood", it should print "This is a noun", otherwise "This is not a noun". 

In [23]:
numval1 = 10

numval2 = 15

greaterval = 0

if numval1 > numval2:
    greaterval = numval1
else:
    greaterval = numval2

print(greaterval)

15


In [24]:
word = "Linguistics"

if (65 <= ord(word[0]) <= 90) or word.endswith("ity") or word.endswith("hood"):
    print("This is a noun")
else:
    print("This is not a noun")

This is a noun


### And, or, not
Sometimes you need to combine multiple Boolean expression.  For example, in the part of speech tagging code above, we have two conditions where we set the part of speech to be "N" for noun. We can combine them by saying that if we find an ending of "ion" *or* "ism", we want to call it a noun:

In [25]:
word = "schism" 

if word.endswith("ion") or word.endswith("ism"):
    pos = "N"
elif word.endswith("ed"):
    pos = "V"
elif word.endswith("able"):
    pos = "A"
else: 
    pos = "N"
    
pos

'N'

This was an example of "or".  There is also "and": For example, suppose you wanted to define medium-length words as words with at least 4 *and* at most 10 letters. Here is some code that says "medium-length" for medium-length words, and "not medium length" for others:

In [26]:
word = "antidisestablishmentarianism" # also try: "Python", "to"

#word = "Python"

if len(word) >= 4 and len(word) <= 10:
    print("medium-length")
else:
    print("not medium-length")
    

not medium-length


The central point in the code above is: ```if len(word) >= 4 and len(word) <= 10```

This combines two Boolean expressions. In general,     
```<expression1> or <expression2>``` is true if at least one of expression1 and expression2 are true. ```<expression1> and <expression2>``` is true if both expressions are true. 

The reserved word "not" flips the value of a Boolean expression:

In [27]:
"apple" == "orange"

False

In [28]:
not("apple" == "orange")

True

**Try it for yourself:** Write another version of the function that filters out function words. Instead of using ```if...elif..elif..else```, use a single "if" condition, but combine Boolean expressions using "and" or "or".

# Lists

A list is a sequence of items, like a shopping list. In Python, you write them with straight brackets around them, and with commas between items:

In [29]:
shopping_list = ["eggs", "milk", "broccoli"]

A list has a length:

In [30]:
len(shopping_list)

3

A list consists of a sequence of items, each of which has an index. We can access list members by their index, in a similar way to strings:

In [31]:
shopping_list[0]

'eggs'

In [32]:
shopping_list[2]

'broccoli'

In [33]:
# shopping_list[10]


## Lists and strings

In fact, lists and strings have many operations in common. 

### Indices and slices

You have seen above that you can use indices on lists. You can also use slices:

In [34]:
mylist = ["acrimonious", "acarus", "caucus"]
mylist[1:3]

['acarus', 'caucus']

In [35]:
mylist[1:2]

['acarus']

In [36]:
mylist[2:]

['caucus']

In [37]:
mylist[0]

'acrimonious'

In [38]:
mylist[0:1]

['acrimonious']

In [39]:
mylist[-1]

'caucus'

In [40]:
mylist[-2:]

['acarus', 'caucus']

**Try it for yourself:**
Assume that
```ismlist = ["absurdism", "antiferromagnetism", "bipedalism", "bimetallism"]```

Can you access just the word "bipedalism" by using an index on mylist? Can you access just the sublist 
```["absurdism", "antiferromagnetism"]```?

In [41]:
ismlist = ["absurdism", "antiferromagnetism", "bipedalism", "bimetallism"]

print(ismlist[-2])

print(ismlist[1:3])


bipedalism
['antiferromagnetism', 'bipedalism']


### Concatenation
Using "+" on strings concatenates them, for example

In [42]:
"arti" + "fact"

'artifact'

Here is what happens when you use "+" on lists: It also does concatenation, just list concatenation.

In [43]:
longerlist = mylist + ["decorous", "arborous"]

longerlist

['acrimonious', 'acarus', 'caucus', 'decorous', 'arborous']

### Length
```len()``` works on lists  and on strings in the same way:

In [44]:
print("length of the string 'abcd':", len("abcd"))
print("length of the list [1,2,3,4,5]:", len([1,2,3,4,5]))

length of the string 'abcd': 4
length of the list [1,2,3,4,5]: 5


### in

You have used ```in``` to check for substrings:

In [45]:
"tall" in "bimetallism"

True

You can also use in to check for items on a list:

In [46]:
"eohippus" in mylist

False

In [47]:
"caucus" in mylist

True

## A list method that is not a string method: append

Here is a function that works on lists but not on strings: You can append another item to the end of a list using the function ```append()```.

In [48]:
mylist.append("nexus")
mylist

['acrimonious', 'acarus', 'caucus', 'nexus']

There is an important difference between concatenation with "+", and "append": When we used concatenation above, it did not change `mylist`. We captured the result of the concatenation in `longerlist`, but `mylist` stayed the same. `append()` is different: It does change `mylist`.

## Splitting

An important string method that we have mentioned before is ```split()```. It works on strings and splits them up into a list of substrings. If given no arguments, it splits on whitespace.

In [49]:
"this is a sentence".split()

['this', 'is', 'a', 'sentence']

## Lists with only one item, zero items

A list with only one item is a list, too:

In [50]:
shortlist = ["broccoli"]
len(shortlist)

1

The list with "broccoli" on it is not the same as the string "broccoli":

In [51]:
['broccoli'] =='broccoli'

False

In [52]:
print("A list is a ", type(["broccoli"]), "and a string is a ", type("broccoli"))

A list is a  <class 'list'> and a string is a  <class 'str'>


You can even make a list with no items on it -- it will turn out to be strangely useful:

In [53]:
veryshortlist = [ ]
len(veryshortlist)

0

**Try it for yourself**:

* Access the item "nexus" from `mylist` using indices. Also access the list slice that only contains the item "nexus". Watch out: the first time, you get a string, and the second time, a list. 


* Use Python to determine how many words there are in the following first lines of a poem by Lewis Carroll: First use `split()` to make a list, then determine the length of the list. 

    """They told me you had been to her
    
    And mentioned me to him;
   
    She gave me a good character,
    
    But said I could not swim. """

In [54]:
mylist[-1]

'nexus'

In [72]:
l_carroll =  """They told me you had been to her
    
    And mentioned me to him;
   
    She gave me a good character,
    
    But said I could not swim. """

lc = l_carroll.split()

len(lc)


25

# Loops

Up to now, we have only accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: 

“For each word in the given list of words: print it”.

Here is how to say this in Python:

In [56]:
my_list = [ "ngram", "isogram", "cladogram", "pangram"]
for word in my_list:
    print( word )

ngram
isogram
cladogram
pangram


Here is what this code does: 
First, a list is established that we want to work with.

The for-loop has this form: `for`, then a new variable, then `in`, then your list name.

Python fills this new variable, here `word`, for you. You never explicitly store anything in it. But Python goes through the list `my_list`. It first stores the first list entry, "ngram", in `word`, then executes `print(word)`. It then stores the second list entry, "isogram", in `word`, then executes `print(word)`. So for this particular value of `my_list`, the code above is equivalent to:

In [57]:
word = "ngram"
print(word)
word = "isogram"
print(word)
word = "cladogram"
print(word)
word = "pangram"
print(word)

ngram
isogram
cladogram
pangram


Now let's look more closely at the shape of the for-loop-command. There are several things to note here. 
* First, the reserved word that signals repetition is "for".
* Second, the overall shape of the "for" loop is very similar to that of a conditon: It uses a colon at the end of the line, and an indented block.

```for <variable> in <something>:
    block```

* Third, “word” is a variable. You could have chosen a different variable name, of course, for example: 

In [58]:
for abcd123 in my_list:
    print( abcd123 )

ngram
isogram
cladogram
pangram


The variable in the loop is like the variable in a function definition: You don't need to specify its contents beforehand. In fact, whatever was in ```abcd123``` before the loop gets erased:

In [59]:
abcd123 = "hello"
for abcd123 in ["ngram", "isogram", "cladogram", "pangram"]:
    print( "list member", abcd123 )
print("at this point, abcd123 is", abcd123)

list member ngram
list member isogram
list member cladogram
list member pangram
at this point, abcd123 is pangram


Summing up, the standard way to use a for-loop is:
* You have a list ```somelist```, and you want to run some code for each item on the list in turn
* Then you make a for-loop saying 

```for some_new_variable_name in somelist:
    do whatever you want done to each list item, treating some_new_variable_name as containing the item in question```


Typically you will choose as the loop variable one that you haven't used before. In the loop, Python fills it with each item on the list in turn. In the example above, it first puts "ngram" in abcd123. This, then, is printed within the block. Then it puts "isogram" into abcd123, and the block is executed with this value of abcd123. In the third execution of the loop, abcd123 is "cladogram", and the fourth time, it is "pangram". Then the list is exhausted, and the loop is done.

Here is another example of a for-loop.

In [60]:
numberlist = [345, 52, 1034, 79421]
mysum = 0
for number in numberlist:
    mysum = mysum + number

mysum

80852

This code does the same as sum(numberlist). It illustrates a general pattern that you will see very often: 
* You initialize a counter (here: to zero), 
* then you iterate over the list, and change the counter. 

This is an example of a "programming idiom", a standard way of solving problems that you can use across a lot of tasks. Let's call this one the "aggregate data over list" idiom. 

Here is another example of this idiom. It counts the "long words" in a given sentence. Note that this code example puts a condition, `if len(word) >= 10`, inside a for-loop. In a for-loop, the code to be repeated needs to be indented. In a condition, the code to be executed conditionally needs to be indented. If you put the two together, you get doubly indented code.

In [61]:
mysent = "Reliefpfeiler is the longest German palindrome"
mywords = mysent.split()
longwords = 0
for word in mywords:
    if len(word) >=10:
        longwords = longwords + 1
longwords

2


***Try it for yourself:***

* Here is a list of words:
    
    ```[ "candygram", "preprogram", "picogram"]```
    
    How many a's do the words on this list contain, taken together?
    
    For this problem, remember a string function that we used a few worksheets ago: ```"candygram.count("a")``` counts the a's in "candygram".

* Let's use again the poem lines we used before:

    """They told me you had been to her
    And mentioned me to him;
    She gave me a good character,
    But said I could not swim. """

    Can you count how often the word "me" occurs in this text? To solve this, first split the poem into words, then iterate over the words. If the current word is "me", then count it. 

In [75]:
listOfWords = [ "candygram", "preprogram", "picogram"]

dict = {}

for word in listOfWords:
    dict[word] = word.count("a")

print(dict)
print(sum(dict.values()))

{'candygram': 2, 'preprogram': 1, 'picogram': 1}
4


In [63]:
poem = """They told me you had been to her
    And mentioned me to him;
    She gave me a good character,
    But said I could not swim. """

poem_list = poem.split(" ")

poem_list.count("me")

3

Here is another variant of the same programming idiom. Like above, we initialize a container to be empty, then fill it while iterating over a list. Only this time the container is not a numeric value, but a list. With this variant of the idiom, we can for example collect all uppercase words from a given text. The following text is from the Wikipedia page on  Python.

In [64]:
mytext = """The Python programming language by Guido van Rossum is named after the troupe, 
and Monty Python references are often found in sample code created for that language. 
Additionally, a 2001 April Fool's Day joke by van Rossum and Larry Wall involving the 
merger of Python with Perl was dubbed "Parrot" after the Dead Parrot Sketch. 
The name "Parrot" was later used for a project to develop a virtual machine for running 
bytecode for interpreted languages such as Perl and Python. 
Also, the Jet Propulsion Laboratory wrote some spacecraft navigation software in Python, 
which they dubbed "Monty". There is also a python refactoring tool called 
bicyclerepair ( [1] ), named after Bicycle Repair Man sketch."""

words = mytext.split()

uppercase_words = [ ]
for word in words:
    if word.istitle():
        uppercase_words.append(word)

uppercase_words

['The',
 'Python',
 'Guido',
 'Rossum',
 'Monty',
 'Python',
 'Additionally,',
 'April',
 'Day',
 'Rossum',
 'Larry',
 'Wall',
 'Python',
 'Perl',
 '"Parrot"',
 'Dead',
 'Parrot',
 'Sketch.',
 'The',
 '"Parrot"',
 'Perl',
 'Python.',
 'Also,',
 'Jet',
 'Propulsion',
 'Laboratory',
 'Python,',
 '"Monty".',
 'There',
 'Bicycle',
 'Repair',
 'Man']

This code first splits the text into words. It then initializes a list uppercase_words, in which we want to collect results. Initially, that list is zero. We then iterate through all words of our text and check if they start with an uppercase letter, that is, if word[0] is a string consisting entirely of uppercase letters (see http://docs.python.org/3/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange for the use of isupper()).  If that is the case, we append word to our collector list uppercase_words.

**Try it for yourself:**
Collect all words that end in "t" in the same text mytext that we just used.

In [73]:
t_list = []

for word in words:
    if word.endswith("t"):
        t_list.append(word)

t_list

['that', 'Parrot', 'project', 'Jet', 'spacecraft']

Similarly, here is some code that goes through the same text, lowercases each word and stores the result in a result list:

In [65]:
mytext = """The Python programming language by Guido van Rossum is named after the troupe, 
and Monty Python references are often found in sample code created for that language. 
Additionally, a 2001 April Fool's Day joke by van Rossum and Larry Wall involving the 
merger of Python with Perl was dubbed "Parrot" after the Dead Parrot Sketch. 
The name "Parrot" was later used for a project to develop a virtual machine for running 
bytecode for interpreted languages such as Perl and Python. 
Also, the Jet Propulsion Laboratory wrote some spacecraft navigation software in Python, 
which they dubbed "Monty". There is also a python refactoring tool called 
bicyclerepair ( [1] ), named after Bicycle Repair Man sketch."""

words = mytext.split()

result_list = [ ]
for word in words:
    lowered = word.lower()
    result_list.append(lowered)
        

result_list[:10]

['the',
 'python',
 'programming',
 'language',
 'by',
 'guido',
 'van',
 'rossum',
 'is',
 'named']

# Ranges

Suppose you wanted to have a list of the form ```[0, 1, 2, 3, 4]```, that is, a series of consecutive numbers. Then you can do that as before:

my_list = [0,1,2,3,4]

But since this is a kind of data structure that is needed relatively often, Python has a shortcut for this:

In [66]:
my_range = range(5)
list(my_range)

[0, 1, 2, 3, 4]

```range(n)``` yields a data type called a *range* (something similar to a list) that starts at 0 and ends at n-1 (*not n!*). This is like with list slices: Remember that ```my_list[1:4]``` gave you the part of the list that started at index 1 and ended at index 3.

You can also use ```range() ```with two parameters instead of one. ```range(j, k)``` yields the numbers from j to k-1:

In [67]:
my_range = range(2,5)
list(my_range)

[2, 3, 4]

You can also do this:

In [68]:
list(range(20, 30, 2))

[20, 22, 24, 26, 28]

Here is how to use ```range()``` in a loop to count to ten:

In [69]:
for num in range(1, 11):
    print( num )

1
2
3
4
5
6
7
8
9
10


**Try it for yourself:**

* How can you use range() to sum up the numbers from 1 to 20?

In [74]:
mySum = 0

for i in range(20):
    mySum += i

mySum

190