# Lists and Dictionaries for Text Processing

## MDS/DH

## Lists

In this notebook, we will focus on working with lists, but with an emphasis on lists of strings, which is often how we represent textual data (e.g. a list of words).  

Strings and lists are sequences, which is to say that they are **ordered** collections of things.  Strings are ordered sequences of characters; lists are ordered sequences of any kind of object.  Strings, however, are immutable once created, whereas lists are mutable.

A list has a clearly defined order, and its elements may be strings, numbers, lists, dictionaries, or pretty much any Python object, including collections of objects.  So you can have a list of lists of strings.  Etc.

Lists may mix different kinds of objects (though this is not common in practice, as it makes processing the list more complex). A given element may appear multiple times in a list.

## Constructing lists

Lists are constructed with square brackets, and elements are separated by commas.  Because the elements here are strings (not numbers), each string must be separately surrounded by quotes:

In [None]:
words = ['The', 'quick', 'brown', 'fox']
words

In [None]:
string = 'The quick brown fox jumped over the lazy dogs.'
words = string.split()
words

## Slicing lists

You can access the elements of a list by using slice notation, just as we accessed the characters in strings.  You can specify an individual item, or a range separated by a colon.  When there is a colon, you can leave out one of the numbers to mean the start/end of the list.

In [None]:
words[-1]

In [None]:
words[-1]

In [None]:
words[3:6]

In [None]:
words[:3]

In [None]:
words[6:]

## Striding

Slice ranges can have a third number in addition to the start and end, which specifies how many elements to step over when pulling out the slice.  In my experience, this is rarely useful for textual data.

In [None]:
words[::2]

One possibly useful thing is that a negative step value causes the slice to be constructed backwards, from end to beginning, so `[::-1]` provides a handy way to print a list in reverse order. (If you want to actually reverse the order of the list itself, you can do that with the reverse method.)

## Nested lists

Lists can be nested.  You can have a list of lists, or a list of lists of lists ...

So we might have a text as a list of sentences where each sentence is represented as a list of words.  That would facilitate going through a text sentence by sentence.

In [None]:
list_of_lists = [
    ['one', 'two', 'three'],
    ['a', 'b', 'c'],
    ['α', 'β', 'γ']
]
list_of_lists

In [None]:
list_of_lists[0]

In [None]:
list_of_lists[2][2]

In [None]:
list_of_lists[2][2]

In [None]:
list_of_lists[1:1]

In [None]:
string = 'The quick brown fox jumped over the lazy dogs.'
words = string.split()
words

In [None]:
words[1][0]

## Sequence methods

Some of the "string methods" are really "sequence methods", which do analogous things to both strings and lists. 

The category of sequence encompasses lists, strings, tuples (fixed-length lists) and ranges (automatically generated lists, such as the numbers from 1 to 100).  


In [None]:
string = 'The quick brown fox jumped over the lazy dogs.'
words = string.lower().split()
words

In [None]:
len(words)

In [None]:
'jumped' in words

In [None]:
words.count('the')

In [None]:
words.index('the')

In [None]:
words.index('the', 1)

In [None]:
for word in words:
    print(word)

## Built-in list methods

There are a number of other useful [list methods](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) (functions that work on lists).

In [None]:
string = 'the quick brown fox jumped over the lazy dogs'
words = string.split()
words.append('again')
words

In [None]:
string

Remember that lists, unlike strings, are mutable.  The method changes the original list; it does not return a new object.  This is unlike string methods, which return a new string with the change implemented.

What if you want to build up a new list, element by element?

In [None]:
new_new_words.append('Hello')

In [None]:
new_new_words

In [None]:
new_new_words = []
new_new_words.append('Hello')
new_new_words

In [None]:
new_new_words.append('World')
new_new_words

In [None]:
words.remove('quick')
words

You can also do this with the del command, which takes a slice index to delete

In [None]:
del words[-1]
words

In [None]:
words.insert(1, 'frisky')
words

Finally, you can turn a string into a list of characters.

In [None]:
string = "Hello world!"
my_list = list(string)
my_list