
# Week 3: Strings, Lists, and Files

In this section of the lecture, we explore some new methods (pun) and a new data type in Python.

- [string methods](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/06-String-Methods.html)
- [lists and loops](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/09-Lists-Loops-Part1.html)
- [files and character encoding](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/07-Files-Character-Encoding.html)

## Lab 

As always, your weekly lab is due Friday at 10pm. We recommend attempting the lab before the tutorial on Thursday so that you can get some help completing it if necessary.

# A Bit of Review

- variables
- data types
- operators

In [None]:
# How do you explain these potentially surprising reults?

print(3 + 3)
print("3" + "3")

In [None]:
name = "Walsh"
print(name * 3)
print(name * 3.0)

In [None]:
print("True" + "False")
print(True + False)

In [None]:
True * 3.14159

In [None]:
True / False

What are the numeric values for `True` and `False`?

# Our Coding Task Today

Today we're going to learn how to do a big part of our Type/Token Ratio experiment: 
* Load a text
* Break that text into words (**"tokenize"** it)
* Count the number of words (or "tokens") in the text

Let's start by doing this task manually. 

Our sample text will be a poem from [*The Policeman's Beard is Half Constructed*](https://q.utoronto.ca/courses/400278/files/38406481?wrap=1):

> Awareness is like consciousness. Soul is like spirit. But soft is not like hard and weak is not like strong. This is called philosophy or a world-view.

How many words are in this text? 

# 3a: Indexing and Slicing strings

In [None]:

text = "Awareness is like consciousness. Soul is like spirit. But soft is not like hard and weak is not like strong. This is called philosophy or a world-view."

In [None]:
type(text)

In [None]:
text

In [None]:
print(text)

## Indexing

Let's start with the below. It will display "item number one" from the `text` variable. What do you expect to see?

In [None]:
text

In [None]:
text[1]

This would be a good place to stop and make sure you that you understand some key features of strings and indexing.
- A string is ...
- Indexing means ...
- We start counting sequences at ...  (0 or 1?)

## Slicing

If you want to pull multiple characters out of a string, that's called **slicing**, and the syntax is as follows: 

> `string[start:stop:step]`, where `start`, `stop`, and `step` are "index positions" in the string you want to slice. 


In [None]:
# What if we want to extract the first word "Awareness"
text[0:8]

What other fun things can we do with slicing? 

In [None]:
text[9:]

# 3b: String methods

Let's now meet **methods**, which are a *special kind of function* that only belong to certain **data types** and which are written out differently than functions.

The syntax for methods is as follows: 

> `data.method(argument)`

As you can see, they look quite a bit like functions, but they come **after** the data that you want to perform some action on, and they are "attached" to that data, as it were, by a period (`.`). 

Python has lots of cool string-specific methods. Let's explore a few:
* `.lower()`: make lowercase
* `.upper()`: make uppercase
* `.title()`: make title case
* `.replace()`: replace some text with other text (like "Find and Replace" in Word)
* `.split()`: break a string into separate units, such as words

Let's try out the make-it-lower case method, `.lower()`

In [None]:
"AAAAAAAAAAAAAAAAAAAAAH".lower()

If we **wanted** to replace the value of `text` with a fully lowercased version of itself, how would we do it?

In [None]:
#
text = text.lower()
text

Below is a look at the `.replace()` method, which as you can see takes **two arguments**, the text to replace with something else, and the "something else" to replace it with.

`string.replace("text to remove", "text to insert in its place")`

Note that the **two arguments** are separated by commas.

In [None]:
"Now approaching Ossington... Ossington Station".replace("Ossington", "Christie")

You can use `names_of_string_variables` rather than `"directly inputted text"` at any position here. For example,

In [None]:
analogy = "consciousness"
new_analogy = "???"

text.replace(analogy, new_analogy)

As exciting as the above undoubtedly is, perhaps the most exciting string method for the purposes of our TTR experiments is 

> `.split()`

— which takes a string and breaks it up into chunks.

`.split()` doesn't **require** an argument. What does it do below?

In [None]:
"April is the cruelest month, breeding / Lilacs out of the dead land".split()

The default, argument-less version of `split()` **"splits on whitespace."** That is to say, any time it encounters any number of consecutive characters that Python (or the people who made Python) interpret as "empty," it sharpens its fangs cuts them out. 

Whitespace characters include:
* "` `": spaces (yes, spaces are characters!)
* "`\t`": tabs (yes, tabs are characters, and are represented as `\t`)
* "`\n`": newlines or "Return"s (yes, "Return" or "Enter" is a character, and is represented as `\n` — among other ways!)

But `.split()` will **accept** an argument, if we want to split a string up by something other than whitespace.

For instance, we could split Eliot up by "`/`" to divide this poem into lines...

In [None]:
waste_land = "April is the cruelest month, breeding / Lilacs out of the dead land"

waste_land.split("/")

In [None]:
text.split()

In [None]:
text_words = text.split()

In [None]:
print(text_words)

In [None]:
type(text_words)

# 3c: Lists!

Lists have names, and they contain multiple items.
**Grocery List**:
 * Chocolate bar
 * Chips
 * Chocolate milk
 * Another bag of chips

If I wanted to create a Python equivalent of the above list, I might do something like...

In [None]:
grocery_list = ["Chocolate bar", "Chips", "Chocolate milk", "Another bag of chips"]

In [None]:
grocery_list

In [None]:
print(grocery_list)

In [None]:
type(grocery_list)

In [None]:
other_list = [2, "chocolate bar", 3.14159, True]

## Indexing and Slicing Lists

If you want to pull out an individual item or **element** from a list, you can **index** it just like a string.

In [None]:
grocery_list[0]

What do you think will happen when we do the below?

In [None]:
type(grocery_list[0])

In [None]:
grocery_list[0] + grocery_list[3]

You can also **slice** `list`s in the same ways as you can `str`s.

In [None]:
grocery_list[:2]

In [None]:
type(grocery_list[:2])

In [None]:
grocery_list[-2:]

In [None]:
grocery_list[::-1]

In [None]:
text_words

## `len()`

Believe it or not, we're one mere tiny function away from being able to do something really exciting and absolutely essential to our TTR experiment: **count the number of tokens in a text**. 

`.split()` **"tokenized"** the text, (It didn't do a perfect job, of course, but it did pretty well.)


All we need now is a way to actually count the number of words in our list for which there is a Python function:  **`len()`**.

In [None]:
print(text)
len(text)

In [None]:
print(text_words)
len(text_words)

Sadly, `len()` doesn't know what it means to count the length an `int`, a `float`, or a `bool`.

In [None]:
len(232)

In [None]:
len(3.14159)

In [None]:
len(True)

When `len()` meets a `list`, he doesn't count anything within the actual items in the list; he only counts the number of elements or items in the list. We could, of course, ask him to count how long an individual element is, too, if it's the sort of thing that can be counted...

In [None]:
print(text_words[4])
len(text_words[4])

## The `.join()` method

Before we move on to counting the actual number of words in an actual novel (!!!!), there's one more string method to introduce, now that we know about `list`s: the `.join()` method.

`string.join(list)` — where the `string` is whatever to want to mash **between** the items being joined and `list` is the list that you want to collapse into a single string.


In [None]:
" Mississippi, ".join(["One", "two", "three", "four"])

Notice that the `string` is **only** stuck in **between** items, so that in my application I'm left with a weirdly hanging four.

Here's a more practical application of `.join()`:

In [None]:
" ".join(text_words)

... can we make the robot sound like a modern teenager?

In [None]:
", like, ".join(text_words)

# 3e: Loading a real-life text file into Python... and tokenizing it!

We now have pretty much all the tools we need to perform an important part of our TTR exercise:
* We can **tokenize** a sting by `.split()`ting it into words (more or less) and producing lists of words (more or less)
* And we can count how many words there are in those lists with `len()`

What we haven't learned to do, however, is actually load a novel into Python.

Well, believe it or not, for that we only need a single line of code. 

It starts with the `open()` function, which we will provide with two arguments:
* The **path** to the file we want to open, entered as `str` (so with `""`s around it). This tells Python *where* that file is, relative to the notebook you currently have open. For this course, we will generally assume that the data files that we will open are in the same folder as the notebook that is using them. In this case, the path is pretty simple: the name of the file, in quotation marks.
* The type of **character encoding** that that file uses, a topic which requires its own subject heading.


## Character encoding

Although it would be handy if there were only one way of encoding text there are in fact many. Two of these are
* "[ASCII](https://en.wikipedia.org/wiki/ASCII)" — first devised in the 1960s! — which has pretty limited support for anything beyond English characters A-Z, a-z, 0-9, with some punctuation and special characters allowed. 
* "[UTF-8](https://en.wikipedia.org/wiki/UTF-8)" (an implmentation of [Unicode](https://en.wikipedia.org/wiki/Unicode)), continues to add characters and languages.

Let's load a file. It's called `"sample_character_encoding.txt"` and it is encoded in the UTF-8 standard.

In [None]:
sample = open("sample_character_encoding.txt", encoding="utf-8").read()
print(sample)

If we try to open this same exact file with a different encoding system called ISO-8859-1, we get a bit of a mess

In [None]:
sample = open("sample_character_encoding.txt", encoding="iso-8859-1").read()
print(sample)

If we try to open this same exact file using Ye Olde Fashionede ASCII encoding, we just get an error.

In [None]:
sample = open("sample_character_encoding.txt", encoding="ascii").read()
print(sample)

Now let's try something loading a whole novel such as Arthur Conan Doyle's ***The Sign of the Four***! 

We've already conveniently put a copy of it — `sign-of-four.txt`, sourced from Project Gutenberg and encoded in UTF-8 — in your JupyterHubs, in the same folder as where this notebook lives. Let's load it into a variable called `sot4`.

In [None]:
sot4 = open("sign-of-four.txt", encoding="utf-8").read()

In [None]:
type(sot4)

In [None]:
len(sot4)

Let's have a look inside!

In [None]:
sot4

In [None]:
sot4[:200]

Notice that if we `print()` this out, we get quite different output!

In [None]:
print(sot4[:200])

Anyway, let's try **tokenizing** this as is and look at the output.

In [None]:
sot4_words = sot4.split()

In [None]:
sot4_words

And then... just one more step to get our number of tokens!

In [None]:
len(sot4_words)

In terms of the TTR project, we now know how to:
* load a file
* split it into words
* count the number of words

We don't have the tools yet to:
* remove punctuation 
* automatically go into a folder full of lots and lots of files, load them all up, and count their lengths
* count unique types
* automatically standardize our sample size

To do all the above, we'll need to learn a bit about iteration and loops. Which we'll do next class...