# Week 5: Finishing the TTR Experiment

In this week, we put together everything we've learned so far this semester to tackle the full TTR experiment.

## Part 1: Removing Puncutation with Regular Expressions

Here we learn about regular expressions, to help us with the super-important task of removing all punctuation from our texts.

## Part 2: Iterating through Files in a Folder

Here we use the `Path()` function to help load a whole folder of texts and analyze them one-by-one.

## Part 3: Automatically Determining Sample Size and Producing Standardized Results

Here we learn how to determine the total length of the shortest text, and then calculate the TTR only of a sample of the full text.


## Part 4: Writing CSV files

Here with use `open()` and `.write()` to get Python to spit out spreadsheet files with our results all ready to use!


## Links

* Melanie Walsh discusses regular expressions in [her chapter on web scraping](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/03-Web-Scraping-Part2.html?highlight=regular%20expressions#regular-expressions). Our discussion is probably a bit gentler, but this is here if you're looking for more explanations.
* Walsh also covers opening and saving files in [her chapter on files and character encoding](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/07-Files-Character-Encoding.html)



# Taking Stock on the TTR Task

We now know how to do most of what we need to write code that will quickly and accurately calculate non-standardized and standardized TTRs for a folder full of text files.

During Week Three, we learned how to
* Load a file
* Split it into words
* Count the number of words (tokens)

In Week Four, we used conditionals and iteration to:
* Count unique words (types)

Today, we will learn how to:
* Remove punctuation for more accurate type counts
* Iterate through a folder of text files
* Automatically determine our sample size (the total length of the shortest text)
* Calculate the standardized TTR for each text file in our folder
* Output our results in the form of CSV spreadsheet files

# 1. Removing Punctuation with Regular Expressions

Regular expressions — also known by their cooler *nom de guerre* **Regex** — are a whole language of their own. And don't let that term "regular" fool you — they are wild and charismatic and (for humanities people, anyway) **extremely cool**.

Note that regular expressions aren't only used in Python. They can be used in all programming languages, and you can even use them in good text editors like Sublime Text. Trust me, they come in extremely handy once you get a handle on them!

Imagine them as a super-sophisticated version of a find-and-replace command. We've already met one of those — the `string.replace()` method. Regular expressions go way, way further. We're only going to explore a tiny fraction of what they can do...

Let's explore a scnario where `string.replace()` doesn't exactly get us what we want, and we need something more...

In [118]:
sot4 = open("sign-of-four.txt", encoding="utf-8").read()

In [None]:
sot4[1475:1843]

In [None]:
cocaine_exchange = sot4[1475:1843]
print(cocaine_exchange)

Let's say we would like to do three things to clean up this string:
* Replace the too-risqué word "cocaine" with "strawberry soda"
* Remove all punctuation
* Extract only the dialogue from this exchange, storing each piece of dialogue as an item in a list

Our old friend `string.replace()` can easily do the first, do the second with a lot of effort, and not do the latter at all.

In [None]:
cocaine_exchange = cocaine_exchange.replace("cocaine", "stawberry soda")
print(cocaine_exchange)

In [None]:
cocaine_exchange = cocaine_exchange.replace("?", "")
print(cocaine_exchange)

As you can see, `string.replace()` can remove punctuation... but we need to specify each piece of punctuation one-by-one, which is rather laborious.

In [None]:
cocaine_exchange = cocaine_exchange.replace("“", "")
print(cocaine_exchange)

As for extracting all passages of dialogue, `string.replace()` offers nothing at all. It can only take particular strings and replace them with other strings.

This is where **regular expressions** come in...

## Python libraries

Python has a bunch of built-in functionality that you need to explicitly call on to "activate." Think of it as a resources issues. Not everyone cares about fancy find-and-replace functions (boring people, to be specific) — so they don't want their Python programs loaded with lots of unnecessary commands they won't call on. There are tons of domain-specific commands (astronomy commands, economics commands, biology commands) that are probably cool but that we don't intend to use — and we don't want them bogging down our stuff, either.

Due to our exquisite taste, we need regular expressions. 

This requires us to **load a Python library**: a set of commands that are lying politely in wait, waiting for their number to be called, ready to slide down the fireman's pole from the realm of mere potentiality into the world of the actual. The Library we seek is **named `re`.**

![Fireman's pole](firemanspole.gif)

The command below calls `re` down the fireman's pole. We can now use Regex!

In [124]:
import re

`re` has a bunch of different **functions** bundled into it, all of whose names begin with `re` then `.` and then the same of the command. 

We're going to start with `re.sub()` which does more or less what `string.replace()` does, though its syntax is a bit different.

Since the `string` in `string.replace()` is really a variable containing some text, let's call it `text_variable.replace()` to make it more clear how it differs from `re.sub()`.

Recall that `text_variable.replace()` takes who arguments `("the string to replace", "the string to replace it with")` and `text_variable` is the variable containing the text to be modified.  In contrast, `re.sub()` takes three arguments: `re.sub("the string to replace", "the string to replace it with", the_variable_containing_the_text)`.  

Here's how we would do our first two tasks:

In [None]:
sot4[1475:1843]

In [None]:
cocaine_exchange = re.sub("cocaine", "strawberry soda", cocaine_exchange)
print(cocaine_exchange)

In [None]:
cocaine_exchange = re.sub("“", "", cocaine_exchange)
print(cocaine_exchange)

Note that **the `re.sub()` function is NOT a mutating function** — so if you want to store its output, you need to explicitly stick it into a variable.

In [128]:
strawberry_soda_exchange = re.sub("cocaine", "strawberry soda", cocaine_exchange)

In [None]:
print(strawberry_soda_exchange)

## Regex 👸🤴

Regex is capable of doing so much more than this.

**Regex — androgynous, post-gender QueenKing 👸🤴, Regina/Rex — is capable of doing so much more than this!!**

Regex is flush with power.

Let's explore some of what Regex can do [at the website Regular Expressions 101](https://regex101.com).

**Don't worry: you don't need to memorize all of this. The only thing you really need to know is how to use Regex to remove punctuation for our TTR task. But we thought you would enjoy seeing a demonstration of Regex's immense power!**

We'll try the following:

* `a`: the character `a`
* `[aeiou]`: any one of `a`, `e`, `i`, *or* `u` (the square brackets `[]` mean "any one of what's between me")
* `[aeiouAEIOU]`: same as above, but adding capital letters
* `[a-z]`: any character in the **range** `a-z` — so, any lowercase letter
* `[a-zA-Z]`: any lowercase or uppercase letter — so, any letter
* `[^a-z]`: anything that is **not** `a-z` (the `^` means "not")

Then we'll meet these fellows:
* `\w`: any letter
* `\d`: any number
* `\s`: any whitespace
* `\W`: anything *not* a letter
* `\D`: anything *not* a number
* `\S`: anything *not* whitespace

## Removing Punctuation with Regex

Our adventures at Regular Expressions 101 will have showed us how we can quite simply remove "punctuation," which we will define as: 
- **any character that isn't a lowercase letter a-z, an uppercase letter A-Z, or a number 0-9.**

In the language of Regex, you would express that same definition as follows: 
- **[^a-zA-Z0-9]**

We want to grab all of those and replace them with **spaces** since replacing the `-` in a word like `seven-per-cent` would turn it into a non-word like `sevenpercent` rather than three separate words, `seven per cent`, which our `string.split()` method will be able to easily "tokenize."

In [None]:
cocaine_exchange = sot4[1475:1843]
print(cocaine_exchange)

In [None]:
cocaine_exchange = re.sub("[^a-zA-Z0-9]", " ", cocaine_exchange)
print(cocaine_exchange)

**That does the trick!** 

So now the question becomes: **WHEN** should we remove punctuation? When `sot4` is a string, or after we've used `.split()` to split it into words?

Let's try it both ways, first removing punctuation *after* `.split()`ting, and then removing it *before*.

In [132]:
ce_words = cocaine_exchange.split() # First we split the text into words

ce_unique_words = []

for word in ce_words:
    word = word.lower()
    word = re.sub("[^a-zA-Z0-9]", " ", word) # Then we remove punctuation in the for loop that counts unique words
    if word not in ce_unique_words:
        ce_unique_words.append(word)

In [None]:
ce_unique_words[:10]

In [134]:
cocaine_exchange_nopunct = re.sub("[^a-zA-Z0-9]", " ", cocaine_exchange) # First we remove punctuation

cenp_words = cocaine_exchange_nopunct.split() # Then we split the text into words

cenp_unique_words = []

for word in cenp_words: # By the time we enter this for loop, the punctuation is already gone
    word = word.lower()
    if word not in cenp_unique_words:
        cenp_unique_words.append(word)

In [None]:
cenp_unique_words[:10]

In [None]:
(len(ce_unique_words) / len(ce_words)) * 100

In [None]:
(len(cenp_unique_words) / len(cenp_words)) * 100

So **we want to remove puncutation *before* using `.split()`,** because otherwise we'll end up with a bunch of funky "unique words" with spaces where their punctuation once was. It doesn't solve the problem we initially had. 

If we remove punctuation before tokenizing with `.split()` we get what we want, because `.split()` splits whenever it meets any number of consecutive whitespace characters. So it will easily turn `asked    morphine` — with its lengthy separating whitespace — into two tokens, `asked` and `morphine`.

### Now, let's do this for real with our old friend *The Sign of the Four*.

Let's start with how we did it last class, before we knew how to remove punctuation...

In [138]:
# Old method without removing punctuation

sot4_words = sot4.split()

sot4_unique_words = []

for word in sot4_words:
    word = word.lower()
    if word not in sot4_unique_words:
        sot4_unique_words.append(word)

... And then do it our fancy new way, using Regex to remove all punctuation.

In [139]:
# New method that removes punctuation

sot4np = re.sub("[^a-zA-Z0-9]", " ", sot4) # The variable names here, "np" signals "no punctuation"

sot4np_words = sot4np.split()

sot4np_unique_words = []

for word in sot4np_words:
    word = word.lower()
    if word not in sot4np_unique_words:
        sot4np_unique_words.append(word)

In [None]:
sot4[:500]

In [None]:
sot4np[:500]

In [None]:
sot4_words[:20]

In [None]:
sot4np_words[:20]

In [None]:
sot4_unique_words.sort()
sot4_unique_words[:20]

In [None]:
sot4np_unique_words.sort()
sot4np_unique_words[:20]

The tokenization without punctuation works a lot better. 

### How much do you think this will affect the TTR?

In [None]:
(len(sot4_unique_words) / len(sot4_words)) * 100

In [None]:
(len(sot4np_unique_words) / len(sot4np_words)) * 100

## 👸🤴 Digression, Part I: Extracting Quotations with Regex

We probably won't have time to cover this in lecture, but it's potentially fun for those of you who are interested.

### Note: None of this Digression (Part 1 or Part 2) will be on the midterm or the exam. It is not something we require you to know. It's just for fun.

For this task, we will need some cool new Regex characters:
* `.`: "any character except a newline"

And this "quantifier" which you add to a regular expression to signal
* `*`: "zero or more occurences of the thing immediately to my left"

So that the expression
* `.*` means "zero or more characters other than a newline"

To catch one-line quotations, one could try...
* `".*"`: a `"` character, followed by zero or more occurences of any character except a newline, followed by a `"` character 

This actually won't work on *The Sign of the Four* — becuase it's from Project Gutenberg, and PG files use “curly quotes.” **Yes: `"` and `“` and `”` are all actually different characters**!

Note that we need to use “curly quotes” for Project Gutenberg files. And we got *The Sign of the Four* from PG. So we need:
* `“.*”`: a `“` character, followed by zero or more occurences of any character except a newline, followed by a `”` character 

We need one added complexity: the searches are "greedy" (they grab as much as they can), so they sometimes add narration in between the opening and closing quotes. We can fix this by specifying that if you see a close-quote character, immediately stop. The regular expression we need is:

* `“[^”]*”` — where the `[^”]` means "any character except a close quote"

The Regex command we want to use to grab all the quotations in `sot4` is `re.findall`, which takes two arguments:
* The Regex pattern you want to find (expressed as a string, so surround it with `"`s)
* The string variable in which you want to look for this pattern

In [None]:
re.findall("“[^”]*”", sot4)

As you can see, `re.findall` returns a `list` in which every match is provided as an item.

That looks good, but it isn't stored anywhere, so let's grab that output and put it into a variable!

In [149]:
sot4_quotations = re.findall("“[^”]*”", sot4)

In [None]:
sot4_quotations[-15:]

## 👸🤴 Digression, Part II: Creating a Literary Mashup

### Note: Like the Part I of this Digression, this will not be on the midterm or the exam.

Now, let's say you want to do something really fun with these quotations you just extracted... like, say, stick them into *Pride and Prejudice* so that all of Austen's dialogue is replaced with Conan Doyle's!

Here's how I'm going to do it:
* Load up P&P
* Replace all the dialogue in P&P with the phrase "QUOTE_HERE" so that I know where to stick my replacement quotations.
* Then iterate through my list of SOT4 quotations, popping them into P&P one by one

In [None]:
# This loads P&P, a copy of which I have conveniently placed in the same folder as this notebook.

pandp = open("pride_prejudice.txt", encoding="utf-8").read()

print(pandp[470:773])

In [152]:
# This replaces all dialogue in P&P with the phrase "QUOTE_HERE", 
# creating targets I can then replace one-by-one with the SOT4 quotations

pride_of_the_four = re.sub("“[^”]*”", "QUOTE_HERE", pandp)

In [None]:
print(pride_of_the_four[470:650])

In [154]:
## Now I will iterate through the list of SOT4 quotations, using each one to replace one "QUOTE_HERE" in pride_of_the_four

for quotation in sot4_quotations:
    pride_of_the_four = re.sub("QUOTE.HERE", quotation, pride_of_the_four, 1) # the "1" at the end of this line specifies to only make one replacement for each item in sot4_quotations

In [None]:
# This is the amusing result...

print(pride_of_the_four[470:778])

In [None]:
# Below we will learn how to write files. But here's a little preview! This line saves our amazing new mashup novel as "pride-of-the-four.txt"

open("pride-of-the-four.txt", mode="w", encoding="utf-8").write(pride_of_the_four)

# 2. Iterating through Files in a Folder

Okay, we now know how to remove punctuation, which is essential to our TTR task: we are now getting accurate token and type counts.

We only need to be able to do three more things:
* Learn how to automatically create a standardized sample size
* Automate the loading of files so that we can can give Python a folder full of text files and let it do its thing, with no additional help from us
* Have Python spit out some nice spreadsheet files for us: one with non-standardized values and one with standardized values.

First, let's handle the folder-loading task.

## The `Path` function

For this, we need to pull in another Python Library: `pathlib`, from which we are going to extract the function `Path`. We can coax it down the fireman's pole with this command:

In [157]:
from pathlib import Path

`Path` will help us look through a folder and find the... **pathways** that Python needs to find files we want it to look and calculate TTR values for. 

First, `Path` needs to know where we've stored our plain text files. We will let it know by passing it a string variable that contains the name of the folder we want it to look in. In this case (check your JupyterHubs!) we've put all the individual chapters of *The Sign of the Four* in a folder called `sot4chaps`.

In [None]:
folder_path = "sot4chaps" # This variable can be named anything as long as is matches the variable name in the Path() command below
from pathlib import Path
foo = Path(folder_path).glob("*")
sorted(foo)

The below command asks `Path` to look in a folder called `sot4chaps` (which it expects to find **in the same folder as the Jupyter Notebook we are currently using — which it is!**), and to print out the paths of **absolutely everything in that folder** (the `"*"` as the argument to the `.glob()` method is what instructs it to look for everything).

In [None]:
for file_path in Path(folder_path).glob("*"):
    print(file_path)

Each of those things is a **file path**: a path or route that Python will need to follow — relative to the place from which it's receiving its commands; namely, this Jupyter Notebook — to get to the files we want it to analyze.

As you can see, I have deviously inserted the dreaded `firemanspole.gif` into that folder. We do not want to calculate the TTR of `firemanspole.gif`!! Luckily we can tell `Path` to only look for particular kinds of files. In this case, we only want it to look at plain text files, which all end `.txt`. So we can replace the `*` in the above command ("I want paths of everything in that folder") with `*.txt` ("I only want paths to the plain text files").

(NOTE: The `*` in the `Path().glob()` method means something different from the `*` in Regex. Such is life in the world of computer programming, where one must learn to speak multiple dialects to communicate...)

In [None]:
for file_path in Path(folder_path).glob('*.txt'):
    print(file_path)

As you'll notice, the above list of files is not sorted, which is both aesthetically annoying and will also make it more difficult to interpret our results. (We can always sort things later using Microsoft Excel, etc., but we may as well do as much as we can right here in Python.)

Thankfully, we can wrap the Python function `sorted()` around the `Path()` function, and it will open the files in an alphabetically sorted manner.

In [None]:
for file_path in sorted(Path(folder_path).glob('*.txt')):
    print(file_path)

Now, rather than just spewing out the file paths with the `print()` function... Let's actually **load** each of these files, and print out the first 100 characters of each, shall we?

In [None]:
for file_path in sorted(Path(folder_path).glob('*.txt')):
    text = open(file_path, encoding='utf-8').read()
    print(text[:100])

Now that we can do this, we can take a major step: we can load an entire folder of files and, for each one, calculate its overall TTR. The code below does just that!

In [None]:
import re
from pathlib import Path

folder_path = "sot4chaps"

for file_path in sorted(Path(folder_path).glob('*.txt')):
    
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    tokens = len(text_words)
    
    unique_words = []
    
    for word in text_words:
        word = word.lower()
        if word not in unique_words:
            unique_words.append(word)
            
    types = len(unique_words)
    
    ttr = (types / tokens) * 100
    
    print(f"'{file_path.stem}' has {types} types, {tokens} tokens, and a TTR of {ttr}")

# 3. Automatically Determining Sample Size and Producing Standardized Results

So we're done, right??

### **NO!**

Why not? What do we still need to be able to do?

That's right: calculate *standardized* TTRs!

To be specific, we still need to figure out
* How to automatically determine our sample size (i.e., the total length of the shortest text)
* And then how to calculate TTRs for that sample.

## Calculating the Total Length of the Shortest Item in a List

Before we do this with full-length texts, let's try with a small-scale experiment.

Let's say create a list with a bunch of strings in it. How could we automatically determine the total length of the shortest of these strings?

In [None]:
bunch_o_strings = ["Adam", "Marta", "Rosie", "Jazz", "Adamillo", "Anna", "Stephen", "Richard", "Ernest"]

There are surely a bunch of ways to do it. But let's use this method.
1. Create a new variable called `length_of_shortest_string` where we'll record the length of the shortest string. We'll initially set its value to `0`, because that's the minimum length that a string (or a list) can be.
2. Iterate through all the strings in the list using a `for` loop.
3. For each item in `bag_o_strings`, grab its length and store it in `string_length`.
4. Check whether that `string_length` is the shortest we've seen so far, in which case we'll save that information is `length_of_shortest_string` — otherwise, we'll ignore it. **OR,** if `length_of_shortest_string` is set to 0 — its initial value — we'll put whatever is in `string_length` in there, since that means this is our first time through the loop.

In [None]:
length_of_shortest_string = 0 

for string in bunch_o_strings:
    string_length= len(string)
    if length_of_shortest_string == 0 or string_length < length_of_shortest_string:
        length_of_shortest_string = string_length

In [None]:
length_of_shortest_string

## Creating Standardized Sample Sizes

Let's stick with our toy example for a moment.

We know that the `sample_size` we want is 4. So how to we actually *slice off* just the first 4 characters of each of those strings?

I'll show you the code, you tell me what the comment line does...

In [None]:
for string in bunch_o_strings:
    string_standardized = string[:length_of_shortest_string] # What's happening here?
    print(string_standardized)

That's right: our old friend **slicing** comes to our rescure here. Remember that the syntax for slicing is `[start:stop:step]`. If we only want the first four character of a string, we write `string[:4]`. Since we won't know our sample size until we load all the files in a folder, we don't want to hard-code a number in there. So we can use a variable name instead.

Now, we're going to be standardizing a sample of `lists` (texts broken up into words) rather than `strings`... but as we know, slicing lists and strings works exactly the same way. If we want the first for items of a list, we would also use `list[:4]`...

## We now have ALMOST all the skills we need to quickly and accurately calculate TTRs for any number of plain text files stored together in a folder

We already know how to:
* Automatically load files in a folder, one-by-one, as strings
* Remove punctuation, split them into words, and calculate their total number of words (tokens)
* Record the number of unique words (types) with a for loop that also lowercases all words
* Automatically standardize the sample size, looking only at the first x words in each text, where x is the total length of the shortest text

The only things we still don't know how to do are:
* Output the results of our analysis of the total texts (non-standardized results)
* Output the standardized results

Below I've written out the code that does all the things we've learned so far. It doesn't store the results anywhere yet, but it does calculate them and print them out.

This first cell records information for overall, non-standardized values and also determines our sample size (the total length of the shortest text). 

In [None]:
import re
from pathlib import Path

folder_path = "sot4chaps"

sample_size = 0  # Note this line and figure out what it's doing!

for file_path in sorted(Path(folder_path).glob('*.txt')):
    
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    tokens = len(text_words)
    
    if sample_size == 0 or tokens < sample_size: # The line I noted above connects to this one and the next
        sample_size = tokens
    
    unique_words = []
    
    for word in text_words:
        word = word.lower()
        if word not in unique_words:
            unique_words.append(word)
            
    types = len(unique_words)
    
    ttr = (types / tokens) * 100
    
    print(f"'{file_path.stem}' has {types} types, {tokens} tokens, and a TTR of {ttr}")
    print(f"So far, the shortest text is {sample_size} words in length.\n")

The cell below uses the sample size determined in the previous step (1769 words; stored in the `sample_size` variable) to calculate the standardized TTR for the first 1769 words of each text.

In [None]:
for file_path in sorted(Path(folder_path).glob('*.txt')):
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    text_words_standardized = text_words[:sample_size] # This is the key new line in this block of code. Figure out what it does! Where did we define the variable sample_size?
    tokens_standardized = len(text_words_standardized)

    unique_words_standardized = []
    
    for word in text_words_standardized:
        word = word.lower()
        if word not in unique_words_standardized:
            unique_words_standardized.append(word)
            
    types_standardized = len(unique_words_standardized)
    
    ttr_standardized = (types_standardized / tokens_standardized) * 100
    
    print(f"'{file_path.stem}' has {types_standardized} types in the standardized sample of {tokens_standardized} tokens, and a TTR of {ttr_standardized}.\n")

# 4. Writing CSV files

Okay — that's a lot to read through and think about. But let's just get this beast of a TTR task finished by taking the final step: doing all this but also storing the results in a file.

We'll do this with the `open()` function, the `open.write()` method, and the CSV file format.

## What is a CSV?

A CSV is a very simple file format for spreadsheets. The name stands for "Comma Separated Values" — and that's really all it is: a plain text file in which values (whatever would go into a cell in a spreadsheet) are separated by commas! The end of a row in a CSV file is signalled by the newline character `\n`. Text cells should be wrapped in quotation marks (`""`) just like strings in Python. Other than that, no tricks!

A pretty table like the following:

| Types | Tokens                                                                                  |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| 12         | 24                                                                             |
| 33         | 100                                              |
| 75 | 500

Would be expressed in a CSV file as:

```
"Types","Tokens"
12,24
33,100
75,500
```

... and a spreadhseet program like Excel would see that and know exactly what to do with it.

(By the way, when Quercus gives your final grades at the end of the year to be uploaded to the eMarks system... it will be in the form of a CSV file!)

## Writing files with `open.write()`

We already know how to **load** a file into Python and stick it into a variable. We do it like this, with the `open()` function and the `.read()` method.

In [None]:
sot4 = open("sign-of-four.txt", encoding="utf-8").read()

Now, this might not be the most intutive thing in the world... but actually **writing** or **creating** a file in Python happens pretty much the same way. 
* First, you **`open()`** a file. Because it's not a file that already exists — it's one you're creating out of nothing — you need to set the *argument* `mode` to `"w"` (write) rather than the default `"r"` (read).
* Then you use the `.write()` method to write something into that file.

In [None]:
open("my-new-file.txt", mode="w", encoding="utf-8").write("Hey, look at this, I made a file in Python!")

We can read this back in to make sure it actually worked...

In [None]:
open("my-new-file.txt", mode="r", encoding="utf-8").read()

That's all you need if you want to create your new file in one shot. But we want to build ours up slowly, text by text, as we iterate through our folder of text files. 

For that, we'll use the following format.

1. First, we `open()` a file and assign that "file object" to a variable called `file`.
2. We then write whatever we want into that `file` varable, applying the `.write()` method. We can do this as many times as we like.
3. Then we "close" our file using the `.close()` method.

In [None]:
file = open("another-new-file.txt", mode="w", encoding="utf-8")

file.write("Here's some text\n")
file.write("Here's some more!\n")
file.write("And that's all I want to write for now!")

file.close()

In [None]:
print(open("another-new-file.txt", mode="r", encoding="utf-8").read())

Thankfully, this all works exactly the same way for a CSV file. Except... we put in commas between values and wrap text in `""`s

In [None]:
file = open("babys-first-spreadsheet.csv", mode="w", encoding="utf-8")

file.write('"Types","Tokens"\n') # Note that if you want to write double quotes, you have to wrap them in single quotes, or Python will get confused.
file.write("12,24\n")
file.write("33,100\n")
file.write("75,500")

file.close()

Now that we know how to write CSV files, we are well and truly finished with the TTR task (well... as much as anything is every finished! I can imagine a few ways to improve it, still. Can you??)

Here is the whole process, in a single cell. The major part of your Lab this week is to comment this big block of code, line by line.

In [None]:
import re
from pathlib import Path

folder_path = "sot4chaps"

sample_size = 0

file = open("ttr-overall.csv", mode="w", encoding="utf-8")

file.write('"Text","Types","Tokens","TTR"\n')

for file_path in sorted(Path(folder_path).glob('*.txt')):
    
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    tokens = len(text_words)
    
    if sample_size == 0 or tokens < sample_size:
        sample_size = tokens
    
    unique_words = []
    
    for word in text_words:
        word = word.lower()
        if word not in unique_words:
            unique_words.append(word)
            
    types = len(unique_words)
    
    ttr = (types / tokens) * 100
    
    file.write(f'"{file_path.stem}",{types},{tokens},{ttr}\n')

file.close()



file = open("ttr-standardized.csv", mode="w", encoding="utf-8")

file.write('"Text","Types","Tokens","TTR"\n')

for file_path in sorted(Path(folder_path).glob('*.txt')):
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    text_words_standardized = text_words[:sample_size]
    tokens_standardized = len(text_words_standardized)

    unique_words_standardized = []
    
    for word in text_words_standardized:
        word = word.lower()
        if word not in unique_words_standardized:
            unique_words_standardized.append(word)
            
    types_standardized = len(unique_words_standardized)
    
    ttr_standardized = (types_standardized / tokens_standardized) * 100
    
    file.write(f'"{file_path.stem}",{types_standardized},{tokens_standardized},{ttr_standardized}\n')

file.close()