# Topic 3 - Diving into unstructured and structured data

This week, we will talk about working with unstructured and structured data in Python. So what is the difference between them? Structured data is information with a high degree of organization, which can easily be ordered and processed by machines. You can compare it with a perfectly organized filing cabinet where everything is identified, labeled and easy to access. Unstructured data, however, is not organized in a pre-defined manner and therefore lacks structure. 

- plain text
- CSV and TSV
- JSON
- XML

## Working with files: plain text

When doing text analysis, you will often work files that contain plain text. These files typically end with the `.txt` extension. In Python, you can read the content of a file, store it as the type of object that you need (string, list, etc.) and manipulate it (e.g. replacing or removing words). You can also write new content to an existing or a new file.

### Reading (and closing) files
Let's start with opening a file in Python. To do this, we need to associate the file on disk with a variable in Python. First, we tell Python where the file is stored on your disk. The location of your file is often referred to as the file path. Python will start looking in the 'working' or 'current' directory (which often will be where your Python script is). If it's in the working directory, you only have to tell Python the name of the file (e.g. `colors.txt`). If it's not in the working directory, you have to tell Python the exact path to your file. We will create a string variable to store this information:

In [None]:
filename = "./texts/charlie.txt"  

Note the single dot in the beginning of the file path; this means 'the current directory'. When writing a file path, you can use the following:
- /     means the root of the current drive; 
- ./    means the current directory;
- ../   means the parent of the current directory.  	 


Now Python knows where your file is stored, we can open the file by using the built-in function `open()`:

In [None]:
infile = open(filename, "r")

The `open()` function requires requires the file path as its first argument. The second argument specifies the *mode* in which the file is opened. The mode you choose will depend on what you wish to do with the file. Here are some of our mode options:

- 'r' : use for reading
- 'w' : use for writing
- 'x' : use for creating and writing to a new file
- 'a' : use for appending to a file
- 'r+' : use for reading and writing to the same file

Let's now print `infile`. What do you think will happen?

In [None]:
print(infile)

"Hey! That's not what I expected to happen!", you might think. Python is not printing the contents of the file but only some mysterious mention of some `TextIOWrapper`. This `TextIOWrapper` thing is Python's way of saying it has *opened* a connection to the file `charlie.txt`. In order to *read* the contents of the file, Python provides three related operations. The first operation is `read()`:

In [None]:
content = infile.read()
print(content)

The variable `content` now holds the entire content of the file `charlie.txt` as a single string and we can access and manipulate it just like any other string. 

The second operation is `readlines()`, which returns a list of the lines in the file, where each item of the list represents a single line:

In [None]:
lines = infile.readlines()
print(lines)

Oops, why doesn't this return anything? Something to keep in mind when you are reading from files is that once a file has been read using one of the read operations, it cannot be read again. Therefore, anytime you wish to read from a file you will have to open a new file variable. Let's try again:

In [None]:
infile = open(filename, "r")
lines = infile.readlines()
print(lines)

Now you can, for example, use a for-loop to print each line in the file:

In [None]:
for line in lines:
    s = "LINE: " + line
    print(s)

The third operation `readline()` returns the next line of the file, returning the text up to and including the next newline character (*\n*). More simply put, this operation will read a file line-by-line. So if you call this operation again, it will return the next line in the file. Try it out below!

In [None]:
infile = open(filename, "r")
print(infile.readline())

In [None]:
print(infile.readline())

In [None]:
print(infile.readline())

After reading the contents of a file, the `TextWrapper` no longer needs to be open since we have stored the content as a variable. In fact, it is good practice to close the file as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [None]:
infile.close()

### Manipulating the content of text files: NLTK toolkit
Last week, we have done several exercises with manipulating strings. Let's recap. We have learned that some of the most common preprocessing steps are casefolding/lowercasing, punctuation removal and stemming/lemmatization. Did you know that there are some very useful NLP toolkits and modules to do some of these steps? One of these toolkits is the NLTK toolkit. You can simply import this toolkit by running:

In [None]:
import nltk

#### Tokenization and sentence splitting
Amongst other things, the NLTK toolkit allows you to tokenize texts with the function `nltk.word_tokenize()`. To be able to use this function, we first need to download the NLTK Tokenizer Models. Run the following command to open an installation window. Go to the `Models` tab and select `punkt` from under the `Identifier` column. Then click `Download` and it will install the necessary files. Also download `averaged_perceptron_tagger` from the `Models` tab, and `wordnet` from the `Corpora` tab; we will use these later. Close the installation window once you are done.

In [None]:
nltk.download()

Now, let's try tokenizing our Charlie story!

In [None]:
tokens = nltk.word_tokenize(content)
print(tokens)

As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens. Another thing that NLTK can do for you is to split a text into sentences:

In [None]:
sentences = nltk.sent_tokenize(content)
print(sentences)

#### POS tagging
We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:

In [None]:
present_participles = []
for token in tokens:
    if token.endswith("ing"):
        present_participles.append(token)
print(present_participles)

This looks good! We now have a list of words like *boiling*, *sizzling*, etc. But wait... Oops, there is one word in the list that actually is not a present participle! Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution. Once again, NLTK comes to the rescue with its Part of Speech (POS) tagger:

In [None]:
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)

We now have a list of tuples. The first element of the tuple is the token, the second element indicates the part of speech of the token. This POS tagger uses the POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). In this tag set, the `VBG` tag is used for present participles and gerunds. Now let's try to make a list of all present participles in `charlie.txt` using the POS tags:

In [None]:
# Finish the following code:
present_participles = []
for token in pos_tagged:
    if token[1] == "VBG"
        # append the word to the list
print(present_participles)

You should get the following list: ['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']

Now finish the following code to get *all* verbs. We already provided you with the full set of verb tags.

In [None]:
# Finish the following code:
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
verbs = []
# Use a for-loop! 

print(verbs)      

#### Lemmatization
We now have a list of all inflected forms of the verbs. We can also use NLTK to lemmatize words. We will use the WordNetLemmatizer for this. In the code below, we loop through the list of verbs, lemmatize each of the verbs, and add them to a new list called `verb_lemmas`.

In [None]:
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
verb_lemmas = []
for participle in verbs:
    # For this lemmatizer, we need to indicate the POS of the word (in this case, v = verb)
    lemma = lmtzr.lemmatize(participle, "v") 
    verb_lemmas.append(lemma)
print(verb_lemmas)

The resulting list contains a lot of duplicates. Do you remember how you can get rid of these duplicates? Create a new list in which each verb occurs only once (give the list another name than `verb_lemmas`):

Now use a for-loop to count the number of times that each of these verb lemmas occurs in the text! For each verb in the list you just created, get the count of this verb in `charlie.txt` using the `count()` method. Create a dictionary that contains the lemmas of the verbs as keys, and the counts of these verbs as values.  


In [None]:
verb_counts = {}
# Write your for-loop here

print(verb_counts)

# The following test should print True if your code is correct 
print(verb_counts["bubble"] == 1 and verb_frequencies["be"] == 9)
    

### Writing files

So far, we have seen how to open a file, and how to read and manipulate its content. But you can also use Python to write files. Let's first slightly adapt our Charlie story by replacing the names in the text:

In [None]:
your_name = "" #type in your name 
friends_name = "" #type in the name of a friend 
new_content = content.replace("Charlie Bucket", your_name)
content = new_content.replace("Mr Wonka", friends_name)

Now, we can open a new file and write the text to this file as follows:

In [None]:
filename = "./texts/charlie_new.txt"
outfile = open(filename, "w")
outfile.write(content)
outfile.close()

Open the file in the folder 'texts' in any text editor and read a personalized version of the story!

Let's try something else. Remember that we have a list of verb lemmas that occured in `charlie.txt`. We want to write these lemmas to a file, with each lemma on a separate line. What do you think will happen if you run the following code?

In [None]:
filename = "./texts/charlie_verbs.txt"
outfile = open(filename, "w")
outfile.write(verb_lemmas)
outfile.close()

Do you understand why you get this error?

... exactly, you can only write strings to files. If you want to write the verbs that are stored within the list `verb_lemmas` to a file, you'll need to use a for-loop or create one string from the list using the `join()`-method. Investigate the output of the following code in `charlie_verbs.txt` and try to understand the differences between the three writing methods.

In [None]:
filename = "./texts/charlie_verbs.txt"
outfile = open(filename, "w")

# Writing method 1
for verb in verb_lemmas:
    outfile.write(verb)
outfile.write("\n\n")

# Writing method 2 
for verb in verb_lemmas:
    s = verb + "\n"
    outfile.write(s)
outfile.write("\n\n")

# Writing method 3
s = " ".join(verb_lemmas)
outfile.write(s)
    
outfile.close()

#### Optional exercise:
Are you or is your friend female? Then you are probably not happy with the personal pronouns used in the personalized version of the story. Can you think of a way to change the pronouns as well? Remember; we have stored the paragraphs as the variable `lines` when we used the `readlines()` operation. We already made a start with the code:

In [None]:
# Replace the pronouns in the text
your_gender = "male"
friends_gender = "female"

if your_gender == "female":   ##### HAVE THE STUDENTS ALREADY LEARNED ABOUT IF-STATEMENTS AT THIS POINT?
    # replace the pronouns in the first paragraph

if friends_gender == "female":
    # replace the pronouns in the second paragraph

new_content = "".join(lines)

# Write the content to a new file
filename = "./texts/charlie_new.txt"
outfile = open(filename, "w")
outfile.write(new_content)
outfile.close()

## CSV and TSV (tables)
Now let's move on to structured data. There are several data structures, the most common of which is probably the *table*. A table represents a set of data points as a series of rows, with a column for each of the data points' properties. Tabular data can be encoded as CSV (comma-separated values) or TSV (tab-separated values).

CSV and TSV files are simply plain text files in which each line represents a row and, within each line, a comma (for CSV) or a tab character (for TSV) separates the cells in the row (the columns). We can read and write CSV and TSV files in a similar way as we have seen with plain text files, but we use a module built-in to Python which will simplify the parsing of CSV and TSV files. The module is called `csv` and we can import it as follows:

In [None]:
import csv

Let's take a look at an example. The folder `text` contains a CSV file with the name `debate.csv`. This file contains transcripts of the 2016 (vice-)presidential debate from 26 September to 9 October. If you'd like, you can open this file in a text editor or Excel (convert text to columns by using the comma as delimiter) to see its content. Using the `csv` module, we can open and read the file as follows in Python: 

In [None]:
filename = "./texts/debate.csv"
csvfile = open(filename, "r", encoding="utf8")
csvreader = csv.reader(csvfile, delimiter=",")
print(csvreader)

As you can see from the output, the last line creates a `Reader` object that we assigned to the variable `csvreader`. A `Reader` object lets you iterate over lines in the CSV file. In a way, it's similar to a list that contains all rows, and each row in itself is a list containing the properties of the data point (the columns). We can also make this explicit by changing the type of `csvreader` to a list and assign that to a new variable `rows`:

In [None]:
rows = list(csvreader)
for row in rows:
    print(row)

Now we have the data stored in Python as the variable `rows`, we can close the CSV file:

In [None]:
csvfile.close()

We can now access and manipulate each of (the properties of) the data points in the CSV file. For example, we can select only the first 3 rows:

In [None]:
print(rows[0:3])

Or print only those rows where Trump is the speaker:

In [None]:
for row in rows:
    speaker = row[1]
    if speaker == "Trump":
        print(row)

In [None]:
for row in rows:
    speaker = row[1]
    text = row[2]
    if speaker == "Trump" and "Obama" in text:
        print(text)
        print("\n")

Now write a code where you get all rows from the debate on the 9th of October 2016 that come from a different speaker than Trump or Clinton. Print both the speaker and the text.

In [None]:
for row in rows:
    speaker = row[1]
    text = row[2]
    date = row[3]
    if date == "2016-10-09" and speaker != "Trump" and speaker != "Clinton":
        print(speaker, text)

## JSON
More information will follow

## XML
More information will folow

## Exercise?
- use os.listdir
- make function to read file