# Topic 2 - Diving into unstructured and structured data

This week, we will talk about working with unstructured and structured data in Python. So what is the difference between them? Structured data is information with a high degree of organization, which can easily be ordered and processed by machines. You can compare it with a perfectly organized filing cabinet where everything is identified, labeled and easy to access. Unstructured data, however, is not organized in a pre-defined manner and therefore lacks structure. 

- plain text
- CSV and TSV
- JSON
- XML

## Working with files: plain text

When doing text analysis, you will often work files that contain plain text. These files typically end with the `.txt` extension. In Python, you can read the content of a file, store it as the type of object that you need (string, list, etc.) and manipulate it (e.g. replacing or removing words). You can also write new content to an existing or a new file.

### Reading (and closing) files
Let's start with opening a file in Python. To do this, we need to associate the file on disk with a variable in Python. First, we tell Python where the file is stored on your disk. The location of your file is often referred to as the file path. Python will start looking in the 'working' or 'current' directory (which often will be where your Python script is). If it's in the working directory, you only have to tell Python the name of the file (e.g. `colors.txt`). If it's not in the working directory, you have to tell Python the exact path to your file. We will create a string variable to store this information:

In [None]:
filename = "./texts/charlie.txt"  

Note the single dot in the beginning of the file path; this means 'the current directory'. When writing a file path, you can use the following:
- /     means the root of the current drive; 
- ./    means the current directory;
- ../   means the parent of the current directory.  	 


Now Python knows where your file is stored, we can open the file by using the built-in function `open()`:

In [None]:
infile = open(filename, "r")

The `open()` function requires requires the file path as its first argument. The second argument specifies the *mode* in which the file is opened. The mode you choose will depend on what you wish to do with the file. Here are some of our mode options:

- 'r' : use for reading
- 'w' : use for writing
- 'x' : use for creating and writing to a new file
- 'a' : use for appending to a file
- 'r+' : use for reading and writing to the same file

Let's now print `infile`. What do you think will happen?

In [None]:
print(infile)

"Hey! That's not what I expected to happen!", you might think. Python is not printing the contents of the file but only some mysterious mention of some `TextIOWrapper`. This `TextIOWrapper` thing is Python's way of saying it has *opened* a connection to the file `charlie.txt`. In order to *read* the contents of the file we must add the function `read()` as follows:

In [None]:
content = infile.read()
print(content)

The variable `content` now holds the contents of the file `charlie.txt` and we can access and manipulate it just like any other string. After we read the contents of a file, the `TextWrapper` no longer needs to be open. In fact, it is good practice to close it as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [None]:
infile.close()

### Manipulating the content of text files: NLTK toolkit
Last week, we have done several exercises with manipulating strings. Let's recap. We have learned that some of the most common preprocessing steps are casefolding/lowercasing, punctuation removal and stemming/lemmatization. Did you know that there are some very useful NLP toolkits and modules to do some of these steps? One of these toolkits is the NLTK toolkit. You can simply import this toolkit by running:

In [None]:
import nltk

#### Tokenization and sentence splitting
Amongst other things, the NLTK toolkit allows you to tokenize texts with the function `nltk.word_tokenize()`. To be able to use this function, we first need to download the NLTK Tokenizer Models. Run the following command to open an installation window. Go to the `Models` tab and select `punkt` from under the `Identifier` column. Then click `Download` and it will install the necessary files. Also download `averaged_perceptron_tagger` from the `Models` tab, and `wordnet` from the `Corpora` tab; we will use these later. Close the installation window once you are done.

In [None]:
nltk.download()

Now, let's try tokenizing our Charlie story!

In [None]:
tokens = nltk.word_tokenize(content)
print(tokens)

As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens. Another thing that NLTK can do for you is to split a text into sentences:

In [None]:
sentences = nltk.sent_tokenize(content)
print(sentences)

#### POS tagging
We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:

In [None]:
present_participles = []
for token in tokens:
    if token.endswith("ing"):
        present_participles.append(token)
print(present_participles)

This looks good! We now have a list of words like *boiling*, *sizzling*, etc. But wait... Oops, there is one word in the list that actually is not a present participle! Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution. Once again, NLTK comes to the rescue with its Part of Speech (POS) tagger:

In [None]:
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)

We now have a list of tuples. The first element of the tuple is the token, the second element indicates the part of speech of the token. This POS tagger uses the POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). In this tag set, the `VBG` tag is used for present participles and gerunds.

In [None]:
# Finish the following code:
present_participles = []
for token in pos_tagged:
    # if the POS (second element) is "VGB"...
        # append the word (first element) to the list
print(present_participles)

You should get the following list: ['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']

Now finish the following code to get *all* verbs. We already provided you with the full set of verb tags.

In [None]:
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
verbs = []
for token in pos_tagged:
    if token[1] in verb_tags:
        verbs.append(token[0])
print(verbs)      

#### Lemmatization
We now have a list of all inflected forms of the verbs. We can also use NLTK to lemmatize words. We will use the WordNetLemmatizer for this. In the code below, we loop through the list of verbs, lemmatize each of the verbs, and add them to a new list called `verb_lemmas`.

In [None]:
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
verb_lemmas = []
for participle in verbs:
    lemma = lmtzr.lemmatize(participle, "v")
    verb_lemmas.append(lemma)
print(verb_lemmas)

The resulting list contains a lot of duplicates. Do you remember how you can get rid of these duplicates? Create a new list called `unique_verbs` in which each verb only occurs once:

Now loop over each verb in unique_verbs and for each verb, print the lemma and the count of this verb in `charlie.txt':

In [None]:
for verb in unique_verbs:
    # print the lemma and the count of the verb in charlie.txt

### Writing files



## CSV and TSV (tables)
Now let's move on to structured data. There are several data forms, the most common of which is probably the table. A table represents a set of data points as a series of rows, with a column for each of the data points' properties. Tabular data can be encoded as CSV (comma-separated values) or TSV (tab-separated values).

CSV and TSV files are simply plain text files in which each line represents a row and, within each line, a comma (for CSV) or a tab character (for TSV) separates the cells in the row (the columns). 




## JSON
More information will follow

## XML
More information will folow