# Topic 2 - Diving into unstructured and structured data

This week, we will talk about working with unstructured and structured data in Python. So what is the difference between them? Structured data is information with a high degree of organization, which can easily be ordered and processed by machines. You can compare it with a perfectly organized filing cabinet where everything is identified, labeled and easy to access. Unstructured data, however, is not organized in a pre-defined manner and therefore lacks structure. 

- plain text
- CSV and TSV
- JSON
- XML

## Working with files: plain text

When doing text analysis, you will often work files that contain plain text. These files typically end with the `.txt` extension. In Python, you can read the content of a file, store it as the type of object that you need (string, list, etc.) and manipulate it (e.g. replacing or removing words). You can also write new content to an existing or a new file.

### Reading files
Let's start with opening a file in Python. To do this, we need to associate the file on disk with a variable in Python. First, we tell Python where the file is stored on your disk. The location of your file is often referred to as the file path. Python will start looking in the 'working' or 'current' directory (which often will be where your Python script is). If it's in the working directory, you only have to tell Python the name of the file (e.g. `colors.txt`). If it's not in the working directory, you have to tell Python the exact path to your file. We will create a string variable to store this information:

In [5]:
filename = "./texts/charlie.txt"  

Note the single dot in the beginning of the file path; this means 'the current directory'. When writing a file path, you can use the following:
- /     means the root of the current drive; 
- ./    means the current directory;
- ../   means the parent of the current directory.  	 


Now Python knows where your file is stored, we can open the file by using the built-in function `open()`:

In [6]:
infile = open(filename, "r")

The `open()` function requires requires the file path as its first argument. The second argument specifies the *mode* in which the file is opened. The mode you choose will depend on what you wish to do with the file. Here are some of our mode options:

- 'r' : use for reading
- 'w' : use for writing
- 'x' : use for creating and writing to a new file
- 'a' : use for appending to a file
- 'r+' : use for reading and writing to the same file

Let's now print `infile`. What do you think will happen?

In [7]:
print(infile)

<_io.TextIOWrapper name='./texts/charlie.txt' mode='r' encoding='UTF-8'>


"Hey! That's not what I expected to happen!", you might think. Python is not printing the contents of the file but only some mysterious mention of some `TextIOWrapper`. This `TextIOWrapper` thing is Python's way of saying it has *opened* a connection to the file `charlie.txt`. In order to *read* the contents of the file we must add the function `read()` as follows:

In [8]:
content = infile.read()
print(content)

Charlie Bucket stared around the gigantic room in which he now found himself. The place was like a witch’s kitchen! All about him black metal pots were boiling and bubbling on huge stoves, and kettles were hissing and pans were sizzling, and strange iron machines were clanking and spluttering, and there were pipes running all over the ceiling and walls, and the whole place was filled with smoke and steam and delicious rich smells. 

Mr Wonka himself had suddenly become even more excited than usual, and anyone could see that this was the room he loved best of all. He was hopping about among the saucepans and the machines like a child among his Christmas presents, not knowing which thing to look at first. He lifted the lid from a huge pot and took a sniff; then he rushed over and dipped a finger into a barrel of sticky yellow stuff and had a taste; then he skipped across to one of the machines and turned half a dozen knobs this way and that; then he peered anxiously through the glass doo

The variable `content` now holds the contents of the file `charlie.txt` and we can access and manipulate it just like any other string. After we read the contents of a file, the `TextWrapper` no longer needs to be open. In fact, it is good practice to close it as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [9]:
infile.close()

Last week, we have done several exercises with manipulating strings. Let's recap. We have learned that some of the most common preprocessing steps are casefolding/lowercasing, punctuation removal and stemming/lemmatization. Did you know that there are some very useful NLP toolkits and modules to do some of these steps? One of these toolkits is the NLTK toolkit. You can simply import this toolkit by running:

In [10]:
import nltk

Amongst other things, the NLTK toolkit allows you to tokenize texts with the function `nltk.word_tokenize()`. To be able to use this function, we first need to download the NLTK Tokenizer Models. Run the following command to open an installation window. Go to the `Models` tab and select `punkt` from under the `Identifier` column. Also select `averaged_perceptron_tagger`; we will use that later. Then click `Download` and it will install the necessary files. Close the installation window once you are done.

In [18]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Now, let's try tokenizing our Charlie story!

In [11]:
tokens = nltk.word_tokenize(content)
print(tokens)

['Charlie', 'Bucket', 'stared', 'around', 'the', 'gigantic', 'room', 'in', 'which', 'he', 'now', 'found', 'himself', '.', 'The', 'place', 'was', 'like', 'a', 'witch’s', 'kitchen', '!', 'All', 'about', 'him', 'black', 'metal', 'pots', 'were', 'boiling', 'and', 'bubbling', 'on', 'huge', 'stoves', ',', 'and', 'kettles', 'were', 'hissing', 'and', 'pans', 'were', 'sizzling', ',', 'and', 'strange', 'iron', 'machines', 'were', 'clanking', 'and', 'spluttering', ',', 'and', 'there', 'were', 'pipes', 'running', 'all', 'over', 'the', 'ceiling', 'and', 'walls', ',', 'and', 'the', 'whole', 'place', 'was', 'filled', 'with', 'smoke', 'and', 'steam', 'and', 'delicious', 'rich', 'smells', '.', 'Mr', 'Wonka', 'himself', 'had', 'suddenly', 'become', 'even', 'more', 'excited', 'than', 'usual', ',', 'and', 'anyone', 'could', 'see', 'that', 'this', 'was', 'the', 'room', 'he', 'loved', 'best', 'of', 'all', '.', 'He', 'was', 'hopping', 'about', 'among', 'the', 'saucepans', 'and', 'the', 'machines', 'like', 'a

As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens. Another thing that NLTK can do for you is to split a text into sentences:

In [12]:
sentences = nltk.sent_tokenize(content)
print(sentences)

['Charlie Bucket stared around the gigantic room in which he now found himself.', 'The place was like a witch’s kitchen!', 'All about him black metal pots were boiling and bubbling on huge stoves, and kettles were hissing and pans were sizzling, and strange iron machines were clanking and spluttering, and there were pipes running all over the ceiling and walls, and the whole place was filled with smoke and steam and delicious rich smells.', 'Mr Wonka himself had suddenly become even more excited than usual, and anyone could see that this was the room he loved best of all.', 'He was hopping about among the saucepans and the machines like a child among his Christmas presents, not knowing which thing to look at first.', 'He lifted the lid from a huge pot and took a sniff; then he rushed over and dipped a finger into a barrel of sticky yellow stuff and had a taste; then he skipped across to one of the machines and turned half a dozen knobs this way and that; then he peered anxiously throug

We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:

In [13]:
present_participles = []
for token in tokens:
    if token.endswith("ing"):
        present_participles.append(token)
print(present_participles)

['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'spluttering', 'running', 'ceiling', 'hopping', 'knowing', 'thing', 'rubbing', 'cackling', 'going']


This looks good! We now have a list of words like *boiling*, *sizzling*, etc. But wait... Oops, there is one word in the list that actually is not a present participle! Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution. Once again, NLTK comes to the rescue with its Part of Speech (POS) tagger:

In [14]:
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)

[('Charlie', 'NNP'), ('Bucket', 'NNP'), ('stared', 'VBD'), ('around', 'IN'), ('the', 'DT'), ('gigantic', 'JJ'), ('room', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('he', 'PRP'), ('now', 'RB'), ('found', 'VBD'), ('himself', 'PRP'), ('.', '.'), ('The', 'DT'), ('place', 'NN'), ('was', 'VBD'), ('like', 'IN'), ('a', 'DT'), ('witch’s', 'JJ'), ('kitchen', 'NN'), ('!', '.'), ('All', 'DT'), ('about', 'IN'), ('him', 'PRP'), ('black', 'JJ'), ('metal', 'NN'), ('pots', 'NNS'), ('were', 'VBD'), ('boiling', 'VBG'), ('and', 'CC'), ('bubbling', 'VBG'), ('on', 'IN'), ('huge', 'JJ'), ('stoves', 'NNS'), (',', ','), ('and', 'CC'), ('kettles', 'NNS'), ('were', 'VBD'), ('hissing', 'VBG'), ('and', 'CC'), ('pans', 'NNS'), ('were', 'VBD'), ('sizzling', 'VBG'), (',', ','), ('and', 'CC'), ('strange', 'JJ'), ('iron', 'NN'), ('machines', 'NNS'), ('were', 'VBD'), ('clanking', 'VBG'), ('and', 'CC'), ('spluttering', 'NN'), (',', ','), ('and', 'CC'), ('there', 'EX'), ('were', 'VBD'), ('pipes', 'NNS'), ('running', 'VBG'), 

We now have a list of tuples. The first element of the tuple is the token, the second element indicates the part of speech of the token. In this tagset, the `VBG` tag is used for present participles.

In [16]:
# Finish the following code:
present_participles = []
for token in pos_tagged:
    # if the PoS (second element) is "VGB"...
        # append the word (first element) to the list
print(present_participles)

['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']


You should get the following list: ['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']


## CSV and TSV (tables)
Now let's move on to structured data. There are several data forms, the most common of which is probably the table. A table represents a set of data points as a series of rows, with a column for each of the data points' properties. Tabular data can be encoded as CSV (comma-separated values) or TSV (tab-separated values).

CSV and TSV files are simply plain text files in which each line represents a row and, within each line, a comma (for CSV) or a tab character (for TSV) separates the cells in the row (the columns). 




## JSON
More information will follow

## XML
More information will folow