<h2>Introduction to Python</h2>

<h3>Session 2</h3>
    <ul>
        <li>Input/Ouptut</li>
        <li>Modules</li>
        <li>LXML</li>
        <li>Pandas</li>
        <li>Plotting</li>
    </ul>

<h2>Input/Output</h2> <br>
<h4>Reading files</h4>

To read simple txt files you can use the build-in function open().
open() takes two arguments. The first one defines the path to the file
you want to read. The second arugemt determines the reading mode.
Reading modes:

"r" - Read - Default value. Opens a file for reading, error if the file does not exist <br>

"a" - Append - Opens a file for appending, creates the file if it does not exist <br>

"w" - Write - Opens a file for writing, creates the file if it does not exist <br>

"x" - Create - Creates the specified file, returns an error if the file exist <br>

"t" - Text - Default value. Text mode <br>

"b" - Binary - Binary mode (e.g. images) <br>

After reading/appending/writing a file it is important to close it again to free memory.
It is done with the close() method.

In [None]:
data = open("article.txt","rt")
print(data)
print(type(data))
data.close()

If we try to print the text from article.txt, we would expect to see its content.
Instead we get an object of the io.TextIOWrapper class. To actually read its text
we have to use the read() method.

In [None]:
data = open("article.txt","rt")
text = data.read()
print(text)
data.close()

Sometimes it is not necessary to read the whole file at once, since large files
tend to block a lot of memory and could slow down our code. By looping over
io.TextIOWrapper in a for-loop we can process each line seperatly.

In [None]:
data = open("article.txt","rt")
for line in data:
    print(line)
data.close()

It is common to open files within a with-block. This has the advantage that the file is automatically closed again as soon as the block is left. If we store the file content in a variable we can work with it even if we closed the file.

In [None]:
with open("article.txt","rt") as file:
    text = file.read()
print(text)

<h4>Writing files</h4>

In case we want to write something on our filesystem, we can use the open() function
in a similar way. Instead of read() you have to use write(). 

In [None]:
text = ""
file = open("my_first_file.txt","wt")
file.write(text)#
file.close()

Check if your first file was created correctly!
Let's append it with another line of text.

In [None]:
text = ""
file = open("my_first_file.txt","at")
file.write(text)
file.close()

<h2>Modules</h2>

Fortunately, there is a large community of python users and many of them offer their code for further use. The import of foreign classes and functions is organized by so-called modules or libraries. To use and structure these for our own projects, import statements are used at the beginning of a python script.<br>
We start with importing the numpy library.

In [None]:
import numpy
print(type(numpy))

We could talk about numpy forever. Roughly speaking, the core of numpy is the numpy array, a data type that is very similar to the python list, but has an extensive repertoire of functions to perform very fast linear algebra calculations.
It is build like this:

In [None]:
my_array = numpy.array([1,2,3,4,5])
print(my_array)

If you plan on using a lot of numpy in your code it gets annoying to type "numpy" every single time.
For this reason python allows to define an alias for modules like this:

In [None]:
import numpy as np

my_array = np.array([1,2,3,4,5])
print(my_array)

Following the same logic, it is possible to import a single function or class of a module in a way that no alias is needed anymore.

In [None]:
from numpy import array

my_array = array([1,2,3,4,5])
print(my_array)

<h2>lxml</h2>

lxml is a python package designed to handle xml data. It is useful to acess elements,
attributes or text within xml and handles xpath expressions. Lets start with opening a
xml file without lxml:

In [None]:
file = open("ENG18482_Gaskell.xml","r")
xml = file.read()
print(xml)

This is okay, at least we can read a file this way. But basicly thats all. lxml gives us the oppotunity to work with xml. We start with the import statement.

In [None]:
from lxml import etree

lxml inherits a class for xml files, the ElementTree-class. First we need to load our xml as instance of this class:

In [None]:
xml = etree.parse("ENG18482_Gaskell.xml")
print(type(xml))

Maybe we are intrested in headings. These should be found inside head elemtents (at least thats what elte-tei  standard promises)

In [None]:
headings = xml.findall("head")
print(headings)

Okay, suprisingly wo got zero heading. <br>
But looking in the xml file its clear that there
have to some of them somewhere.<br>
We forgot to define namespaces...

In [None]:
namespace = {"TEI": "http://www.tei-c.org/ns/1.0"}
headings = xml.xpath("//TEI:head", namespaces=namespace)
print(headings)

We have again received rather unpleasant output. We should check the data type of headings:

In [None]:
print(type(headings))

Okay, its a list. But what is inside this list?

In [None]:
print(type(headings[0]))

We got instances of class etree.Element.. <br>
If you stuble upon something you have no idea how to handle it, it's
never a bad idea to ask at stackoverflow.com:
https://stackoverflow.com/questions/30772943/get-inner-text-from-lxml

In [None]:
print(headings[0].text)

Alright, now we can print every heading within a for loop!!

In [None]:
for heading in headings:
    
    print(heading.text)

<h2>pandas</h2>

One can imagine pandas as a python variant of ms excel. It is usefull
for orangisation, storing and analysing your data alongside with its
metadata. The basic element of pandas is the DataFrame

In [None]:
import pandas as pd

In [None]:
frame = pd.DataFrame()
print(frame)

A DataFrame contains an index (rows, axis 0) and columns (axis 1), along with the actual data. We start by creating an empty frame and add three named columns.

In [None]:
frame = pd.DataFrame()
frame["author_lastname"] = ["Gaskell","Dickens","Eliot"]
frame["author_forename"] = ["Elisabeth","Charles","George"]
frame["title"] = ["Mary Barton","Hard Times","Middlemarch"]
frame

Now we can access values by selecting fields via conditions in brackets. This expression reads:
Give me all titles of novels written by Gaskell

In [None]:
frame[frame.author_lastname == "Gaskell"]["title"]

We can save DataFrames to our disk with to_csv().
The sep-argument determines a character, that will be
used to separate values. You should always make sure
this character is not containd by any field in your
dataFrame since it would corrupt the file and make
it unreadable.

In [None]:
frame.to_csv("my_first_table.tsv", sep="\t")

The counterpart to to_csv() is from_csv(). <br>
Lets try to load an example file:

In [None]:
frame = pd.read_csv("dtm_example.tsv", sep="\t", index_col=0)
frame

This table holds token counts for 57 english novels. Document names are stored in
the columns and words in its index. If we want to know how often a word is contained
by a certain document we can just type:

In [None]:
frame.loc["the","d01"]

Just as well we can compare the word frequencies of several documents and words.

In [None]:
frame.loc[["he","she"],["d01","d22","d34"]]

In addition we can select values by numerical indexing:

In [None]:
frame.iloc[[1,2,4],[2,4,6]]

In both loc and iloc ":" can be used as wildcard (e.g. every row/column)

In [None]:
frame.iloc[[1,2],:]

Sorting is also possible

In [None]:
frame.sort_values(["d01"], ascending=False)

Sometimes its usefull to flip rows and columns:

In [None]:
frame = frame.T
frame

In [None]:
frame.sort_values("she",ascending=False)["she"]

We can calculate the sum of a whole row or column with
the sum() operator. The same applies for mean().

In [None]:
frame.loc["d01"].sum()

In [None]:
frame["and"].sum()

In [None]:
frame["and"].mean()

If we want really want to compare word counts in novels, it
does not seem to be reasonable to look at absolute values. 
Instead we should transform our data into relative frequencies.
To do so we have to get the total count of words in each document
and divide every single word count by that value.

In [None]:
frame = frame.div(frame.sum(axis=1), axis=0)
frame

<h3>Plotting</h3>

There are several libraries for ploting in python.
You can look at some galeries:

https://bokeh.pydata.org/en/latest/docs/gallery.html
https://plot.ly/python/
https://dash-gallery.plotly.host/Portal/
https://seaborn.pydata.org/examples/index.html
https://matplotlib.org/3.1.1/gallery/index.html

While bokeh and plotly are quite advanced and strongly 
designed for web-app development. We should stick
with matplotlib based solutions.

We want to show the realive frequency for "he" in
the first 5 docuemnts and choose a barplot as 
visualization. 

In [None]:
import matplotlib.pyplot as plt

To actualy plot a figure we need to type plt.show(). Lets try:

In [None]:
plt.show()

Of course we can't see anything here, since we did not define 
a figure that could have been plotted.<br>
In order to change this, we use the function subplots() and
configure the size of our plot

In [None]:
fig, ax = plt.subplots(figsize=(16,9))
plt.show()

Now we see something, but its still far from a bar plot.
We can create barplots using plt.bar()

In [None]:
fig, ax = plt.subplots(figsize=(16,9))
plt.bar()
plt.show()

This Error Message is quite handy. 
We just need to add x (name of the bars) 
and height:

In [None]:
fig, ax = plt.subplots(figsize=(16,9))
plt.bar(frame.index[:5], frame.iloc[:5,1])
plt.show()

Basicly thats it! But if you want to use this plot in any form of
publication you should invest a bit of time to enhance its 
appearacne.

In [None]:
fig, ax = plt.subplots(figsize=(16,9))
plt.bar(frame.index[:5], frame.iloc[:5,1])
plt.ylabel("Relative frequency of the word he", size=18) # Label of y axis
plt.xlabel("Documents", size = 18) # Label of x axis
ax.tick_params(labelsize=15) # Size of tick labels
plt.show()