<a href="https://colab.research.google.com/github/esohman/EADH/blob/main/2_EADH_beginner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#EADH Beginner Notebook

Welcome to the Python for Digital Humanities Beginner Notebook!
You should have a general idea of many programming concepts now and understand what comments, variables, and datatypes are and what can be done with some basic data types.

In this notebook we will set a solid practical base that will allow you to move onto intermediate tasks. We will examine the concepts you learned in the previous notebook in more detail, learn about some new datatypes, and combine all the different things you learned previously to see how they can be used together. We will also explore other Python libraries such as NLTK and learn more about regular expressions.

Like the introductory notebook, this one too links to existing sources. Don't hesitate to look at those other sources whenever you are feeling confused.

#Lists, dictionaries, sets, and tuples!

A **List** is a collection which is ordered and changeable. Allows duplicate members.

A **Tuple** is a collection which is ordered and unchangeable. Allows duplicate members.

A **Set** is a collection which is unordered and unindexed. No duplicate members.

A **Dictionary** is a collection which is ordered and changeable. No duplicate members.


##Lists
We looked briefly at lists in the previous notebook. Now we'll take a much closer look at them.

NB! Lists are their own data type unrelated to the data type they contain.


This data type is used to group values together. You can store practically any objects in lists. Lists are defined using square brackets, and elements are separated using a comma. List indexes start from 0 so that the first element of a list had index 0 and not 1, the second element has index 1, etc.

Lists are also good when you don't know how big your output will be as you can always add more elements to your list. A list can consists of any type of data. You can have a list of integers, strings or a combination of both. You can also have a list of lists or a list of dictionaries etc.

You can read more lists and how you can manipulate lists here: 
https://www.w3schools.com/python/python_lists_add.asp

or [watch the Python for DH video](https://pythonhumanities.com/lesson-06-python-lists/) on lists.

In [None]:
# This is a list of integers
l = [1, 2, 3]

# This is a list of lists (that was unintentionally made to look like a matrix)
ll = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
print(len(ll))
# And this is a list of all kinds of things!
misc = [0, 1 == 1.0, type('cat')]
print(misc)

Lists also support indexing, just like strings, and you can access elements of a list by number using square brackets.

In [None]:
print(l[0]) #printing the first element of list l (an integer)
print(ll[-1]) #printing the last element of list ll (a list)

# In the following statement [0, 0, 0] is a list and [0] is an index 
print([0, 0, 0][0]) #we're printing the first element of the list with three zeroes

print(ll[2][1]) #we are printing the second element (index 1) of the third element (the list with index 2) of ll

You can add elements to existing lists using .append() and .extend() functions:

In [None]:
# Modify list l by adding another object. If you run this sell more than once, more numbers will be added.
l.append(4)
print('Modified list: ', l)

# Notice that here we do not modify the existing list ll, but create different new lists
ll_app = ll
ll_app.append([0, 0, 0])
print('Appended: ', ll_app)
#Can you spot the difference between extend and append?
ll_ext = ll
ll_ext.extend([0, 0, 0])
print('Extended: ', ll_ext)

In [None]:
l.append(ll)
print(l)

In [None]:
l = [1, 2, 3]
ll = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
l.append(ll)
print(l)

In [None]:
l = [1, 2, 3]
ll = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
l.extend(ll)
print(l)

Append adds an object at the end of the specified list. If that object is another list, it adds that list as a single element. Say the original length of a list is 3, and you append another list with three elements to the first list, the length of the list is now 4 (not 6).

Extend adds the specified values to the end of the list to be extended. So with a list with three elements, extending it with another list with three elements, gets us a list with six elements. 

***Extend appends elements from an iterable. Append appends the whole iterable.***

We can also remove and insert items from/to lists. In the code cell here you can see the effect of most these list methods.


In [None]:
myList = [1,2,3,4,5,6,7,8]
item = "item"
print(f'myList in the beginning: {myList}')

myList[2] = "juice" #replaces the value in index 2 with the new value ‘juice’ You can replace multiple values at the same time.
print(f'myList after replacing index: {myList}')

myList.append(item) #adds an item at the end of the list
print(f'myList after append: {myList}')

myList.extend(item) #merges two iterables adding the second iterable’s items  at the end of the first list’s items
print(f'myList after extension: {myList}') 
#note that the item is a string, i.e. a sequence of characters and is iterable

index = 1
myList.insert(index,item) #inserts an new item at the given index moving all the other items one index forward
print(f'myList after insertion: {myList}')

myList.remove(item) #removes the first instance of the specified item
print(f'myList after remove: {myList}')

myList.pop(index) #removes the item at the specified index, if left empty, it removes the last item in the list
print(f'myList after pop: {myList}')

#pop() also returns the popped item so
popped = myList.pop(index)
print(f'Popped: {popped}\nMyList: {myList}')

del myList[index] #deletes the item at the specified index
print(f'myList after deletion of index: {myList}')

del myList # deletes the entire list 
print(f'myList after deletion of entire list: {myList}')
#notice how you get an error since the list no longer exists!

###sort and reverse
We can also sort and reverse lists

In [None]:
mylist = [4,3,78,34,54,23,1]

print(mylist)
mylist.reverse()
print(f'myList reversed: {mylist}')
mylist.sort()
print(f'myList sorted: {mylist}')

##Dictionaries  
Dictionaries are associtiative variables. They contain unique keys, with which different values can be associated. Values belonging to different keys do not have to be unique. Only immutable variables, such as strings and numbers, can be keys. Dictionaries are much like real dictionaries in that you have certain keys (like a lexical item) that correspond to values (the definition of the word in a dictionary).

Read more about dictionaries here: https://www.w3schools.com/python/python_dictionaries.asp

Python for DH also has a good [video on creating and working with dictionaries](https://pythonhumanities.com/index.php/lesson-07-python-dictionaries/).


In [None]:
d = {'cat': 'singular', 'cats': 'plural', 'kittens': 'plural'}
d2 = {'cat': ('singular','mammal'), 'cats': 'plural', 'kittens': 'plural'}

Dictionaries are indexed by keys. You can access any value by its key.

In [None]:
d2['cat']

There are many things that can be done with dictionaries in Python, so for now we will only cover the basics.

In [None]:
# Add a new key: value pair
# Notice that no matter how many times you rerun this cell, only one pair will be added. 
# This is because a dictionary can't have multiple pairs with the same key, so each time you rerun this cell, the 'puppy''s value is rewritten, not added again.
d['puppy'] = 'singular'
print(d)

# Add multiple new pairs
dd = {'humans': 'plural', 'puppies': 'plur'}
d.update(dd)
print(d)

# Rewrite a value
d['puppies'] = 'plural'
print(d)

# Delete a key: value pair
del d['humans']
print(d)

# Display all key: value pairs using the .items() method
print("All keys and values: ", d.items())
# You can also use .keys() method to access all keys, and .values() method to access all values

More information can always be found, for example, in the official Python guides ans tutorials: https://docs.python.org/3/

###defaultdict

The built-in Python dictionary is a very useful data type, but if you try to access a key that does not exist, you get a KeyError. KeyErrors can be helpful, but it can also be beneficial to have the ability to automatically create a key-value pair if the key you are trying does not exist.

This can be done with the help of ***defaultdict*** which needs to be imported from collections

In [None]:
from collections import defaultdict
 
test_dic = defaultdict(int) #we set the defaultdict's default value for when a Key does not exist
#this default value can be a data type (defaultdict(int)) or a function

test_dic["cat"] = 1
test_dic["dog"] = 2
 
print(test_dic["cat"])
print(test_dic["dog"])
print(test_dic["parrot"])

## Sets and Tuples

Sets are similar to lists with a few exceptions:

* Sets are unordered
* Sets are unchangeable
* Sets can not hold duplicate values


You can convert a list to a set using set(myList), this removes duplicate values.
		


In [None]:
mylist = [3,5,5,7,8,3,4,5,6]
print(f'mylist as list: {mylist} and mylist as set: {set(mylist)}')

As for tuples, they are [variables that are used to store multiple values](https://www.w3schools.com/python/python_tuples.asp). You can understand them better by watching [this Python for DH video](https://pythonhumanities.com/lesson-05-python-tuples/).

Tuples are ordered, unchangeable, and unlike sets can hold duplicate values.

Just like lists, you can access individual values in a tuple by their indexes that start at 0.

In [None]:
mytuple = (1,"cat",4)
print(len(mytuple))
print(mytuple[1])

You can do most things with tuples and sets that you can do with lists. The main difference is that sets do not allow duplicates and tuples are unchangeable which means that you cannot change the values after the tuple has been created.

#List comprehension
List comprehension is Python at its most Python.

List comprehension is a more efficient and clean way of writing for loops of certain types.

Sometimes a traditional for loops works better so don't use list comprehension just for the sake of it.

In [None]:
#this is a typical for loop
mylist = [1,2,3,4,5,6,7,8,9,0]
newlist = []

for element in mylist:
  if element%2: #this is the same as saying element%2!=0 as False = 0, and True = 1
    newlist.append(element)

print(newlist)

In [None]:
#we can write the above code using list comprehension
mylist = [1,2,3,4,5,6,7,8,9,0]

newlist = [element for element in mylist if element%2]

print(newlist)

We can also do nested loops and several if statements in one.

In [None]:
mylist = [1,2,3,4,5,6,7,8,9,0]
newlist = [element*2 for element in mylist if element > 2 and element <8]
print(newlist)

In [None]:
mylist = [1,2,3,4,5]
mylist2 = ["a","b","c"]
newlist = [element*element2 for element in mylist for element2 in mylist2 if element %2]
print(newlist)

###Dictionary comprehension
Similar to list comprehension, we can do dictionary comprehension too.

In [None]:
#rather than doing 
d = {}
list1 = [1,2,3,4,5,6,7]
list2 = [54,12,876,34,123,756,435]

for item in list1:
  d[item] = list2[list1.index(item)] #we are checking the index of item

print(d)

In [None]:
#or
d = {}
list1 = [1,2,3,4,5,6,7]
list2 = [54,12,876,34,123,756,435]
mergedlist = list(zip(list1,list2)) #we are using zip to zip together, merge, two lists

for i,j in mergedlist:
  d[i] = j

print(d)

In [None]:
#we can use dictionary comprehension
list1 = [1,2,3,4,5,6,7]
list2 = [54,12,876,34,123,756,435]
d = {i:j for i,j in list(zip(list1,list2))}
print(d)

#Troubleshooting
Getting errors when programming is one of the best ways to learn. Errors force you to troubleshoot and solve problems.

There are a few steps to troubleshooting that go something like:
1. Check your code for obvious mistakes like typos, using the wrong variable in the wrong place, an extra comma or parenthesis somewhere. 

  If the problem is that the output is not what you expected, print out the values of variables along the way like before during and after loops. This shows you where things go wrong.

2. Read the error message. Try to understand it. It could be a very simple fix. If not, google it. StackOverflow is your best friend and will remain so until the day you stop programming.

3. One of the easiest ways to deal with certain types of errors in your code is by using exceptions. Exceptions allow you to check if something works, and if it doesn't, keep going. Error types can be specified or you can leave the type blank for the same treatment of all errors. The danger here is though that you might not catch errors that you should be catching and your output is not what it's supposed to be. It is recommended to always specify the error or errors that you want to except.

In [None]:
divisorlist = [1,2,3,0,1]
n = 1234
try:
  for i in divisorlist:
    print(n/i)
except:
  pass

In [None]:
try:
  for i in divisorlist:
    print(n/i)
except ZeroDivisionError:
  print("Don't divide with zero!")

In [None]:
try:
  for i in divisorlist:
    print(n/i)
except ZeroDivisionError:
  print("Don't divide with zero!")
finally: #we can also add a finally that is executed after the rest of the code
  print("All done!")

Did you notice that once we get an exception, the try block is no longer attempted. There is no way around this except for complicated work arounds.






Also note how we used ***pass*** under except in the first example. With except if there is nothing we want to do with the exception, we use pass. With loops we can choose between:
* ***continue*** which returns to the top of a loop or the next part of the code
* ***pass*** which literally does nothing. If there is code after pass, it will be executed
* ***break*** which exits the loop

In [None]:
for letter in 'hello':
  if(letter == 'e'):
    print('pass to be executed')
    pass
  print(letter)

In [None]:
for letter in 'hello':
  if(letter == 'e'):
    print('continue to be executed')
    continue
  print(letter)

In [None]:
for letter in 'hello':
  if(letter == 'e'):
    print('break to be executed')
    break
  print(letter)

#Regular Expressions
"A Regular Expression (RegEx) is a sequence of characters that defines a search pattern." ([source](https://www.programiz.com/python-programming/regex))

To truly understand regular expressions, I recommend a tutorial like [RegexOne](https://regexone.com/). It is interactive and goes through many different aspects and functions of regular expressions or regex.

When regex are used with programming languages, they vary a little bit with each language. This variation is often referred to as the flavor of regex. With Python we of course use the Python flavor of regex.

Regular expressions are find or find and replace functions. We use them when we want to find (and possibly replace) specific things in a text. Perhaps we want to find all the words that start with an A or all the words that end in -ing. Regex makes this easy.

For this part, if you are new to regex, I recommend that you do the RegexOne tutorial first and then continue on with the code in order to be able to construct your own regular expressions. Also, regular expressions is a complex topic and could easily be an entire course of its own. This notebook only takes up a few of the most common use cases so I recommend you do the tutorial and watch the tutorial videos linked to below for a comprehensive view of regex.

Python for DH has two videos that deal with regex. You can find them [here](https://pythonhumanities.com/lesson-14-python-and-regex-part-01/) and [here](https://pythonhumanities.com/lesson-15-python-and-regex-part-02/).

Another source for Python-specifique regex is [W3 schools](https://www.w3schools.com/python/python_regex.asp).

When creating regular expressions, it can be a good idea to check your code on something like [regex101](https://regex101.com/) This is an online, interactive regex checker. Write your pattern in the pattern box, choose Python as your regex flavor, write or copy paste some sample text in the text box and you're ready to test your pattern. Once it does exactly what you want it to do, you can copy paste your pattern back to your Python code.

In [None]:
#The simplest regex patterns consist of alphanumeric characters
import re
#you can use both match and search
#the difference is that match matches only the beginning of the string, search matches in the whole string

#let's call all our regular expressions "pattern"s. Note the r before the quotation marks
pattern = r"cookie" #change this to "The" and see what happens
sequence = r"The biggest monster is the Cookie Monster" #let's call all the strng sequences we are searching "sequence"s

if re.match(pattern, sequence):
  print("Match!")
else: 
  print("Not a match!")

#you can do the same with re.search() 
if re.search(pattern, sequence):
  print("Found!")
else: 
  print("Not found!")

In [None]:
#Sometimes we want to know exactly what was found with our pattern
pattern = r'\w+'
sequence = 'This is just an example.'
matches = re.findall(pattern,sequence) #findall returns a list
matches

In [None]:
#Another common thing we might want to do is replace
pattern = r'is'
replacewith = r'are'
sequence = 'Data is knowledge'
matches = re.sub(pattern, replacewith,sequence)
matches

A "neat" thing you can do with regex is finding something and changing it. I.e. you are not replacing it with something completely new, but instead changing or adding to the original.


In [None]:
pattern = r'(\w+)(\s{2,})(\.)' #Match instances where a word is followed by at least two space type characters and a period
replacewith = r'\g<1> !'
sequence = 'This is a silly     . sentence pattern   . unlikely to be found in the wild     .'
matches = re.sub(pattern, replacewith,sequence)
matches

We use parentheses to create groups. These groups get an index from left to right starting with 1. The first group in our pattern matches any word, the second group matches two or more spaces, and the third group matches a literal period (Remember that \ is used to escape special characters).

So when we want to replace the spaces if there are more than two with a single space, and the period with an exclamation point, we can reference the first match with \g<1> This allows us to keep the word the same while still matching it.

**NB!** Regular expressions are extremely useful and important in programming and computer science in general. You can use them in any word processor to find and replace as well. Most search engines support regex too. It might be the most difficult part of your programming journey so far, but I highly recommend that you work hard on learning to use regex. Whatever you end up programming in the future, I can almost guarantee that you will end up having to use regex at some point.

#Text proccessing and corpus linguistics with Python

When we are working with real world texts, that is texts that are not just clean example files created for teaching purposes, there are many preprocessing steps that need to be done.

Some common issues include removing or replacing whitespace at the beginning and end of a file.

Often our data is in txt or csv format. Usually we want to take a look at our data before we load it into Python. For text files a good program is [Sublime text](https://www.sublimetext.com/download) as it is free and available for Windows, OS X, and *nix. For CSV files, I recommend [TAD](https://www.tadviewer.com/). TAD is also free and avalaible for all the most common operating systems. With TAD you can open large csv files that are too large for Excel and similar programs. You cannot edit files with TAD, but you can filter data and export the filtered results.

Encoding is also often an issue and can easily lead to garbled data. Try to use utf-8 encoding where possible.

##NLTK
###NLTK BOOK

The Natural Language Toolkit is a comrehensive module for Python with many built-in tools that will help you automatically process and analyze texts.

We will work with it in later weeks, but if you are really interested in language technology and natural language processing, feel free to read [the book](https://www.nltk.org/book/) and complete the exercises within.

Parts of this NLTK section has been inspired by Dr. Na-Rae Han's [excellent tutorial](https://sites.pitt.edu/~naraehan/) on NLTK.

NLTK comes preinstalled on colab, but on other platforms you have to pip install it before you can use it.

Libraries and modules (sometimes also called packages) that are not built-in parts of Python need to be imported. This is done by using the import command. You can watch this Python for DH video for more information on [Python modules and libraries](https://pythonhumanities.com/lesson-13-python-modules-and-libraries/).

In [None]:
import nltk
#nltk comes with many different data sets, corpora, and tools that need to be separately downloaded and installed
nltk.download() #you can see a list with this command

In [None]:
#We can also just download everything
nltk.download('all')

Unfortunately, the downside of using colab is that we have to install these extra nltk parts every time we restart a runtime. But now we are good to go!

###Tokenization

Tokenization is a crucial part of almost any computational text analysis task. Tokenization means splitting the text into meaningful units like words or sentences.

In [None]:
sample = "This is a text. It contains a whole bunch of different words."
tokens = nltk.word_tokenize(sample)
tokens

Most of NLTK's built-in corpora are pre-tokenzied. NLTK also provides many text analysis tools that often require the text to be of NLTK's own text data type. You can convert text like this:

In [None]:
ttext = nltk.Text(tokens)
ttext

Let's take a look at the State of the Union corpus that comes with NLTK.

In [None]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
#Here we created a list of words from the corpus for all the words that are alphabetic, i.e. removed numbers and special characters.
#We can also lower case everything in one go
words = [w.lower() for w in nltk.corpus.state_union.words() if w.isalpha()]

### stopwords
Stopwords are words that are not considered to carry meaning. Depending on the purpose of your analysis, you might want to remove stopwords from your data. NLTK provides stopword lists for many different languages and you can also create your own.


In [None]:
stopwords = nltk.corpus.stopwords.words("english")
words = [w.lower() for w in words if w.lower() not in stopwords]

We can also add part-of-speech tags to our text.

In [None]:
pos_text = nltk.pos_tag(tokens)
pos_text

Let's use the original state of the union words to look at some of NLTK's built-in functions.

In [None]:
fd = nltk.FreqDist(words)
fd #what you should see is a list of all the unique words in the text and the number of times they appear

In [None]:
fd.most_common(5) #the 5 most common words

In [None]:
nltk.Text(words).similar("people") #What words are the most similar to "people"

In [None]:
nltk.Text(words).concordance("people") #With what words do "people" co-occur?

##Spacy
