# Week 3: Getting creative with text (the Accumulator Pattern)

Last week, we introduced techniques like if/then statements, for loops, and slicing to manipulate and interact with text.


This week, we're going to take that further and focus on using looping or iteration to answer a variety of questions. We will introduce something called the **accumulator pattern** which allows us to keep track of specific data as we explore a text or collection of textx. We'll also introduce the concept of a stop word list.

## Review: working with strings

**Strings** are a type of variable we use often with Python. A string is essentially an ordered list of characters - typically letters, numbers, and punctuation. 

In [2]:
our_string = "Here is an example of a string"

We can print out strings, which is a convenient way to test whether they contain the value we expect them to contain.

In [4]:
print(our_string)

Here is an example of a string


Strings are also interchangeable with another type of variable called a **list**. A list is a collection of items. For instance, here's a list of disciplines:

In [5]:
disciplines = ['anthropology', 'history', 'microbiology', 'science and technology studies']

When working with lists, you often want to focus on just one item, or a subset of the list. To focus on an individual item, you can indicate the index of the list you wish to focus on using list[index] format.

To practice, print out the 0th element in the disciplines list:

In [6]:
# replace with the expression to print out the element from the disciplines list located at index 0

We can do the same thing with strings. Can you remember how to indicate the index of the last item in a list? (Hint, it involves a negative number...) Let's print the final character of our_string above:

In [7]:
# replace with the expression to print out the final character from the our_string string

What about printing the first four characters of the our_string string? To do this, we can use **slice notation**. You can write a slice using list[starting_index_inclusive:ending_index_excluisive]

In [8]:
print(disciplines[0:2])

['anthropology', 'history']


Write a statement to print the first four characters of the our_string (remember, strings behave like lists of characters)

In [9]:
# replace with the expression to print out the first four characters in our_string

We can also add a substring into a string using the **.format()** string method. This is convenient especially in print statements. For instance:

In [11]:
print("My undergraduate major was {}... but sometimes I wish I had majored in {}! Oh well!".format(disciplines[0], disciplines[-1]))

My undergraduate major was anthropology... but sometimes I wish I had majored in science and technology studies! Oh well!


The .format() method takes as many parameters as you have {} in the string, which is a convenient system!

## Iterating over lists and strings

We often will want to loop over a list or a string, check for a certain condition, and take action if that condition is met. Consider the code snippet below:

In [13]:
for discipline in disciplines:
    if 'ology' in discipline:
        print('{} is an -ology discipline'.format(discipline))
    else:
        print('No -ology in {}'.format(discipline))

anthropology is an -ology discipline
No -ology in history
microbiology is an -ology discipline
science and technology studies is an -ology discipline


What's going on here? [Discuss]

Now what happens if we want to add more disciplines to the disciplines list? We can use the snazzy **.append() method.** For instance:

In [14]:
disciplines.append('religious studies')
disciplines.append('zoology')
disciplines.append('astronomy')

Now try running the "for discipline in disciplines" cell again.

We can also solve this problem by using what's called the **accumulator pattern**. In this pattern, we also loop over a list and check for a condition, but we store any matching values as we go into a list. Here's an example of this same script using the accumulator pattern:

In [17]:
ology_disciplines = []
studies_disciplines = []

for discipline in disciplines:
    if 'ology' in discipline:
        ology_disciplines.append(discipline)
    elif 'studies' in discipline:
        studies_disciplines.append(discipline)
    else:
        pass

print("Here are the -ology disciplines:")
print(ology_disciplines)
print("And here are the studies disciplines:")
print(studies_disciplines)

Here are the -ology disciplines:
['anthropology', 'microbiology', 'science and technology studies', 'zoology']
And here are the studies disciplines:
['religious studies']


Take a look at the output here. What do you notice? Is there any strange behavior jumping out at you? [Discuss]

Let's take a new example. Here is an abstract of a recent popular article in Frontiers in Digital Humanities:

## A map for big data research in digital humanities
### Frédéric Kaplan

This article is an attempt to represent Big Data research in digital humanities as a structured research field. A division in three concentric areas of study is presented. Challenges in the first circle – focusing on the processing and interpretations of large cultural datasets – can be organized linearly following the data processing pipeline. Challenges in the second circle – concerning digital culture at large – can be structured around the different relations linking massive datasets, large communities, collective discourses, global actors, and the software medium. Challenges in the third circle – dealing with the experience of big data – can be described within a continuous space of possible interfaces organized around three poles: immersion, abstraction, and language. By identifying research challenges in all these domains, the article illustrates how this initial cartography could be helpful to organize the exploration of the various dimensions of Big Data Digital Humanities research.

Let's analyze the text of the abstract. To do this, we'll need to use a couple of additional tools. 

### #1
First, we can create a punctuation_remover with the string library, and apply it to our string using the .translate() method.

In [20]:
import string
# s.translate(None, string.punctuation)
punctuation_remover = str.maketrans('', '', string.punctuation)

In [22]:
test_string = "Here's a string!!! with a bunch? of? punctuation!?!?"
print(test_string)
test_string = test_string.translate(punctuation_remover)
print(test_string)

Here's a string!!! with a bunch? of? punctuation!?!?
Heres a string with a bunch of punctuation


### #2

Second, we might want to split a very long string into separate words. The **.split()** method is great for this purpose. Here's an example:

In [23]:
long_sentence = "Here is a long sentence full of words we'd like to split up"

In [24]:
split_sentence = long_sentence.split()

In [25]:
print(split_sentence)

['Here', 'is', 'a', 'long', 'sentence', 'full', 'of', 'words', "we'd", 'like', 'to', 'split', 'up']


### #3

Third, we might want to treat a given word as lowercase. To do this, we can call **.lower()** on our word. Here's an example:

In [26]:
word = "Capitalized"
word_l = word.lower()
print(word_l)

capitalized


Now, let's try to combine these techniques together in the accumulator pattern recipe below:

In [28]:
abstract_as_string = #fill in abstract here
# note, to prevent your string from getting chopped off by quotation marks within the abstract, you can use the notation
# for a long excerpt, which is three single quotation marks surrounded text. E.g.: '''text'''

instances_of_data = []

abstract_without punctuation = # write expression to remove punctuation from abstract
abstract_as_words = # write expression to split the abstract-sans-punctuation into a list of words

for word in # fill in with appropriate list :
    word_lower = # make this current word lowercase
    if 'data' in word_lower:
        # append this word_lower to our list of accumulated values


SyntaxError: invalid syntax (<ipython-input-28-c5b51d847c5e>, line 1)

In [29]:
print(instances_of_data)

NameError: name 'instances_of_data' is not defined

In [30]:
## Challenge

You've implemented an accumulator pattern to analyze text! Now, let's spend some time taking this same pattern and applying it to a new problem. 1. Pick any article that catches your interest from [this list](https://www.frontiersin.org/journals/digital-humanities/sections/digital-history#articles)

2. Instead of focusing just on the abstract, grab the entire article and put it into a variable (Jupyter can handle this just fine!)

3. Decide ahead of time - are there any specific words, prefixes, suffixes, etc. you are curious about counting? Note that if you want to check for an exact match instead of a substring, you can use the syntax if word == data instead of if "data in word. Try to accumulate two lists with different queries, and remember to think about how you formulate your if/elif/else statements!

4. Write your script below and print out your results.