# Working with strings

It is possible to extract quite a lot of interesting, structured information from text data simply by using string processing techiques. 

In this session, we'll see how to do some of these things, specifically calculating word frequencies and showing key-words-in-context (concordances). We'll do this for individual files and then you'll work together to write Python code which does this for a larger corpus of texts.

In [4]:

import os # operating system
import re # regex
import string # string processing tools 
from collections import Counter, OrderedDict

__Loading text files__

We start by defining a filepath using ```os.path.join()``` like we saw last week.

In [5]:
filename = os.path.join("..", "data", "Dickens_Expectations_1861.txt")

We then need to load the file that we want to work with.

There are a number of ways to do this in Python, but the following should be considered "best practice".

In [6]:
with open(filename, "r", encoding = "utf-8-sig") as file: # with open filename in read mode, with variable nmae file  
    text = file.read()

    # could see later that there was something funny, so we needed to add "encoding = 'utf-8-sig'" 

When we load the text file, we just have a simple string object which can be indexed and sliced.

In [7]:
# print(text[:300])
# the above is not really the format we want 
text[:300]

"REAT EXPECTATIONS\n 1867 Edition \nby Charles Dickens\nChapter I\nMy father's family name being Pirrip, and my Christian name Philip, my\ninfant tongue could make of both names nothing longer or more explicit\nthan Pip. So, I called myself Pip, and came to be called Pip.\nI give Pirrip as my father's famil"

You can see that there are some formatting things that are a little funky, such as lots of newline breaks.

We can get rid of those by using the ```.replace()``` method on strings.

In [8]:
text.replace("\n", " ")

ibited the same fat five fingers. "Hah!" he went on, handing me the bread and butter. "And air you a going to Joseph?" "In heaven\'s name," said I, firing in spite of myself, "what does it matter to you where I am going? Leave that teapot alone." It was the worst course I could have taken, because it gave Pumblechook the opportunity he wanted. "Yes, young man," said he, releasing the handle of the article in question, retiring a step or two from my table, and speaking for the behoof of the landlord and waiter at the door, "I will leave that teapot alone. You are right, young man. For once you are right. I forgit myself when I take such an interest in your breakfast, as to wish your frame, exhausted by the debilitating effects of prodigygality, to be stimilated by the \'olesome nourishment of your forefathers. And yet," said Pumblechook, turning to the landlord and waiter, and pointing me out at arm\'s length, "this is him as I ever sported with in his days of happy infancy! Tell me not

__Tokenize text__

So far, we have one long string of characters. But we want to be able to work with individual words. To do that, we have to *tokenize* our data - in other words, to split it into individual tokens (or words).

In [9]:
tokens = text.split() # if leave brackets empty, will split on white space 

In [10]:
tokens 

['REAT',
 'EXPECTATIONS',
 '1867',
 'Edition',
 'by',
 'Charles',
 'Dickens',
 'Chapter',
 'I',
 'My',
 "father's",
 'family',
 'name',
 'being',
 'Pirrip,',
 'and',
 'my',
 'Christian',
 'name',
 'Philip,',
 'my',
 'infant',
 'tongue',
 'could',
 'make',
 'of',
 'both',
 'names',
 'nothing',
 'longer',
 'or',
 'more',
 'explicit',
 'than',
 'Pip.',
 'So,',
 'I',
 'called',
 'myself',
 'Pip,',
 'and',
 'came',
 'to',
 'be',
 'called',
 'Pip.',
 'I',
 'give',
 'Pirrip',
 'as',
 'my',
 "father's",
 'family',
 'name,',
 'on',
 'the',
 'authority',
 'of',
 'his',
 'tombstone',
 'and',
 'my',
 'sister,',
 '-',
 'Mrs.',
 'Joe',
 'Gargery,',
 'who',
 'married',
 'the',
 'blacksmith.',
 'As',
 'I',
 'never',
 'saw',
 'my',
 'father',
 'or',
 'my',
 'mother,',
 'and',
 'never',
 'saw',
 'any',
 'likeness',
 'of',
 'either',
 'of',
 'them',
 'for',
 'their',
 'days',
 'were',
 'long',
 'before',
 'the',
 'days',
 'of',
 'photographs',
 ',',
 'my',
 'first',
 'fancies',
 'regarding',
 'what',
 't

__Get sentences with regex__

We can use a similar logic to split the data into separate sentences.

This time we use a bit of ```regex``` to do our string splitting.

In [20]:
sentences = re.split(r'[.?!]\s*', text)

# splitting any time there is the listed punctuation in the [],  followed by a space and any character after 
# so basically splitting by sentences. 

## Find word frequencies

We can count how many times an individual word appears manually, simply by iterating over the list of tokens and using a counter. 

To do this, we use a built in Python function called ```enumerate()```.

43


In [21]:
counter = 0 
keyword = "love"

for token in tokens:
    # create a new thing called stripped, is that token minus any left over puncuation 
    stripped = token.strip(string.punctuation)
    # make lower case 
    lowered = stripped.lower()
    if token == keyword:
        counter +=1

In [14]:
print(counter)

43


In [22]:
tokens.count("love")

43

In [23]:
cleaned = []
for token in tokens:
     # remove punctuation
    stripped = token.strip(string.punctuation)
    # make lower case 
    lowered = stripped.lower()
    # add to new list 
    cleaned.append(lowered)


In [18]:
Counter(cleaned)
# which gives us a dictionary - can see with the keyed pairings 

# can see issue with 'i' because it's also used as a roman numeral  

Counter(cleaned).most_common()
# now it's an ordered list of tuples 

[('the', 8143),
 ('and', 7078),
 ('i', 6484),
 ('to', 5079),
 ('of', 4431),
 ('a', 4041),
 ('in', 3025),
 ('that', 2988),
 ('was', 2836),
 ('it', 2669),
 ('he', 2208),
 ('you', 2184),
 ('had', 2093),
 ('my', 2070),
 ('me', 1996),
 ('his', 1858),
 ('as', 1773),
 ('with', 1760),
 ('at', 1639),
 ('on', 1419),
 ('for', 1381),
 ('', 1377),
 ('said', 1349),
 ('her', 1172),
 ('him', 1150),
 ('have', 1084),
 ('but', 1068),
 ('not', 1067),
 ('be', 1034),
 ('she', 888),
 ('when', 882),
 ('by', 809),
 ('so', 794),
 ('were', 794),
 ('out', 784),
 ('if', 781),
 ('we', 761),
 ('this', 747),
 ('all', 734),
 ('mr', 711),
 ('joe', 692),
 ('there', 690),
 ('is', 655),
 ('no', 643),
 ('what', 634),
 ('up', 612),
 ('been', 609),
 ('would', 599),
 ('from', 579),
 ('or', 566),
 ('which', 548),
 ('one', 494),
 ('into', 492),
 ('could', 483),
 ('do', 457),
 ('now', 453),
 ('an', 445),
 ('then', 438),
 ('more', 404),
 ('very', 398),
 ('your', 388),
 ('miss', 383),
 ('know', 382),
 ('come', 374),
 ('little', 37

We can use a similar logic to find all sentences where a certain keyword appears.

In [24]:
for sentence in sentences:
    sentence = sentence.lower()

In [31]:
keyword = "love"

# enumerate allows indexing for the entries 
for idx, sentence in enumerate(sentences): # for every index and sentence (idx is random but often seen as idx)
    # make everything lowercase
    lowered = sentence.lower()
     # stripping punctuation
    stripped = sentence.strip(string.punctuation)
    # add white space around keyword
    modified_kw = " " + keyword + " "
    if modified_kw in sentence:
        print(idx, sentence)

 pillow, "I
love her, I love her, I love her
5611 "Herbert," said I, laying my hand upon his knee, "I love - I
adore - Estella
6898 I thought I saw him leer in an ugly way at me while the
decanters were going round, but as there was no love lost between us,
that might easily be
6931 To the present moment, I believe it to have been referable
to some pure fire of generosity and disinterestedness in my love for
her, that I could not endure the thought of her stooping to that hound
7223 Isn't  there bright eyes somewheres, wot you love the
thoughts on
8107 "
"Estella," said I, turning to her now, and trying to command my
trembling voice, "you know I love you
8127 When you say you love me, I know what you mean, as a form
of words; but nothing more
8146 "
"You cannot love him, Estella
8409 But what a blessing it is for the son of my father and mother to love a
girl who has no relations, and who can never bother herself or anybody
else about her family
9993 "
"Herbert, I shall always need you

In [30]:
for idx,token in enumerate(tokens):
    print(idx, token)

38 to
186639 me,"
186640 returned
186641 Estella,
186642 very
186643 earnestly,
186644 "'God
186645 bless
186646 you,
186647 God
186648 forgive
186649 you!'
186650 And
186651 if
186652 you
186653 could
186654 say
186655 that
186656 to
186657 me
186658 then,
186659 you
186660 will
186661 not
186662 hesitate
186663 to
186664 say
186665 that
186666 to
186667 me
186668 now,
186669 -
186670 now,
186671 when
186672 suffering
186673 has
186674 been
186675 stronger
186676 than
186677 all
186678 other
186679 teaching,
186680 and
186681 has
186682 taught
186683 me
186684 to
186685 understand
186686 what
186687 your
186688 heart
186689 used
186690 to
186691 be.
186692 I
186693 have
186694 been
186695 bent
186696 and
186697 broken,
186698 but
186699 -
186700 I
186701 hope
186702 -
186703 into
186704 a
186705 better
186706 shape.
186707 Be
186708 as
186709 considerate
186710 and
186711 good
186712 to
186713 me
186714 as
186715 you
186716 were,
186717 and
186718 tell
186719 me
186720 we
186721 are
1

In [34]:
#love_sentences = []
#keyword = "love"

#for sentence in sentences:
#    if re.search(keyword):
 #       love_sentences.append()
# my try to do what we were asked 

Python also has some built-in tools which we can use to count how many times a token appears in a list.

There are some problems, though! 

## Viewing keywords in context (KWIC, concordancing)

In [35]:
keyword = "love"

# every token(word) and it's id
for idx,token in enumerate(cleaned):
    # if the word is our keyword, love 
    if token == keyword:
        # joining the tokens 5 before but not including the token (keyword) from the list called cleaned in a string 
        before = ' '.join(cleaned[idx-5:idx])
        # joining the tokens 5 after the keyword but not including it 
        after = ' '.join(cleaned[idx+1:idx+6])
        # putting them in the order we want 
        full = [before, token, after]
        # telling python how to format the full list, saying after "before" we want 50 blank characters and so on. (old fashtioned way of doing it) 
        print(idx, "{:50} {:20} {:50}".format(*full)) # *full, for every entry in this list 
        # can add or take away idx if we want the indexing number 

ou i have loved you                              
138382 comprehend when you say you                        love                 me i know what you                                
138590 replied quite true you cannot                      love                 him estella her fingers stopped                   
138790 to the few who truly                               love                 you among those few there                         
143523 my father and mother to                            love                 a girl who has no                                 
145192 a little affair of true                            love                 i felt as if the                                  
146996 for the genius of youthful                         love                 being in want of assistance                       
153258 little girl to rear and                            love                 and save from my fate                             
172831 you because i shall always       

## Exercises

In groups, work on the following exercises in class. 

I've left these somewhat underspecified, so you're welcome to solve them in whatever way you please, and to save the results in whatever format you think works best.

- Write some code which searches through *all* of the novels in the folder called *100 English Novels* and shows how many times a given keyword appears in each novel.
   - Save your results in a way which 
- Turn the KWIC in context code above into a function which can be used to show *all* occurrences of a keyword in the corpus. 
  - Bonus: Your results should show the same results as those above but with an additional column showing the filename
  - Bonus: Write your function in such a way that a user can define the context window size to display.