# Introduction to Language Processing

Patterns within written text are not the same across all authors or languages.
This allows linguists to study the language of origin
or potential authorship of texts where these characteristics are not
directly known such as the Federalist Papers of the American Revolution.
In this case study, we will examine the properties of individual books
in a book collection from various authors and various languages.
More specifically, we will look at book lengths, number of unique words,
and how these attributes cluster by language of or authorship.
Project Gutenberg is the oldest digital library of books.
It aims to digitize and archive cultural works, and at present,
contains over 50,000 books, all previously published
and now available electronically.

We have downloaded a collection of over 100 titles
from Project Gutenberg for analysis as a sample library for this case study.

The downloaded sample consist of several nested folders.
At the top level, we have four languages: English, French, German,
and Portuguese.
For each language, we have from one to four authors each, 13 authors in total.
For each author, we have from 1 to 16 books, 102 books in total.
Some authors have appeared in several language categories
because their books are available as translations in several languages.

Our goal is to write a function that given
a string of text counts the number of times each unique word appears.

In [1]:
text="This is my test text. We're keeping this short"
def count_words(text):
    """
    Count the the number of times each word occurs in a text(str). Return dictionary where keys are unique words and values are word counts
    """
    word_counts={}
    for word in text.split(" "):
        #known word
        if word in word_counts:
            word_counts[word]+=1
         #unknown word
        else:
            word_counts[word]=1
    return word_counts
count_words(text) 

{'This': 1,
 'is': 1,
 'my': 1,
 'test': 1,
 'text.': 1,
 "We're": 1,
 'keeping': 1,
 'this': 1,
 'short': 1}

We have to address the issue of capitalization and also punctuations which may make the same word count as diiferrent words

In [2]:
def count_words(text):
    """
    Count the the number of times each word occurs in a text(str). Return dictionary where keys are unique words and values are word counts. It skips punctuation
    """
    text=text.lower()
    skips=[".",",",";",":","'",'"'] #we put a sing double quote in a closed single quote for python to recoginze it as string and vice versa
    word_counts={}
    for ch in skips:
        text=text.replace(ch,"")
    for word in text.split(" "):
        #known word
        if word in word_counts:
            word_counts[word]+=1
         #unknown word
        else:
            word_counts[word]=1
    return word_counts
count_words(text)

{'this': 2,
 'is': 1,
 'my': 1,
 'test': 1,
 'text': 1,
 'were': 1,
 'keeping': 1,
 'short': 1}

Now we implement this using pythons inbuilt counter function

In [3]:
from collections import Counter

def count_words_fast(text):
    """
    Count the the number of times each word occurs in a text(str). Return dictionary where keys are unique words and values are word counts. It skips punctuation
    """
    text=text.lower()
    skips=[".",",",";",":","'",'"'] #we put a sing double quote in a closed single quote for python to recoginze it as string and vice versa
    for ch in skips:
        text=text.replace(ch,"")
    word_counts=Counter(text.split(" "))
    return word_counts
count_words_fast(text)

Counter({'this': 2,
         'is': 1,
         'my': 1,
         'test': 1,
         'text': 1,
         'were': 1,
         'keeping': 1,
         'short': 1})

In [4]:
count_words_fast(text)==count_words(text)

True

In [5]:
len(count_words("This comprehension check is to check for comprehension."))

6

In [6]:
count_words(text) is count_words_fast(text)

False

### Reading in a book

In [7]:
def read_book(title_path):
    """
    Read a book and return it as a string
    """
    with open(title_path,"r",encoding="utf-8") as current_file:
        text=current_file.read()
        text=text.replace("\n","").replace("\r","")
    return text


In [8]:
text=read_book("./Books/English/shakespeare/Romeo and Juliet.txt")
len(text)

169275

Finding a famous text in a story

In [9]:
ind=text.find("What's in a name?")
ind

42757

In [10]:
sample_text=text[ind:ind+1000]
sample_text

"What's in a name? That which we call a rose    By any other name would smell as sweet.    So Romeo would, were he not Romeo call'd,    Retain that dear perfection which he owes    Without that title. Romeo, doff thy name;    And for that name, which is no part of thee,    Take all myself.  Rom. I take thee at thy word.    Call me but love, and I'll be new baptiz'd;    Henceforth I never will be Romeo.  Jul. What man art thou that, thus bescreen'd in night,    So stumblest on my counsel?  Rom. By a name    I know not how to tell thee who I am.    My name, dear saint, is hateful to myself,    Because it is an enemy to thee.    Had I it written, I would tear the word.  Jul. My ears have yet not drunk a hundred words    Of that tongue's utterance, yet I know the sound.    Art thou not Romeo, and a Montague?  Rom. Neither, fair saint, if either thee dislike.  Jul. How cam'st thou hither, tell me, and wherefore?    The orchard walls are high and hard to climb,    And the place death, consid

### Computing Word Frequency Statisticsm

In [11]:
def word_stats(word_counts):
    """ Return number of uniwue words and frequencies"""
    num_unique=len(word_counts) #number of unique words in a text
    counts=word_counts.values() #frequencies of each word in the text
    return (num_unique,counts)

In [12]:
text=read_book("./Books/English/shakespeare/Romeo and Juliet.txt")

In [13]:
word_counts=count_words(text)

In [14]:
(num_unique,counts)=word_stats(word_counts)

In [15]:
num_unique

5118

To know how many words there were in total,

In [16]:
sum(counts)

40776

Comparing Romeo and Juliet to its German Version

In [17]:
text=read_book("./Books/German/shakespeare/Romeo und Julia.txt")
word_counts=count_words(text)

In [18]:
(num_unique,counts)=word_stats(word_counts)

In [19]:
num_unique

7527

In [20]:
sum(counts)

20311

## Reading multiple files

In [21]:
import os
book_dir="./Books"

We want to list all directories contained in our book directory

In [22]:
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir+"/"+language):
        for title in os.listdir(book_dir+"/"+language+"/"+author):
            inputfile=book_dir+"/"+language+"/"+author+"/"+title
            print(inputfile)
            text=read_book(inputfile)
            (num_unique,stats)=word_stats(count_words(text))

./Books/English/shakespeare/A Midsummer Night's Dream.txt
./Books/English/shakespeare/Hamlet.txt
./Books/English/shakespeare/Macbeth.txt
./Books/English/shakespeare/Othello.txt
./Books/English/shakespeare/Richard III.txt
./Books/English/shakespeare/Romeo and Juliet.txt
./Books/English/shakespeare/The Merchant of Venice.txt
./Books/French/chevalier/L'enfer et le paradis de l'autre monde.txt
./Books/French/chevalier/L'åle de sable.txt
./Books/French/chevalier/La capitaine.txt
./Books/French/chevalier/La fille des indiens rouges.txt
./Books/French/chevalier/La fille du pirate.txt
./Books/French/chevalier/Le chasseur noir.txt
./Books/French/chevalier/Les derniers Iroquois.txt
./Books/French/de Maupassant/Boule de Suif.txt
./Books/French/de Maupassant/Claire de Lune.txt
./Books/French/de Maupassant/Contes de la Becasse.txt
./Books/French/de Maupassant/L'inutile beautÇ.txt
./Books/French/de Maupassant/La Main Gauche.txt
./Books/French/de Maupassant/La Maison Tellier.txt
./Books/French/de Mau

In [23]:
num_unique

9700

#### pandas
gets its name from oanel data

In [24]:
import pandas as pd

creating a table in pandas using pandas dataframe

In [25]:
table=pd.DataFrame(columns=("name","age"))

#now we add new entries to our table

In [26]:
table.loc[1]="James",22

In [27]:
table.loc[2]="Jess",32

In [28]:
table

Unnamed: 0,name,age
1,James,22
2,Jess,32


In [29]:
#to get columns
table.columns

Index(['name', 'age'], dtype='object')

now we will use pandas dataframes to keep track of our books

In [30]:
import pandas as pd
import os

#create empty table with 5 columns
stats=pd.DataFrame(columns=("language","author","title","length","unique"))
title_num=1


In [31]:
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + '/' + author):
            inputfile = book_dir + '/' + language + '/' + author + '/' + title
            print(inputfile)
            text = read_book(inputfile)
            (num_unique, counts) = word_stats(count_words(text))
            stats.loc[title_num] = language, author, title, sum(counts), num_unique
            title_num += 1

./Books/English/shakespeare/A Midsummer Night's Dream.txt
./Books/English/shakespeare/Hamlet.txt
./Books/English/shakespeare/Macbeth.txt
./Books/English/shakespeare/Othello.txt
./Books/English/shakespeare/Richard III.txt
./Books/English/shakespeare/Romeo and Juliet.txt
./Books/English/shakespeare/The Merchant of Venice.txt
./Books/French/chevalier/L'enfer et le paradis de l'autre monde.txt
./Books/French/chevalier/L'åle de sable.txt
./Books/French/chevalier/La capitaine.txt
./Books/French/chevalier/La fille des indiens rouges.txt
./Books/French/chevalier/La fille du pirate.txt
./Books/French/chevalier/Le chasseur noir.txt
./Books/French/chevalier/Les derniers Iroquois.txt
./Books/French/de Maupassant/Boule de Suif.txt
./Books/French/de Maupassant/Claire de Lune.txt
./Books/French/de Maupassant/Contes de la Becasse.txt
./Books/French/de Maupassant/L'inutile beautÇ.txt
./Books/French/de Maupassant/La Main Gauche.txt
./Books/French/de Maupassant/La Maison Tellier.txt
./Books/French/de Mau

In [32]:
stats

Unnamed: 0,language,author,title,length,unique
1,English,shakespeare,A Midsummer Night's Dream.txt,16103,4345
2,English,shakespeare,Hamlet.txt,28551,6776
3,English,shakespeare,Macbeth.txt,16874,4780
4,English,shakespeare,Othello.txt,26590,5898
5,English,shakespeare,Richard III.txt,48315,5449
...,...,...,...,...,...
98,Portuguese,Queir¢s,O crime do padre Amaro.txt,128630,29300
99,Portuguese,Queir¢s,O Mandarim.txt,21440,7836
100,Portuguese,Queir¢s,O Primo Bazilio.txt,107303,27644
101,Portuguese,Queir¢s,Os Maias.txt,195771,40665


In [33]:
#shows top 5 lines
stats.head()

Unnamed: 0,language,author,title,length,unique
1,English,shakespeare,A Midsummer Night's Dream.txt,16103,4345
2,English,shakespeare,Hamlet.txt,28551,6776
3,English,shakespeare,Macbeth.txt,16874,4780
4,English,shakespeare,Othello.txt,26590,5898
5,English,shakespeare,Richard III.txt,48315,5449


In [35]:
#shows bottom 5 lines
stats.tail()

Unnamed: 0,language,author,title,length,unique
98,Portuguese,Queir¢s,O crime do padre Amaro.,128630,29300
99,Portuguese,Queir¢s,O Mandarim.,21440,7836
100,Portuguese,Queir¢s,O Primo Bazilio.,107303,27644
101,Portuguese,Queir¢s,Os Maias.,195771,40665
102,Portuguese,Shakespeare,Hamlet.,30567,9700


In [37]:
import pandas as pd
import os

#create empty table with 5 columns
stats=pd.DataFrame(columns=("language","author","title","length","unique"))
title_num=1
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + '/' + author):
            inputfile = book_dir + '/' + language + '/' + author + '/' + title
            print(inputfile)
            text = read_book(inputfile)
            (num_unique, counts) = word_stats(count_words(text))
            stats.loc[title_num] = language, author.capitalize(), title.replace(".txt",""), sum(counts), num_unique
            title_num += 1

./Books/English/shakespeare/A Midsummer Night's Dream.txt
./Books/English/shakespeare/Hamlet.txt
./Books/English/shakespeare/Macbeth.txt
./Books/English/shakespeare/Othello.txt
./Books/English/shakespeare/Richard III.txt
./Books/English/shakespeare/Romeo and Juliet.txt
./Books/English/shakespeare/The Merchant of Venice.txt
./Books/French/chevalier/L'enfer et le paradis de l'autre monde.txt
./Books/French/chevalier/L'åle de sable.txt
./Books/French/chevalier/La capitaine.txt
./Books/French/chevalier/La fille des indiens rouges.txt
./Books/French/chevalier/La fille du pirate.txt
./Books/French/chevalier/Le chasseur noir.txt
./Books/French/chevalier/Les derniers Iroquois.txt
./Books/French/de Maupassant/Boule de Suif.txt
./Books/French/de Maupassant/Claire de Lune.txt
./Books/French/de Maupassant/Contes de la Becasse.txt
./Books/French/de Maupassant/L'inutile beautÇ.txt
./Books/French/de Maupassant/La Main Gauche.txt
./Books/French/de Maupassant/La Maison Tellier.txt
./Books/French/de Mau

In [38]:
stats.head()

Unnamed: 0,language,author,title,length,unique
1,English,Shakespeare,A Midsummer Night's Dream,16103,4345
2,English,Shakespeare,Hamlet,28551,6776
3,English,Shakespeare,Macbeth,16874,4780
4,English,Shakespeare,Othello,26590,5898
5,English,Shakespeare,Richard III,48315,5449


In [39]:
stats.tail()

Unnamed: 0,language,author,title,length,unique
98,Portuguese,Queir¢s,O crime do padre Amaro,128630,29300
99,Portuguese,Queir¢s,O Mandarim,21440,7836
100,Portuguese,Queir¢s,O Primo Bazilio,107303,27644
101,Portuguese,Queir¢s,Os Maias,195771,40665
102,Portuguese,Shakespeare,Hamlet,30567,9700
