## Language Processing
Examine Properties of individual books
    - book lenth
    - number of unique words
    - how attributes cluster by language or authorship

<b> Project Gutenberg </b> - oldest digital library of books with over 50,000 books
are in a public domain - can be downloaded and read for free

In [1]:
text = "This is a sentence. First words of sentences start with a capital letter and end with a period, question mark, or exclamation in the English language."

In [2]:
def count_words(text):
    """
    count the number of times each word occurs in text.
    Returns dictionary key = words and values = word count_words
    """
    word_counts = {}
    for word in text.split(" "):
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
    return word_counts

print (count_words(text))

{'This': 1, 'is': 1, 'a': 3, 'sentence.': 1, 'First': 1, 'words': 1, 'of': 1, 'sentences': 1, 'start': 1, 'with': 2, 'capital': 1, 'letter': 1, 'and': 1, 'end': 1, 'period,': 1, 'question': 1, 'mark,': 1, 'or': 1, 'exclamation': 1, 'in': 1, 'the': 1, 'English': 1, 'language.': 1}


Periods and puncuations are counted as words along with mixed case words"
Can address mixed case words my making the string all lowercase.
Addressing punctuaction - specify all the punctuations to skip then loop over that container
and replace every occurrence of a punctuaction with an empty string

In [3]:
def count_words2(text):
    """
    count the number of times each word occurs in text.
    Returns dictionary key = words and values = word count_words
    accounts for mixed case words and punctuaction
    """
    text = text.lower()
    skips = [".", ",", ":", ";", "?", "!", '""', "''" ]
    for ch in skips:
        text = text.replace(ch, "")
    word_counts = {}
    for word in text.split(" "):
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
    return word_counts

print (count_words2(text))


{'this': 1, 'is': 1, 'a': 3, 'sentence': 1, 'first': 1, 'words': 1, 'of': 1, 'sentences': 1, 'start': 1, 'with': 2, 'capital': 1, 'letter': 1, 'and': 1, 'end': 1, 'period': 1, 'question': 1, 'mark': 1, 'or': 1, 'exclamation': 1, 'in': 1, 'the': 1, 'english': 1, 'language': 1}


Python provides a counter tool from the Collections module"

In [7]:
from collections import Counter
def count_words_faster(text):
    text = text.lower()
    skips = [".", ",", ":", ";", "?", "!", '""', "''" ]
    for ch in skips:
        text = text.replace(ch, "")
    word_counts = Counter(text.split(" "))
    return word_counts

print (count_words_faster(text))

print (count_words2(text) == count_words_faster(text))

print (len(count_words2("This comprehension check is to check for comprehension.")))

print (count_words2(text) is count_words_faster(text))

Counter({'a': 3, 'with': 2, 'this': 1, 'is': 1, 'sentence': 1, 'first': 1, 'words': 1, 'of': 1, 'sentences': 1, 'start': 1, 'capital': 1, 'letter': 1, 'and': 1, 'end': 1, 'period': 1, 'question': 1, 'mark': 1, 'or': 1, 'exclamation': 1, 'in': 1, 'the': 1, 'english': 1, 'language': 1})
True
6
False


### Reading in a Book
Character encoding = how a computer reads characters
UTF-8 dominant encoding

In [16]:
def read_book(title_path):
    """
    Read a book and return it as a string
    """
    with open(title_path, "r", encoding="utf-8") as current_file:
        text = current_file.read()
        text.replace("\n", "").replace("\r","")
    return text

In [9]:
text_book = read_book("./Books/English/shakespeare/Romeo and Juliet.txt")

print ("Number of characters in Romeo and Juliet:",len(text_book))

Number of characters in Romeo and Juliet: 174128


In [10]:
ind = text_book.find("What's in a name?")
print (ind)

sample_text = text_book[ind:ind + 1000]
print (sample_text)

44114
What's in a name? That which we call a rose
    By any other name would smell as sweet.
    So Romeo would, were he not Romeo call'd,
    Retain that dear perfection which he owes
    Without that title. Romeo, doff thy name;
    And for that name, which is no part of thee,
    Take all myself.

  Rom. I take thee at thy word.
    Call me but love, and I'll be new baptiz'd;
    Henceforth I never will be Romeo.

  Jul. What man art thou that, thus bescreen'd in night,
    So stumblest on my counsel?

  Rom. By a name
    I know not how to tell thee who I am.
    My name, dear saint, is hateful to myself,
    Because it is an enemy to thee.
    Had I it written, I would tear the word.

  Jul. My ears have yet not drunk a hundred words
    Of that tongue's utterance, yet I know the sound.
    Art thou not Romeo, and a Montague?

  Rom. Neither, fair saint, if either thee dislike.

  Jul. How cam'st thou hither, tell me, and wherefore?
    The orchard walls are high and hard to clim

### Computing Word Frequency Statistics
how many unique words and frequency in a given book

In [51]:
def count_words_faster(text):
    text = text.lower()
    skips = [".", ",", ":", ";", "?", "!", '""', "''" ]
    for ch in skips:
        text = text.replace(ch, "")
    word_counts = Counter(text.split(" "))
    return word_counts

def word_stats(t):
    """return number of unique words and frequencies"""
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

word_counts = count_words_faster(text_book)
(num_unique, counts) = word_stats(word_counts)

print ("Number of words in Romeo and Juliet:",(num_unique))
print ("Number of unique words in Romeo and Juliet:", sum(counts))

Number of words in Romeo and Juliet: 5926
Number of unique words in Romeo and Juliet: 40776


In [22]:
text_book_German = read_book("./Books/German/shakespeare/Romeo und Julia.txt")
print ("Number of characters in Romeo und Julia:",len(text_book_German))

Number of characters in Romeo und Julia: 142630


In [23]:
def word_stats(t):
    """return number of unique words and frequencies"""
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

word_counts = count_words_faster(text_book_German)
(num_unique, counts) = word_stats(word_counts)

print ("Number of words in Romeo und Julia:",num_unique)
print ("Number of unique words in Romeo und Julia:", sum(counts))

Number of words in Romeo und Julia: 7407
Number of unique words in Romeo und Julia: 20311


### Reading Multiple Files
read every book in the various subdirectories of the Books folder <br/>
`import os`

In [24]:
import os

In [25]:
book_dir = "./Books"
os.listdir(book_dir)

['English', 'French', 'German', 'Portuguese']

In [50]:
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + "/" + author):
            inputfile = book_dir + "/" + language + "/" + author + "/" + title
            #print (inputfile)
            text = read_book(inputfile)
            (num_unique, counts) = word_stats(count_words(text))
            # takes text - feeds into count_words function
            # returns an object that gets pushed into the words text function
            # which returns as an object thats returned as a tuple
            # unpacking a tuple


### Pandas


In [34]:
import pandas as pd

In [41]:
table = pd.DataFrame(columns = ("name", "age"))
table.loc[1] = "Amber" , 30
table.loc[2] = "Kumani", 6
print (table.columns)
table

Index(['name', 'age'], dtype='object')


Unnamed: 0,name,age
1,Amber,30.0
2,Kumani,6.0


In [53]:
def count_words_faster(text):
    text = text.lower()
    skips = [".", ",", ":", ";", "?", "!", '""', "''" ]
    for ch in skips:
        text = text.replace(ch, "")
    word_counts = Counter(text.split(" "))
    return word_counts

def word_stats(t):
    """return number of unique words and frequencies"""
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

stats = pd.DataFrame(columns = ("language", "author", "title", "length", "unique"))
title_num = 1
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + "/" + author):
            inputfile = book_dir + "/" + language + "/" + author + "/" + title
            # print (inputfile)
            text = read_book(inputfile)
            (num_unique, counts) = word_stats(count_words_faster(text))
            stats.loc[title_num] = language, author, title, sum(counts), num_unique
            title_num += 1
stats    

Unnamed: 0,language,author,title,length,unique
1,English,shakespeare,A Midsummer Night's Dream.txt,40776.0,5926.0
2,English,shakespeare,Hamlet.txt,40776.0,5926.0
3,English,shakespeare,Macbeth.txt,40776.0,5926.0
4,English,shakespeare,Othello.txt,40776.0,5926.0
5,English,shakespeare,Richard III.txt,40776.0,5926.0
6,English,shakespeare,Romeo and Juliet.txt,40776.0,5926.0
7,English,shakespeare,The Merchant of Venice.txt,40776.0,5926.0
8,French,chevalier,L'a╠èle de sable.txt,40776.0,5926.0
9,French,chevalier,L'enfer et le paradis de l'autre monde.txt,40776.0,5926.0
10,French,chevalier,La capitaine.txt,40776.0,5926.0


In [55]:
def count_words_faster(text):
    text = text.lower()
    skips = [".", ",", ":", ";", "?", "!", '""', "''" ]
    for ch in skips:
        text = text.replace(ch, "")
        word_counts = Counter(text.split(" "))
    return word_counts

def word_stats(t):
    """return number of unique words and frequencies"""
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

stats = pd.DataFrame(columns = ("language", "author", "title", "length", "unique"))
title_num = 1
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + "/" + author):
            inputfile = book_dir + "/" + language + "/" + author + "/" + title
            # print (inputfile)
            text = read_book(inputfile)
            (num_unique, counts) = word_stats(count_words(text))
            # modifications capitalize first letter in title and remove .txt extension
            stats.loc[title_num] = language, author.capitalize(), title.replace(".txt", ""), sum(counts), num_unique
            title_num += 1
stats

Unnamed: 0,language,author,title,length,unique
1,English,Shakespeare,A Midsummer Night's Dream,40776.0,5926.0
2,English,Shakespeare,Hamlet,40776.0,5926.0
3,English,Shakespeare,Macbeth,40776.0,5926.0
4,English,Shakespeare,Othello,40776.0,5926.0
5,English,Shakespeare,Richard III,40776.0,5926.0
6,English,Shakespeare,Romeo and Juliet,40776.0,5926.0
7,English,Shakespeare,The Merchant of Venice,40776.0,5926.0
8,French,Chevalier,L'a╠èle de sable,40776.0,5926.0
9,French,Chevalier,L'enfer et le paradis de l'autre monde,40776.0,5926.0
10,French,Chevalier,La capitaine,40776.0,5926.0


### Plotting Book Statistics

In [48]:
stats.length

1      20311.0
2      20311.0
3      20311.0
4      20311.0
5      20311.0
6      20311.0
7      20311.0
8      20311.0
9      20311.0
10     20311.0
11     20311.0
12     20311.0
13     20311.0
14     20311.0
15     20311.0
16     20311.0
17     20311.0
18     20311.0
19     20311.0
20     20311.0
21     20311.0
22     20311.0
23     20311.0
24     20311.0
25     20311.0
26     20311.0
27     20311.0
28     20311.0
29     20311.0
30     20311.0
        ...   
73     20311.0
74     20311.0
75     20311.0
76     20311.0
77     20311.0
78     20311.0
79     20311.0
80     20311.0
81     20311.0
82     20311.0
83     20311.0
84     20311.0
85     20311.0
86     20311.0
87     20311.0
88     20311.0
89     20311.0
90     20311.0
91     20311.0
92     20311.0
93     20311.0
94     20311.0
95     20311.0
96     20311.0
97     20311.0
98     20311.0
99     20311.0
100    20311.0
101    20311.0
102    20311.0
Name: length, dtype: float64

In [49]:
stats.unique

1      7407.0
2      7407.0
3      7407.0
4      7407.0
5      7407.0
6      7407.0
7      7407.0
8      7407.0
9      7407.0
10     7407.0
11     7407.0
12     7407.0
13     7407.0
14     7407.0
15     7407.0
16     7407.0
17     7407.0
18     7407.0
19     7407.0
20     7407.0
21     7407.0
22     7407.0
23     7407.0
24     7407.0
25     7407.0
26     7407.0
27     7407.0
28     7407.0
29     7407.0
30     7407.0
        ...  
73     7407.0
74     7407.0
75     7407.0
76     7407.0
77     7407.0
78     7407.0
79     7407.0
80     7407.0
81     7407.0
82     7407.0
83     7407.0
84     7407.0
85     7407.0
86     7407.0
87     7407.0
88     7407.0
89     7407.0
90     7407.0
91     7407.0
92     7407.0
93     7407.0
94     7407.0
95     7407.0
96     7407.0
97     7407.0
98     7407.0
99     7407.0
100    7407.0
101    7407.0
102    7407.0
Name: unique, dtype: float64