# Session 1: Setup

This will be a very introductory session to help us all get started on the same page. Today, we will download and install Python, the Classical Language Toolkit (CLTK), the Natural Language Toolkit (NLTK), and a few helper packages that will be beneficial later on! These steps are only necessary if you are using your own computer for this workgroup. The second half of this first session will provide an exercise to demonstrate a perhaps impractical application of computational text analysis.

## So let's get started: Anaconda, I choose you!

Please install the newest version of Anaconda (https://www.anaconda.com/download). This software let's us use Jupyter notebooks (what you're reading from right now!) which is a great way to test code in a modular format, allowing for speedy changes and immensely less frustration! 

## Next up: Installing the CLTK & NLTK

Let's install these awesome packages:

In [None]:
# This line tells the computer to install the correct packages
import sys
!{sys.executable} -m pip install cltk

# Depending on your machine, you may need to go to the terminal itself and install the CLTK manually.
# Come chat with me if this happens.

In [None]:
# Let's do the same for the NLTK
!{sys.executable} -m pip install -U nltk

## Now we just need to install some dependencies

The following packages will help us out later on as we start doing more advanced things!

In [None]:
# NumPy provides some of the more advanced mathematical things we might need
!{sys.executable} -m pip install numpy

# pandas is good for data visualization and analysis. We'll do more with these later.
!{sys.executable} -m pip install pandas

# Introduction: Latin Palindromes

To whet our appetite for *computational text analysis*, let's play around with finding latin palindromes. This exercise comes from Patrick Burns' blog *Disiecta Membra*. 
Link here: https://disiectamembra.wordpress.com/2017/03/26/finding-palindromes-in-the-latin-library/

In [None]:
# Imports
import string
import re
from collections import Counter
from pprint import pprint
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer

# These statements set up tools that help us normalize the texts.
# They will be discussed in more detail in the next session.
word_tokenizer = WordTokenizer('latin')
replacer = JVReplacer()

# This function uses the previously defined tools to preprocess the texts.
# This comes directly w/o modification from Patrick Burns
def preprocess(text):    

    # Normalizing ligatures
    text = re.sub(r'&aelig;','ae',text)
    text = re.sub(r'&AElig;','AE',text)
    text = re.sub(r'&oelig;','oe',text)
    text = re.sub(r'&OElig;','OE',text)
    
    text = re.sub('\x00',' ',text)
    
    #Lowercasing all the text
    text = text.lower()
    
    #Replacing j's & v's
    text = replacer.replace(text)
    
    # More normalizing work
    text= re.sub(r'&lt;','<',text)
    text= re.sub(r'&gt;','>',text)    
    
    # Getting rid of punctuation
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)

    # Getting rid of some standard Latin Library titles
    remove_list = [r'\bthe latin library\b',
                   r'\bthe classics page\b',
                   r'\bneo-latin\b', 
                   r'\bmedieval latin\b',
                   r'\bchristian latin\b',
                   r'\bthe miscellany\b'
                  ]

    for pattern in remove_list:
        text = re.sub(pattern, '', text)
    
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    
    return text

In [None]:
from cltk.corpus.utils.importer import CorpusImporter

corpus_importer = CorpusImporter('latin')
corpus_importer.list_corpora

corpus_importer.import_corpus('latin_models_cltk')
corpus_importer.import_corpus('latin_text_latin_library')

In [None]:
# Get the Latin Library corpus

from cltk.corpus.readers import get_corpus_reader
ll = get_corpus_reader(language='latin', 
                       corpus_name='latin_text_latin_library')
files = ll.fileids()
print(files[:50]) # The first 50 files in the corpus


In [None]:
# Stats

file_count = len(files)
print(f'There are {file_count} files in this corpus.')

In [None]:
# Importing the raw text of the entire Latin Library
latinlibrary_whole = ll.raw()
print(latinlibrary_whole[:100])

In [None]:
# Now we use our handy-dandy function from P.B. to process the raw text.
ll_text = preprocess(latinlibrary_whole)
print(ll_text[:100])

In [None]:
# This line splits the text based on whitespace. We don't need a fancy method
# for splitting enclitics or anything here, since we are only interested in 
# whether a word, even with an enclitic, forms a palindrome.
ll_tokens = ll_text.split()

# We remove all tokens(words) that are shorter than 3 characters.
ll_tokens = [token for token in ll_tokens if len(token) > 2]

# We remove tokens made up of a single character.
ll_tokens = [token for token in ll_tokens if token != len(token)*token[0]]

In [None]:
# Let's define a function to check if a word is a palindrome or not:
def is_palindrome(token):
    return token == token[::-1]

In [None]:
# Now we should filter out all the tokens(words) from the Latin Library
# for palindromes. This line will make a list of all the palindromes in 
# this corpus.
palindromes = [token for token in ll_tokens if is_palindrome(token)]

In [None]:
# How many are there total?
print(len(palindromes))

In [None]:
# We can determine the most common ones:
c = Counter(palindromes)
print(c.most_common(10))

In [None]:
# We can make a list of the longest palindromes
palindromes = [k for k, c in c.items()]
palindromes.sort(key=len, reverse=True)

# This line let's us see how many unique palindromes exist in this corpus
print(len(palindromes))
print(palindromes[:10])