## Welcome to Day Two: 
# Text Analysis and Text Mining, The Liberator

Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text faster than humanly possible. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this morning session introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.

In this morning session, you will learn to:

- Compare frequency distribution of words in a text
- How to clean and standardize your data, including powerful tools such as stemmers and lemmatizers
- How to prepare texts for computational analysis, including strategies for transforming texts into numbers
- How to tokenize your data and put it in a format compatible with Natural Language Toolkit.
- How to use NLTK methods such as concordance and similar
- Understand stop words and how to remove them when needed.

# About The Liberator

![J.L. Edmond photo](https://images.squarespace-cdn.com/content/v1/5a35aa12e5dd5b6cf4dee74e/1518408666587-SISWY5K1X7AQSS5B732A/jledmond-home-story-cover-01.png?format=2500w)

# About The Liberator

![Liberator GIF](https://images.squarespace-cdn.com/content/v1/5a35aa12e5dd5b6cf4dee74e/1518401679225-F6TB2P29TLCTY805JPMG/jle-archiveGIF.gif?format=1000w)

## Text as Data

![text_as_data.jpeg](attachment:text_as_data.jpeg)

When we think of “data,” we often think of numbers, things that can be summarized, statisticized, and graphed. Rarely when I ask people “what is data?” do they respond “*Invisible Man*.” And yet, more and more, text is data. Whether it is _Invisible Man_, or every romance novel written since 1750, or today’s newspaper or twitter feed, we are able to transform written (and spoken) language into data that can be quantified and visualized. That has been done for a while, but now we can do it in a much larger scale, in a much faster way.

# How do we turn a print book into machine-readable text? 

![Liberator_Print_vs_OCR.png](attachment:Liberator_Print_vs_OCR.png)

## How do we turn a print book into machine-readable text? 

   - Scanning and Optical Character Recognition (OCR) 
   
Where do we find machine-readable texts? 
   - Archives: Internet Archive, Project Gutenberg, Chronicling America, Hathi Trust, JStor 

Is there an easier way to analyze texts? 
   - Yes! You can use ready-made tools, such as Voyant. 

We are taking you into the black-box to demystify the process. You can then build skills to use more sophisticated tools and methods. 

## Definitions

![Definitions.png](attachment:Definitions.png)

Lets unpack this definition a bit. 

1. In the case of text mining, "*__computational analysis__*" means using computer algorithms to analyze text(s). And there are a variety of analysis methods you can employ to analyze your corpus computationally.

2. Next we have "*__vast quantities__*" of texts. How much data (or texts) do we need when undertaking a text mining project? There’s no exact threshold you need to meet, but generally, the more data you compile, the more meaningful your results are. 

3. Finally, we need digital, free-form, natural language, *__unstructured texts__*. The key term here is unstructured. When working with data, you can divide them into structured and unstructured formats. Unstructured texts are data not formatted according to an encoding structure, like HTML or XML. Whereas structured data generally takes the form of a spreadsheet or is encoded according to some standard. To undertake a text mining project, your data (i.e., the text) must be unstructured.

# Text Analysis Workflow

![Workflow.png](attachment:Workflow.png)

 
1. Text cleaning/parsing [to select relevant portions of text(s) for analysis]
   - Clean OCR errors
   - Remove extraneous content (i.e., remove front & back matter, chapter headings, indices, etc.)
   - Remove potential encoding structures (such as extracting the HTML or XML from web-based texts)
   
2. Pre-processing (to standardize your corpus)
    - Tokenizing/segmenting
    - Stemming/lemmatization
    
3. Analysis
    - Keyword/feature extraction
    - NLP (Natural Language Processing): part-of speech, named entity tagging
    - Stylometry/information-theoretical analysis
    - Similarity measurements and clustering
    - Text classification
    - Sentiment analysis
    - Topic modeling, vector-space analysis
    
4. Visualization

## Python’s Core Text Mining Modules

![Python_TM_Modules%20Modules-2.png](attachment:Python_TM_Modules%20Modules-2.png)

Python itself is a programming language with simple, easty to learn syntax that emphasizes readability. We can extend Python's functinality by importing libraries. Libraries are sets of instructions that Python can use to perform specialized functions. One such library is the Natural Language Toolkit (NLTK). NLTK is a rich library of natural language processing *tools* and *datasets*. It works very well with Python, allowing users to write powerful natural language processing programs with relatively short sections of code.