<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-1:-Text-Processing" data-toc-modified-id="Project-1:-Text-Processing-1">Project 1: Text Processing</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#By-the-end-of-this-project,-you-should-be-able-to:" data-toc-modified-id="By-the-end-of-this-project,-you-should-be-able-to:-2.0.1">By the end of this project, you should be able to:</a></span></li></ul></li></ul></li><li><span><a href="#Task-1.-Load-data-" data-toc-modified-id="Task-1.-Load-data--3">Task 1. Load data </a></span></li><li><span><a href="#Task-2.-Data-cleaning" data-toc-modified-id="Task-2.-Data-cleaning-4">Task 2. Data cleaning</a></span></li><li><span><a href="#Task-3-Count-words" data-toc-modified-id="Task-3-Count-words-5">Task 3 Count words</a></span></li><li><span><a href="#Task-4.-Calculate-probabilities-of-individual-words" data-toc-modified-id="Task-4.-Calculate-probabilities-of-individual-words-6">Task 4. Calculate probabilities of individual words</a></span></li><li><span><a href="#Task-5.-Calculate-probabilities-of-a-group-of-words" data-toc-modified-id="Task-5.-Calculate-probabilities-of-a-group-of-words-7">Task 5. Calculate probabilities of a group of words</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-8">Summary</a></span></li></ul></div>

Project 1: Text Processing
-----

Text is a very valuable source of data. Think of the information in books, reports, emails, and on websites. However, text data can be difficult to extract that value. There are many reasons. Text is less-structured than tabular data. Also, data in string format is harder to process for a computer than data in numerical format. 

The first step in most Natural Language Processing projects is text processing. This project guide you through implementing fundamental text processing algorithms from scratch.

Learning Outcomes
----

#### By the end of this project, you should be able to:

- Work with text data in Python.
- Efficiently process large amounts of text data.
- Apply your knowledge of Python: assignments, expressions, if and loop statements, functions, lists, and libraries. 
- Write the following text processing functions from scratch:
    1. Load automatically load data from subfolders.
    2. Split text into separate words.
    3. Count the occurrence of words.
    4. Calculate probabilities of individual words
    5. Calculate probabilities of a group of words

In [None]:
reset -fs

Task 1. Load data 
-----

You'll need a corpus for NLP, a corpus is a collection of related documents. 

For this project, we are going to use a collection of [slate.com](https://slate.com) articles. The data has already be downloaded for you.

For example, in folder `./text/1` there is `Article247_4.txt`. The web version is [here](https://slate.com/culture/1999/03/harmonic-convergences.html).

The first function is the `load_data` function.

Write code that implements the following steps:

1. We need to efficiently and effectively walk all of the subfolders and load all of the text data. The best solution is from [pathlib](https://docs.python.org/3/library/pathlib.html).

1. As we do that, create an index that keeps tracks of document name. Tracking meta-data is very important in a data science project.

1. For each document, remove all linebreaks - `"\n"` and split the continuous string into separate words (called tokenization) using only `str` methods. A mature NLP system would use an established text processing package or regular expression. For now, `str` methods will work.

As always, do not import anything else.

In [123]:
from pathlib import Path

path = Path('./text/')

In [124]:
from os import PathLike
from typing import List

def load_data(path: PathLike) -> (List[PathLike], List[List[str]]):

    title_index = []
    docs = []
    
    directory = path
    
    for each in directory.glob('**/*.txt'):
        with open(each) as title:
            line = each
            title_index.append(line)
            
            content = title.read().split()
            docs.append(content)
    
    return title_index, docs

In [125]:
"""
Test code for load_data function.
5 points
This cell should NOT give any errors when it is run.
"""

# Run function to load documents into memory
title_index, docs = load_data(path)

# Sort both lists by path, aka title_index
title_index, docs = [list(x) for x in zip(*sorted(zip(title_index, docs), key=lambda pair: pair[0]))]

n_docs = 4_530 
assert len(docs)   == n_docs
assert len(title_index) == n_docs

i = 0 
assert str(title_index[i]) == 'text/1/Article247_4.txt'
assert len(docs[i]) == 586
assert docs[i][:10] == ['Harmonic',
                         'Convergences',
                         "You're",
                         'right,',
                         "Maxim's",
                         'strong',
                         'point',
                         'is',
                         'that',
                         "it's"]
i = -1 
assert str(title_index[i]) == 'text/9/Article247_3299.txt'
assert len(docs[i]) == 756
assert docs[i][-10:] == ['of',
                         'some',
                         'future',
                         '14-volume',
                         'biography',
                         'of',
                         'Rosalynn',
                         'Carter.',
                         'Yours,',
                         'Chris']

Task 2. Data cleaning
-----

Now that we have the data loaded of disk into memory, we can process it more.

Write code that implements the following:

1. Lowercase all words
1. Remove any remaining line breaks
1. Remove punctuation
1. Remove any empty strings (e.g., strings that were just punctuation and are now empty)

Ignore apostrophes.

Use only `str` processing methods.

Try to write efficient code. The two best ways to write efficient code is:

1. Use the best data structure for the problem. Review Python's built-in types [here](https://docs.python.org/3/library/stdtypes.html).
1. Minimize the number of complete passes through the data. 

Your code should run in less than 10 seconds (my code takes less than 5 seconds on my machine). If your code times out during autograding, you'll lose a lot of points.

Hints: 

1. Write helper functions. 
1. Then process the documents with list comprehensions.

In [129]:
from string import punctuation

newList = []

def cutWord(aString):

    newstring = ""

    for each in aString:
        if each not in punctuation:
            newstring = newstring + each.lower()
                 
    return newstring

for x in range(len(docs)):
    newList = [cutWord(aString = docs[x][y]) for y in range(len(docs[x])) if docs[x][y] != None]
    docs[x] = newList

    while "" in docs[x]:
        docs[x].remove("")

In [130]:
"""
Test code for data cleaning.
5 points
This cell should NOT give any errors when it is run.
"""

i = 0 
assert len(docs[i]) == 586
assert docs[i][:10] == ['harmonic',
                         'convergences',
                         'youre',
                         'right',
                         'maxims',
                         'strong',
                         'point',
                         'is',
                         'that',
                         'its']
i = -1 
assert len(docs[i]) == 753
assert docs[i][-10:] == ['of',
                         'some',
                         'future',
                         '14volume',
                         'biography',
                         'of',
                         'rosalynn',
                         'carter',
                         'yours',
                         'chris']

Task 3 Count words
-----

Data science is mostly counting stuff (e.g., clicks, users, …). In NLP, we count words.

Write a function that counts the occurrence of each word across all documents.

In [133]:
from collections import Counter
from typing import List, Counter

def count(docs: List[List[str]]) -> Counter:
    
    word_counts = None
    compWordList = []
    
    for each in docs:
        while None in each:
            each.remove(None)
        compWordList.extend(each)
        
    word_counts = Counter(compWordList)
    
    return word_counts

In [134]:
"""
Test code for count function
5 points
This cell should NOT give any errors when it is run.
"""

word_counts = count(docs)

assert word_counts.most_common(3)  == [('the', 263_910), 
                                       ('of',  115_388), 
                                       ('to',  107_005)]
assert word_counts['brian']  == 110 
assert word_counts['lambda'] == 0

Task 4. Calculate probabilities of individual words
-----

Now that we have counts, we can normalize the counts to create probabilities. Probability is the foundation of Statistics.

Write a function `word_prob` that finds the probability of a given word occurring compared to all other words.

__Modeling Sidebar__

We are building [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) (PMF) for the entire corpus. 

We could have also built a PMF per document and compared the relative frequency of words between documents. 

```python
word_count_doc_0 = Counter(docs[0])
word_count_doc_1 = Counter(docs[1])
```

In [137]:
def word_prob(counter, word):
    
    wrdCount = counter
    total = 0
    
    for each in counter:
        total += counter[each]

    return counter[word]/total

In [138]:
"""
Test code for word_prob function
3 points
This cell should NOT give any errors when it is run.
"""

from math import isclose

assert isclose(word_prob(word_counts, 'the'), 0.06324061988316144)
assert isclose(word_prob(word_counts, 'brian'), 2.6359244390692887e-05)
assert isclose(word_prob(word_counts, 'lambda'), 0.0)

Task 5. Calculate probabilities of a group of words
-----

We can calculate the probability of a sequence of words using the joint probability formula:

$$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1 w_2) \ldots  \times \ldots P(w_n \mid w_1 \ldots w_{n-1})$$

Often we make a simplifing assumption, each word is drawn from the bag *independently* of the others $P(w_2 \mid w_1) = P(w_2)$. This is called a [bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model).  

That assumption simplifies our joint probability formula to the just a product of each word:
    
$$P(w_1 \ldots w_n) = P(w_1) \times P(w_2) \times P(w_3) \ldots  \times \ldots P(w_n) = \prod\limits_{i}P(w_i)$$


First, write a `product` function that multiples each item in a sequence (just like `sum` adds each word in sequence).

Then, write the `words_prob` function that finds the bag-of-words probability for a given sequence.

In [139]:
from typing import Sequence

def prod(nums: Sequence[float]) -> float:
    
    product = 1
    
    for each in nums:
        product *= each
    
    return product

In [140]:
"""
Test code for prod function.
1 points
This cell should NOT give any errors when it is run.
"""

from numpy import product

assert prod([0]) == product([0]) == 0
assert prod([1]) == product([1]) == 1 
assert prod([-2, -3]) == prod([-2, -3]) == 6

In [141]:
def words_prob(words):
    
    total = 0
    product = 1
    counter = count(docs)
    
    for each in counter:
        total += counter[each]
    
    for word in words:
        product *= counter[word]/total

    return product

In [142]:
"""
Test code for words_prob function.
3 points
This cell should NOT give any errors when it is run.
"""

from math import isclose

word = 'the'
assert word_prob(word_counts, word) == words_prob([word])

words = 'The quick brown fox'.lower().split()
assert isclose(words_prob(words), 3.238175022726325e-14)

words = 'The quick brown house-elf'.lower().split()
assert words_prob(words) == 0

Summary
-----

We wrote a series of functions that takes in collection of documents, processes the documents, and calculates descriptive statistics.

Give the modularity of the code, we could apply the same functions to another corpus, for New York Times articles or emails. We would have to update the tests to make sure to handle end-cases in that new corpus.

<br>
<br> 
<br>

----