# Challenge: Text Into Data

```yaml
Course:   DS 5001 
Module:   02 Text Models
Topic:    Text into Data Challenge
Author:   R.C. Alvarado
Date:     14 October 2022 (revised)
```

## Purpose

Ww import a text using the  Clip, Chunk, and Split pattern.

Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

In this notebook, we use the pattern from `M02_01` on a new text.

## Recipe

### Create TOKEN table

1. Inspect source text, taking note of where it begins and ends and the header patterns.
2. Import the source text into a dataframe of line strings.
3. Extract the title.
4. Clip the cruft by using regexs for the beginning and end of the actual text.
5. Chunk by using a regex for chapter headings, assign lines, and group.
6. Split into paragraphs using new lines.
7. Split into sentences using regex.
8. Split into tokens using regex.

## Create VOBAB table

1. Get token value counts and save as data frame.

## Set Up

In [1]:
import pandas as pd

### Import Config

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [3]:
text_file = f"{data_home}/gutenberg/pg161.txt"
csv_file = f"{output_dir}/austen-sense-and-sensibility.csv" # The file we will create

In [4]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

## Import file into a dataframe

In [5]:
#Create a dataframe in pandas, open a text file, sometimes encode, readlines opens a file and converst to strings, name column "line_str"
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])

#Create a column with the index number
LINES.index.name = 'line_num'

# Replace line breaks
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [6]:
LINES.sample(20)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
4249,"sister-in-law's mother, Mrs. Ferrars?"""
6621,"Dashwood's communication, in such an instantan..."
1890,"evening with them, and Margaret, by being left..."
12639,
6390,grieves me to see her! And I declare if she i...
5026,"to go wherever I do, well and good, you may al..."
9831,"takes place, depend upon it his mother will fe..."
10404,"Half an hour passed away, and the favourable s..."
10931,"""You are very wrong, Mr. Willoughby, very blam..."
12163,"any amends for the defect of the style."""


## Extract Title 

In [8]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [9]:
print(title)

Sense and Sensibility, by Jane Austen


## Clip Cruft

In [10]:
#Regular expression
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [11]:
#Search for lines that match the regular expressions
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

In [12]:
# indicate lines between these things
line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

In [13]:
# Lines that contain the story
line_a, line_b

(20, 12666)

In [14]:
#Selct just the lines that contain the story in the lines
LINES = LINES.loc[line_a : line_b]

## Chunk by chapter

### Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [15]:
#Regular expression to find chapters
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

In [16]:
#A vector of lines where there are chapters
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [17]:
#Filter the LINES data frame to find chapters
LINES.loc[chap_lines] # Use as filter for dataframe

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
42,CHAPTER 1
196,CHAPTER 2
399,CHAPTER 3
561,CHAPTER 4
756,CHAPTER 5
858,CHAPTER 6
986,CHAPTER 7
1112,CHAPTER 8
1244,CHAPTER 9
1448,CHAPTER 10


### Assign numbers to chapters

In [21]:
#For all headers, apply a vector of numbers to the dataframe
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [22]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
42,CHAPTER 1,1.0
196,CHAPTER 2,2.0
399,CHAPTER 3,3.0
561,CHAPTER 4,4.0
756,CHAPTER 5,5.0
858,CHAPTER 6,6.0
986,CHAPTER 7,7.0
1112,CHAPTER 8,8.0
1244,CHAPTER 9,9.0
1448,CHAPTER 10,10.0


### Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [23]:
#ffill (forward fill) essentially drags down the chapter number to subsequent rows will NaN's 
LINES.chap_num = LINES.chap_num.ffill()

In [24]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
9160,"dear Mrs. Jennings, I spent two happy hours with",38.0
9615,"a look so serious, so earnest, so uncheerful, ...",40.0
8276,"five minutes after it stopped at the door, a p...",36.0
4969,Elinor was soon called to the card-table by th...,24.0
6770,"a short one. On such a subject,"" sighing heav...",31.0
12428,as rapidly as before. With apprehensive cauti...,50.0
3088,"is not often really merry.""",17.0
7859,"said in a low, but eager, voice,",34.0
3737,wished to appear. His temper might perhaps be...,20.0
4466,"us at Longstaple, to go to you, that I was afr...",22.0


### Clean up

In [25]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

### Group lines into chapters

In [26]:
OHCO[:1]

['chap_num']

In [27]:
# Make big string for each chapter - data frame where each line is a chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')

In [28]:
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,\n\nThe family of Dashwood had long been settl...
2,\n\nMrs. John Dashwood now installed herself m...
3,\n\nMrs. Dashwood remained at Norland several ...
4,"\n\n""What a pity it is, Elinor,"" said Marianne..."
5,"\n\nNo sooner was her answer dispatched, than ..."
6,\n\nThe first part of their journey was perfor...
7,\n\nBarton Park was about half a mile from the...
8,\n\nMrs. Jennings was a widow with an ample jo...
9,\n\nThe Dashwoods were now settled at Barton w...
10,"\n\nMarianne's preserver, as Margaret, with mo..."


#Identify the pattern of paragraphs
para_pat = r'\n\n+'## Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [29]:
#Identify the pattern of paragraphs
para_pat = r'\n\n+'

In [30]:
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

In [31]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,
1,1,The family of Dashwood had long been settled i...
1,2,"By a former marriage, Mr. Henry Dashwood had o..."
1,3,"The old gentleman died: his will was read, and..."
1,4,"Mr. Dashwood's disappointment was, at first, s..."


## Split paragraphs into sentences

In [32]:
# Consider how arbitrary it is to define sentences 
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

In [33]:
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS

In [34]:
SENTS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,1,0,The family of Dashwood had long been settled i...
1,1,1,"Their estate\nwas large, and their residence w..."
1,1,2,The late owner of this estate was a single\nma...
1,1,3,"But her\ndeath, which happened ten years befor..."
1,1,4,"for to supply her loss, he invited and receive..."


## Split sentences into tokens

In [35]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

In [36]:
TOKENS.index.names = OHCO[:4]

In [37]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,1,0,0,The
1,1,0,1,family
1,1,0,2,of
1,1,0,3,Dashwood
1,1,0,4,had
...,...,...,...,...
50,23,0,8,and
50,23,0,9,Sensibility
50,23,0,10,by
50,23,0,11,Jane


## Extract Vocabulary

In [38]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [39]:
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,to,4115
1,the,4105
2,of,3574
3,and,3490
4,her,2543
...,...,...
6275,prefer,1
6276,dissolving,1
6277,beset,1
6278,effectually,1


## Gathering by Content Object

In [40]:
#This is a function that reconcatonates tokens into another level in the OHCO
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

In [41]:
gather(1)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,The family of Dashwood had long been settled i...
2,Mrs John Dashwood now installed herself mistre...
3,Mrs Dashwood remained at Norland several month...
4,"""What a pity it is Elinor "" said Marianne ""tha..."
5,No sooner was her answer dispatched than Mrs D...
6,The first part of their journey was performed ...
7,Barton Park was about half a mile from the cot...
8,Mrs Jennings was a widow with an ample jointur...
9,The Dashwoods were now settled at Barton with ...
10,Marianne s preserver as Margaret with more ele...


TOKENS.to_csv(csv_file)## Save work to CSV

This is important -- will be used for homework.

In [42]:
TOKENS.to_csv(csv_file)