<a href="https://colab.research.google.com/github/a-forty-two/QAPyJuly22/blob/main/TextAnalysisRegex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Yesterday- List, Tuple, Set, Dictionary
# ORDERED- List and Tuple
# Indexed- list, tuple, dictionary
# readonly - Tuples 
# Key-value pairs- dictionary 
# Range- generate numbers in the given range; with given step
#       - range(10,2,-1)
# SLICING and DICING- Data selection and collection!
# Data Structures 


# Exploratory Text Analysis

## What kinds of text analysis are there?

* analyst knows the pattern
    * regular expressions
* analyst does not know the pattern
    * natural language processing
        * compares historical examples to judge novel cases
            * comparisons are statistical and approximate
            

### Examples of Analysis

When you know the pattern:

In [None]:
pattern = '£ ?[0-9][0-9]?' # £ then SPACE-optional then digit then digit-optional 

document = 'My eggs cost £3, bread cost £2, vodka cost £35'

In [None]:
import re

In [None]:
re.findall(pattern, document)

If you dont:

* sentiment analysis
    * how positive/negative is this (new) review?
* topic analysis 
    * what is this document about?

## What can I do if I know what pattern I want to find?

* finding ("extracting")
    * what matches the pattern?
* matching ("validating")
    * does the entire document match YES/NO?
* substitue ("replacing")
    * replace a part that matches a pattern with another...

## How do I validate text with pandas?

In [2]:
import pandas as pd

In [3]:
ti = pd.read_csv('https://raw.githubusercontent.com/a-forty-two/DFEData2/main/titanic.csv')
ti.sample(1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
418,0,2,male,30.0,0,0,13.0,S,Second,man,True,,Southampton,no,True


In [4]:
ti['ticket'] = "Ticket: " + ti['class'] + "; Price: $ " + ti['fare'].astype(str) + "; Port: " + ti['embark_town'] + ";"

In [5]:
ti[['class', 'fare', 'embark_town', 'ticket']].head(3)

Unnamed: 0,class,fare,embark_town,ticket
0,Third,7.25,Southampton,Ticket: Third; Price: $ 7.25; Port: Southampton;
1,First,71.2833,Cherbourg,Ticket: First; Price: $ 71.2833; Port: Cherbourg;
2,Third,7.925,Southampton,Ticket: Third; Price: $ 7.925; Port: Southampton;


In [6]:
pattern = '(First|Second)'

ti['class'].str.match(pattern)

0      False
1       True
2      False
3       True
4      False
       ...  
886     True
887     True
888    False
889     True
890    False
Name: class, Length: 891, dtype: bool

In [10]:
survival_percentage = ti.loc[ ti['class'].str.match(pattern)  , 'survived'].mean()  * 100 
print("FIrst and Second class people who survived = "+ str(survival_percentage) + "%")

FIrst and Second class people who survived = 55.75%


## How do I extract data with pandas?

In [11]:
ti[['class', 'fare', 'embark_town', 'ticket']].head(3)

Unnamed: 0,class,fare,embark_town,ticket
0,Third,7.25,Southampton,Ticket: Third; Price: $ 7.25; Port: Southampton;
1,First,71.2833,Cherbourg,Ticket: First; Price: $ 71.2833; Port: Cherbourg;
2,Third,7.925,Southampton,Ticket: Third; Price: $ 7.925; Port: Southampton;


In [12]:
pattern = '([0-9.]+)'

ti['ticket'].str.extract(pattern).sample(4)

Unnamed: 0,0
639,16.1
535,26.25
597,0.0
404,8.6625


## How do I substitue text with pandas?

In [13]:
ti['ticket'].str.replace('$', '€').sample(1)

  """Entry point for launching an IPython kernel.


373    Ticket: First; Price: € 135.6333; Port: Cherbo...
Name: ticket, dtype: object

## What are regular expressions?

Regular expressions are a language for describing patterns in text. 

They are separate from python, but may be used within python program. (And elsewhere, eg., often in SQL). 

They are notoriously difficult to read and write; and as a separate language, an additional tool to learn. 

## What regular expression patterns can I use?

* literals
    * `a`, find me an `a`
    * `£`, find `£`
    * `!` means `!` 
    * ... most symbols mean "find me"
* `.`
    * find any **single** symbol 
* character classes -- find a **single** symbol
    * `[abc]` $\rightarrow$ **either** a, b, c
    * `[0-9]` $\rightarrow$ **either** 0, 1, 2, 3,...9
    * `[A-Z]` $\rightarrow$ **either** capital A, B, ... Z
    * inversions
        * `[^abc]` $\rightarrow$ **is not** `a` OR `b` OR `c`
        * `[^a-zA-Z0-9 ]`  $\rightarrow$ **is not** alphanumeric-ish
    
* alternatives -- find the character**s** given by...
    * `(May|June|July)`  $\rightarrow$ **the whole worlds** May OR June..
    

In [None]:
ti['ticket'].str.extract('(Ticket: (First|Second))')

In [None]:
ti['ticket'].str.extract('( [0-9][0-9])')

In [None]:
ti['ticket'].sample(1)

In [None]:
ti['ticket'].str.extract('(Ticket: [A-Z])').sample(2)

In [None]:
ti['ticket'].str.extract('(T........)').sample(3)

In [None]:
ti['ticket'].str.extract('(Price: [^0-9A-Za-z] ..)').sample(3)

In [None]:
ti['ticket'].str.extract('(Port: (Cherbourg|Southampton))').sample(3)

* repetitions
    * optional `?`
        * an optional number: `[0-9]?`
    * one or more `+`
        * one or more spaces: ` +`  
    * optional, or more, `*`
        * ` [0-9][0-9]?.[0-9]*`
    

In [None]:
ti['ticket'].str.extract('([0-9][0-9]?.[0-9]*)').sample(3)

In [None]:
ti['ticket'].str.extract('(Ticket: [a-zA-Z]+)').sample(3)

In [None]:
row = 0
match = 1 # second match

ti['ticket'].str.extractall('([a-zA-Z]+: [a-zA-Z]+)').loc[row, match]

In [None]:
ti['ticket'].str.extract('([a-zA-Z]+tow?n)')

* EXTRA: 
    * escaping
        * How do I say, literally, the `.` symbol?
        * `\.`
    

In [None]:
ti['ticket'].str.extract('(\$ [0-9]+\.[0-9]+)').sample(2)

* positional matching
    * `^` means **at the beginning**
    * `$` means **at the end**

In [None]:
ti['ticket'].str.extractall('([a-zA-Z]+: [a-zA-Z]+;$)').sample(1)

In [None]:
ti['ticket'].str.extractall('(^[a-zA-Z]+: [a-zA-Z]+;)').sample(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
2,0,Ticket: Third;


## Next Steps

* review a "Regex Cheat Sheet"
    * also, eg., https://en.wikipedia.org/wiki/Regular_expression#Examples

## Exercise (30 min) - ONLY FOR THOSE WHO FINISH EARLY!

* find all the words in the tickets 
    * HINT: a word is a repeated letter followed by a space or a colon
    * HINT: `[ :]` means a space or a colon
* find all the USD prices
    * HINT: ``` \$ ``` and repeated numbers 
    
* find all the high-price tickets
    * HINT: consider `\$`, tripple-digit number, `\.`

In [None]:
#Solution

In [None]:
ti['ticket'].str.findall('([a-zA-Z]+[ :])').sample(10)

372    [Ticket:, Price:, Port:]
258    [Ticket:, Price:, Port:]
752    [Ticket:, Price:, Port:]
369    [Ticket:, Price:, Port:]
478    [Ticket:, Price:, Port:]
517    [Ticket:, Price:, Port:]
461    [Ticket:, Price:, Port:]
770    [Ticket:, Price:, Port:]
455    [Ticket:, Price:, Port:]
657    [Ticket:, Price:, Port:]
Name: ticket, dtype: object

In [None]:
ti['ticket'].str.extract('(\$ [0-9]+\.[0-9]+)').sample(2)

Unnamed: 0,0
644,$ 19.2583
790,$ 7.75


In [None]:
row = 0
match = 1 # second match
ti['ticket'].str.extractall('(\$ [0-9][0-9][0-9]+\.[0-9]+)')#.loc[row, 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
27,0,$ 263.0
31,0,$ 146.5208
88,0,$ 263.0
118,0,$ 247.5208
195,0,$ 146.5208
215,0,$ 113.275
258,0,$ 512.3292
268,0,$ 153.4625
269,0,$ 135.6333
297,0,$ 151.55
