# Elements Of Data Processing (2021S1) - Week 5


### Case Folding
- Case folding removes all case distinctions present in a string (i.e lower and upper cases are matched regardless). 
- It is used for caseless matching when it the text input isn't always guaranteed to have the correct grammar. 
- Essentially, casefolding is a more aggressive version of the `str.lower()` method that is designed to take into account *much more* unique Unicode characters and make them more comparable.
- You can use `str.lower()` when your text field is purely ASCII Text, but you should use `str.casefold()` when working with Unicode text or user input.


### <span style="color:blue"> Exercise 1 </span>

Use appropriate functions to covert `"Whereof one cannot speak, thereof one must be silent."` into:
- Lower case.
- Upper case.
- Casefold.

In [1]:
s = "Whereof one cannot speak, thereof one must be silent."

print(s.lower())
print(s.upper())
print(s.casefold())

### Natural Language Processing (NLP)
- Preprocessing steps for NLP can be done using the `nltk` library.
- Provides useful functions for tokenizing, stemming, lemmatizing, and vectorizing text fields.
- We don't always need to remove punctuation - sometimes you want to keep the natural language features to help split apart [contractions](https://www.thoughtco.com/contractions-commonly-used-informal-english-1692651).

The example below parses the `speech` string and outputs a frequency dictionary.

In [2]:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from collections import Counter

In [3]:
speech = """Four score and seven years ago our fathers brought forth on this continent, 
a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. 
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, 
can long endure. We are met on a great battle-field of that war. 
We have come to dedicate a portion of that field, as a final resting place for those who here 
gave their lives that that nation might live. It is altogether fitting and proper that we should do this. 
But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. 
The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. 
The world will little note, nor long remember what we say here, but it can never forget what they did here.
It is for us the living, rather, to be dedicated here to the unfinished work which they who 
fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the 
great task remaining before us -- that from these honored dead we take increased devotion to 
that cause for which they gave the last full measure of devotion -- that we here highly resolve 
that these dead shall not have died in vain -- that this nation, under God, shall have a new birth 
of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth."""

In [4]:
# tokenize words - similar to str.split()
words = nltk.word_tokenize(speech)

# create a set of stopwords (i.e and, or, is, it, etc)
stop_words = set(stopwords.words('english'))

# initialise the porter stemmer function
ps = PorterStemmer()

In [5]:
stemmed_words = [ps.stem(word) for word in words if word not in stop_words]
print(stemmed_words)

['four', 'score', 'seven', 'year', 'ago', 'father', 'brought', 'forth', 'contin', ',', 'new', 'nation', ',', 'conceiv', 'liberti', ',', 'dedic', 'proposit', 'men', 'creat', 'equal', '.', 'now', 'engag', 'great', 'civil', 'war', ',', 'test', 'whether', 'nation', ',', 'nation', 'conceiv', 'dedic', ',', 'long', 'endur', '.', 'We', 'met', 'great', 'battle-field', 'war', '.', 'We', 'come', 'dedic', 'portion', 'field', ',', 'final', 'rest', 'place', 'gave', 'live', 'nation', 'might', 'live', '.', 'It', 'altogeth', 'fit', 'proper', '.', 'but', ',', 'larger', 'sens', ',', 'dedic', '--', 'consecr', '--', 'hallow', '--', 'ground', '.', 'the', 'brave', 'men', ',', 'live', 'dead', ',', 'struggl', ',', 'consecr', ',', 'far', 'poor', 'power', 'add', 'detract', '.', 'the', 'world', 'littl', 'note', ',', 'long', 'rememb', 'say', ',', 'never', 'forget', '.', 'It', 'us', 'live', ',', 'rather', ',', 'dedic', 'unfinish', 'work', 'fought', 'thu', 'far', 'nobli', 'advanc', '.', 'It', 'rather', 'us', 'dedic'

In [6]:
freq_stem = Counter(stemmed_words)
for word, freq in sorted(freq_stem.items(), key=lambda x: -x[1]):
    print(word, freq)

, 22
. 10
-- 7
dedic 6
nation 5
live 4
great 3
It 3
dead 3
us 3
shall 3
peopl 3
new 2
conceiv 2
men 2
war 2
long 2
We 2
gave 2
consecr 2
the 2
far 2
rather 2
devot 2
four 1
score 1
seven 1
year 1
ago 1
father 1
brought 1
forth 1
contin 1
liberti 1
proposit 1
creat 1
equal 1
now 1
engag 1
civil 1
test 1
whether 1
endur 1
met 1
battle-field 1
come 1
portion 1
field 1
final 1
rest 1
place 1
might 1
altogeth 1
fit 1
proper 1
but 1
larger 1
sens 1
hallow 1
ground 1
brave 1
struggl 1
poor 1
power 1
add 1
detract 1
world 1
littl 1
note 1
rememb 1
say 1
never 1
forget 1
unfinish 1
work 1
fought 1
thu 1
nobli 1
advanc 1
task 1
remain 1
honor 1
take 1
increas 1
caus 1
last 1
full 1
measur 1
highli 1
resolv 1
die 1
vain 1
god 1
birth 1
freedom 1
govern 1
perish 1
earth 1


### <span style="color:blue"> Exercise 2 </span>

- Modify the example above to use a `WordNet` Lemmatizer instead of a Porter Stemmer.
- What are the differences?

In [7]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

lemmatized_words = [lem.lemmatize(word) for word in words if word not in stop_words]

freq_lemma = Counter(lemmatized_words)
for word, freq in sorted(freq_lemma.items(), key=lambda x: -x[1]):
    print(word, freq)

, 22
. 10
-- 7
nation 5
dedicated 4
great 3
It 3
dead 3
u 3
shall 3
people 3
new 2
conceived 2
men 2
war 2
long 2
We 2
dedicate 2
gave 2
The 2
living 2
far 2
rather 2
devotion 2
Four 1
score 1
seven 1
year 1
ago 1
father 1
brought 1
forth 1
continent 1
Liberty 1
proposition 1
created 1
equal 1
Now 1
engaged 1
civil 1
testing 1
whether 1
endure 1
met 1
battle-field 1
come 1
portion 1
field 1
final 1
resting 1
place 1
life 1
might 1
live 1
altogether 1
fitting 1
proper 1
But 1
larger 1
sense 1
consecrate 1
hallow 1
ground 1
brave 1
struggled 1
consecrated 1
poor 1
power 1
add 1
detract 1
world 1
little 1
note 1
remember 1
say 1
never 1
forget 1
unfinished 1
work 1
fought 1
thus 1
nobly 1
advanced 1
task 1
remaining 1
honored 1
take 1
increased 1
cause 1
last 1
full 1
measure 1
highly 1
resolve 1
died 1
vain 1
God 1
birth 1
freedom 1
government 1
perish 1
earth 1


In [8]:
# differences between the two sets
set(freq_lemma.items()).difference(set(freq_stem.items()))

{('But', 1),
 ('Four', 1),
 ('God', 1),
 ('Liberty', 1),
 ('Now', 1),
 ('The', 2),
 ('advanced', 1),
 ('altogether', 1),
 ('cause', 1),
 ('conceived', 2),
 ('consecrate', 1),
 ('consecrated', 1),
 ('continent', 1),
 ('created', 1),
 ('dedicate', 2),
 ('dedicated', 4),
 ('devotion', 2),
 ('died', 1),
 ('endure', 1),
 ('engaged', 1),
 ('fitting', 1),
 ('government', 1),
 ('highly', 1),
 ('honored', 1),
 ('increased', 1),
 ('life', 1),
 ('little', 1),
 ('live', 1),
 ('living', 2),
 ('measure', 1),
 ('nobly', 1),
 ('people', 3),
 ('proposition', 1),
 ('remaining', 1),
 ('remember', 1),
 ('resolve', 1),
 ('resting', 1),
 ('sense', 1),
 ('struggled', 1),
 ('testing', 1),
 ('thus', 1),
 ('u', 3),
 ('unfinished', 1)}

### Dataset: Smokers
In the first twenty rows, there are seven errors that all fall into one of the following categories:
- Semantic Errors
- Range Errors
- Format Errors

Identify the errors and what category they fall into. 
- Where possible, fix the errors manually and save the new spreadsheet as `smoking-info-corrected.csv`
- Suggest how you would write a program to detect them.

In [9]:
import pandas as pd
from IPython.display import display

df = pd.read_csv('smoking_data_us_1995_2010.csv')
display(df.tail())
display(df.dtypes)

Unnamed: 0,Year,State,Smoke everyday,Smoke some days,Former smoker,Never smoked
871,1995,Virginia,18.70%,2.70%,25.20%,53.50%
872,1995,Washington,17.50%,2.40%,29.90%,50.20%
873,1995,West Virginia,23.70%,1.90%,23.30%,51.10%
874,1995,Wisconsin,18.20%,3.50%,27.60%,50.70%
875,1995,Wyoming,19.10%,2.90%,26.80%,51.20%


Year                int64
State              object
Smoke everyday     object
Smoke some days    object
Former smoker      object
Never smoked       object
dtype: object

In [10]:
# length of a percentages are at most 7 characters (XXX.XX%)
DECIMAL_LENGTH = 6

for col in df.columns[2:]:
    print(df[col].apply(lambda x: float(x.rstrip('%')) if len(x) <= DECIMAL_LENGTH else print(x)).max())

29.1
8.5
105.00%
twenty-three point 6 percent
33.4
83.7


In [11]:
df['Year'].describe()

count     876.000000
mean     2002.683790
std         5.640406
min      1995.000000
25%      1999.000000
50%      2003.000000
75%      2007.000000
max      2100.000000
Name: Year, dtype: float64

In [12]:
from word2number import w2n
import re

word_num = 'twenty-three point 6 percent'

# strip percent
word_num = re.sub(r'(\spercent)?', r'', word_num)
print(word_num)

w2n.word_to_num(word_num)

twenty-three point 6


23

- As you can see, although libraries can help, they aren't always perfect.
- Part of your work may be to discuss with the client and Business Analysts on fixing these issues manually...

### <span style="color:blue"> Exercise 3 </span>

Write python code for the following tasks:
- Import your file `smoking_data_us_1995_2010_corrected.csv` into a pandas data frame
- Remove the percentage symbols from the data. 
- For removing/replacing characters see [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html)
- After the removals, convert all the strings to numeric values.

In [13]:
import pandas as pd

df = pd.read_csv('smoking_data_us_1995_2010_corrected.csv')
df.tail()

Unnamed: 0,Year,State,Smoke everyday,Smoke some days,Former smoker,Never smoked
871,1995,Virginia,18.70%,2.70%,25.20%,53.50%
872,1995,Washington,17.50%,2.40%,29.90%,50.20%
873,1995,West Virginia,23.70%,1.90%,23.30%,51.10%
874,1995,Wisconsin,18.20%,3.50%,27.60%,50.70%
875,1995,Wyoming,19.10%,2.90%,26.80%,51.20%


In [14]:
df.dtypes

Year                int64
State              object
Smoke everyday     object
Smoke some days    object
Former smoker      object
Never smoked       object
dtype: object

In [15]:
for col in df.columns[2:]:
    df[col] = df[col].str.rstrip('%').astype(float)
    # df[col] = df[col].apply(lambda x: float(x.strip('%')))
    
df.tail()

Unnamed: 0,Year,State,Smoke everyday,Smoke some days,Former smoker,Never smoked
871,1995,Virginia,18.7,2.7,25.2,53.5
872,1995,Washington,17.5,2.4,29.9,50.2
873,1995,West Virginia,23.7,1.9,23.3,51.1
874,1995,Wisconsin,18.2,3.5,27.6,50.7
875,1995,Wyoming,19.1,2.9,26.8,51.2


In [16]:
df.dtypes

Year                 int64
State               object
Smoke everyday     float64
Smoke some days    float64
Former smoker      float64
Never smoked       float64
dtype: object