# Chapter 3: Processing and Understanding Text

* 바벨피쉬 / 바벨봇 : 텍스트마이닝 [1,2]
* 김무성

# Contents
* Text Tokenization
* Text Normalization
* Understanding Text Syntax and Structure 

#### 참고
* [1] Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data - https://www.amazon.com/Text-Analytics-Python-Real-World-Actionable/dp/148422387X/
* [2] (github) dipanjanS/text-analytics-with-python - https://github.com/dipanjanS/text-analytics-with-python

# Text Tokenization
* Sentence Tokenization
* Word Tokenization

#### 참고
* [4] CS 124: From Languages to Information / Winter 2018 / Basic Text Processing - https://web.stanford.edu/class/cs124/lec/textprocessingboth.pdf

-------------------

## Sentence Tokenization

* sent_tokenize
* PunktSentenceTokenizer
* RegexpTokenizer
* Pre-trained sentence tokenization models

In [2]:
# pip install nltk

In [1]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

In [3]:
# loading text corpora
alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = 'We will discuss briefly about the basic syntax,\
 structure and design philosophies. \
 There is a defined hierarchical syntax for Python code which you should remember \
 when writing code! Python is a really powerful programming language!'

In [4]:
# Total characters in Alice in Wonderland
len(alice)

144395

In [5]:
# First 100 characters in the corpus
alice[0:100]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was"

In [6]:
## default sentence tokenizer
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)

In [7]:
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
pprint(sample_sentences)
print('\nTotal sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:-')
pprint(alice_sentences[0:5])

Total sentences in sample_text: 3
Sample text sentences :-
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember  when writing code!',
 'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a daisy-chain would be worth the 

--------------------

In [8]:
## Other languages sentence tokenization
from nltk.corpus import europarl_raw

In [11]:
nltk.download('europarl_raw')

[nltk_data] Downloading package europarl_raw to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/europarl_raw.zip.


True

In [12]:
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
print(german_text[0:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [13]:
# default sentence tokenizer 
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance  
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

In [14]:
# verify the type of german_tokenizer
# should be PunktSentenceTokenizer
print(type(german_tokenizer))

<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>


In [17]:
# check if results of both tokenizers match
# should be True
print(german_sentences_def == german_sentences)

True


In [18]:
# print first 5 sentences of the corpus
for sent in german_sentences[0:5]:
    print(sent)

 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .


In [16]:
## using PunktSentenceTokenizer for sentence tokenization
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember  when writing code!',
 'Python is a really powerful programming language!']


In [19]:
## using RegexpTokenizer for sentence tokenization
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
            pattern=SENTENCE_TOKENS_PATTERN,
            gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)  

['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 ' There is a defined hierarchical syntax for Python code which you should '
 'remember  when writing code!',
 'Python is a really powerful programming language!']


## Word Tokenization

* word_tokenize
* TreebankWordTokenizer
* RegexpTokenizer
* Inherited tokenizers from RegexpTokenizer

In [32]:
## WORD TOKENIZATION
sentence = "The brown fox wasn't that quick and he couldn't win the race"

In [33]:
# default word tokenizer
default_wt = nltk.word_tokenize
words = default_wt(sentence)
words

['The',
 'brown',
 'fox',
 'was',
 "n't",
 'that',
 'quick',
 'and',
 'he',
 'could',
 "n't",
 'win',
 'the',
 'race']

In [34]:
# treebank word tokenizer
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 'was',
 "n't",
 'that',
 'quick',
 'and',
 'he',
 'could',
 "n't",
 'win',
 'the',
 'race']

In [35]:
# regex word tokenizer
TOKEN_PATTERN = r'\w+'        
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,
                                gaps=False)
words = regex_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 'wasn',
 't',
 'that',
 'quick',
 'and',
 'he',
 'couldn',
 't',
 'win',
 'the',
 'race']

In [36]:
GAP_PATTERN = r'\s+'        
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
                                gaps=True)
words = regex_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 "wasn't",
 'that',
 'quick',
 'and',
 'he',
 "couldn't",
 'win',
 'the',
 'race']

In [37]:
word_indices = list(regex_wt.span_tokenize(sentence))
word_indices

[(0, 3),
 (4, 9),
 (10, 13),
 (14, 20),
 (21, 25),
 (26, 31),
 (32, 35),
 (36, 38),
 (39, 47),
 (48, 51),
 (52, 55),
 (56, 60)]

In [38]:
print([sentence[start:end] for start, end in word_indices])

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


In [39]:
# derived regex tokenizers
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 'wasn',
 "'",
 't',
 'that',
 'quick',
 'and',
 'he',
 'couldn',
 "'",
 't',
 'win',
 'the',
 'race']

In [40]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 "wasn't",
 'that',
 'quick',
 'and',
 'he',
 "couldn't",
 'win',
 'the',
 'race']

------------------

## 실습 - 한국어 예제 

#### 참고
* [3] KoNLPy: 파이썬 한국어 NLP - http://konlpy-ko.readthedocs.io/ko/v0.4.3/


In [41]:
sample_text = '''Trying to let you know
Sign을 보내 signal 보내
I must let you know
Sign을 보내 signal 보내
Sign을 보내 signal 보내
Sign을 보내 signal 보내
Sign을 보내 signal 보내
I must let you know
Sign을 보내 signal 보내
근데 전혀 안 통해
눈빛을 보내 눈치를 주네
근데 못 알아듣네
답답해서 미치겠다 정말
왜 그런지 모르겠다 정말
다시 한 번 힘을 내서
Sign을 보내 signal 보내
눈짓도 손짓도 어떤 표정도
소용이 없네 하나도 안 통해
눈치도 코치도 전혀 없나 봐
더 이상 어떻게 내 맘을 표현해'''

In [42]:
# Sentence Tokenization
# 그냥 문자열의 split() 함수를 사용해서
# 해보자 

['Trying to let you know',
 'Sign을 보내 signal 보내',
 'I must let you know',
 'Sign을 보내 signal 보내',
 'Sign을 보내 signal 보내',
 'Sign을 보내 signal 보내',
 'Sign을 보내 signal 보내',
 'I must let you know',
 'Sign을 보내 signal 보내',
 '근데 전혀 안 통해',
 '눈빛을 보내 눈치를 주네',
 '근데 못 알아듣네',
 '답답해서 미치겠다 정말',
 '왜 그런지 모르겠다 정말',
 '다시 한 번 힘을 내서',
 'Sign을 보내 signal 보내',
 '눈짓도 손짓도 어떤 표정도',
 '소용이 없네 하나도 안 통해',
 '눈치도 코치도 전혀 없나 봐',
 '더 이상 어떻게 내 맘을 표현해']

In [45]:
from konlpy.tag import Kkma
from konlpy.utils import pprint
kkma = Kkma()

# kkma.sentences() 사용
# 해보자

['Trying to let you know Sign을 보내',
 'signal 보내',
 'I must let you know Sign을 보내',
 'signal 보내',
 'Sign을 보내',
 'signal 보내',
 'Sign을 보내',
 'signal 보내',
 'Sign을 보내',
 'signal 보내',
 'I must let you know Sign을 보내',
 'signal 보내',
 '근데 전혀 안 통해 눈빛을 보내',
 '눈치를 주네',
 '근데 못 알아듣네',
 '답답해서 미치겠다 정말 왜 그런지 모르겠다 정말 다시 한 번 힘을 내서 Sign을 보내',
 'signal 보내',
 '눈짓도 손짓도 어떤 표정도 소용이 없네',
 '하나도 안 통해 눈치도 코치도 전혀 없나',
 '봐 더 이상 어떻게 내 맘을 표현해']

In [60]:
# Word Tokenization
sentence = "눈빛을 보내 눈치를 주네"
#sentence = "소용이 없네 하나도 안 통해 '눈치'도 '코치'도 전혀 없나 봐"

In [57]:
# 그냥 기본 파이썬 문자열 split() 
# 해보자

['소용이', '없네', '하나도', '안', '통해', "'눈치'도", "'코치'도", '전혀', '없나', '봐']

In [58]:
# nltk.RegexpTokenizer
# 해보자 

['소용이', '없네', '하나도', '안', '통해', '눈치', '도', '코치', '도', '전혀', '없나', '봐']

In [59]:
# kkma.pos()
# 해보자

[('소용', 'NNG'),
 ('이', 'JKS'),
 ('없', 'VA'),
 ('네', 'EFN'),
 ('하나', 'NNG'),
 ('도', 'JX'),
 ('안', 'MAG'),
 ('통해', 'NNG'),
 ("'", 'SS'),
 ('눈치', 'NNG'),
 ("'", 'SS'),
 ('도', 'NNG'),
 ("'", 'SS'),
 ('코치', 'NNG'),
 ("'", 'SS'),
 ('도', 'NNG'),
 ('전혀', 'MAG'),
 ('없', 'VA'),
 ('나', 'EFQ'),
 ('보', 'VV'),
 ('아', 'ECS')]

# Text Normalization
* Cleaning Text
* Tokenizing Text
* Removing Special Characters
* Expanding Contractions
* Case Conversions
* Removing Stopwords
* Correcting Words
* Stemming
* Lemmatization

In [61]:
import nltk
import re
import string
from pprint import pprint

corpus = ["The brown fox wasn't that quick and he couldn't win the race",
          "Hey that's a great deal! I just bought a phone for $199",
          "@@You'll (learn) a **lot** in the book. Python is an amazing language!@@"]


## Cleaning Text

* HTML tags
* from XML
* JSON feed 
* 등등..

clean_html() from nltk or even the BeautifulSoup library 

## Tokenizing Text

In [62]:
def tokenize_text(text):
    # 해보자
    return word_tokens

In [63]:
token_list = [tokenize_text(text) 
              for text in corpus]
pprint(token_list)

[[['The',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book',
   '.'],
  ['Python', 'is', 'an', 'amazing', 'language', '!'],
  ['@', '@']]]


## Removing Special Characters

In [74]:
def remove_characters_after_tokenization(tokens):
    # 해보자
    return filtered_tokens

In [76]:
filtered_list_1 =  [list(filter(None,[remove_characters_after_tokenization(tokens) 
                                for tokens in sentence_tokens])) 
                    for sentence_tokens in token_list]

pprint(filtered_list_1)

[[['The',
   'brown',
   'fox',
   'was',
   'nt',
   'that',
   'quick',
   'and',
   'he',
   'could',
   'nt',
   'win',
   'the',
   'race']],
 [['Hey', 'that', 's', 'a', 'great', 'deal'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '199']],
 [['You', 'll', 'learn', 'a', 'lot', 'in', 'the', 'book'],
  ['Python', 'is', 'an', 'amazing', 'language']]]


In [77]:
def remove_characters_before_tokenization(sentence,
                                          keep_apostrophes=False):
    # 해보자
    return filtered_sentence

In [78]:
filtered_list_2 = [remove_characters_before_tokenization(sentence) 
                    for sentence in corpus]    
pprint(filtered_list_2)

['The brown fox wasnt that quick and he couldnt win the race',
 'Hey thats a great deal I just bought a phone for 199',
 'Youll learn a lot in the book Python is an amazing language']


In [79]:
cleaned_corpus = [remove_characters_before_tokenization(sentence, keep_apostrophes=True) 
                  for sentence in corpus]
pprint(cleaned_corpus)

["The brown fox wasn't that quick and he couldn't win the race",
 "Hey that's a great deal! I just bought a phone for 199",
 "You'll learn a lot in the book. Python is an amazing language!"]


## Expanding Contractions

In [81]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [82]:
def expand_contractions(sentence, contraction_mapping):
    
    # 해보자 
    return expanded_sentence
 

In [83]:
expanded_corpus = [expand_contractions(sentence, CONTRACTION_MAP) 
                    for sentence in cleaned_corpus]    
pprint(expanded_corpus)

['The brown fox was not that quick and he could not win the race',
 'Hey that is a great deal! I just bought a phone for 199',
 'You will learn a lot in the book. Python is an amazing language!']


## Case Conversions

In [84]:
# case conversion    
print(corpus[0].lower())
print(corpus[0].upper())

the brown fox wasn't that quick and he couldn't win the race
THE BROWN FOX WASN'T THAT QUICK AND HE COULDN'T WIN THE RACE


## Removing Stopwords

In [85]:
# removing stopwords
# nltk.corpus.stopwords.words()
def remove_stopwords(tokens):
    # 해보자
    return filtered_tokens

In [86]:
expanded_corpus_tokens = [tokenize_text(text)
                          for text in expanded_corpus]    
filtered_list_3 =  [[remove_stopwords(tokens) 
                        for tokens in sentence_tokens] 
                        for sentence_tokens in expanded_corpus_tokens]
print(filtered_list_3)

[[['The', 'brown', 'fox', 'quick', 'could', 'win', 'race']], [['Hey', 'great', 'deal', '!'], ['I', 'bought', 'phone', '199']], [['You', 'learn', 'lot', 'book', '.'], ['Python', 'amazing', 'language', '!']]]


## Correcting Words
* Correcting Repeating Characters
* Correcting Spellings

### Correcting Repeating Characters

In [87]:
# removing repeated characters
sample_sentence = 'My schooool is realllllyyy amaaazingggg'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]

In [88]:
from nltk.corpus import wordnet

In [89]:
# wordnet.synsets() 
def remove_repeated_characters(tokens):
    # 해보자
    return correct_tokens

In [90]:
pprint(remove_repeated_characters(sample_sentence_tokens)) 

['My', 'school', 'is', 'really', 'amazing']


### Correcting Spellings

In [91]:
import re, collections

def tokens(text): 
    """
    Get all words from the corpus
    """
    return pass 


In [94]:
fn = "data/big.txt"
WORDS = tokens(open(fn).read())
WORD_COUNTS = collections.Counter(WORDS)

# top 10 words in corpus
print(WORD_COUNTS.most_common(10))

[('the', 80030), ('of', 40025), ('and', 38313), ('to', 28766), ('in', 22050), ('a', 21155), ('that', 12512), ('he', 12401), ('was', 11410), ('it', 10681)]


In [95]:
def known(words):
    """
    Return the subset of words that are actually 
    in our WORD_COUNTS dictionary.
    """
    return pass

In [96]:
def edits0(word): 
    """
    Return all strings that are zero edits away 
    from the input word (i.e., the word itself).
    """
    return pass

In [97]:
def edits1(word):
    """
    Return all strings that are one edit away 
    from the input word.
    """
    # 해보자
    return set(deletes + transposes + replaces + inserts)

In [98]:
def edits2(word):
    """Return all strings that are two edits away 
    from the input word.
    """
    return pass

In [99]:
def correct(word):
    """
    Get the best correct spelling for the input word
    """
    # Priority is for edit distance 0, then 1, then 2
    # else defaults to the input word itself.
    # 해보자
    return pass

In [100]:
def correct_match(match):
    """
    Spell-correct word in match, 
    and preserve proper upper/lower/title case.
    """
    
    return pass

In [101]:
def correct_text_generic(text):
    """
    Correct all the words within a text, 
    returning the corrected text.
    """
    return pass

In [102]:
print(correct_text_generic('fianlly'))

finally


## Stemming

<img src="figures/cap1.png" width=400 />

In [110]:
# porter stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [111]:
print(ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped'))
print(ps.stem('lying'))
print(ps.stem('strange'))

jump jump jump
lie
strang


In [112]:
# lancaster stemmer
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

In [113]:
print(ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped'))
print(ls.stem('lying'))
print(ls.stem('strange'))

jump jump jump
lying
strange


In [114]:
# regex stemmer
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)

In [115]:
print(rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped'))
print(rs.stem('lying'))
print(rs.stem('strange'))

jump jump jump
ly
strange


In [116]:
# snowball stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")

In [118]:
print('Supported Languages:', SnowballStemmer.languages)

# autobahnen -> cars
# autobahn -> car
print(ss.stem('autobahnen'))

# springen -> jumping
# spring -> jump
print(ss.stem('springen'))

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
autobahn
spring


## Lemmatization

In [119]:
# lemmatization
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

In [120]:
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

car
men


In [121]:
# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

run
eat


In [122]:
# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

sad
fancy


In [123]:
# ineffective lemmatization
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))

ate
fancier


# Understanding Text Syntax and Structure
* Installing Necessary Dependencies
* Important Machine Learning Concepts
* Parts of Speech (POS) Tagging
* Shallow Parsing
* Dependency-based Parsing
* Constituency-based Parsing

We will focus on implementing the following techniques:
* Parts of speech (POS) tagging
* Shallow parsing
* Dependency-based parsing
* Constituency-based parsing

## Installing Necessary Dependencies

* The nltk library, preferably version 3.1 or 3.2.1
* The spacy library
* The pattern library
* The Stanford parser
* Graphviz and necessary libraries for the same

## Important Machine Learning Concepts

## Parts of Speech (POS) Tagging

## Shallow Parsing

## Dependency-based Parsing

## Constituency-based Parsing

# 참고자료
* [1] Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data - https://www.amazon.com/Text-Analytics-Python-Real-World-Actionable/dp/148422387X/
* [2] (github) dipanjanS/text-analytics-with-python - https://github.com/dipanjanS/text-analytics-with-python
* [3] KoNLPy: 파이썬 한국어 NLP - http://konlpy-ko.readthedocs.io/ko/v0.4.3/
* [4] CS 124: From Languages to Information / Winter 2018 / Basic Text Processing - https://web.stanford.edu/class/cs124/lec/textprocessingboth.pdf
* Configuring Stanford Parser and Stanford NER Tagger with NLTK in python on Windows and Linux -  https://blog.manash.me/configuring-stanford-parser-and-stanford-ner-tagger-with-nltk-in-python-on-windows-f685483c374a