#Task 1: Extracting important words from a text (Tokenization-Stopwords)

Write a Python script on a Google Colab: 
1. Take the whole Pancasila text as input (e.g. from https://en.wikibooks.org/wiki/Indonesian/Texts/Pancasila)
2. Do tokenization to the input text.
3. List down all stopwords in Indonesian using nltk library.
4. Remove stopwords from the input text.

In [None]:
import nltk
nltk.download("popular")
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

**Tokenization** is the process of dividing the whole text into tokens.

In [None]:
example_text = "Ketuhanan Yang Maha Esa, Kemanusiaan Yang Adil dan Beradab, Persatuan Indonesia, Kerakyatan Yang Dipimpin oleh Hikmat Kebijaksanaan, Dalam Permusyawaratan dan Perwakilan, Keadilan Sosial bagi seluruh Rakyat Indonesia"

# sent_tokenize (Separated by sentence)
sentences = sent_tokenize(example_text)
print('sent_tokenize :', sentences)
##word_tokenize (Separated by words)
words = word_tokenize(example_text)
print('word_tokenize :', words)

sent_tokenize : ['Ketuhanan Yang Maha Esa, Kemanusiaan Yang Adil dan Beradab, Persatuan Indonesia, Kerakyatan Yang Dipimpin oleh Hikmat Kebijaksanaan, Dalam Permusyawaratan dan Perwakilan, Keadilan Sosial bagi seluruh Rakyat Indonesia']
word_tokenize : ['Ketuhanan', 'Yang', 'Maha', 'Esa', ',', 'Kemanusiaan', 'Yang', 'Adil', 'dan', 'Beradab', ',', 'Persatuan', 'Indonesia', ',', 'Kerakyatan', 'Yang', 'Dipimpin', 'oleh', 'Hikmat', 'Kebijaksanaan', ',', 'Dalam', 'Permusyawaratan', 'dan', 'Perwakilan', ',', 'Keadilan', 'Sosial', 'bagi', 'seluruh', 'Rakyat', 'Indonesia']


In general, **stopwords** are the words in any language which does not add much meaning to a sentence. In NLP stopwords are those words which are not important in analyzing the data.

In [None]:
!pip install PySastrawi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Stopwords
from nltk.corpus import stopwords
print(stopwords.words('indonesian'))

['ada', 'adalah', 'adanya', 'adapun', 'agak', 'agaknya', 'agar', 'akan', 'akankah', 'akhir', 'akhiri', 'akhirnya', 'aku', 'akulah', 'amat', 'amatlah', 'anda', 'andalah', 'antar', 'antara', 'antaranya', 'apa', 'apaan', 'apabila', 'apakah', 'apalagi', 'apatah', 'artinya', 'asal', 'asalkan', 'atas', 'atau', 'ataukah', 'ataupun', 'awal', 'awalnya', 'bagai', 'bagaikan', 'bagaimana', 'bagaimanakah', 'bagaimanapun', 'bagi', 'bagian', 'bahkan', 'bahwa', 'bahwasanya', 'baik', 'bakal', 'bakalan', 'balik', 'banyak', 'bapak', 'baru', 'bawah', 'beberapa', 'begini', 'beginian', 'beginikah', 'beginilah', 'begitu', 'begitukah', 'begitulah', 'begitupun', 'bekerja', 'belakang', 'belakangan', 'belum', 'belumlah', 'benar', 'benarkah', 'benarlah', 'berada', 'berakhir', 'berakhirlah', 'berakhirnya', 'berapa', 'berapakah', 'berapalah', 'berapapun', 'berarti', 'berawal', 'berbagai', 'berdatangan', 'beri', 'berikan', 'berikut', 'berikutnya', 'berjumlah', 'berkali-kali', 'berkata', 'berkehendak', 'berkeinginan'

In [None]:
from nltk.corpus import stopwords
text = "Ketuhanan Yang Maha Esa, Kemanusiaan Yang Adil dan Beradab, Persatuan Indonesia, Kerakyatan Yang Dipimpin oleh Hikmat Kebijaksanaan, Dalam Permusyawaratan dan Perwakilan, Keadilan Sosial bagi seluruh Rakyat Indonesia"
words = word_tokenize(text)
words_without_stopwords = [word for word in words if word not in stopwords.words('indonesian')]
print(words_without_stopwords)

['Ketuhanan', 'Yang', 'Maha', 'Esa', ',', 'Kemanusiaan', 'Yang', 'Adil', 'Beradab', ',', 'Persatuan', 'Indonesia', ',', 'Kerakyatan', 'Yang', 'Dipimpin', 'Hikmat', 'Kebijaksanaan', ',', 'Dalam', 'Permusyawaratan', 'Perwakilan', ',', 'Keadilan', 'Sosial', 'Rakyat', 'Indonesia']


#Task 2: Text classification based on Bag-of-Words

Write a Python script on a Google Colab: 
1. Define three possible text topics: Health, Sport, Finance.
2. Find texts/articles as the basis/training of the three topics.
3. Compute the Bag-of-Words vector for each topic.
4. Classify this text to the correct topic:

“Cristiano Ronaldo came off the bench to earn Manchester United a hard-fought 2-1 victory at Everton in the Premier League on Sunday, taking his career goal tally to 700 in the process. Just as United did last weekend in their derby mauling at the hands of local rivals Manchester City, they again found themselves behind early on at Goodison Park after Alex Iwobi curled a sublime strike into the net from 20 metres.”

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer

# Three texts with labeled topic
text_sport = "by the way Liverpool held off a late charge from Tottenham after Mohamed Salah struck twice to win 2-1 in north London and lift themselves back into the Premier League top-four race. Jurgen Klopp side had suffered shock defeats to relegation-threatened pair Nottingham Forest and Leeds in their last two league outings but started fast against Spurs, with a sharp touch and finish into the bottom corner from in-form Salah giving them the lead on 11 minutes. Ivan Perisic had a header deflected onto the post by Liverpool goalkeeper Alisson and Ryan Sessegnon saw a penalty shout from a challenge by Trent Alexander-Arnold waved away as Spurs came to life - but the Reds struck again just before the break thanks to a gift from Eric Dier. The centre-back miscued a header towards his own goal and Salah (40) raced through to chip in his ninth goal in eight games. Tottenham were sent out early for the second half and - not for the first time this season - were better after the break, with Alisson again pushing a Perisic effort onto the woodwork before Harry Kane (70) fired home a brilliant strike when played in by sub Dejan Kulusevski. Rodrigo Bentancur went close with a couple of headers and Kane glanced wide as Spurs desperately sought another late goal but Liverpool clung on for their first away win in the Premier League this season to move up to eighth and within seven points of fourth-placed Spurs, with a game in hand."
text_medical = "There is growing popularity of the Hospital Incident Command System (HICS) as an organizational tool for hospital management in the COVID-19 pandemic. We specifically describe implementation of HICS at the Isfahan province reference hospital (Isabn-e-Maryam) during the COVID-19 pandemic and try to explore performance of it. Methods: To document the actions taken during the COVID-19 pandemic, standard, open-ended interviews were conducted with individuals occupying activated HICS leadership positions during the event. A checklist based on the job action sheets of the HICS was used for performance assessment. Results: With the onset of the pandemic, hospital director revised ICS structure that adheres to span of better control of COVID-19. Methods of expanding hospital inpatient capacity to enable surge capacity were considered. The highest performance score was in the field of planning. Performance was intermediate in Financial/Administration section and good in other fields. Discussion: In the current COVID-19 pandemic, establishing HICS with some consideration about long-standing events can help improve communication, resource use, staff and patient protection, and maintenance of roles."
text_finance = "According to Fidelity’s Financial Resolutions survey, saving more money was the number one-resolution for respondents. Close to half (43%) said this was a goal they wanted to work toward in the new year. Building a nest egg can help you pay for big ticket items — like a house, a vacation, a wedding or even just an expensive item you really want — without taking on additional debt. Having savings can also come in handy if an emergency expense comes up. If you’re looking to increase your savings in the new year, it helps to start small — even if you’re only transferring $10 a week into your savings account. Starting small helps you build a muscle for saving. This way, when you receive salary bumps, bonuses and gift money, you’ve already gotten into the habit of saving, and you’ll be more likely to transfer that money to your savings account. You may also want to consider automating that process instead of just manually moving money into your savings. Relying on manual transfers leaves a lot of room for procrastination — and before we know it, we’ve spent the money we intended to save. But when you set up automatic transfers into your savings account, you take away the need to make that decision altogether. You can usually schedule automatic transfers through your bank’s mobile app. Lastly, if you want to see your savings grow just a little faster, you can opt for a high-yield savings account instead of a traditional savings account. High-yield savings accounts — like the Marcus by Goldman Sachs Online Savings Account or the Ally Online Savings Account — pay you more in interest each month compared to traditional savings accounts."

texts = [text_sport, text_medical, text_finance]
bow_keys = []
corpus_texts = []
for text in texts:
    words  = word_tokenize(text)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    bow_keys += texts
    text = ' '.join(texts)
    corpus_texts.append(text)
bow_keys = set(bow_keys)
print(bow_keys)   #### Cleaned Data
print(corpus_texts)   #### Cleaned Data

{'try', 'structure', '43', 'corner', 'two', 'Rodrigo', 'post', 'Ivan', 'With', 'last', 'Reds', 'break', 'assessment', 'used', 'held', 'centre-back', 'respondent', 'Bentancur', 'fast', 'taking', 'System', 'lead', 'instead', 'little', 'province', 'spent', 'relegation-threatened', 'checklist', '-', 'interest', 'opt', 'enable', 'The', 'individual', 'close', 'patient', 'season', '40', 'establishing', 'communication', 'make', 'leadership', 'gotten', 'big', 'transferring', 'strike', 'According', 'manually', 'expense', 'field', 'sent', 'planning', 'section', 'manual', 'increase', 'surge', 'Ryan', 'bump', 'Spurs', 'point', 'open-ended', 'popularity', 'another', 'Kulusevski', 'You', 'money', 'expanding', 'eight', 'Account', 'may', 'league', 'pushing', ')', 'toward', 'know', 'bank', 'item', 'automatic', 'growing', 'current', 'bottom', 'Isfahan', 'number', 'capacity', 'muscle', 'year', 'Marcus', 'better', 'Mohamed', 'month', 'lift', 'implementation', 'fired', 'woodwork', 'sought', 'finish', 'glanc

In [None]:
# A new text to be classified based on topic
query_text = "“Cristiano Ronaldo came off the bench to earn Manchester United a hard-fought 2-1 victory at Everton in the Premier League on Sunday, taking his career goal tally to 700 in the process. Just as United did last weekend in their derby mauling at the hands of local rivals Manchester City, they again found themselves behind early on at Goodison Park after Alex Iwobi curled a sublime strike into the net from 20 metres.”"
query_words = word_tokenize(query_text)
query_words_clean = [lemmatizer.lemmatize(word) for word in query_words if word not in set(stopwords.words('english'))]
query_words_corpus = [word for word in query_words_clean if word in set(bow_keys)]
query_text_corpus = ' '.join(query_words_corpus)
corpus_texts.append(query_text_corpus)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
bow_vectors = cv.fit_transform(corpus_texts).toarray()
print(bow_vectors)
print(len(bow_vectors[0]))

[[0 1 0 ... 0 0 0]
 [0 0 5 ... 0 0 0]
 [1 0 0 ... 2 2 2]
 [0 0 0 ... 0 0 0]]
330


Classification based on maximum similarity

In [None]:
# Normalize the BoW vectors
bow_texts_norm = []
for bow in bow_vectors:
  length = (sum(i*i for i in bow)) ** 0.5
  bow_norm = bow / length
  bow_texts_norm.append(bow_norm)

# Compute similarity using dot product
similarity_vector = []
bow_norm_query = bow_texts_norm[3]
for bow in bow_texts_norm[:3]:
  similarity_vector.append(sum(i*j for i,j in zip(bow,bow_norm_query)))
print(similarity_vector)

# Find the highest similarity
id_max_sim = similarity_vector.index(max(similarity_vector))
if (id_max_sim == 0):
  print ("The query text is classified as: Sport")
elif (id_max_sim == 1):
  print ("The query text is classified as: Medical")
elif (id_max_sim == 2):
  print ("The query text is classified as: Finance")

[0.2722550403896021, 0.0, 0.04234180777032276]
The query text is classified as: Sport


#Task 3: Understanding the challenge of NLP

Explain briefly the challenges that make NLP difficult to reach human level sense.

#Answer

First, the same words and phrases can have different meanings according to the context of the sentence and the number of words. Second, synonyms can cause problems because we use many different words to express the same idea. Third, irony and sarcasm present a problem for machine learning models because they commonly use words and phrases that, by definition, may be positive or negative, but actually connote the opposite. Fifth, there is an informal language. Everyday language may not have a "dictionary definition" at all, and these expressions may even mean different things in different geographic areas. In addition, slang culture is constantly changing and developing, so new words appear every day. This makes NLP difficult to reach human level sense.
