# Natural Language Processing. Lab 1
**Professor**: Vladimir Ivanov

**Teaching Assistant**: Aidar Valeev

**Labs**:
- Wednesdays, 16:10 - 17:40
- Mondays, 16:10 - 17:40
- you can visit either one

**Deadline**: Tuesday, 31 January 2023, 11:59 PM

**Grading**:
- Solve any two tasks to get full lab point.
- Solve 3 or 4 tasks for extra 0.5 or 1 bonus point, respectively.
- Labs are part of Lab participation, one lab weighs ~0.5% of the course total.

## Task 1
Write a python program that does the following :
1. retrieve content of a wikipedia page on a topic of your choice - english 
  - raw text or html format - is up to you
  - you can copy paste manually
2. retrieve content of a wikipedia page on a topic of your choice - language of your choice (i.e russian, french) (optional)
3. preprocess the data 
  - any transformations that you need to accomplish 4th part
4. print 
  - distinct words statatistics
  - number of chapters
  - number of sentences
  - number of numerical values
  - number of entities with links (optional)

In [69]:
! pip install wikipedia



In [70]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

import nltk
from nltk.stem import PorterStemmer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bouab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [71]:
source = urlopen('https://en.wikipedia.org/wiki/Russia').read()

# Make a soup 
soup = BeautifulSoup(source)

# extract the number of chapters
print("Number of chapters :", len(soup.find_all('p')))

# extract paragraphs
paras = [str(p.text) for p in soup.find_all('p')]

# extract heads
heads = [str(h.text) for h in soup.find_all('span', attrs={'mw-headline'})]

# insert a new line between each pair of headers and paragraphs
text = "\n".join([val for pair in zip(paras, heads) for val in pair])

# remove all the characters from HTML code
text = re.sub(r"\[.*?\]+", '', text)

# reduce the different 
text = re.sub(r'\s', ' ', text)

text = text[:100000] # limit the text to 100000 characters

print(text[:1000])

Number of chapters : 130
     Pages for logged out editors learn more  Etymology   History Russia (Russian: Россия, Rossiya, ), or the Russian Federation, is a transcontinental country spanning Eastern Europe and Northern Asia. It is the largest country in the world, with its internationally recognised territory covering 17,098,246 square kilometres (6,601,670 sq mi), and encompassing one-eighth of Earth's inhabitable landmass. Russia extends across eleven time zones and shares land boundaries with fourteen countries. It is the world's ninth-most populous country and Europe's most populous country, with a population of over 147 million people. The country's capital and largest city is Moscow. Saint Petersburg is Russia's cultural centre and second-largest city. Other major urban areas include Novosibirsk, Yekaterinburg, Nizhny Novgorod, and Kazan.  Early history The East Slavs emerged as a recognisable group in Europe between the 3rd and 8th centuries CE. The first East Slavic state, K

In [72]:
# use nltk tokenizer for the tokenizing the text
words = nltk.tokenize.word_tokenize(text)
# 4.3 number of words
print("the count of words: ", len(words)) 
## stem words
stemmer = PorterStemmer()
stemmed_words = ([stemmer.stem(w) for w in words])

# 4.1 statistics for distinct words (stemmed ones)
fdist1 = nltk.FreqDist(stemmed_words)
filtered_word_freq = dict((word, freq) for word, freq in fdist1.items() if word.isalpha() and len(word) > 1)
print(filtered_word_freq)

the count of words:  5774
{'page': 1, 'for': 18, 'log': 1, 'out': 5, 'editor': 1, 'learn': 1, 'more': 3, 'etymolog': 1, 'histori': 5, 'russia': 61, 'russian': 63, 'россия': 1, 'rossiya': 1, 'or': 4, 'the': 538, 'feder': 5, 'is': 10, 'transcontinent': 1, 'countri': 22, 'span': 2, 'eastern': 11, 'europ': 13, 'and': 188, 'northern': 5, 'asia': 3, 'it': 38, 'largest': 8, 'in': 167, 'world': 21, 'with': 27, 'intern': 6, 'recognis': 3, 'territori': 7, 'cover': 1, 'squar': 1, 'kilometr': 1, 'sq': 1, 'mi': 1, 'encompass': 1, 'of': 235, 'earth': 2, 'inhabit': 2, 'landmass': 1, 'extend': 4, 'across': 5, 'eleven': 1, 'time': 5, 'zone': 1, 'share': 1, 'land': 5, 'boundari': 2, 'fourteen': 2, 'popul': 9, 'most': 8, 'over': 11, 'million': 16, 'peopl': 6, 'capit': 6, 'citi': 5, 'moscow': 14, 'saint': 2, 'petersburg': 2, 'cultur': 5, 'centr': 3, 'other': 6, 'major': 8, 'urban': 1, 'area': 2, 'includ': 10, 'novosibirsk': 1, 'yekaterinburg': 2, 'nizhni': 1, 'novgorod': 9, 'kazan': 2, 'earli': 14, 'east'

## Task 2

Write a python program that does the following :
1. Retrieve data from sklearn --> (from sklearn.datasets import fetch_20newsgroups)
2. Preprocess the data 
3. Do classification using any classical machine learning method
  - TfidfVectorizer from sklearn might be a good choice

In [73]:
# first retrieve data from sklearn 
from sklearn.datasets import fetch_20newsgroups
# define the categories to be used
categories = ['alt.atheism','comp.graphics', 'sci.med']
# extract the
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=11)

In [74]:
# let's first the tokenization mechanisms provided by scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()
train_counts = count_vec.fit_transform(train.data)
print(f"counts: {train_counts.shape}")

# let's convert the counts to frequencies using tfidf transformer
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
train_freq = transformer.fit_transform(train_counts) 
print(f"frequencies: {train_freq.shape}")

counts: (1658, 30277)
frequencies: (1658, 30277)


In [75]:
# let's put everything into a pipeline as it is easier to work with
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

text_clf = Pipeline([('count', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('classifier', LogisticRegression())])
text_clf.fit(train.data, train.target)


In [76]:
# time to test the pipeline
from sklearn.metrics import balanced_accuracy_score
# first import test dataset
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=11)
pred = text_clf.predict(test.data)

print(f"balanced accuracy for the current pipeline on the test data is: {round(balanced_accuracy_score(test.target, pred), 3)}")

balanced accuracy for the current pipeline on the test data is: 0.913


## Task 3

Write a python program that does the following :
1. Preprocess the given data
2. Find entities in the data using regular expressions: dates, names, locations
3. Anonymise the names of US presidents
4. Highlight the locations
5. Sort by dates

In [77]:
TEXT = [
    "Barack Obama was the 44th president of the US and he followed George W. Bush and was followed by Donald Trump in 2017",
    "As a young man, George H.W. Bush served in World War II as a fighter pilot. In 1944, he was shot down and had to parachute to safety.",
    "Before he was president, George W. Bush was a cheerleader, a fraternity brother, an oilman, an owner of a professional baseball team, and a governor. After leaving office in 2009, Bush learned to paint.",
    "Here's something else you probably didn't know about John Adams: He died on the Fourth of July. And he wasn't the only commander in chief to do so. In fact, three of the nation's five founding fathers—Adams, Thomas Jefferson, and James Monroe—died on Independence Day. Adams and Jefferson even passed on the same exact day: July 4, 1826, which happened to be the 50th anniversary of the adoption of the Declaration of Independence.",
    "At 6 feet 4 inches tall, Abraham Lincoln and Lyndon B. Johnson were America's tallest presidents. But what about America's shortest president? That distinction goes to founding father James Madison (1809-1817), who, at 5 feet 4 inches tall, was a full foot shorter than his tallest peers.",
    "That changed, however, in October 1860, when Lincoln received a letter from an 11-year-old girl named Grace Bedell. 'If you will let your whiskers grow I will try and get [my brothers] to vote for you,' Bedell wrote to Lincoln. 'You would look a great deal better for your face is so thin. All the ladies like whiskers and they would tease their husbands to vote for you and then you would be president.'",
    "Richard Nixon was hardly the first president who liked to unwind by rolling a few strikes. Harry S. Truman also enjoyed bowling, and opened the first White House bowling alley in 1947. ",
    "If you had to bet on which U.S. president was the biggest movie fan, you'd probably put your money on America's actor-turned-president, Ronald Reagan (1981-1989). And that would be a great guess. Reagan reportedly watched 363 movies during his two terms in office.",
    "Thomas Jefferson offered to sell his personal library when the Library of Congress was burned by the British during the War of 1812. He sold them 6487 books from his own collection, the largest in America at the time.",
    "Born in New York in 1782, Martin Van Buren was the first president to have been born after the American Revolution, technically making him the first American-born president.",

    "Benjamin Harrison had a tight-knit family and loved to amuse and dote on his grandchildren. He put up the first recorded White House Christmas tree in 1889, and was known to put on the Santa suit for entertainment.",
    "A 16-year-old Bill Clinton managed to shake hands with President John F. Kennedy at a Boys Nation event in 1963. This would take place just four months before Kennedy's assassination.",
    "In 1993—two years before he became the governor of Texas—George W. Bush ran the Houston marathon, finishing with a time of 3:44:52. He is the only president to have ever run a marathon.",
]

In [78]:
import re
import copy
from string import punctuation

def preprocess(text):
    new_text = re.sub(r'[^\w\s\.]', '', text)
    new_text = re.sub(r'\s+', ' ', new_text)
    
    return new_text

preprocess("Every ### da%y learn,,,ing a new th!!ing.")

'Every day learning a new thing.'

In [79]:
import re
# let's create a regex for names made out of at least 2 components
# regex_name_atleast_2_words = r'([A-Z]+[a-z]* [A-Z]+[a-z]*\. [A-Z]+[a-z]*)| ([A-Z]+[a-z]* ){1,}([A-Z]+[a-z]*)'
regex_name_with_point = r'([A-Z]+[a-z]* [A-Z]+[a-z]*\. [A-Z]+[a-z]*)'
regex_name_2_words = "[A-Z]+[a-z]* [A-Z]+[a-z]*"
regex_name_at_least_3_words = "([A-Z]+[a-z]* ){2,}([A-Z]+[a-z]*)"

locations = ["US", "America", "White House", "Library of Congress", "Texas", "New York", 'Houston']

locations_regex = "(" + "|".join(locations) + ")"

print(locations_regex)


(US|America|White House|Library of Congress|Texas|New York|Houston)


In [80]:
names = []
for s in TEXT:
    n1 = set(re.findall(regex_name_2_words, s))
    n2 = set(re.findall(regex_name_at_least_3_words, s))
    n3 = set(re.findall(regex_name_with_point, s))

    n1 = set(["".join(group) for group in n1])   
    n2 = set(["".join(group) for group in n2])
    n3 = set(["".join(group) for group in n3])
    
    n3.update(n1)
    n3.update(n2)
    
    names.extend(list(n3))
    # let's further improve the results : if one element in a substring of another then, remove it
    copy_names = names.copy()
    for i1 in range(len(names)):
        for  i2 in range(i1 + 1, len(names)):
            if names[i1] in names[i2]:
                copy_names.remove(names[i1])
                break

# filter some results manually
filtered = ['War II', 'World War', 'Independence Day', 'New York', 'American Revolution', 'White House', 'House Christmas', 'Boys Nation', 'Independence Day. Adams']

final_names = [name for name in copy_names if name not in filtered]

print(final_names)


['Barack Obama', 'Donald Trump', 'George H', 'John Adams', 'James Monroe', 'Lyndon B. Johnson', 'Abraham Lincoln', 'Lyndon B', 'James Madison', 'Grace Bedell', 'Richard Nixon', 'Harry S. Truman', 'Ronald Reagan', 'Thomas Jefferson', 'Van Buren', 'Martin Van', 'Benjamin Harrison', 'President John', 'John F. Kennedy', 'Bill Clinton', 'George W. Bush']


In [81]:
for s in TEXT:
    print(s)
    for n in final_names:
        s = s.replace(n, "*" * len(n))

    print("after transformation")
    print(s)


Barack Obama was the 44th president of the US and he followed George W. Bush and was followed by Donald Trump in 2017
after transformation
************ was the 44th president of the US and he followed ************** and was followed by ************ in 2017
As a young man, George H.W. Bush served in World War II as a fighter pilot. In 1944, he was shot down and had to parachute to safety.
after transformation
As a young man, ********.W. Bush served in World War II as a fighter pilot. In 1944, he was shot down and had to parachute to safety.
Before he was president, George W. Bush was a cheerleader, a fraternity brother, an oilman, an owner of a professional baseball team, and a governor. After leaving office in 2009, Bush learned to paint.
after transformation
Before he was president, ************** was a cheerleader, a fraternity brother, an oilman, an owner of a professional baseball team, and a governor. After leaving office in 2009, Bush learned to paint.
Here's something else you p

In [82]:
# let's check the locations
locations  = [ ]
for s in TEXT:
    locations.extend(re.findall(locations_regex, s))
print(list(set(locations)))

['America', 'Library of Congress', 'White House', 'US', 'Houston', 'New York', 'Texas']


## Task 4

Write a python program that does the following :
1. Preprocess the data from Task 3
2. Find entities in the data using [Gazetteers](https://gatenlp.readthedocs.io/en/latest/gazetteers/) from gatenlp: dates, names, locations
3. Anonymise the names of US presidents
4. Highlight the locations
5. Sort by dates


In [83]:
! pip install spacy
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 13.3 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [84]:
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_features(text):
    global nlp
    doc = nlp(text)
    ns = [pt.text for pt in doc.ents if pt.label_ == 'PERSON']
    locs = [pt.text for pt in doc.ents if pt.label_ == 'GPE']

    return ns, locs 

names = []
locations = []  
for s in TEXT:
    ns, locs = extract_features(s)
    names.extend(ns)
    locations.extend(locs)

In [85]:

# the library might sometimes confuse the name of location with a name of a person. It is out time to intervene
# let's first define the set of locations

true_locations = ['America', 'Library of Congress', 'White House', 'US', 'New York', 'Texas', 'Houston']

names = [n for n in names if n not in true_locations]

from copy import copy

def anonymise(sentence, to_replace, char='#'):
    s = copy(sentence)
    for n in to_replace:
        s = s.replace(n, "*" * len(n))
    return s

for s in TEXT:
    print(s)
    print("after transformation")
    print(anonymise(s, names))


Barack Obama was the 44th president of the US and he followed George W. Bush and was followed by Donald Trump in 2017
after transformation
************ was the 44th president of the US and he followed ************** and was followed by ************ in 2017
As a young man, George H.W. Bush served in World War II as a fighter pilot. In 1944, he was shot down and had to parachute to safety.
after transformation
As a young man, **************** served in World War II as a fighter pilot. In 1944, he was shot down and had to parachute to safety.
Before he was president, George W. Bush was a cheerleader, a fraternity brother, an oilman, an owner of a professional baseball team, and a governor. After leaving office in 2009, Bush learned to paint.
after transformation
Before he was president, ************** was a cheerleader, a fraternity brother, an oilman, an owner of a professional baseball team, and a governor. After leaving office in 2009, **** learned to paint.
Here's something else you p

In [86]:
def retrieve_date(text):
    global nlp
    doc = nlp(text)
    dates = [pt.text for pt in doc.ents if pt.label_ == 'DATE']

    return dates

dates = []
for s in TEXT:
    dates.extend(retrieve_date(s))
print(dates)
# let's filter the results further
# we will consider only dates that have year components: and consider only the year components out of these dates

final_dates = [d for d in dates if re.findall(r'\d{4}', d)]


# the actual years (dates) present in the text: sorted
print("\ndates filtered and sorted\n")
print(sorted(final_dates, key=lambda x: re.findall(r'\d{4}', x)[0]))

['US', 'America', 'America', 'U.S.', 'America', 'America', 'New York', 'Texas', 'Houston']
['2017', '1944', '2009', 'the same exact day', 'July 4, 1826', '1809-1817', 'October 1860', '11-year-old', '1947', '1981-1989', '1782', 'Christmas', '1889', '16-year-old', '1963', 'just four months', '1993', 'two years', '3:44:52']

dates filtered and sorted

['1782', '1809-1817', 'July 4, 1826', 'October 1860', '1889', '1944', '1947', '1963', '1981-1989', '1993', '2009', '2017']
