# Basic NLP exercises

* During these exercises, you will learn basic Python skills required in NLP, for example
  * Reading and processing language data
  * Segmenting text
  * Calculating word frequencies and idf weights

* Exercises are based on tweets downloaded using Twitter API. Both Finnish and English tweets are available, you are free to choose which language you want to work with.


> Finnish: http://dl.turkunlp.org/intro-to-nlp/finnish-tweets-sample.jsonl.gz

> English: http://dl.turkunlp.org/intro-to-nlp/english-tweets-sample.jsonl.gz


* Both files include 10,000 tweets. If processing the whole file takes too much time, you can also read just a subset of the data, for example only 1,000 tweets.


## 1) Read tweets in Python

* Download the file, and read the data in Python
* **The outcome of this exercise** should be a list of tweets, where each tweet is a dictionary including different (key, value) pairs

In [None]:
# write your code here and run the cell to see your output
print("Hello world!")
print("A")

Hello world!
A


In [14]:
!wget -nc http://dl.turkunlp.org/intro-to-nlp/english-tweets-sample.jsonl.gz
# write your code here and run the cell to see your output

import gzip
import json 

f = gzip.open ("english-tweets-sample.jsonl.gz", "rt", encoding = "utf-8")

lines = f.readlines()[:1000]

tweets=[]
for line in lines:
  data = json.loads(line) #json.load(f)
  tweets.append(data)

print (len(lines))
print (type(tweets[0]))
print (lines[0])

print (len(tweets))
print (type(tweets[0]))
print (tweets[0]) 

File ‘english-tweets-sample.jsonl.gz’ already there; not retrieving.

1000
<class 'dict'>
{"created_at":"Tue Dec 26 14:16:22 +0000 2017","id":945659557480611840,"id_str":"945659557480611840","text":"Check out my class in #GranblueFantasy! https:\/\/t.co\/pAvXn8diJr","display_text_range":[0,39],"source":"\u003ca href=\"http:\/\/granbluefantasy.jp\/\" rel=\"nofollow\"\u003e\u30b0\u30e9\u30f3\u30d6\u30eb\u30fc \u30d5\u30a1\u30f3\u30bf\u30b8\u30fc\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":883980236655779840,"id_str":"883980236655779840","name":"Pc Kwok","screen_name":"jensenpck","location":null,"url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":0,"friends_count":1,"listed_count":0,"favourites_count":0,"statuses_count":42,"created_at":"Sun Jul 09 09:24:46 +0000 2017","utc_offset":n

In [15]:
tweets

[{'contributors': None,
  'coordinates': None,
  'created_at': 'Tue Dec 26 14:16:22 +0000 2017',
  'display_text_range': [0, 39],
  'entities': {'hashtags': [{'indices': [22, 38], 'text': 'GranblueFantasy'}],
   'media': [{'display_url': 'pic.twitter.com/pAvXn8diJr',
     'expanded_url': 'https://twitter.com/jensenpck/status/945659557480611840/photo/1',
     'id': 945659555123404801,
     'id_str': '945659555123404801',
     'indices': [40, 63],
     'media_url': 'http://pbs.twimg.com/media/DR-oWuWVoAEUZJd.jpg',
     'media_url_https': 'https://pbs.twimg.com/media/DR-oWuWVoAEUZJd.jpg',
     'sizes': {'large': {'h': 512, 'resize': 'fit', 'w': 1024},
      'medium': {'h': 512, 'resize': 'fit', 'w': 1024},
      'small': {'h': 340, 'resize': 'fit', 'w': 680},
      'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
     'type': 'photo',
     'url': 'https://t.co/pAvXn8diJr'}],
   'symbols': [],
   'urls': [],
   'user_mentions': []},
  'extended_entities': {'media': [{'display_url': 'pic.t

## 2) Extract texts from the tweet jsons

* During these exercises we need only the actual tweet text. Inspect the dictionary and extract the actual text field for each tweet.
* When carefully inspecting the dictionary keys and values, you may see the old Twitter character limit causing unexpect behavior for text. In these cases, are you able to extract the full text?
* **The outcome of this exercise** should be a list of tweets, where each tweet is a string.

In [4]:
# How many documents the dataset have?
print("Number of documents:", len(tweets))

documents = [document["text"] for document in tweets] # right now we only need the text field for each document
print(len(documents))
print(documents[599])

Number of documents: 1000
1000
#Christmasgift Gift Card in a Gift Bag #nowplaying, #jk, #humanrightsday #ps4share #asksrk, #ikon #giveaway… https://t.co/0pnVck1SQq


## 3) Segment tweets

* Segment tweets using the UDPipe machine learned model, remember to select the correct language.

> English model: https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

> Finnish model: https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/fi.segmenter.udpipe

* Note that the segmentation model was not trained on tweets, so it may have difficulties in some cases. Inspect the output to get an idea how well it performs on tweets.
* Note: In case of the notebook cell dies while trying to load/run the model, the most typical reason is wrong file path or name, or incorrectly downloaded model.
* **The output of this excercise** should be a list of segmented tweets, where each tweet is a string.

In [None]:
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

!pip3 install ufal.udpipe

import ufal.udpipe as udpipe

model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal") # horizontal: returns one sentence per line, with words separated by a single space

segmented_document = pipeline.process(documents[599])

print(segmented_document)
    

--2021-01-27 17:20:10--  https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/Data/en.segmenter.udpipe [following]
--2021-01-27 17:20:11--  https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/Data/en.segmenter.udpipe
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17394186 (17M) [application/octet-stream]
Saving to: ‘en.segmenter.udpipe’


2021-01-27 17:20:12 (67.3 MB/s) - ‘en.segmenter.udpipe’ saved [17394186/17394186]

Collecting ufal.udpipe
[?25l  Downloading https://files.pythonhos

## 4) Calculate word frequencies

* Calculate a word frequency list (how many times each word appears) based on the tweets. Which are the most common words appearing in the data?
* Calculate the size of your vocabulary (how many unique words there are).
* **The output of this excercise** should be a sorted list of X most common words and their frequencies, and the number of unique words in the data.

In [None]:
from collections import Counter

token_counter = Counter()
for doc in documents[:1000]: # Tweets documents
    tokenized = pipeline.process(doc)
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

print("Most common tokens:", token_counter.most_common(20))
print("Vocabulary size:", len(token_counter))

Most common tokens: [('@', 717), (':', 716), ('RT', 612), ('.', 360), ('the', 281), ('#', 280), (',', 247), ('…', 244), ('a', 241), ('to', 228), ('I', 203), ('and', 192), ('you', 178), ('of', 152), ('in', 150), ('for', 137), ('is', 137), ('-', 123), ('!', 108), ('on', 97)]
Vocabulary size: 6078


Extra Task: Stop words

In [None]:
import nltk
nltk.download('stopwords') # download the stopwords dataset

from nltk.corpus import stopwords

# take 150 most common words from the IMDB corpus and filter out stop words and punctuation
filtered_tokens = []
punctuation_chars = '. , : ( ) ! ? " = & - ; ... \\ '.split() # list of punctuation symbols to ignore
for word, count in token_counter.most_common(150):
    if word.lower() in stopwords.words("english") or word in punctuation_chars:
        continue
    filtered_tokens.append((word, count))
print("Number of tokens:", len(filtered_tokens))
print("Tokens:", filtered_tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Number of tokens: 58
Tokens: [('@', 717), ('RT', 612), ('#', 280), ('…', 244), ("'s", 79), ('’s', 53), ('Christmas', 44), ('n’t', 40), ('people', 40), ("n't", 40), ('•', 39), ('one', 38), ('amp', 38), ('like', 36), ('love', 32), ('1', 31), ("'", 30), ('“', 28), ('year', 28), ('much', 27), ('2017', 23), ('get', 23), ('/', 22), ('2', 19), ('2018', 19), ('person', 19), ('got', 19), ('life', 19), ('need', 18), ('’m', 18), ('back', 18), ('day', 17), ('”', 17), ('go', 17), ('see', 17), ('family', 17), ('u', 17), ('today', 17), ("'m", 16), ('really', 16), ('ever', 16), ('never', 16), ('even', 16), ('time', 15), ("'re", 15), ('3', 15), ('know', 15), ('’re', 14), ('hate', 14), ('hope', 14), ('last', 14), ('Thank', 13), ('right', 13), ('could', 13), ('girl', 13), ('everyone', 13), ('big', 12), ('10', 12)]


## 5) Calculate idf weights

* Calculate idf weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values.
* Can you think of a reason why someone could claim that tf does not have a high impact when processing tweets?
* **The output of this excercise** should be a list of words sorted by their idf weights.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(smooth_idf=True, use_idf=True)
transformed_documents = vectorizer.fit_transform(documents)

df = pd.DataFrame(transformed_documents[0].T.todense(), index = vectorizer.get_feature_names(), columns = ['IDF'])
df = df.sort_values('IDF', ascending=False)

print(df.head(20))

                      IDF
pavxn8dijr       0.482401
granbluefantasy  0.482401
class            0.436060
check            0.374801
out              0.297039
my               0.226436
in               0.196949
co               0.114070
https            0.112465
pet              0.000000
pettymamii       0.000000
petridishes      0.000000
peterjmarshall   0.000000
peter            0.000000
00               0.000000
pesto            0.000000
pettyspice       0.000000
personal         0.000000
person           0.000000
persisting       0.000000


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(smooth_idf=True, use_idf=True)
vectors = vectorizer.fit_transform(documents)

feature_names = vectorizer.get_feature_names()

#for col in vectors.nonzero()[1]:
#    print (feature_names[col], ' - ', vectors[0, col])

print('total elements:', len(vectors.nonzero()[1]))
for n,col in zip(range(20), vectors.nonzero()[1]):
  print('\n', n, '\col', col)
  print(feature_names[col], ' - ', vectors[0,col])


total elements: 13943

 0 \col 3375
pavxn8dijr  -  0.48240065007302313

 1 \col 984
co  -  0.11406957348391668

 2 \col 2155
https  -  0.11246497245526342

 3 \col 1925
granbluefantasy  -  0.48240065007302313

 4 \col 2243
in  -  0.1969493145949549

 5 \col 957
class  -  0.4360601796521979

 6 \col 3045
my  -  0.22643619063076748

 7 \col 3310
out  -  0.2970387683897221

 8 \col 897
check  -  0.3748014098726122

 9 \col 1134
cu7on7g1si  -  0.0

 10 \col 4926
world  -  0.0

 11 \col 4420
the  -  0.0

 12 \col 3314
over  -  0.0

 13 \col 334
all  -  0.0

 14 \col 3354
partner  -  0.0

 15 \col 1015
community  -  0.0

 16 \col 3308
our  -  0.0

 17 \col 4506
to  -  0.0

 18 \col 5032
you  -  0.0

 19 \col 4411
thank  -  0.0


## 6) Duplicates or near duplicates

* Check whether we have duplicate tweets (in terms of text field only) in our dataset. Duplicate tweet means here that the exactly same tweet text appears more than once in our dataset.
* Note: It makes sense to check the duplicates using original tweet texts as the texts were before segmentation. I would also recommend using the full 10,000 dataset here in order to get higher chance of seeing duplicates (this does not require heavy computing).
* Try to check whether tweets have additional near-duplicates. Near duplicate means here that tweet text is almost the same in two or more tweets. Ponder what kind of near duplicates there could be and how to find those. Start by considering for example different normalization techniques. Implement some of the techniques you considered.
* **The outcome of this exercise** should be a number of unique tweets in our dataset (with possibly counting also which are the most common duplicates) as well as the number of unique tweets after removing also near duplicates.

In [33]:
!pip install snowballstemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

def stem(doc):
  return " ".join(stemmer.stem(w) for w in doc.split())
def lemmatize(doc):
  return " ".join(lemmatizer.lemmatize(w) for w in doc.split())



In [34]:
duplicates = []
filtered_duplicated_tweets = []
seen = {}

In [35]:
for tweets in documents:
    if tweets not in seen:
        seen[tweets] = 1
        filtered_duplicated_tweets.append(tweets)
    else:
        if seen[tweets]>=1:
            duplicates.append(tweets)
        seen[tweets] += 1 

In [36]:
print(len(filtered_duplicated_tweets))

962


In [None]:
def get_shingles(twts, k):
  shingles = set()
  
  tweets_length = len(twts)

  for i in range( tweets_length-k+1):
    twts_shingle = twts[i:i + k]
    shingles.add(twts_shingle)
  return shingles 


In [None]:
print("number of shingles: {}".format(len(get_shingles(documents[599], k=5))))
print("number of shingles: {}".format(get_shingles(documents[599], k=5)))

number of shingles: 126
number of shingles: {' #hum', ' #giv', 'sksrk', 'y… ht', 'k, #h', ' Card', 'a Gif', 'e #as', 'owpla', 'nVck1', 'ing, ', 'ying,', 'iveaw', ' #now', 'day #', 'Card ', 'n a G', ' #jk,', 'hrist', 'ard i', 'htsda', 'k1SQq', 'asksr', '… htt', 'ps4sh', 'manri', 'sgift', 'givea', 'playi', 're #a', 'share', ' in a', '.co/0', '#ps4s', 'y #ps', '://t.', '0pnVc', 'gift ', 'rd in', 'wplay', '/0pnV', 'ift C', 'layin', 'ksrk,', 'rk, #', ' Gift', 'ights', 'Vck1S', 'masgi', 'veawa', 'o/0pn', 'Bag #', '#Chri', 'ps://', ' a Gi', 'pnVck', '#nowp', 'jk, #', 'tsday', ', #jk', 'stmas', 'k, #i', 'ikon ', 'are #', 'n #gi', 'Gift ', 'istma', 'Chris', 'way… ', 't Car', 'ay #p', 'ay… h', 't Gif', 'ag #n', 'ift B', '//t.c', ' http', 'on #g', 'tps:/', '/t.co', 'away…', 'ghtsd', 'd in ', 'aying', 's4sha', 'ft Gi', ' Bag ', 'ft Ca', 'asgif', 'ttps:', 'ristm', '#jk, ', 'umanr', 'tmasg', ', #hu', ' #iko', 'human', 'right', 'in a ', 't.co/', 'co/0p', 'nrigh', ', #ik', '#asks', 'eaway', 'ft Ba', '

In [None]:
def jaccard_similarity_score(x, y):

    intersection_cardinality = len(set(x).intersection(set(y)))
    union_cardinality = len(set(x).union(set(y)))
    return intersection_cardinality / float(union_cardinality)

In [None]:
shingles_vectors = []

for file in documents: 
    sh = list(get_shingles(file, k=5))
    shingles_vectors.append(sh)

print("jaccard_similarity_score: {}".format(jaccard_similarity_score(shingles_vectors[0], shingles_vectors[1])))

jaccard_similarity_score: 0.07971014492753623


In [None]:
%%time
import itertools

s = 0.9
candidates = []

for pair in itertools.combinations(documents,2):
    js = jaccard_similarity_score(get_shingles(pair[0], k=5),get_shingles(pair[1], k=5))
    
    if js > s:
        print(pair)
        candidates.append(pair)

('RT @_alinaangel: Our dad passed away earlier this summer so my mom and I decided to surprise my sisters with bears with his favorite cologn…', 'RT @_alinaangel: Our dad passed away earlier this summer so my mom and I decided to surprise my sisters with bears with his favorite cologn…')
('RT @rosieposii_: My lil Filipino mom thought her iPhone X was perfume and I cry everytime I watch it. https://t.co/dmfM5HQCs8', 'RT @rosieposii_: My lil Filipino mom thought her iPhone X was perfume and I cry everytime I watch it. https://t.co/dmfM5HQCs8')
('RT @thebaemarcus: George Lopez\nMy Wife and Kids\nEverybkdy Hates Chris\nThe Nanny\nFresh Prince of BelAir\nRick &amp; Morty\nDegrassi\nLaw &amp; Order\nTh…', 'RT @thebaemarcus: George Lopez\nMy Wife and Kids\nEverybkdy Hates Chris\nThe Nanny\nFresh Prince of BelAir\nRick &amp; Morty\nDegrassi\nLaw &amp; Order\nTh…')
('RT @TeamJuJu: RT if you knew what my celebration was 😂😂 https://t.co/RCwVn6gSIz', 'RT @TeamJuJu: RT if you knew what my celebrati

In [None]:
print("Number of similar items: {}".format(len(candidates)))

Number of similar items: 63
