<a href="https://colab.research.google.com/github/davidleahy22/davidleahy22/blob/main/INFOMTMA_Seminar1_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text and Media Analytics
Seminar 1 Lab

This is the Google Colab notebook accompanying the second lecture of the Applied Data Science Text and Media Analytics course at Utrecht University. It exemplifies

1.   Text preprocessing
2.   Counting words
3.   TF-IDF & K-means

Some of the code is © Joris Veerbeek, taken from the [course manual](https://jveerbeek.gitlab.io/data-mining/index.html#) for this course in 2021.


# Tokenization
Everything starts at a basis. For text analytics, this is *tokenization*.

Before we can do the cool and interesting stuff (read: do the analysis) with texts, the first step in textual analysis - in most cases - is to tokenize them. That is, we want to convert a string, or a list of strings, into a list of tokens…

This might seem like a simple task, but there are several ways to do that, with all slightly different results. The core idea is to use the space as a natural separator (delimiter) between words, and additionally you want to separate punctuation from words. You may want to keep some strings with punctuation together, such as abbreviations, email addresses, URLs, and words with hashtags or '@' in front of them.

And that is not all; the tokenization task also blends in with the sentence detection task (not all sentences end in ., ? or !, not everything is a sentence) and the detection of pairs of matching quotes. In the end, tokenization requires language-specific knowledge on abbreviations and specific punctuation rules. An example of an elaborate rule-based tokenizer is [Ucto](https://languagemachines.github.io/ucto/). We will get to Ucto later.

# (Almost) pure Python
The simplest way to tokenize a text, is to “just” use spaces as boundaries. Let’s say we have some text:

In [1]:
text = "This is a simple sentence. Simple methods work great on simple sentences!"

Then we can just split it using Python’s `split()`.

In [2]:
tokenized_text = text.split()
print(tokenized_text)

['This', 'is', 'a', 'simple', 'sentence.', 'Simple', 'methods', 'work', 'great', 'on', 'simple', 'sentences!']


This looks reasonable, but not great. In most cases, we at least want the punctuation removed. So let’s do that:

In [3]:
punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for punctuation in punctuations:
    text = text.replace(punctuation, '')
tokenized_text_nopunct = text.split()
print(tokenized_text_nopunct)


['This', 'is', 'a', 'simple', 'sentence', 'Simple', 'methods', 'work', 'great', 'on', 'simple', 'sentences']


Finally, we want all words converted to lowercase:

In [4]:
tokenized_text_lwc = [word.lower() for word in tokenized_text_nopunct]

Wrapping this up in a nice function:

In [19]:
def tokenize(text):
    punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    for punctuation in punctuations:
        text = text.replace(punctuation, '')
    text = text.lower()
    text = text.split()
    return text

tokenized_text = tokenize(text)
print(tokenized_text)


['this', 'is', 'a', 'simple', 'sentence', 'but', 'the', 'next', 'one', 'isnt', 'theres', 'more', 'to', 'tokenization', 'than', 'just', 'separating', 'on', 'spacesnever', 'forget', 'that']


# Counting words

Once we have a tokenized text, we can count the occurence of a given word using Python’s `count`.

In [6]:
count_this = tokenized_text.count('simple')
print(count_this)

3


To get the most frequent words of a tokenized text, we can use Python’s `Counter` objects. This returns a dictionary like object with the words stored as keys and the counts as values:

In [9]:
from collections import Counter
tokenized_text = ['this', 'text', 'is', 'a', 'tokenized', 'text', 'A']

word_counts = Counter(tokenized_text)
word_counts


Counter({'this': 1, 'text': 2, 'is': 1, 'a': 1, 'tokenized': 1, 'A': 1})

We can then sort this object using `most_common` to get the most frequent words:

In [10]:
word_counts.most_common()[:20] # n = 20

[('text', 2), ('this', 1), ('is', 1), ('a', 1), ('tokenized', 1), ('A', 1)]

When we want to get the most frequent words in a collection of texts, we have to concatenate these texts first. We can do this using the following `flatten` function:

In [11]:
flatten = lambda t: [item for sublist in t for item in sublist]

tokenized_texts =  [['this', 'text', 'is', 'a', 'tokenized', 'text'],
                ['this', 'is', 'also', 'a', 'tokenized', 'text']]
tokenized_texts_concat = flatten(tokenized_texts)

word_counts = Counter(tokenized_texts_concat)
word_counts.most_common()[:20] # n = 20

[('text', 3), ('this', 2), ('is', 2), ('a', 2), ('tokenized', 2), ('also', 1)]

# Ucto

[Ucto](https://languagemachines.github.io/ucto/) is a rule-based tokenizer. It has a Python binding that first needs to be installed.

In [12]:
!pip install python-ucto
import ucto
ucto.installdata()

Collecting python-ucto
  Downloading python_ucto-0.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.3/22.3 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-ucto
Successfully installed python-ucto-0.6.7


Installation of uctodata 0.9.1 complete
Language detection will not be available unless you install libexttextcat and rerun installdata()


Let us initialize an English tokenizer. The configuration files supplied with ucto are named tokconfig-xxx where xxx corresponds to a three letter iso-639-3 language code. There is also a tokconfig-generic one that has no language-specific rules. Alternatively, you can make and supply your own configuration file. Note that for older versions of ucto you may need to provide the absolute path, but the latest versions will find the configurations supplied with ucto automatically. See [here](https://github.com/LanguageMachines/uctodata/tree/master/config) for a list of available configuration in the latest version.

In [13]:
import ucto
configurationfile = "tokconfig-eng"
tokenizer = ucto.Tokenizer(configurationfile)

Now let's take the sample text from before and tokenize it. Ucto separates words delimited by spaces, and separates punctuation from letter strings. It also detects the beginning and end of sentences, and remembers for each token whether it was preceded by a space.

Furthermore, it detects certain language-specific contractions that are usually split, such as "isn't" into "is" and "n't" and "There's" into "There" and "'s". Note again that Ucto remembers that the second tokens from these contractions were not preceded by a space in the original sentence; in this sample output this is expressed by not printing a space in front.

In [14]:
#pass the text (a str) (may be called multiple times),
text = "This is a simple sentence, but the next one isn't. There's more to tokenization than just separating on spaces--never forget that!"
tokenizer.process(text)

#read the tokenised data
for token in tokenizer:
    #token is an instance of ucto.Token, serialise to string using str()
    print(str(token))

    #tokens remember whether they are followed by a space
    if token.isendofsentence():
        print()
    elif not token.nospace():
        print(" ",end="")

This
 is
 a
 simple
 sentence
,
 but
 the
 next
 one
 is
n't
.

There
's
 more
 to
 tokenization
 than
 just
 separating
 on
 spaces
--
never
 forget
 that
!



Ucto can also detect pairs of quotes and beginnings and ends of paragraphs. It can lowercase or uppercase all text. It also has several options for reading and writing sentences per line or not; e.g., sometimes your text is already separated into single sentences per line and you do not need Ucto to figure that out. Furthermore, Ucto is quite fast and can also be run on the commandline. See the [webpage](https://languagemachines.github.io/ucto/) for further information.

# Seminar 1 Exercises

A) Simple text processing

B) Pre-processing & Exploring tweets



##### A.1) Take a look at the following paragraph from a news article:

***But the origins of these reptiles have remained murky because of a lack of fossils from the earliest fliers. “The oldest pterosaur we have already had wings and were capable fliers,” said Davide Foffa, a paleontologist at Virginia Tech, which makes it difficult to chart their aerial evolution (https://www.nytimes.com/2022/10/05/science/pterosaurs-reptiles-wings.html).***

First, tokenize the sentence yourself manually (i.e,. convert into tokens manually, without any code - including the url).

In [22]:
#answer
text_to_tokenise = "But the origins of these reptiles have remained murky because of a lack of fossils from the earliest fliers. “The oldest pterosaur we have already had wings and were capable fliers,” said Davide Foffa, a paleontologist at Virginia Tech, which makes it difficult to chart their aerial evolution (https://www.nytimes.com/2022/10/05/science/pterosaurs-reptiles-wings.html)."

punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for punctuation in punctuations:
    text = text.replace(punctuation, '')
text_to_tokenise_no_punct = text.split()
print(text_to_tokenise_no_punct)




['This', 'is', 'a', 'simple', 'sentence', 'but', 'the', 'next', 'one', 'isnt', 'Theres', 'more', 'to', 'tokenization', 'than', 'just', 'separating', 'on', 'spacesnever', 'forget', 'that']


A.2) Now use two different methods for tokenization. Compare the results and discuss the differences. Explain them based on how each method works (see manual for methods).

In [30]:
tokenised_ans = tokenize(text_to_tokenise)
print(tokenised_ans)

from nltk.tokenize import RegexpTokenizer



tokenizer = RegexpTokenizer('\w+')
tokenized_text = tokenizer.tokenize(text_to_tokenise.lower())
print(tokenized_text)


['but', 'the', 'origins', 'of', 'these', 'reptiles', 'have', 'remained', 'murky', 'because', 'of', 'a', 'lack', 'of', 'fossils', 'from', 'the', 'earliest', 'fliers', '“the', 'oldest', 'pterosaur', 'we', 'have', 'already', 'had', 'wings', 'and', 'were', 'capable', 'fliers”', 'said', 'davide', 'foffa', 'a', 'paleontologist', 'at', 'virginia', 'tech', 'which', 'makes', 'it', 'difficult', 'to', 'chart', 'their', 'aerial', 'evolution', 'httpswwwnytimescom20221005sciencepterosaursreptileswingshtml']
['but', 'the', 'origins', 'of', 'these', 'reptiles', 'have', 'remained', 'murky', 'because', 'of', 'a', 'lack', 'of', 'fossils', 'from', 'the', 'earliest', 'fliers', 'the', 'oldest', 'pterosaur', 'we', 'have', 'already', 'had', 'wings', 'and', 'were', 'capable', 'fliers', 'said', 'davide', 'foffa', 'a', 'paleontologist', 'at', 'virginia', 'tech', 'which', 'makes', 'it', 'difficult', 'to', 'chart', 'their', 'aerial', 'evolution', 'https', 'www', 'nytimes', 'com', '2022', '10', '05', 'science', 'pt

A.3) Compare the length of the paragraph from A.1) for diffeerent tokenization methods. Which method do you think is "correct"? How do you explain the differences?

In [32]:
#code & answer
len(tokenized_text) - len(tokenised_ans)

11

A.4) Compare the ten most frequent words for the paragraph in A.1) per tokenization method (including your manual one).

In [None]:
#code & answer

B.1) Next, you will analyse some Twitter data. Load the data as a df -> use the CTTW.csv from Team. It's a random sample of 10.000 tweets that mention "China" and "tech". Below are some of the imports that may come in handy but feel free to choose your own!

In [33]:
#imports
import pandas as pd
import re
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from collections import Counter, defaultdict
from wordcloud import WordCloud
from gensim import corpora, models
import geopandas as gpd
import nltk
from nltk.corpus import stopwords
import networkx as nx

In [45]:
#code
df = pd.read_excel("CTTW.xlsx")

B.2) Inspect the dataframe's first 20 rows and shape. It is always good to explore the data first by browsing through some of them.

In [46]:
#code
df.head(20)
#Unnamed:
#"author id"
#"geo"
#id
#users
#in_reply_to
#source

Unnamed: 0.1,Unnamed: 0,author id,username,created_at,geo,id,lang,like_count,quote_count,reply_count,retweet_count,source,tweet,in_reply_to,users,followers,tweet count,country,tags
0,0,395944523,arson03,2012-09-24 03:47:59+00:00,,250079103636631008,en,0,0,0,0,http://gadgetchinos.gq,"Really ãƒ… nice!: UNI-T UT ."" LCD Digital Ther...",,"['ChemarieMonica', 'Zite', 'designtaxi', 'WSJ'...",1060,17888,China,[]
1,1,351057439,mikethenerd,2012-06-10 00:49:51+00:00,,211621164303400000,en,0,0,0,0,twitterfeed,#tech China announces plans for major mission ...,,"['girlsintech_uk', 'freaklabs', 'TechLabsSA', ...",596,4041,China,['#tech']
2,2,130473117,greenwillydot,2010-12-01 00:34:31+00:00,,9767261288009720,en,0,0,0,1,Twitter Web Client,RT @YS_KARASU: WikiLeaks: Great firewall of Ch...,,"['TechCrunch', 'addthis', 'Techme101', 'Metalb...",18,13960,China,[]
3,3,430757506,GradyGroup,2012-10-17 21:33:02+00:00,,258682054873604000,en,0,0,0,0,dlvr.it,"Tech firm Huwei did not spy for China, White H...",,"['OffbeatChina', 'Tech_Eater', 'chaz1944', 'ed...",3173,18942,China,[]
4,4,68717270,darcelchoy,2017-05-24 20:03:01+00:00,,867471013340163968,en,1,0,0,0,Sprout Social,Check it out: World's largest floating #solar ...,,"['AP', 'mikel_maria', 'mikellomealy', 'calesto...",1998,52710,China,['#solar']
5,5,211366198,ttscottw,2013-03-28 15:38:47+00:00,,317299738250715008,en,0,0,0,0,i-love-china-tech,Rajoo X Wired Folding Headphones w/ Microphone...,,"['CNNMoney', 'CNNMoney', 'FortuneMagazine', 'C...",84,105122,China,[]
6,6,2371631,ProLifePoint,2019-11-25 17:18:23+00:00,,1199014480530870016,en,7,0,0,2,Twitter for Android,[Every time another detail of China's high-tec...,,"['SenSchumer', 'AP', 'HPC_Guru', 'AP', 'RealSa...",121,6620,China,[]
7,7,1047341849529250048,davereaboi,2021-08-27 19:33:50+00:00,,1431339189631680000,en,0,0,0,0,Twitter Web App,China is doing what every country should have ...,1.047342e+18,"['AbhishBanerj', 'Alevskey', 'Techmeme', 'Winn...",168938,210136,China,[]
8,8,256368478,newsfeeedflash,2011-11-06 16:50:44+00:00,,133224811798343008,en,0,0,0,0,twitterfeed,Tech firms agree to stricter online regulation...,,"['topnepalnews', 'cnni', 'cnbcworld', 'andrewb...",697,129836,China,[]
9,9,813064693,EnergiewendeGER,2013-12-29 03:36:18+00:00,,417136970544857024,en,0,0,0,0,dlvr.it,must-read tech stories in China this week htt...,,"['IsraelinChina', 'karmel80', 'karmel80', 'tec...",20789,31878,China,[]


B.3) There might be a column that is not very useful. Remove it from the dataframe.

In [60]:
#code
df.drop(columns=["geo", "source", "author id", "id", "users", "in_reply_to"], axis = 1)

Unnamed: 0.1,Unnamed: 0,username,created_at,lang,like_count,quote_count,reply_count,retweet_count,tweet,followers,tweet count,country,tags
0,0,arson03,2012-09-24 03:47:59+00:00,en,0,0,0,0,"Really ãƒ… nice!: UNI-T UT ."" LCD Digital Ther...",1060,17888,China,[]
1,1,mikethenerd,2012-06-10 00:49:51+00:00,en,0,0,0,0,#tech China announces plans for major mission ...,596,4041,China,['#tech']
2,2,greenwillydot,2010-12-01 00:34:31+00:00,en,0,0,0,1,RT @YS_KARASU: WikiLeaks: Great firewall of Ch...,18,13960,China,[]
3,3,GradyGroup,2012-10-17 21:33:02+00:00,en,0,0,0,0,"Tech firm Huwei did not spy for China, White H...",3173,18942,China,[]
4,4,darcelchoy,2017-05-24 20:03:01+00:00,en,1,0,0,0,Check it out: World's largest floating #solar ...,1998,52710,China,['#solar']
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9995,masterstuff2,2018-01-27 06:01:43+00:00,en,0,0,0,0,"China's police state ""shames jay walkers"" usi...",849,89515,China,"['#DAVOS', '#China', '#Globalists', '#Globalis..."
9996,9996,AppleInvestNews,2020-02-28 12:23:21+00:00,en,0,0,0,0,"#VivoZ #G announced in China ."" FHD+ LCD SD G ...",411,84861,China,"['#VivoZ6', '#5G', '#Vivo', '#Tech', '#TechNew..."
9997,9997,ensredshirt,2013-11-09 18:02:44+00:00,en,0,0,0,0,Retro Skull Pattern Long-Sleeve Women's Sweate...,688,39211,China,[]
9998,9998,ElisafromCA,2017-02-26 05:50:03+00:00,en,7,0,0,1,#technology &amp; #digital #entrepreneurship: ...,1202,59721,China,"['#technology', '#digital', '#entrepreneurship..."


B.4) Inspect the dataframe again and discuss: what kind of research questions and/or hypotheses could you investigate?

In [None]:
#answer

B.5) Create a timeline of tweets over years - first converst the string date-time to datetime, then create a line graph. Take some notes of your observations.


In [None]:
#code

***Note: Exercises B.6) to B.11) are all about pre-processing and "cleaning" text. If you're advanced in your Python for text analysis, feel free to address all of the steps below in one piece of code.***

B.6) Convert the words in the tweets to lowercase. Create a new column for that to keep the original tweets.

In [None]:
#code

B.7) Remove the numbers from the tweets.

In [None]:
#code

B.8) Remove punctuation from the tweets.

In [None]:
#code

B.9) Strip excessive white spaces from the tweets.

In [None]:
#code

B.10) Remove the stopwords from the tweets. Compare your "cleaned-up" tweets with the originals.

In [None]:
#code

B.11) Tokenize the tweets.


In [None]:
#code

B.12) Create a wordcloud visual based on hashtags. A for-loop might do the trick.

In [None]:
#code

B.13) Discuss the wordcloud - what can it tell you, what are limitations? What hashtags would you filter out and why?

In [None]:
#answer

B.14) Create a list of the top 10 most frequent hashtags.

In [None]:
#code

B.15) Create a list of the top 10, 20, and 50 user-accounts. Whos is tweeting about "China" & "tech"?

In [None]:
#code

B.16) Bonus: try simple topic modeling with the hashtags using a for-loop and LdaModel. Can you discern distinct topics?

In [None]:
#code