# Pre-processing text data


Most of the text data are cleaned by following below steps.<br>

- Remove punctuations and unwanted symbols
- Tokenization - Converting a sentence into list of words
- Remove stopwords
- Normalization (putting everything in the same way)

## Orientation: Where am I?
<img src='res/launch_1.jpg'>


<i>Credits: Kerbal space program: Falcon 9 Space X</i>

## What is the Natural Language Toolkit?
<img src='res/NLTK.png'>

NLTK is a Python Library for working with written language data. 


NLTK is free and extensively documented [here](http://www.nltk.org/).
> Note: NLTK provides tools for tasks ranging from very simple (counting words in a text) to very complex (writing and training parsers, etc.)



## From previous Introduction workshop: Read data from disk

In [None]:
import os

text_path = 'datasets/twcs.csv'
path=os.path.join(text_path)
file = open(os.path.join(text_path), "r", encoding='UTF-8')
text = file.read()
lines=text.split('\n')

we can inspect the type of object we just got in python:

In [None]:
lines[1:3]

We can actually use another python library to make this easier and access only the column we need: (More about this in Pandas Introductory workshops at [Python Resplat Community](https://research.unimelb.edu.au/infrastructure/research-platform-services/training/python))

In [None]:
import pandas as pd
df=pd.read_csv(text_path)
df.head()

We are interested the in the column "Text", let's put it in a variable:

In [None]:
text=df['text'].tolist()
print(type(text))
print(len(text))

This means we have roughly 2.8M tweets to analyse!!

let's take a look at the first tweet:

In [None]:
text[0]

I'm going to take a protion of this text

In [None]:
tweets=text[0:1000]

## Pre-processing tasks
<img src='res/textprocessing.png'>

We can notice some special characters: @ denotes a username in twitter, we will probably need to get rid of these. We can also remove puntuaction.

## Work with strings: Find unwanted characteres with regular expressions 

### Can you find a pattern??
<img src='res/pat.jpg'>

### Regular expressions = Find patterns in text

Look at a group of tweets and try to identify some patterns

In [None]:
import re

# here we careate a regular expression. A word starting with @ and following next a string of letters or numbers
regex_username=re.compile(r'@([A-Za-z0-9_]+)')

In [None]:
print("Original tweet:",tweets[0])
print("Pattern found:",re.findall(regex_username,tweets[0]))
my_new_tweet=regex_username.sub('',tweets[0]).strip()
print("Text removing the pattern:",my_new_tweet)

<img src='res/regex.png'>

Let's extract all usernames from tweets and put them on a list so we can use them later

In [None]:
users_list=list()
for tweet in tweets:
    users_list+=re.findall(regex_username,tweet)


In [None]:
# create sets for users. A set will remove the duplicates
users=set(users_list)
print(len(users))

Suppose we want to remove all the usernames of our sample tweets just to anonymize

In [None]:
anonymous_tweets=list()
for tweet in tweets:
    new_tweet=regex_username.sub('',tweet).strip()
    anonymous_tweets.append(new_tweet)
print("Original:",tweets[0:10])
print('*****')
print("Anonymous",anonymous_tweets[0:10])


### Challenge: Extract all topics 

In [None]:
regex_topics=re.compile(...) # fill in
topic_list=list()
for tweet in tweets:
    topic_list+=... # fill
    
# print the total number of topics (remember the duplicates)
print(..) #fill in


## Work with strings: Puntuaction and symbols

Can I have a shortcut of symbols I can start with?
Answer: yes!

In [None]:
import string
string.punctuation

We can use this list to go through our entire text and remove words in this list. Let's see how:

In [None]:
text_no_punct=[]
text='Capacity must be shown (in other work); in the law, concealment of it will do'
for letter in text:
    if letter not in string.punctuation:
        text_no_punct.append(letter)
        
''.join(text_no_punct)

### Challenge: Build anonymous tweet dataset without puntuaction

In [None]:
no_punct_tweets=list()
punct=string.punctuation
for tweet in anonymous_tweets:
    new_tweet=[]
    for letter in tweet:
        if letter not in punct:
            new_tweet.append(letter)
    no_punct_tweets.append(''.join(new_tweet))      
    
no_punct_tweets[100:150]

## Tokenization

We are interested in words and sentences. We can do this with an operation called "tokenization"

**Tokenization = cut the text into pieces like sentences or words**
<br><br>
<img src="res/token.jpg"/>

As you can use sicssors and paper, you can cut the text by words, sentences and other combinations. NLTK has several tokenizers (including support for different languages) which you can explore [here](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize).<br>
Let's explore the most basic ones:

In [None]:
from nltk import word_tokenize,sent_tokenize

### Word tokenization

In [None]:
tokens=word_tokenize(tweets[5]) # try different tokenizers
tokens

Now, collect all tokens into a list so we can keep use them later

In [None]:
word_tokens_list=list()
for tweet in anonymous_tweets:
    word_tokens_list+=word_tokenize(tweet)

print(len(word_tokens_list))
word_tokens_list[100:150]

### Sentence tokenizer


In [None]:
sent_tokens_list=list()
for tweet in anonymous_tweets:
    sent_tokens_list+=sent_tokenize(tweet)

print(len(sent_tokens_list))
sent_tokens_list[0:10]

Now, notice you can have tokens which can be exactly the same, for example, you can see the same word with different case.

## Challenge: Try the wordpunct_tokenize and TweetTokenizer

In [None]:
from nltk import wordpunct_tokenize,TweetTokenizer

# your code here

In [None]:
# this is the word punct tokenizer
punct_tokens_list=list()
for tweet in anonymous_tweets:
    punct_tokens_list+=wordpunct_tokenize(tweet)

print(len(punct_tokens_list))
punct_tokens_list[100:150]

In [None]:
# this is a special tokenizer for tweets
tweet_tokenize = TweetTokenizer()
tokens=tweet_tokenize.tokenize(tweets[5]) 
tokens

## Normalization: Set the ground rules

Strings have built-in functions to help us transform the text. Our goal is have the text in a normalized way so we can start analysing.
1. Take letters to lowercase or uppercase
2. Remove blancs
3. Remove unwanted characters: punctuation, symbols, accents and diacritical (e.g ',`)
5. Expand abbreviations (not covered)
7. Remove stop words

Let's rembember some of the functions to work out this tasks

In [None]:
# lowercase
word='We'
print(word.lower())

In [None]:
# remove spaces
word='We '
print(word.strip())

In [None]:
# remove unwanted accents
word='We\'d'
new_word=''
for letter in word:
    if letter not in string.punctuation:
        new_word+=letter
print(new_word)

In general we can use all string functions python has:<br>
<img src='res/strings.png'>

## Removing stop words
Stop words = most common words in a language that we are not interested to be part of the analysis

In [None]:
# remove stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

### Challenge: Normalization
Choose one of the token list (word/punct) and normalize every token by lowercasing, removing stop words, symbols and duplicates

In [None]:
# write your code here

## Lemmatization and stemming
We have tried to normalise our text, we converted the word "We" to "we" so we don't count it only once. Now, how to handle a situation like "child" and "children"

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:<br>
am, are, is $\Rightarrow$ be <br>
car, cars, car's, cars' $\Rightarrow$ car

<i>(From Stanford NLP)</i>

### Stemming <br>
<img src="res/Stemming_Words_print.png" width='50%'/>

In [None]:
from nltk.stem.porter import PorterStemmer

# Create stemmer
porter = PorterStemmer()

# Apply stemmer
for word in word_tokens_list[100:150]:
    stem_word=porter.stem(word)
    print(word,'->',stem_word)

### Leematisation <br>
Grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.<br>
<img src='res/lemma.png'>

In [None]:
import nltk
from nltk.corpus import wordnet as wn
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

# try words like 'better' (adj), 'cats' (noun)
print(lemmatizer.lemmatize('better',wn.ADJ))

## Challenge: Create a new list with tokens lemmatizing by verb, noun and adj

In [None]:
# hint:  check if the word changes with verb, if not try noun, and so on...

### Spelling correction
<img src='res/spelit.gif'>


Our goal: Identify misspelled words and try to use the correct ones. We can have several approaches, we will focus today in the most basic one, but we can try more advanced techniques in next meetups:
1. Using regular expressions
2. Using measures of similarity in words, language models and meanings

In [None]:
pattern = re.compile(r"(.)\1{2,}")

for word in word_tokens_list:
    new_word=pattern.sub(r"\1\1", word)
    if (word!=new_word):
        print(word,'->',new_word)

## Visualization
Remember the Frequency Distribution plot? (Introduction to NLTK Workshop), here we will use a different method to plot it, more visual.

### Word cloud
a Word Cloud represents the importance of a token giving a bigger size to the most important ones.

<img src='res/wordle_love_song.jpg'>


We need to use an extra library for this, called <i>wordcloud</i>.
<br>
To install, run this code in your command line:<br>
`conda install -c conda-forge wordcloud`

if you are using pip:<br>
`pip install wordcloud`

Now you can use the library!



In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

We need to prepare the text first, Let's try to use the username list. We will want to put all usernames in lower case to avoid counting different words for the same username.

In [None]:
# set all users in one string
flat_users=' '.join(users_list)

# normalize by putting lowercase
flat_users=flat_users.lower()

In [None]:
wordcloud = WordCloud(width=1024, height=768,max_font_size=40,collocations=False).generate(flat_users)
plt.figure(figsize = (15, 15), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

Can we make it more pretty?


In [None]:
import numpy as np
import os
from PIL import Image

mask = np.array(Image.open("res/heart.jpg"))
wordcloud = WordCloud(max_font_size=40,collocations=False,mask=mask).generate(flat_users)
plt.figure(figsize = (15, 15), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

## Challenge:  Build a wordcloud 
Create two wordclouds, one with the topics and other with the whole cleaned tweets

## Bonus!!!
Play ploting  frequency distributions using your final list of tokens!!

In [None]:
## Frequency distributions
from nltk import Text
from nltk.probability import FreqDist

my_text=Text(..) #fill in
fdist=FreqDist(...) #fill in

More on visualization? come to the next python meetup [Python Data Viz with Matplotlib, the Godparent of Python Plotting Libraries](https://www.eventbrite.com.au/e/python-data-viz-with-matplotlib-the-godparent-of-python-plotting-libraries-tickets-48771750619)