<a href="https://colab.research.google.com/github/steve-wilson/ds32019/blob/master/02_Noisy_Text_Processing_DS3Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fundamentals of Text Analysis for User Generated Content @ [DS3](https://www.ds3-datascience-polytechnique.fr/)

# Part 2: Noisy Text Processing

[<- Previous: Text Processing Basics](https://colab.research.google.com/drive/1_RjgWX5FfVeaipdDIS1zHsDUrfYnBT60)

[-> Next: Data Collection](https://colab.research.google.com/drive/1Jjx7t3cAkNTtCcKP4Qkkp5uOqd8wQTB3)

Dates: June 27-28, 2019

Facilitator: [Steve Wilson](https://steverw.com)

(To edit this notebook: File -> Open in Playground Mode)

---



## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.



In [0]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# counting and data management
import collections
# operating system utils
import os
# regular expressions
import re
# additional string functions
import string
# system utilities
import sys
# request() will be used to load web content
import urllib.request


# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')

# numpy: matrix library for Python
import numpy as np

# scipy: scientific operations
# works with numpy objects
import scipy

# matplotlib (and pyplot) for visualizations
import matplotlib
import matplotlib.pyplot as plt

# sklearn for basic machine learning operations
import sklearn
import sklearn.manifold
import sklearn.cluster

# for spelling correction
!pip install pyspellchecker
import spellchecker

!pip install emoji
import emoji

!pip install spacy
import spacy
    
# redefine some functions from before
NLP = spacy.load('en',disable=['ner','parser'])
def text_to_lemma_frequencies(text):
    doc = NLP(text)
    words = [token.lemma for token in doc if token.is_stop != True and token.is_punct != True]
    return collections.Counter(words)
    
# quick test:
test_doc = "This is a test. Does this work?"
result = text_to_lemma_frequencies(test_doc)
passed = result == nltk.probability.FreqDist(["test","work"])
if passed:
    print ("Test passed!")
else:
    print("Test did not pass yet.")
    if type(result) == type(nltk.probability.FreqDist(["a"])):
        print("got these words:", result.keys(),\
              "\nwith these counts:", result.values())
    else:
        print("Did not return a FreqDist object.")

print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---
## Dealinggg w/ #NoisyText 💬 ➡️ 💻 

(depending on your OS/Browswer, emoji may be [displayed differently](https://unicode.org/emoji/charts/full-emoji-list.html))

### Spelling Correction

- One of the most common isses when dealing with user-generated content is typos or alternate spellings.
- Ideally, our text analysis systems will be robust to this sort of thing.
- We could:
    1. ignore it.
    2. use a pre-defined vocabulary and remove any unknown words.
    3. use a character based model that will be less sensitive to it.
    4. try to automatically correct it.
- Let's look at how we might implement the last approach.

In [0]:
typo_text = "This is some sameple text that probabbly contais some typos."

spellcheck = spellchecker.SpellChecker()
corrected_text = [spellcheck.correction(word) for word in typo_text.split()]
print(' '.join(corrected_text))

- This spellchecker module uses the notion of [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between words.
    - Other more sophisticated spell checking tools might also look for things like keyboard key distance or lists of common mistakes.

- You may also want to add rules to handle emerging lingistic phenomena such as the repitition of the last (or other) letters of a word
    - these kinds of alterations are difficult to handle with edit distance because they may contain many extra characters.
    - e.g., "sorryyyyyy", "thanksssss", or "byeeeeee"
    - It is worth deciding whether or not this information is worth recording. You could:
        - Try to completely remove the additional letters by checking for repititions, removing letters until a match is found in a dictionary.
            - This means that "sorryyyyyy" and "sorry" will be treated as the same word!
        - Remove them, but include a token like \<rep\> or \<rep-N\>at the end of the word to denote N repeats of the preceeding letter.
            - This way, sorry can be matched with sorry\<rep-3\> if desired, and total repititions per sentence/document/user can be easily computed.
        - Leave these words as they are
            - This means that "sorryyyy" and "sorryyy" will be treated as different words!

In [0]:
# load a set of English words that we can check against
dictionary = set(nltk.corpus.words.words())
sentence = "Ahhhhhh I can't believeee it! Yessssssss"

# let's just check at the end of words for now
def normalize_repititions(word, dictionary):
    counter = 0
    while word not in dictionary and word[-1] == word[-2]:
        word = word[:-1]
        counter += 1
    if counter:
        return word + " <rep-" + str(counter) + '>'
    else:
        return word
    
print(" ".join([normalize_repititions(word,dictionary) for word in sentence.split()]))

### When special characters matter ¯\\_(ツ)_/¯ 

- Before, we just removed punctuation marks and special characters.
    - Like many of the decisions we are making, there is a time when this could be desirable, but what about when dealing with user-generated content?
    - Special characters often contain meaningful semantic content, so they might be worth keeping.
- One popular example is [emoji](http://www.unicode.org/emoji/charts/full-emoji-list.html):

In [0]:
# Use the Python emoji library
print(emoji.emojize("We're at DS3 in :France:!"))

- You can check the exact names of the emoji [here](https://github.com/carpedm20/emoji/blob/master/emoji/unicode_codes.py).
- Perhaps more useful for us is going the other direction:

In [0]:
print(emoji.demojize('I just saw the new Avengers movie 👍👍👍'))

- Consider how the meaning of that sentence would change given different emoji.
- By demojizing our input, we can retain the emoji information and still continue to process our data as plain text.
    - *Note:* It's still possible to read and handle the emoji themselves, but it will be much easier to do things like `the_emoji==":thumbs_up:"` in our code if we don't have an easy way to type emoji (and are left copy-pasting everything).

ϞϞ(๑⚈ ․̫ ⚈๑)∩

- We also might be interested in emoji's predecessor: [emoticons](https://en.wikipedia.org/wiki/List_of_emoticons)
    - however, it has been shown that the inclusion of more emoji causes a decrease in emoticon usage on platforms like Twitter
- How could we handle these?
    1. Having a dictionary of these can be useful, since we can't necessarily consider any random combination of non-alphanumeric characters to be an emoticon.
    2. Alternatively, we could try to [write regular expressions](https://github.com/aritter/twitter_nlp/blob/master/python/emoticons.py) to detect the typical \<eyes\> (\<nose\>) \<mouth\> structure in typical Western emoticons
    3. A third option would be to learn from the data: we can build a list of non-alphanumeric sequences that appear with at least some predefined frequency.
- People are creative! New combinations are appearing every day, even [mixing emoji with other characters](https://www.fastemoji.com/).

In [0]:
# find possible emoticons

def find_nonalphanum_seqs(text):
    return re.findall(r"[^\w\d\s]{2,}",text)

text = "There's a emoticon in this sentence ＼(＾=＾)／ can you find it?"
print(find_nonalphanum_seqs(text))

- Even the above approach has issues because of the letter **o** in the middle of the celebrating emoticon.
    - There is no "rule" on what can be an emoticon
        - it is actually based on the visual resemblance between a sequence of characters and some real-world expression/object.

### Social elements of user-generated content


- Language is inherently social.
- User-generated content, especially on social media platforms, often makes the social component of the text more explicit.
    - We should try to leverage this information whenever possible!
- Take the following example:

In [0]:
a_tweet = "What word do you first think of when you hear the word “bath”? " +\
          "@radamihalcea from @michigan_AI presents work at #NAACL19 " +\
          "#NLPCSS showing how your culture and gender are associated with " +\
          "your responses to these kinds of questions, and how #NLProc models " +\
          "can capture this phenomenon."

mentions = re.findall("\@\w+",a_tweet)
hashtags = re.findall("\#\w+",a_tweet)

print("mentions:",mentions)
print("hashtags:",hashtags)

- Even with *only the text data*, we can extract some meaningful information about the social context of this text.
    - We could do even more with metadata like likes, retweets, timestamp, location, user-level information, images, etc.
        - That is all available from the Twitter API, and we'll cover that in more detail in the next section.
- We can also analyze links (and follow them for even more content),  

### Putting it together: noisy text processing

**Exercise 3**: Noisy text processor

- Write a fuction that takes a sentence as input and produces a "standard text processing" friendly output. It should (at a minimum)
    - convert emoji to strings describing them,
    - correct small typos,
    - remove standard puctution like sentence ending '.', but keep possible emoticons.
- As a bonus, you could:
    - convert between American and British English spellings
        - collapse into one spelling per word, but add a tag to the beginng or end of the sentence indicating which dialect was used
    - split hashtags into meaningful word sequences
        - use capitalization if possible, but use word-matching otherwise
            - e.g., `#wordsoftheday` should become something like: `\<#\> words of the day \<\#\>`. (example [from Jack Reuter](http://cs.uccs.edu/~jkalita/work/reu/REU2015/FinalPapers/05Reuter.pdf))
    - in addition to the cleaned sentence, out a dictionary of a useful features that could be passed to a machine learning model, like "mentions", "possible emoticons", etc.
    - replace character repitions found anywhere in words (not just as the end)
        - e.g., converting "Oh wooooooow that's craaaaazyyyy" -> "Oh wow that's crazy".

- Note: These (real) example tweets used come from the [Kaggle Twitter Airline Sentiment Dataset](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/downloads/twitter-airline-sentiment.zip/2)
    - the handles of airlines have been replaced with `@airline` so as to avoid expressing opinions about specific companies.

In [0]:
def clean_text(noisy_text):
    
# ------------- Exercise 3 -------------- #

    cleaned_text = noisy_text

# ---------------- End ------------------ #

    return cleaned_text

- Use the following cell to test your function:

In [0]:
# from the
tweets =  ["@airline disappointed that u didnt honor my $100 credit given to me " +\
          "for ur mistakes. Taking my business elsewhere  ✌️ out.", \
          "@airline Awwweesssooomee!", \
          "@airline on hold for TWO HOURS now, pick up the PHONEEEEE", \
          "@airline I never made a reward reservation becuase no one ever " +\
          "answered the phone. The online one I made got Cancelled Flighted " +\
          "and I can't change", \
          "@airline I will. Thank you for at least tweeting me back :) better " +\
          "than most. 👌"]

cleaned = []
for tweet in tweets:
    cleaned.append(clean_text(tweet))
    
print("Cleaned tweets:\n" + '\n'.join(cleaned))

# quick test
passed = True
print("\nTesting...")
if "✌️" in cleaned[0]:
    passed = False
    print("Did not remove emoji.")
if re.match("ee\b",cleaned[1]):
    passed = False
    print("Did not remove repititions at word endings.")
if "becuase" in cleaned[3]:
    passed = False
    print("Did not correct spelling")
if passed:
    print("Passed all tests.")
    

- From the results, you should start to get a feel for how tricky it can be to work with this kind of data.
    - For example, the spell checking dictionary doesn't have an entry for "tweeting", which leads to an erroneous correction.
- Every type of cleaning is a design decision and forces a tradeoff between *natural text* and *machine-friendly text*.

In [0]:
#@title Sample Solution (double-click to view) {display-mode: "form"}

def clean_text(noisy_text):
    
# ------------- Exercise 3 -------------- #
    # creat a spellchecker object
    spellcheck = spellchecker.SpellChecker()
    # load dictionary of english words
    dictionary = set(nltk.corpus.words.words())
    
    cleaned_words = []
    # get sentences
    sentences = nltk.sent_tokenize(noisy_text)
    for sentence in sentences:        
        
        demojified = emoji.demojize(sentence)
        # and get words
        words = nltk.word_tokenize(demojified) 
        is_emoji = False
        for word in words:
            
            # correct word-level spelling
            corrected = word
            
            # don't spell correct emoji
            if not is_emoji:
                corrected = spellcheck.correction(word)
                     
            # remove repititions
            counter = 0
            if len(corrected) > 2:
                while corrected not in dictionary and corrected[-1] == corrected[-2]:
                    corrected = corrected[:-1]
                    counter += 1
            if counter:
                cleaned_words += [corrected," <rep-" + str(counter) + '>']
            else:
                cleaned_words += [corrected]
                
            is_emoji = word == ":"
                
    cleaned_text = ' '.join(cleaned_words)

# ---------------- End ------------------ #

    return cleaned_text
    

- Next, we will look at how we can collect some (potentially noisy) text data.

- [-> Next: Data Collection](https://colab.research.google.com/drive/1Jjx7t3cAkNTtCcKP4Qkkp5uOqd8wQTB3)