# Text Data in Python

## Exercise 1:
  
  1. Read the `hillary_tweets.txt` file that is located in the "data" directory of the workshop repository. Alternatively, you can read the url directly https://raw.githubusercontent.com/boyko/text-analytics-script/main/data/hillary_tweets.txt
  2. Split the string on the newline character.
  3. Compute the frequency distribution of all words over all tweets using `nltk.FreqDist`.
  4. Compute the frequency distribution of all words in the first tweet using `nltk.FreqDist`.
  5. Normalise each tweet by:
    - compressing whitespace and removing leading and trailing whitespace
    - removing the punctuation using regular expressions
    - tokenize the tweets using the `nltk.word_tokenize` function.
    - remove the stopwords
    - applying the Porter stemming algorithm
  6. Create a pandas dataframe with one column containing the tweets
  7. Summarise each tweet by:
    - Counting the number of characters in each tweet
    - Counting the number of sentences in each tweet
    - Counting the number of hashtags in each tweet
    - Extracting the first mention (@) in each tweet in a column called `first_mention`.
    - Extract the datetime of each tweet and store it in a column called `timestamp`.

In [16]:
## Imports
import os
import nltk
import pandas as pd
import string
import re

## File path to the data
tweets_path = os.path.join("..", "data", "hillary_tweets.txt")

In [2]:
## 1) Read the data
## For the relative path to work, make sure that the jupyter process run in the directory
## of this file

with open(tweets_path, 'r') as f:
    text = f.read()

In [3]:
## 2) Number of tweets
## The tweets are separated by newline characters, so we split on "\n" on it to
## obtain a list of tweets. Alternatively, you can obtain the same result
## by using f.readlines() instead of f.read() in the previous step

tweets = text.split("\n")
len(tweets)

6

In [19]:
# 3) Frequency of words
## We need to obtain the words in a single list in order
## to invoke nltk.FreqDist.

## For example, we can split the entire text on blanks
tweet_words_blanks = text.split(" ")
freqs_by_blanks = nltk.FreqDist(tweet_words_blanks)
freqs_by_blanks.most_common(3)

## Or using the nltk words tokenizers

sentences = nltk.sent_tokenize(text)
words = []
for sent in sentences:
    words.extend(nltk.word_tokenize(sent))

freq = nltk.FreqDist(words)
freq.most_common(3)

[(',', 16), ('the', 10), ('@', 10)]

In [21]:
## Frequency distribution of words in the first tweet

## To keep the code DRY, we can package the word-splitting in a function

def tokenize_text(txt):
    txt_sents = nltk.sent_tokenize(txt)
    txt_words = []
    for s in txt_sents:
        txt_words.extend(nltk.word_tokenize(s))

    return txt_words

words_tweet1 = tokenize_text(tweets[0])

freq_tweet1 = nltk.FreqDist(words_tweet1)
freq_tweet1.most_common(3)

[("''", 2), ('the', 2), ('&', 2)]

In [23]:
## Text normalization

## tokenize the tweets using the nltk.word_tokenize function.
## remove the stopwords
## applying the Porter stemming algorithm

## We can wrap the sequence of transformations in a small function

stemmer = nltk.PorterStemmer()
stop_words = set(nltk.corpus.stopwords.words('english'))

def normalize_text(txt):
    ## compressing whitespace and removing leading and trailing whitespace
    txt_next = txt.strip()
    txt_next = re.sub("\s+", " ", txt_next)

    ## Remove stopwords
    ## First we convert the string to lowercase, because the stopwords
    ## in nltk are also in lowercase
    txt_next = txt_next.lower()

    ## Next, get the words and
    txt_words = tokenize_text(txt_next)

    ## Filter the stopwords
    txt_words = [w for w in txt_words if w not in stop_words]
    ## Remove the punctuation
    txt_words = [w for w in txt_words if w not in string.punctuation]
    ## Apply the stemmer
    stems = [stemmer.stem(w) for w in txt_words]

    return stems

normalize_text(tweets[0])


['9.11.201610pm',
 "''",
 'littl',
 'girl',
 'watch',
 '...',
 'never',
 'doubt',
 'valuabl',
 'power',
 'deserv',
 'everi',
 'chanc',
 'opportun',
 'world',
 "''"]

In [52]:
## Pandas dataframe from the list of tweets
dt = pd.DataFrame(tweets, columns = ["tweet"])

## Count the number of characters
dt["num_chars"] = dt.tweet.str.len()

## Count the number of sentences
dt["num_sents"] = dt.tweet.apply(lambda x: len(nltk.sent_tokenize(x)))

## Number of hashtags
dt["num_hashtags"] = dt.tweet.str.count("#")

## First mention
## The regular expression matches anyting starting with @ and followed by
## any number of word-character (but at least one).
dt["first_mention"] = dt.tweet.str.extract(r"@(\w+)")

## Extract the datetime
## The regexp to match the dates in this format
## matches one or two numbers (day) followed by a dot
## followed by one or two numbers (month) followed by a dot
## followed by four numbers (year)
## followed by one or two numbers (hour)
## followed by two characters that can be "a", "m" or "p"
dt["timestamp"] = dt.tweet.str.extract(r"(\d{1,2}\.\d{1,2}\.\d{4}\d{1,2}[amp]{2})")
dt[["tweet", "timestamp"]]

Unnamed: 0,tweet,timestamp
0,"9.11.201610pm""To all the little girls watching...",9.11.201610pm
1,9.11.20169amWe're delighted to have spent the ...,9.11.20169am
2,11.11.20164pmMeet the people who are suing the...,11.11.20164pm
3,14.11.20163pmHillary Clinton at Wednesday's @d...,14.11.20163pm
4,10.11.20168pmI'm so proud of @OnwardTogether p...,10.11.20168pm
5,,


## Exercise 2

The following data frame contains 5 appointment records.

1. Replace each weekday mentioned in the records with its 3 letter abbreviation.
2. Extract the time of each appointment (keep the am/pm suffix). Use named capture groups and the
    `str.extractall` method of the `text` column.


In [56]:
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.",
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [60]:
## Extract the times

df.text.str.extractall(r"(?P<hour>\d{1,2}):(?P<minute>\d{1,2})\s*(?P<ampm>[amp]{2})")

Unnamed: 0_level_0,Unnamed: 1_level_0,hour,minute,ampm
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,2,45,pm
1,0,11,30,am
2,0,7,0,pm
3,0,11,15,pm
4,0,8,10,am
4,1,9,0,am


In [66]:
## Replace the weekdays (Monday, Tuesday, etc.) with their three letter abbreviation
## This exercise demonstrates the use of functions when replacing text

## The functions received a match as its argument
def abbreviate_daynames(m):
    matched_day =  m.groups()[0]
    return matched_day[:3]

## It is important to notice, that the whole function "abbreviate_daynames"
## is passed to the replacement argument of "replace". The "abbreviate_daynames"
## function is called when replace finds a match

df.text.str.replace("(\w+day)", abbreviate_daynames)


  if __name__ == '__main__':


0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object