# SETUP

- Do not change the content of the cells under __SETUP__ and __TESTS__*
- Work only in the __YOUR WORK__ area*
- Rename the notebook with your group at the end (subsitute XX with your group number).
- Assign the results of each numbered question to the appropriate test variable. For example, the answer of `1.` should be assigned to `test_1`
- Rounding: use the supplied function `hround` to round decimal numbers when instructed. It's important to use this function because there are [multiple ways to round numbers in Python](https://www.knowledgehut.com/blog/programming/python-rounding-numbers) and they may not result in the same value that the tester is testing against.
- Ensure your run the cells under __SETUP__ before you run your work
- Before you submit your work, ensure you clean up your notebook. Your notebook has to run without an error in order to be tested. The easiest way to ensure is to `Kernel->Restart & Run All`
- Answers are provided below for your convenience
- You will need to write a program to calculate the answers. Setting the answers to be their correct values without solving them is considered *hardcoding* and will result in zero grade for the assignment as well as a potential academic honesty violation.
- You can also test your submission using [the online code tester](https://notebook-tester.safadi-puzzler.com/)


In [1]:
# DO NOT EDIT OR CHANGE THE CONTENT OF THIS CELL
scenario = 0

In [2]:
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction import _stop_words
stop_words = set(_stop_words.ENGLISH_STOP_WORDS)

# Text analytics

In this question we will further analyze the tweet data set. 
We will extract word co-locations from the tweet text.
Word colocations are sequences of words in text. For example "big data" is one word collocation because the words "big" and "data" come frequently together.

1. First we will simplify the tweets. In this analysis, we do not care about the hashtags, mentions, or links. However, we want to keep them because they are part of the tweet. Therefore, we want to replace each hashtag with `HASHTAG` verbatim, each mention with `MENTION`, and each link with `LINK`. Take a look at `test_1` for how the first tweet look like. You need to simplify all tweets for the next step.
2. Next, create a function `bigrams` that takes a string as a parameter and returns the bigrams in the string. A bigram is a pair of consecutive words. In the function use TextBlob to obtain the list of words in the text. Then write a loop that goes over the words and then combine each sequence of two consecutive words in a tuple. The tuples are added to the list that the function returns. Take a look at `test_2` for what the function returns on a sample text. Report the outcome of invoking this function on `"A bigram is a pair of consecutive written words"`
3. Apply the function on the simplified text of each tweet. What are the bigrams of the first tweet.
4. Count the frequencies of the bigrams. Report the ten most frequent bigrams in a list.
5. As you see, the top frequent bigrams all contain hashtags, mentions, links and common English words. We want to filter out any bigram that contains `HASHTAG`, `MENTION`, `LINK`, or a common English word that is in the variable `stop_words`. Filter out the frequent bigrams and report the ten most frequent ones again.
6. We are starting to see some topics emerging from this analysis. We want to extend the previous analysis to report trigrams which are sequences of three words. Write a function `trigrams` that is similar to the previous function but returns a list of trigrams in the passed string. Report the outcome of invoking the function on `"A trigram is a three consecutive written words"`
7. What are the trigrams of the first tweet?
8. What are the ten most frequent trigrams?
9. What are the ten most frequent trigrams after filtering out `HASHTAG`, `MENTION`, `LINK`, and common English words?


In [3]:
data = pd.read_csv('tweets.csv')
data.head()

Unnamed: 0,text,timestamp,tweetid,tweeter,level,hits,score
0,Do we need another CxO? RT @KirkDBorne: Hail t...,2014-02-11 19:05:23,433315847696687104,johnweathington,Individual,1,2.0867
1,To the #analysts planning their future also on...,2012-05-25 11:08:55,205978750607822849,79dirk,Individual,1,2.222976
2,#datascientist #bigdata what proportion of Big...,2014-01-10 16:27:51,421679791385751553,brwood,Individual,1,2.213131
3,Could teams with complementary skills tackle t...,2014-07-16 11:38:53,489373601754124288,accenture_irl,Organization,2,2.434355
4,Last week @hmason spoke about #innovation thro...,2014-10-18 10:12:01,523416199674494976,cdelancray,Individual,2,3.066636


### Notes
- Write code the calculate the following numbers. Store each answer in a variable `test_#`. For example, the answer for the first question should be stored in `test_1`
- Before you submit your work, ensure you clean up your notebook. Your notebook has to run without an error in order to be tested. The easiest way to ensure is to `Kernel->Restart & Run All`
- Answers are provided below for your convenience
- __AGAIN__ Don't change anything in the __SETUP__ and __TEST__ sections

In [4]:
test_1=test_2=test_3=test_4=test_5=test_6=test_7=test_8=test_9=0.0

# YOUR WORK

In [5]:
data['filter'] = data['text'].str.replace('@\w+', 'MENTION').str.replace('#\w+', 'HASHTAG').str.replace('http:[./\w]+', 'LINK')

In [6]:
test_1 = data['filter'].iloc[0]

In [7]:
def bigrams(text):
    blob = TextBlob(text)
    words = blob.words
    res = []
    for i in range(len(blob.words)-1):
        res.append((words[i], words[i+1]))
    return res

In [8]:
test_2 = bigrams("A bigram is a pair of consecutive written words")

In [9]:
data['bigrams']=data['filter'].apply(bigrams)

In [10]:
test_3 = data['bigrams'].iloc[0]

In [11]:
from collections import Counter

In [12]:
freq = Counter(data['bigrams'].sum())

In [13]:
df = pd.DataFrame(freq.items(), columns = ['bigram', 'frequency'])
df = df.sort_values('frequency', ascending=False)
test_4 = list(df.head(10)['bigram'].values)

In [14]:
#or
#sorted_freq = sorted(freq.items(), key=lambda x: -x[-1])
#test_4 = sorted_freq[:10]

In [15]:
mask = df['bigram'].apply(lambda x: x[0] not in stop_words and x[1] not in stop_words and x[0] not in ["HASHTAG", "MENTION", "LINK"] and x[1] not in ["HASHTAG", "MENTION", "LINK"])
test_5 = list(df[mask].head(10)['bigram'].values)
test_5

[('Data', 'Scientist'),
 ('data', 'scientist'),
 ('’', 's'),
 ('Big', 'Data'),
 ('Data', 'Scientists'),
 ('convoluted', 'world'),
 ('The', 'convoluted'),
 ('data', 'scientists'),
 ('data', 'science'),
 ('Data', 'Science')]

In [16]:
def trigrams(text):
    blob = TextBlob(text)
    words = blob.words
    res = []
    for i in range(len(blob.words)-2):
        res.append((words[i], words[i+1], words[i+2]))
    return res

In [17]:
test_6 = trigrams("A trigram is a three consecutive written words")

In [18]:
data['trigrams']=data['filter'].apply(trigrams)

In [19]:
test_7 = data['trigrams'].iloc[0]

In [20]:
freq3 = Counter(data['trigrams'].sum())

In [21]:
df3 = pd.DataFrame(freq3.items(), columns = ['trigram', 'frequency'])
df3 = df3.sort_values('frequency', ascending=False)
test_8 = list(df3.head(10)['trigram'].values)

In [22]:
test_8

[('HASHTAG', 'HASHTAG', 'HASHTAG'),
 ('LINK', 'HASHTAG', 'HASHTAG'),
 ('HASHTAG', 'HASHTAG', 'LINK'),
 ('HASHTAG', 'LINK', 'HASHTAG'),
 ('a', 'HASHTAG', 'LINK'),
 ('LINK', 'via', 'MENTION'),
 ('MENTION', 'HASHTAG', 'HASHTAG'),
 ('via', 'MENTION', 'HASHTAG'),
 ('be', 'a', 'HASHTAG'),
 ('MENTION', 'LINK', 'HASHTAG')]

In [23]:
mask = df3['trigram'].apply(lambda x: x[0] not in stop_words and x[1] not in stop_words and x[2] not in stop_words and x[0] not in ["HASHTAG", "MENTION", "LINK"] and x[1] not in ["HASHTAG", "MENTION", "LINK"] and x[2] not in ["HASHTAG", "MENTION", "LINK"])
test_9 = list(df3[mask].head(10)['trigram'].values)
test_9

[('The', 'convoluted', 'world'),
 ('data', 'scientist', 'A'),
 ('2015', "'s", 'hottest'),
 ("'s", 'hottest', 'profession'),
 ('data', 'science', 'teams'),
 ('scientist', 'A', 'guide'),
 ('Scientist', '’', 'In'),
 ('Social', 'Media', 'Scientist'),
 ('’', 's', '2014'),
 ('s', '2014', 'Top')]

# TESTS

In [24]:
### TEST 1
test_1

'Do we need another CxO? RT MENTION: Hail to the Chief Data Officer HASHTAG LINK via MENTION HASHTAG HASHTAG'

In [25]:
## TEST 2
test_2

[('A', 'bigram'),
 ('bigram', 'is'),
 ('is', 'a'),
 ('a', 'pair'),
 ('pair', 'of'),
 ('of', 'consecutive'),
 ('consecutive', 'written'),
 ('written', 'words')]

In [26]:
## TEST 3
test_3

[('Do', 'we'),
 ('we', 'need'),
 ('need', 'another'),
 ('another', 'CxO'),
 ('CxO', 'RT'),
 ('RT', 'MENTION'),
 ('MENTION', 'Hail'),
 ('Hail', 'to'),
 ('to', 'the'),
 ('the', 'Chief'),
 ('Chief', 'Data'),
 ('Data', 'Officer'),
 ('Officer', 'HASHTAG'),
 ('HASHTAG', 'LINK'),
 ('LINK', 'via'),
 ('via', 'MENTION'),
 ('MENTION', 'HASHTAG'),
 ('HASHTAG', 'HASHTAG')]

In [27]:
## TEST 4
test_4

[('HASHTAG', 'HASHTAG'),
 ('LINK', 'HASHTAG'),
 ('HASHTAG', 'LINK'),
 ('a', 'HASHTAG'),
 ('MENTION', 'HASHTAG'),
 ('via', 'MENTION'),
 ('MENTION', 'LINK'),
 ('the', 'HASHTAG'),
 ('MENTION', 'MENTION'),
 ('HASHTAG', 'MENTION')]

In [28]:
## TEST 5
test_5

[('Data', 'Scientist'),
 ('data', 'scientist'),
 ('’', 's'),
 ('Big', 'Data'),
 ('Data', 'Scientists'),
 ('convoluted', 'world'),
 ('The', 'convoluted'),
 ('data', 'scientists'),
 ('data', 'science'),
 ('Data', 'Science')]

In [29]:
## TEST 6
test_6

[('A', 'trigram', 'is'),
 ('trigram', 'is', 'a'),
 ('is', 'a', 'three'),
 ('a', 'three', 'consecutive'),
 ('three', 'consecutive', 'written'),
 ('consecutive', 'written', 'words')]

In [30]:
## TEST 7
test_7

[('Do', 'we', 'need'),
 ('we', 'need', 'another'),
 ('need', 'another', 'CxO'),
 ('another', 'CxO', 'RT'),
 ('CxO', 'RT', 'MENTION'),
 ('RT', 'MENTION', 'Hail'),
 ('MENTION', 'Hail', 'to'),
 ('Hail', 'to', 'the'),
 ('to', 'the', 'Chief'),
 ('the', 'Chief', 'Data'),
 ('Chief', 'Data', 'Officer'),
 ('Data', 'Officer', 'HASHTAG'),
 ('Officer', 'HASHTAG', 'LINK'),
 ('HASHTAG', 'LINK', 'via'),
 ('LINK', 'via', 'MENTION'),
 ('via', 'MENTION', 'HASHTAG'),
 ('MENTION', 'HASHTAG', 'HASHTAG')]

In [31]:
## TEST 8
test_8

[('HASHTAG', 'HASHTAG', 'HASHTAG'),
 ('LINK', 'HASHTAG', 'HASHTAG'),
 ('HASHTAG', 'HASHTAG', 'LINK'),
 ('HASHTAG', 'LINK', 'HASHTAG'),
 ('a', 'HASHTAG', 'LINK'),
 ('LINK', 'via', 'MENTION'),
 ('MENTION', 'HASHTAG', 'HASHTAG'),
 ('via', 'MENTION', 'HASHTAG'),
 ('be', 'a', 'HASHTAG'),
 ('MENTION', 'LINK', 'HASHTAG')]

In [32]:
## TEST 9
test_9

[('The', 'convoluted', 'world'),
 ('data', 'scientist', 'A'),
 ('2015', "'s", 'hottest'),
 ("'s", 'hottest', 'profession'),
 ('data', 'science', 'teams'),
 ('scientist', 'A', 'guide'),
 ('Scientist', '’', 'In'),
 ('Social', 'Media', 'Scientist'),
 ('’', 's', '2014'),
 ('s', '2014', 'Top')]