> * Text analysis is an essential part of social network analysis.
> * Imagine you have a social media dataset with tweets. You want to analyze text to find nodes and edges as well as what they are talking about.
> * You can use text analysis to find the most frequent words, hashtags, and mentions.

In [1]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.util import ngrams

> * We are going to use the `nltk` library to analyze text.
> * `from` is telling Python to import a library called `nltk`.
> * `import` is used to import a specific module from a library.

In [2]:
data=pd.read_csv('../week3/Political-media-DFE.csv', encoding='latin1')

> * Let's import the data from week 3.

In [3]:
data.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'audience', 'audience:confidence', 'bias',
       'bias:confidence', 'message', 'message:confidence', 'orig__golden',
       'audience_gold', 'bias_gold', 'bioid', 'embed', 'id', 'label',
       'message_gold', 'source', 'text'],
      dtype='object')

In [4]:
data.dtypes

_unit_id                 int64
_golden                   bool
_unit_state             object
_trusted_judgments       int64
_last_judgment_at       object
audience                object
audience:confidence    float64
bias                    object
bias:confidence        float64
message                 object
message:confidence     float64
orig__golden           float64
audience_gold          float64
bias_gold              float64
bioid                   object
embed                   object
id                      object
label                   object
message_gold           float64
source                  object
text                    object
dtype: object

> * `.dtypes` is used to check the data type of each column.

> * Let's subset the data to have who posted, where they posted (social media platform), and what they posted.

In [21]:
content=data[['label', 'source', 'text']]

In [6]:
content

Unnamed: 0,label,source,text
0,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...
3,From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...
...,...,...,...
4995,From: Ted Yoho (Representative from Florida),facebook,I applaud Governor PerryÛªs recent decision t...
4996,From: Ted Yoho (Representative from Florida),facebook,"Today, I voted in favor of H.R. 5016 - Financi..."
4997,From: Ted Yoho (Representative from Florida),facebook,(Taken from posted WOKV interview) Congressm...
4998,From: Ted Yoho (Representative from Florida),facebook,Join me next week for a town hall in Ocala! I'...


> * We have contents from both twitter and facebook and have a text column for the contents.

In [7]:
content['source'].unique()

array(['twitter', 'facebook'], dtype=object)

> * Let's print out some of the contents.

In [8]:
content['text'].iloc[0]

'RT @nowthisnews: Rep. Trey Radel (R- #FL) slams #Obamacare. #politics https://t.co/zvywMG8yIH'

In [9]:
content['text'].iloc[1]

'VIDEO - #Obamacare:  Full of Higher Costs and Broken Promises: http://t.co/dn3vzqIrWF'

In [20]:
content['text'].iloc[2]

'Please join me today in remembering our fallen heroes and honoring the men and women currently in military service for their sacrifices.'

> * Q. What additional type of text do you see in the contents?

```YOUR ANSWER HERE```

> * Yes, we see hashtags, retweets, mentions, and URLs.
> * In the third row, we see an example of a content only with text.

> * In text analysis, it is important to make the text clean (remove unnecessary words, symbols, etc.) and to make the text uniform (lowercase, no punctuation, etc.).

> * Let's lowercase all the text.

In [23]:
pd.options.mode.chained_assignment = None  # default='warn'

content['text-lower']=content['text'].str.lower()

In [24]:
content['text-lower'].iloc[2]

'please join me today in remembering our fallen heroes and honoring the men and women currently in military service for their sacrifices.'

> * We can seperate the entire contents into tokens (words, hashtags, mentions, etc.).
> * Seperating the contents into tokens is called tokenization.
> * We can use the `word_tokenize` function from the `nltk` library to tokenize the contents.
> * There is also a `TweetTokenizer` function in the `nltk` library that is specifically for tweets.

> * `.apply` is used to apply a function to a column. You don't have to use a for loop to apply a function to each row.

> * There are two ways to tokenize the contents. One is to use `apply()` function to tokenize the lowercased text. 
> * `apply()` function allows you to apply a function along the axis of a DataFrame.
> * Another way is to iterate through the lowercased text and tokenize each content.

In [25]:
content['tokenized_unigrams']=content['text-lower'].apply(word_tokenize)

In [26]:
iterated_unigrams=[]
for idx, row in content.iterrows():
    iterated_unigrams.append(word_tokenize(row['text-lower']))

In [27]:
content['iterated_unigrams']=iterated_unigrams 

> * The results of iterating through each row and applying the `word_tokenize` function is a list of lists are identical.

In [28]:
content.loc[0,'iterated_unigrams'] == content.loc[0,'tokenized_unigrams']

True

> * One token is called a unigram.
> * We can try to find bigrams (two tokens) and trigrams (three tokens) as well. 
> * All of these tokens are called n-grams.


In [29]:
#bigrams lambda function
content['tokenized_bigrams']=content['text-lower'].apply(TweetTokenizer().tokenize).apply(lambda x: list(ngrams(x, 2)))
#It first applies the 'tokenize' method of the TweetTokenizer class to the 'text-lower' column,
#and then applies the lambda function to the resulting list of bigrams.

> * Lambda function is created using the `lambda` keyword, followed by the input variable(s), a colon, and the function code.
> * Lambda is a small, anonymous, and inline function.
> * `lambda arguments : expression`
> * e.g., `lambda x: x+1`

In [30]:
#practicing lambda functions
def square(x):
    return x**2

In [31]:
#equivalent lambda function
square_lambda = lambda x: x**2

In [32]:
function_result = square(5)
lambda_result = square_lambda(5)

In [33]:
print(function_result)
print(lambda_result)

25
25


> * The result of the lambda function and regular function is the same.

In [34]:
#bigram iteration
iterated_bigrams=[]
for idx, row in content.iterrows():
    iterated_bigrams.append(list(ngrams(row['tokenized_unigrams'], 2)))

In [35]:
content['iterated_bigrams']=iterated_bigrams

> * We are importing another library called `collections` to count the frequency of the tokens.

In [36]:
from collections import Counter

> * Let's see how bigrams and unigrams are saved in each column.

In [37]:
content['tokenized_bigrams']

0       [(rt, @nowthisnews), (@nowthisnews, :), (:, re...
1       [(video, -), (-, #obamacare), (#obamacare, :),...
2       [(please, join), (join, me), (me, today), (tod...
3       [(rt, @senatorleahy), (@senatorleahy, :), (:, ...
4       [(., @amazon), (@amazon, delivery), (delivery,...
                              ...                        
4995    [(i, applaud), (applaud, governor), (governor,...
4996    [(today, ,), (,, i), (i, voted), (voted, in), ...
4997    [((, taken), (taken, from), (from, posted), (p...
4998    [(join, me), (me, next), (next, week), (week, ...
4999    [(foreign, affairs), (affairs, committee), (co...
Name: tokenized_bigrams, Length: 5000, dtype: object

In [38]:
content['tokenized_unigrams']

0       [rt, @, nowthisnews, :, rep., trey, radel, (, ...
1       [video, -, #, obamacare, :, full, of, higher, ...
2       [please, join, me, today, in, remembering, our...
3       [rt, @, senatorleahy, :, 1st, step, toward, se...
4       [., @, amazon, delivery, #, drones, show, need...
                              ...                        
4995    [i, applaud, governor, perryûªs, recent, deci...
4996    [today, ,, i, voted, in, favor, of, h.r, ., 50...
4997    [(, taken, from, posted, wokv, interview, ), c...
4998    [join, me, next, week, for, a, town, hall, in,...
4999    [foreign, affairs, committee, hearing, on, syr...
Name: tokenized_unigrams, Length: 5000, dtype: object

> * Let's count the most frequent unigrams.

In [39]:
Counter([item for row in content['tokenized_unigrams'] for item in row]).most_common(10)

[('the', 7668),
 ('.', 6857),
 ('to', 5913),
 (',', 4927),
 (':', 3763),
 ('and', 3678),
 ('of', 3371),
 ('in', 2892),
 ('#', 2518),
 ('a', 2508)]

> Let's count the most frequent bigrams.

In [40]:
Counter([item for row in content['tokenized_bigrams'] for item in row]).most_common(10)

[(('of', 'the'), 766),
 (('\x89', 'ûªs'), 684),
 (('in', 'the'), 547),
 ((',', 'and'), 547),
 (('on', 'the'), 416),
 (('to', 'the'), 398),
 (('.', 'i'), 328),
 (('for', 'the'), 307),
 (('\x89', 'û'), 305),
 (('at', 'the'), 305)]

> * However, we see frequent words include function words (e.g., the, and, is, etc.) and punctuation.
> * We can remove function words and punctuation to find the content words (e.g., nouns, verbs, adjectives, etc.).

In [41]:
content.head(2)

Unnamed: 0,label,source,text,text-lower,tokenized_unigrams,iterated_unigrams,tokenized_bigrams,iterated_bigrams
0,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...,rt @nowthisnews: rep. trey radel (r- #fl) slam...,"[rt, @, nowthisnews, :, rep., trey, radel, (, ...","[rt, @, nowthisnews, :, rep., trey, radel, (, ...","[(rt, @nowthisnews), (@nowthisnews, :), (:, re...","[(rt, @), (@, nowthisnews), (nowthisnews, :), ..."
1,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...,video - #obamacare: full of higher costs and ...,"[video, -, #, obamacare, :, full, of, higher, ...","[video, -, #, obamacare, :, full, of, higher, ...","[(video, -), (-, #obamacare), (#obamacare, :),...","[(video, -), (-, #), (#, obamacare), (obamacar..."


> * In the `nltk` library, there is a list of stopwords (function words) that we can use to remove from the contents.

In [42]:
stop=stopwords.words('english')

In [43]:
stop[1:10] #use slice to show only the first 10 stopwords

['me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [44]:
stop[-10:] #use negative index to slice the last 10 stopwords

['shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

> * We want to remove the stopwords from the text-lower column

In [45]:
content['stopword']=content['text-lower'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
#The lambda function takes each row of the 'text-lower' column, splits it into a list of words, 
#and then joins the words back together into a string, excluding any words that are in the 'stop' list.
    

> * Let's compare the result of removing stopwords and the original text-lower column

In [46]:
content['stopword']

0       rt @nowthisnews: rep. trey radel (r- #fl) slam...
1       video - #obamacare: full higher costs broken p...
2       please join today remembering fallen heroes ho...
3       rt @senatorleahy: 1st step toward senate debat...
4       .@amazon delivery #drones show need update law...
                              ...                        
4995    applaud governor perryûªs recent decision dep...
4996    today, voted favor h.r. 5016 - financial servi...
4997    (taken posted wokv interview) congressman yoho...
4998    join next week town hall ocala! i'll answer qu...
4999    foreign affairs committee hearing syria. remai...
Name: stopword, Length: 5000, dtype: object

In [47]:
content['text-lower']

0       rt @nowthisnews: rep. trey radel (r- #fl) slam...
1       video - #obamacare:  full of higher costs and ...
2       please join me today in remembering our fallen...
3       rt @senatorleahy: 1st step toward senate debat...
4       .@amazon delivery #drones show need to update ...
                              ...                        
4995    i applaud governor perryûªs recent decision t...
4996    today, i voted in favor of h.r. 5016 - financi...
4997    (taken from posted wokv interview)   congressm...
4998    join me next week for a town hall in ocala! i'...
4999    foreign affairs committee hearing on syria. i ...
Name: text-lower, Length: 5000, dtype: object

> * We'll get unigrams for the text after we remove function words.

In [48]:
content['stop_tokenized_unigrams']=content['stopword'].apply(word_tokenize)

> * We'll get bigrams for the text after we remove function words.

In [49]:
content['stop_tokenized_bigrams']=content['stopword'].apply(TweetTokenizer().tokenize).apply(lambda x: list(ngrams(x, 2)))

> * Here are the most frequent unigrams and bigrams after removing function words.

In [50]:
Counter([item for row in content['stop_tokenized_unigrams'] for item in row]).most_common(10)

[('.', 6863),
 (',', 4927),
 (':', 3763),
 ('#', 2518),
 ('http', 2162),
 ('@', 1877),
 ('!', 995),
 ("'s", 812),
 ('today', 784),
 ('&', 589)]

In [51]:
Counter([item for row in content['stop_tokenized_bigrams'] for item in row]).most_common(10)

[(('\x89', 'ûªs'), 684),
 (('\x89', 'û'), 305),
 (('û', '\x9d'), 249),
 (('.', '\x89'), 198),
 (('s', '.'), 193),
 (('u', '.'), 192),
 (('.', 's'), 190),
 (('here', ':'), 183),
 (('.', '"'), 154),
 (('\x89', 'ûò'), 149)]

> * We still see irrelevant punctuations. Let's get rid of them.


In [52]:
content['punct_tokenized_unigrams']=content['stop_tokenized_unigrams'].apply(lambda x: [word for word in x if word.isalnum()])

In [53]:
punct_iterated_bigrams=[]
for idx, row in content.iterrows():
    punct_iterated_bigrams.append(list(ngrams(row['punct_tokenized_unigrams'], 2)))

In [54]:
content['punct_tokenized_bigrams']=punct_iterated_bigrams

> * Let's count the most frequent unigrams and bigrams after removing punctuation.

In [55]:
Counter([item for row in content['punct_tokenized_unigrams'] for item in row]).most_common(10)

[('http', 2162),
 ('today', 784),
 ('house', 435),
 ('amp', 431),
 ('great', 396),
 ('new', 361),
 ('bill', 324),
 ('president', 317),
 ('act', 294),
 ('congress', 289)]

In [56]:
Counter([item for row in content['punct_tokenized_bigrams'] for item in row]).most_common(10)

[(('here', 'http'), 159),
 (('health', 'care'), 97),
 (('president', 'obama'), 83),
 (('united', 'states'), 75),
 (('town', 'hall'), 67),
 (('high', 'school'), 66),
 (('immigration', 'reform'), 61),
 (('small', 'business'), 47),
 (('last', 'night'), 46),
 (('make', 'sure'), 46)]

> * But we want to do additional cleaning. 
> * When counting the most frequent words, past and present tense of the same word are counted as different words.
> * For example, "run" and "running" are counted as different words.
> * We can use lemmatization to convert words to their base form.

In [57]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer() #Initialize lemmatizer
from nltk.corpus import wordnet

In [58]:
content['lemma']=content['punct_tokenized_unigrams'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [59]:
content['lemma_str']=content['lemma'].apply(lambda x: ' '.join(x))

> * Let's compare the results of lemmatization and without lemmatization.

In [60]:
content.loc[5, 'lemma_str']

'called usdotfra release info inspection casseltonderailment review quality rail'

In [61]:
content.loc[5, 'text-lower']

'called on the @usdotfra to release info about inspections before the #casseltonderailment to review quality of rails. (1/2)'

> * Q. Do you see any different results? What tokens have changed after lemmatization?

The punctuation, symbols, and stopwords have been removed so that the lemmatized string is limited to relevant words

In [62]:
lemmatizer.lemmatize('better')

'better'

In [63]:
lemmatizer.lemmatize('better', pos=wordnet.ADJ)

'good'

In [64]:
lemmatizer.lemmatize('cars')

'car'

In [65]:
lemmatizer.lemmatize('cars', pos=wordnet.VERB)

'cars'

> * Interestingly enough, NLTK's WordNetLemmatizer is not perfect.
> * By default, it only lemmatize nouns.
> * Therefore, we need to specify the part of speech (POS) for each token.

In [66]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'): #ADJECTIVE
        return wordnet.ADJ
    elif nltk_tag.startswith('V'): #VERN
        return wordnet.VERB
    elif nltk_tag.startswith('N'): #NOUN        
        return wordnet.NOUN
    elif nltk_tag.startswith('R'): #ADVERB
        return wordnet.ADV
    else:          
        return None

In [67]:
def lemmatize_sentence(sentence):
    # Tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    # Tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged) 
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            # If no tag was found, then use the word as is
            lemmatized_sentence.append(word)
        else:        
            # Else use the tag to lemmatize the word
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [68]:
content['lemmatizer_str']=content['lemma'].apply(lambda x: lemmatize_sentence(' '.join(x)))

In [69]:
content.loc[5, 'lemma_str']

'called usdotfra release info inspection casseltonderailment review quality rail'

In [70]:
content.loc[5, 'lemmatizer_str']

'call usdotfra release info inspection casseltonderailment review quality rail'

> * Let's extract hashtags and mentions from the contents.
> * We have to use regular expressions to extract hashtags and mentions.

In [71]:
import re

> * There are three useful regex functions in Python: `findall`, `search`, and `match`.

> * `findall` is used to find all matches of a pattern in a string.
> * `search` is used to find the first match of a pattern in a string.
> * `match` is used to match a pattern at the beginning of a string.

> * In this class, we are mostly going to use `findall` to extract hashtags and mentions.

> * The result of `findall` is a list of strings.

> * Basic regex patterns:
> * `.` matches any character except a newline.
> * `*` matches 0 or more repetitions of the preceding regex pattern.
> * `+` matches 1 or more repetitions of the preceding regex pattern.
> * `?` matches 0 or 1 repetition of the preceding regex pattern.
> * `^` matches the start of a string.
> * `$` matches the end of a string.
> * `[]` matches any one of the characters inside the square brackets.
> * `\` is used to escape special characters.
> * `|` is used to match either the regex pattern on the left or the right.

> * `[a-z]` matches any lowercase letter.
> * `[A-Z]` matches any uppercase letter.
> * `[0-9]` matches any digit.
> * `\d` matches any digit.
> * `\D` matches any non-digit.
> * `\w` matches any word character (alphanumeric and underscore).
> * `\W` matches any non-word character.
> * `\s` matches any whitespace character.
> * `\S` matches any non-whitespace character.

> * `[a-zA-Z]` matches any alphabet character.
> * `[a-zA-Z0-9]` matches any alphanumeric character.


<img src="../week4/Regex-Cheat-Sheet.png" width=500px height=800px />

> * Let's practice regex with the `findall` function.
> * Return all non-overlapping matches of pattern in string, as a list of strongs or tuples. The string is scanned left-to-right, and matches are returned in the order found.

In [72]:
pattern = re.compile(r'\d+') # \d+ is the pattern to search for, it means any digit 0-9 and the + means one or more times
sample1='I have 2 dogs and 3 cats'
result=re.findall(pattern, sample1)
print(result)

['2', '3']


In [73]:
pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') 
#[a-zA-Z0-9._%+-]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the characters ._%+-, and the + means one or more times
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9.-]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the characters .-, and the + means one or more times
#\ is the escape character, it is used to escape the . character, so it is not interpreted as a special character
#[a-zA-Z]{2,} is the pattern to search for, it means any letter a-z or A-Z, and the {2,} means two or more times
sample2='''My primary email is is324socialnetworkanalysis@illinois.edu. Please contact me! 
If I am not replying, you can contact me at jaihyunpark@illinois.edu'''
result=re.findall(pattern, sample2)
print(result)

['is324socialnetworkanalysis@illinois.edu', 'jaihyunpark@illinois.edu']


In [74]:
pattern = re.compile(r'https?://\S+')
#http is the pattern to search for, it means the characters http
#s? is the pattern to search for, it means the character s, and the ? means zero or one time
#:// is the pattern to search for, it means the characters ://
#\S+ is the pattern to search for, it means any character that is not a white space, and the + means one or more times
sample3='Look at what is happening in Washington! https://is324.com'
result=re.findall(pattern, sample3)
print(result)

['https://is324.com']


In [75]:
pattern = re.compile(r'#[a-zA-Z0-9]+')
## is the pattern to search for, it means the character #
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times
sample4='I am so excited for the class #IS324'
result=re.findall(pattern, sample4)
print(result)

['#IS324']


In [76]:
pattern = re.compile(r'@[a-zA-Z0-9]+')
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times
sample5='@jaihyunpark is the instructor for #IS324'
result=re.findall(pattern, sample5)
print(result)

['@jaihyunpark']


In [77]:
pattern = re.compile(r'(rt\s+@[a-zA-Z0-9]+ | @[a-zA-Z0-9]+)')
#The pattern is composed of two subpatterns, separated by the | character, 
#which means "or" and grouped by parentheses.
#rt is the pattern to search for, it means the characters rt
#\s+ is the pattern to search for, it means any white space, and the + means one or more times
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times

#Another subpattern is looking for mentions without the rt prefix
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times
sample6='rt @jaihyunpark: I am so excited for the class #IS324 @socialmedia'
result=re.findall(pattern, sample6)
print(result)

[' @jaihyunpark', ' @socialmedia']


> * Let's practice regex with the `search` function.
> * Scan through the string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.

In [78]:
pattern = re.compile(r'\d+') # \d+ is the pattern to search for, it means 
#any digit 0-9 and the + means one or more times
sample1='I have 2 dogs and 3 cats'
result=re.search(pattern, sample1) #search method returns the index of the first match
print(result)
# span=(7, 8) means that the match was found from the 7th to the 8th character of the string

<re.Match object; span=(7, 8), match='2'>


In [79]:
print(sample1[result.span()[0]:result.span()[1]])

2


In [80]:
pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') 
#Matches one or more alphanumeric characters, dots, underscores, percent signs, plus signs, or hyphens. 
#The @ symbol is also matched.
sample2='''My primary email is is324socialnetworkanalysis@illinois.edu. 
Please contact me! If I am not replying, you can contact me at jaihyunpark@illinois.edu'''
result=re.search(pattern, sample2)
print(result)

<re.Match object; span=(20, 59), match='is324socialnetworkanalysis@illinois.edu'>


In [81]:
print(sample2[result.span()[0]:result.span()[1]])

is324socialnetworkanalysis@illinois.edu


In [82]:
pattern = re.compile(r'https?://\S+')
#Matches http or https, followed by ://, followed by any non-whitespace characters.
sample3='Look at what is happening in Washington! https://is324.com'
result=re.search(pattern, sample3)
print(result)

<re.Match object; span=(41, 58), match='https://is324.com'>


In [83]:
print(sample3[result.span()[0]:result.span()[1]])

https://is324.com


In [84]:
pattern = re.compile(r'#[a-zA-Z0-9]+')
#Matches hashtags that start with #, followed by one or more alphanumeric characters.
sample4='I am so excited for the class #IS324'
result=re.search(pattern, sample4)
print(result)

<re.Match object; span=(30, 36), match='#IS324'>


In [85]:
print(sample4[result.span()[0]:result.span()[1]])

#IS324


In [86]:
pattern = re.compile(r'@[a-zA-Z0-9]+')
#Matches mentions that start with @, followed by one or more alphanumeric characters.
sample5='@jaihyunpark is the instructor for #IS324'
result=re.search(pattern, sample5) #search method returns the index of the first match
print(result)

<re.Match object; span=(0, 12), match='@jaihyunpark'>


In [87]:
print(sample5[result.span()[0]:result.span()[1]])

@jaihyunpark


In [88]:
pattern = re.compile(r'(rt\s+@[a-zA-Z0-9]+ | @[a-zA-Z0-9]+)')
#Matches retweets that start with rt, followed by one or more white spaces, 
#followed by @, followed by one or more alphanumeric characters.
sample6='rt @jaihyunpark: I am so excited for the class #IS324 @socialmedia'
result=re.search(pattern, sample6)
print(result)

<re.Match object; span=(2, 15), match=' @jaihyunpark'>


In [89]:
print(sample6[result.span()[0]:result.span()[1]])

 @jaihyunpark


> * Let's practice regex with `match` function.
> * If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

In [90]:
pattern = re.compile(r'\d+') # \d+ is the pattern to search for, 
#it means any digit 0-9 and the + means one or more times
sample1='I have 2 dogs and 3 cats'
result=re.match(pattern, sample1) #check if the pattern is at the beginning of the string
print(result)

None


> * Let's extract hashtags and mentions from the content DataFrame.

> * First, we are extracting mentions.

In [91]:
pattern = re.compile(r'@[a-zA-Z0-9]+')
content['mentions']=content['text-lower'].apply(lambda x: re.findall(pattern, x))

> * Let's check if mention is extracted correctly.

In [92]:
content['text-lower'].iloc[0]

'rt @nowthisnews: rep. trey radel (r- #fl) slams #obamacare. #politics https://t.co/zvywmg8yih'

In [93]:
content['mentions'].iloc[0]

['@nowthisnews']

In [94]:
content['text-lower'].iloc[1]

'video - #obamacare:  full of higher costs and broken promises: http://t.co/dn3vzqirwf'

In [95]:
content['mentions'].iloc[1]

[]

> * Let's extract hashtags.

In [96]:
pattern = re.compile(r'#([a-zA-Z0-9]+)')
content['hashtags']=content['text-lower'].apply(lambda x: re.findall(pattern, x))

In [97]:
content['text-lower'].iloc[0]

'rt @nowthisnews: rep. trey radel (r- #fl) slams #obamacare. #politics https://t.co/zvywmg8yih'

In [98]:
content['hashtags'].iloc[0]

['fl', 'obamacare', 'politics']

In [99]:
content['text-lower'].iloc[1]

'video - #obamacare:  full of higher costs and broken promises: http://t.co/dn3vzqirwf'

In [100]:
content['hashtags'].iloc[1]

['obamacare']

> * We can further extract https links 

In [101]:
pattern=re.compile(r'https?://\S+')
content['http']=content['text-lower'].apply(lambda x: re.findall(pattern, x))

In [102]:
content['text-lower'].iloc[0]

'rt @nowthisnews: rep. trey radel (r- #fl) slams #obamacare. #politics https://t.co/zvywmg8yih'

In [103]:
content['http'].iloc[0]

['https://t.co/zvywmg8yih']

>* Pandas Series also has a `str` attribute to apply string methods.
>* The useful built-in functions are `str.contains`,  `str.replace`, and `str.findall`

> * `str.contains` is used to check if a pattern is contained in each string of the Series.

In [104]:
content['text-lower'].str.contains('http')

0        True
1        True
2       False
3       False
4        True
        ...  
4995    False
4996    False
4997     True
4998    False
4999    False
Name: text-lower, Length: 5000, dtype: bool

> * How many tweets contain a URL?

In [105]:
content[content['text-lower'].str.contains('http')].shape

(2171, 19)

>* `str.replace` is used to replace a pattern with another pattern in each string of the Series.

In [106]:
content['no-url']=content['text-lower'].str.replace(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', regex=True)

In [107]:
content['text-lower'].iloc[0]

'rt @nowthisnews: rep. trey radel (r- #fl) slams #obamacare. #politics https://t.co/zvywmg8yih'

In [108]:
content['no-url'].iloc[0]

'rt @nowthisnews: rep. trey radel (r- #fl) slams #obamacare. #politics '

>* `str.findall` is used to find all matches of a pattern in each string of the Series.

In [109]:
content['mentions-str']=content['text-lower'].str.findall(r'@([a-zA-Z0-9]+)')

> * The result of `str.findall` in Pandas is identical to the regular expression `findall` function.

In [110]:
content['mentions-str'].iloc[2277]

['huckabeeshow',
 'foxnews',
 'huckabeeshow',
 'grahamblog',
 'replankford',
 'kimguilfoyle',
 'markgeragos',
 'gopblackchick',
 'rcamposduffy']

In [111]:
content['mentions'].iloc[2277]

['@huckabeeshow',
 '@foxnews',
 '@huckabeeshow',
 '@grahamblog',
 '@replankford',
 '@kimguilfoyle',
 '@markgeragos',
 '@gopblackchick',
 '@rcamposduffy']

Practice

**Let's rename the column name for 'label' into 'from'**

In [117]:
#YOUR CODE HERE
content.rename(columns={"label" : "from"})

Unnamed: 0,from,source,text,text-lower,tokenized_unigrams,iterated_unigrams,tokenized_bigrams,iterated_bigrams,stopword,stop_tokenized_unigrams,...,punct_tokenized_unigrams,punct_tokenized_bigrams,lemma,lemma_str,lemmatizer_str,mentions,hashtags,http,no-url,mentions-str
0,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...,rt @nowthisnews: rep. trey radel (r- #fl) slam...,"[rt, @, nowthisnews, :, rep., trey, radel, (, ...","[rt, @, nowthisnews, :, rep., trey, radel, (, ...","[(rt, @nowthisnews), (@nowthisnews, :), (:, re...","[(rt, @), (@, nowthisnews), (nowthisnews, :), ...",rt @nowthisnews: rep. trey radel (r- #fl) slam...,"[rt, @, nowthisnews, :, rep., trey, radel, (, ...",...,"[rt, nowthisnews, trey, radel, fl, slams, obam...","[(rt, nowthisnews), (nowthisnews, trey), (trey...","[rt, nowthisnews, trey, radel, fl, slam, obama...",rt nowthisnews trey radel fl slam obamacare po...,rt nowthisnews trey radel fl slam obamacare po...,[@nowthisnews],"[fl, obamacare, politics]",[https://t.co/zvywmg8yih],rt @nowthisnews: rep. trey radel (r- #fl) slam...,[nowthisnews]
1,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...,video - #obamacare: full of higher costs and ...,"[video, -, #, obamacare, :, full, of, higher, ...","[video, -, #, obamacare, :, full, of, higher, ...","[(video, -), (-, #obamacare), (#obamacare, :),...","[(video, -), (-, #), (#, obamacare), (obamacar...",video - #obamacare: full higher costs broken p...,"[video, -, #, obamacare, :, full, higher, cost...",...,"[video, obamacare, full, higher, costs, broken...","[(video, obamacare), (obamacare, full), (full,...","[video, obamacare, full, higher, cost, broken,...",video obamacare full higher cost broken promis...,video obamacare full high cost broken promise ...,[],[obamacare],[http://t.co/dn3vzqirwf],video - #obamacare: full of higher costs and ...,[]
2,From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...,please join me today in remembering our fallen...,"[please, join, me, today, in, remembering, our...","[please, join, me, today, in, remembering, our...","[(please, join), (join, me), (me, today), (tod...","[(please, join), (join, me), (me, today), (tod...",please join today remembering fallen heroes ho...,"[please, join, today, remembering, fallen, her...",...,"[please, join, today, remembering, fallen, her...","[(please, join), (join, today), (today, rememb...","[please, join, today, remembering, fallen, her...",please join today remembering fallen hero hono...,please join today remember fall hero honor men...,[],[],[],please join me today in remembering our fallen...,[]
3,From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...,rt @senatorleahy: 1st step toward senate debat...,"[rt, @, senatorleahy, :, 1st, step, toward, se...","[rt, @, senatorleahy, :, 1st, step, toward, se...","[(rt, @senatorleahy), (@senatorleahy, :), (:, ...","[(rt, @), (@, senatorleahy), (senatorleahy, :)...",rt @senatorleahy: 1st step toward senate debat...,"[rt, @, senatorleahy, :, 1st, step, toward, se...",...,"[rt, senatorleahy, 1st, step, toward, senate, ...","[(rt, senatorleahy), (senatorleahy, 1st), (1st...","[rt, senatorleahy, 1st, step, toward, senate, ...",rt senatorleahy 1st step toward senate debate ...,rt senatorleahy 1st step toward senate debate ...,[@senatorleahy],[vawa],[],rt @senatorleahy: 1st step toward senate debat...,[senatorleahy]
4,From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...,.@amazon delivery #drones show need to update ...,"[., @, amazon, delivery, #, drones, show, need...","[., @, amazon, delivery, #, drones, show, need...","[(., @amazon), (@amazon, delivery), (delivery,...","[(., @), (@, amazon), (amazon, delivery), (del...",.@amazon delivery #drones show need update law...,"[., @, amazon, delivery, #, drones, show, need...",...,"[amazon, delivery, drones, show, need, update,...","[(amazon, delivery), (delivery, drones), (dron...","[amazon, delivery, drone, show, need, update, ...",amazon delivery drone show need update law pro...,amazon delivery drone show need update law pro...,[@amazon],"[drones, innovation, privacy, uas]",[http://t.co/l9ta5skq6u],.@amazon delivery #drones show need to update ...,[amazon]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,From: Ted Yoho (Representative from Florida),facebook,I applaud Governor PerryÛªs recent decision t...,i applaud governor perryûªs recent decision t...,"[i, applaud, governor, perryûªs, recent, deci...","[i, applaud, governor, perryûªs, recent, deci...","[(i, applaud), (applaud, governor), (governor,...","[(i, applaud), (applaud, governor), (governor,...",applaud governor perryûªs recent decision dep...,"[applaud, governor, perryûªs, recent, decisio...",...,"[applaud, governor, recent, decision, deploy, ...","[(applaud, governor), (governor, recent), (rec...","[applaud, governor, recent, decision, deploy, ...",applaud governor recent decision deploy nation...,applaud governor recent decision deploy nation...,[],[],[],i applaud governor perryûªs recent decision t...,[]
4996,From: Ted Yoho (Representative from Florida),facebook,"Today, I voted in favor of H.R. 5016 - Financi...","today, i voted in favor of h.r. 5016 - financi...","[today, ,, i, voted, in, favor, of, h.r, ., 50...","[today, ,, i, voted, in, favor, of, h.r, ., 50...","[(today, ,), (,, i), (i, voted), (voted, in), ...","[(today, ,), (,, i), (i, voted), (voted, in), ...","today, voted favor h.r. 5016 - financial servi...","[today, ,, voted, favor, h.r, ., 5016, -, fina...",...,"[today, voted, favor, 5016, financial, service...","[(today, voted), (voted, favor), (favor, 5016)...","[today, voted, favor, 5016, financial, service...",today voted favor 5016 financial service gener...,today vote favor 5016 financial service genera...,[],[],[],"today, i voted in favor of h.r. 5016 - financi...",[]
4997,From: Ted Yoho (Representative from Florida),facebook,(Taken from posted WOKV interview) Congressm...,(taken from posted wokv interview) congressm...,"[(, taken, from, posted, wokv, interview, ), c...","[(, taken, from, posted, wokv, interview, ), c...","[((, taken), (taken, from), (from, posted), (p...","[((, taken), (taken, from), (from, posted), (p...",(taken posted wokv interview) congressman yoho...,"[(, taken, posted, wokv, interview, ), congres...",...,"[taken, posted, wokv, interview, congressman, ...","[(taken, posted), (posted, wokv), (wokv, inter...","[taken, posted, wokv, interview, congressman, ...",taken posted wokv interview congressman yoho s...,take post wokv interview congressman yoho say ...,[],[],[http://www.wokv.com/news/news/local/yoho-airs...,(taken from posted wokv interview) congressm...,[]
4998,From: Ted Yoho (Representative from Florida),facebook,Join me next week for a town hall in Ocala! I'...,join me next week for a town hall in ocala! i'...,"[join, me, next, week, for, a, town, hall, in,...","[join, me, next, week, for, a, town, hall, in,...","[(join, me), (me, next), (next, week), (week, ...","[(join, me), (me, next), (next, week), (week, ...",join next week town hall ocala! i'll answer qu...,"[join, next, week, town, hall, ocala, !, i, 'l...",...,"[join, next, week, town, hall, ocala, i, answe...","[(join, next), (next, week), (week, town), (to...","[join, next, week, town, hall, ocala, i, answe...",join next week town hall ocala i answer questi...,join next week town hall ocala i answer questi...,[],[],[],join me next week for a town hall in ocala! i'...,[]


**We created the column that contains the mentions in 'mentions' column. How many unique accounts have been mentioned?**

> * You might want to flatten the list of mentions and then count the frequency of each mention.
> * This code below will flatten the list of mentions.
> * `[item for sublist in content['mentions'] for item in sublist]`

In [118]:
[item for sublist in content['mentions'] for item in sublist]

['@nowthisnews',
 '@senatorleahy',
 '@amazon',
 '@usdotfra',
 '@bbcworld',
 '@mha',
 '@genevrapittman',
 '@medcitynews',
 '@skitchp',
 '@corybooker',
 '@senbookerofc',
 '@accesshealthct',
 '@ktvb',
 '@senatorreid',
 '@marcthiessen',
 '@uscapitol',
 '@houseofcards',
 '@flxwinecountry',
 '@niagwinetrail',
 '@politico',
 '@ftmeademd',
 '@askgeorge',
 '@edworkforcedems',
 '@grahamblog',
 '@lyndawoolard',
 '@aaas',
 '@fanniemae',
 '@fixthedebt',
 '@thecankicksback',
 '@whitehouse',
 '@repjoegarcia',
 '@wolfblitzer',
 '@changinghands',
 '@foxmorningnews',
 '@jaketapper',
 '@freebeacon',
 '@rtenews',
 '@haroldmeyerson',
 '@washingtonpost',
 '@househomeland',
 '@mccaulpressshop',
 '@cbsthismorning',
 '@mayornoak',
 '@lewisuniversity',
 '@seanhannity',
 '@wyoguard',
 '@washingtonpost',
 '@mydesert',
 '@utahpoliticohub',
 '@matthewwymt',
 '@michellewymt',
 '@wymt',
 '@repsusandavis',
 '@tenygross',
 '@hanscom',
 '@kevinscholla',
 '@repdavid',
 '@sarahnetradio',
 '@robbhanrahancbs',
 '@cbs21news'

**We created the column that contains hashtags in 'hashtags' column. How many unique hashtags have been used?**

> * You might want to flatten the list of hashtags.
> * This code below will flatten the list of hashtags.
> * `[item for sublist in content['hashtags'] for item in sublist]`

In [134]:
[item for sublist in content['hashtags'] for item in sublist]

['fl',
 'obamacare',
 'politics',
 'obamacare',
 'vawa',
 'drones',
 'innovation',
 'privacy',
 'uas',
 'casseltonderailment',
 'bringbackourgirls',
 'joinrepwilson',
 'fmla',
 'hoyerheadlines',
 'renewui',
 'getenrolled',
 'defundobamacare',
 'idpol',
 'idaho150',
 'otd',
 'actonclimate',
 'nca2014',
 'yourtime',
 'tx32',
 'wv',
 '4jobs',
 'energy',
 'jobs',
 'obamacare',
 'renewui',
 'trainwreck',
 'scnews',
 'florida',
 'makegovernmentwork',
 'problemsolvers',
 'fixnotfight',
 'jerusalem',
 'israel',
 'gopbudget',
 'wi',
 'medicare',
 'oped',
 'gsereform',
 'kxl',
 'climatechange',
 'kentucky',
 'coleg',
 'venezuela',
 'sonyhack',
 'colorado30',
 'thirteensoldiers',
 'tempe',
 'arizona',
 'requireaplan',
 'va06',
 'rke',
 'bergdahl',
 'irs',
 'syria',
 'irs',
 'irsscandal',
 'benghazi',
 'strikefastfood',
 'boston',
 'willcounty',
 'fl17',
 'sayfie',
 'vawa',
 'obamacare',
 'bahrain',
 'cir',
 'childcareprotection',
 'utpol',
 'lasen',
 'rxdrugabuse',
 'arts',
 'landmines',
 'tcot',

**Who mentioned the most in one tweet?**

In [146]:
mentions = [[content["mentions-str"]]]
mentions


[[0        [nowthisnews]
  1                   []
  2                   []
  3       [senatorleahy]
  4             [amazon]
               ...      
  4995                []
  4996                []
  4997                []
  4998                []
  4999                []
  Name: mentions-str, Length: 5000, dtype: object]]

**Looks like `YOUR ANSWER` mentioned nine accounts in one tweet. What is the content of the tweet? Use 'text-lower' column**

In [None]:
#YOUR CODE HERE

**Who used hashtags the most in one tweet?**

In [None]:
#YOUR CODE HERE

**Looks like `YOUR ANSWER` used eight hastags in one tweet. What is the content of the tweet that used the most hashtags? Use 'text-lower' column**

In [None]:
#YOUR CODE HERE

**In the 'from' column, the name of the politician comes with 'From: ' in front of the name and whether the politician is Representative or Senator. Let's clean this.**

**First, clean the 'From: ' in the 'from' column.**

In [None]:
#YOUR CODE HERE

**Second, extract where they are from (the name of the State) and create a new column called 'state'.**

In [None]:
#YOUR CODE HERE

**Third, extract the name of the politician and create a new column called 'politician'.**

In [None]:
#YOUR CODE HERE