> * Text analysis is an essential part of social network analysis.
> * Imagine you have a social media dataset with tweets. You want to analyze text to find nodes and edges as well as what they are talking about.
> * You can use text analysis to find the most frequent words, hashtags, and mentions.

In [None]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.util import ngrams

> * We are going to use the `nltk` library to analyze text.
> * `from` is telling Python to import a library called `nltk`.
> * `import` is used to import a specific module from a library.

In [None]:
data=pd.read_csv('../week3/Political-media-DFE.csv', encoding='latin1')

> * Let's import the data from week 3.

In [None]:
data.columns

In [None]:
data.dtypes

> * `.dtypes` is used to check the data type of each column.

> * Let's subset the data to have who posted, where they posted (social media platform), and what they posted.

In [None]:
content=data[['label', 'source', 'text']]

In [None]:
content

> * We have contents from both twitter and facebook and have a text column for the contents.

In [None]:
content['source'].unique()

> * Let's print out some of the contents.

In [None]:
content['text'].iloc[0]

In [None]:
content['text'].iloc[1]

In [None]:
content['text'].iloc[2]

> * Q. What additional type of text do you see in the contents?

```YOUR ANSWER HERE```

> * Yes, we see hashtags, retweets, mentions, and URLs.
> * In the third row, we see an example of a content only with text.

> * In text analysis, it is important to make the text clean (remove unnecessary words, symbols, etc.) and to make the text uniform (lowercase, no punctuation, etc.).

> * Let's lowercase all the text.

In [None]:
content['text-lower']=content['text'].str.lower()

In [None]:
content['text-lower'].iloc[2]

> * We can seperate the entire contents into tokens (words, hashtags, mentions, etc.).
> * Seperating the contents into tokens is called tokenization.
> * We can use the `word_tokenize` function from the `nltk` library to tokenize the contents.
> * There is also a `TweetTokenizer` function in the `nltk` library that is specifically for tweets.

> * `.apply` is used to apply a function to a column. You don't have to use a for loop to apply a function to each row.

> * There are two ways to tokenize the contents. One is to use `apply()` function to tokenize the lowercased text. 
> * `apply()` function allows you to apply a function along the axis of a DataFrame.
> * Another way is to iterate through the lowercased text and tokenize each content.

In [None]:
content['tokenized_unigrams']=content['text-lower'].apply(word_tokenize)

In [None]:
iterated_unigrams=[]
for idx, row in content.iterrows():
    iterated_unigrams.append(word_tokenize(row['text-lower']))

In [None]:
content['iterated_unigrams']=iterated_unigrams 

> * The results of iterating through each row and applying the `word_tokenize` function is a list of lists are identical.

In [None]:
content.loc[0,'iterated_unigrams'] == content.loc[0,'tokenized_unigrams']

> * One token is called a unigram.
> * We can try to find bigrams (two tokens) and trigrams (three tokens) as well. 
> * All of these tokens are called n-grams.


In [None]:
#bigrams lambda function
content['tokenized_bigrams']=content['text-lower'].apply(TweetTokenizer().tokenize).apply(lambda x: list(ngrams(x, 2)))
#It first applies the 'tokenize' method of the TweetTokenizer class to the 'text-lower' column,
#and then applies the lambda function to the resulting list of bigrams.

> * Lambda function is created using the `lambda` keyword, followed by the input variable(s), a colon, and the function code.
> * Lambda is a small, anonymous, and inline function.
> * `lambda arguments : expression`
> * e.g., `lambda x: x+1`

In [None]:
#practicing lambda functions
def square(x):
    return x**2

In [None]:
#equivalent lambda function
square_lambda = lambda x: x**2

In [None]:
function_result = square(5)
lambda_result = square_lambda(5)

In [None]:
print(function_result)
print(lambda_result)

> * The result of the lambda function and regular function is the same.

In [None]:
#bigram iteration
iterated_bigrams=[]
for idx, row in content.iterrows():
    iterated_bigrams.append(list(ngrams(row['tokenized_unigrams'], 2)))

In [None]:
content['iterated_bigrams']=iterated_bigrams

> * We are importing another library called `collections` to count the frequency of the tokens.

In [None]:
from collections import Counter

> * Let's see how bigrams and unigrams are saved in each column.

In [None]:
content['tokenized_bigrams']

In [None]:
content['tokenized_unigrams']

> * Let's count the most frequent unigrams.

In [None]:
Counter([item for row in content['tokenized_unigrams'] for item in row]).most_common(10)

> Let's count the most frequent bigrams.

In [None]:
Counter([item for row in content['tokenized_bigrams'] for item in row]).most_common(10)

> * However, we see frequent words include function words (e.g., the, and, is, etc.) and punctuation.
> * We can remove function words and punctuation to find the content words (e.g., nouns, verbs, adjectives, etc.).

In [None]:
content.head(2)

> * In the `nltk` library, there is a list of stopwords (function words) that we can use to remove from the contents.

In [None]:
stop=stopwords.words('english')

In [None]:
stop[1:10] #use slice to show only the first 10 stopwords

In [None]:
stop[-10:] #use negative index to slice the last 10 stopwords

> * We want to remove the stopwords from the text-lower column

In [None]:
content['stopword']=content['text-lower'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
#The lambda function takes each row of the 'text-lower' column, splits it into a list of words, 
#and then joins the words back together into a string, excluding any words that are in the 'stop' list.
    

> * Let's compare the result of removing stopwords and the original text-lower column

In [None]:
content['stopword']

In [None]:
content['text-lower']

> * We'll get unigrams for the text after we remove function words.

In [None]:
content['stop_tokenized_unigrams']=content['stopword'].apply(word_tokenize)

> * We'll get bigrams for the text after we remove function words.

In [None]:
content['stop_tokenized_bigrams']=content['stopword'].apply(TweetTokenizer().tokenize).apply(lambda x: list(ngrams(x, 2)))

> * Here are the most frequent unigrams and bigrams after removing function words.

In [None]:
Counter([item for row in content['stop_tokenized_unigrams'] for item in row]).most_common(10)

In [None]:
Counter([item for row in content['stop_tokenized_bigrams'] for item in row]).most_common(10)

> * We still see irrelevant punctuations. Let's get rid of them.


In [None]:
content['punct_tokenized_unigrams']=content['stop_tokenized_unigrams'].apply(lambda x: [word for word in x if word.isalnum()])

In [None]:
punct_iterated_bigrams=[]
for idx, row in content.iterrows():
    punct_iterated_bigrams.append(list(ngrams(row['punct_tokenized_unigrams'], 2)))

In [None]:
content['punct_tokenized_bigrams']=punct_iterated_bigrams

> * Let's count the most frequent unigrams and bigrams after removing punctuation.

In [None]:
Counter([item for row in content['punct_tokenized_unigrams'] for item in row]).most_common(10)

In [None]:
Counter([item for row in content['punct_tokenized_bigrams'] for item in row]).most_common(10)

> * But we want to do additional cleaning. 
> * When counting the most frequent words, past and present tense of the same word are counted as different words.
> * For example, "run" and "running" are counted as different words.
> * We can use lemmatization to convert words to their base form.

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer() #Initialize lemmatizer
from nltk.corpus import wordnet

In [None]:
content['lemma']=content['punct_tokenized_unigrams'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [None]:
content['lemma_str']=content['lemma'].apply(lambda x: ' '.join(x))

> * Let's compare the results of lemmatization and without lemmatization.

In [None]:
content.loc[5, 'lemma_str']

In [None]:
content.loc[5, 'text-lower']

> * Q. Do you see any different results? What tokens have changed after lemmatization?

`YOUR ANSWER`

In [None]:
lemmatizer.lemmatize('better')

In [None]:
lemmatizer.lemmatize('better', pos=wordnet.ADJ)

In [None]:
lemmatizer.lemmatize('cars')

In [None]:
lemmatizer.lemmatize('cars', pos=wordnet.VERB)

> * Interestingly enough, NLTK's WordNetLemmatizer is not perfect.
> * By default, it only lemmatize nouns.
> * Therefore, we need to specify the part of speech (POS) for each token.

In [None]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'): #ADJECTIVE
        return wordnet.ADJ
    elif nltk_tag.startswith('V'): #VERN
        return wordnet.VERB
    elif nltk_tag.startswith('N'): #NOUN        
        return wordnet.NOUN
    elif nltk_tag.startswith('R'): #ADVERB
        return wordnet.ADV
    else:          
        return None

In [None]:
def lemmatize_sentence(sentence):
    # Tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    # Tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged) 
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            # If no tag was found, then use the word as is
            lemmatized_sentence.append(word)
        else:        
            # Else use the tag to lemmatize the word
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [None]:
content['lemmatizer_str']=content['lemma'].apply(lambda x: lemmatize_sentence(' '.join(x)))

In [None]:
content.loc[5, 'lemma_str']

In [None]:
content.loc[5, 'lemmatizer_str']

> * Let's extract hashtags and mentions from the contents.
> * We have to use regular expressions to extract hashtags and mentions.

In [None]:
import re

> * There are three useful regex functions in Python: `findall`, `search`, and `match`.

> * `findall` is used to find all matches of a pattern in a string.
> * `search` is used to find the first match of a pattern in a string.
> * `match` is used to match a pattern at the beginning of a string.

> * In this class, we are mostly going to use `findall` to extract hashtags and mentions.

> * The result of `findall` is a list of strings.

> * Basic regex patterns:
> * `.` matches any character except a newline.
> * `*` matches 0 or more repetitions of the preceding regex pattern.
> * `+` matches 1 or more repetitions of the preceding regex pattern.
> * `?` matches 0 or 1 repetition of the preceding regex pattern.
> * `^` matches the start of a string.
> * `$` matches the end of a string.
> * `[]` matches any one of the characters inside the square brackets.
> * `\` is used to escape special characters.
> * `|` is used to match either the regex pattern on the left or the right.

> * `[a-z]` matches any lowercase letter.
> * `[A-Z]` matches any uppercase letter.
> * `[0-9]` matches any digit.
> * `\d` matches any digit.
> * `\D` matches any non-digit.
> * `\w` matches any word character (alphanumeric and underscore).
> * `\W` matches any non-word character.
> * `\s` matches any whitespace character.
> * `\S` matches any non-whitespace character.

> * `[a-zA-Z]` matches any alphabet character.
> * `[a-zA-Z0-9]` matches any alphanumeric character.


<img src="../week4/Regex-Cheat-Sheet.png" width=500px height=800px />

> * Let's practice regex with the `findall` function.
> * Return all non-overlapping matches of pattern in string, as a list of strongs or tuples. The string is scanned left-to-right, and matches are returned in the order found.

In [None]:
pattern = re.compile(r'\d+') # \d+ is the pattern to search for, it means any digit 0-9 and the + means one or more times
sample1='I have 2 dogs and 3 cats'
result=re.findall(pattern, sample1)
print(result)

In [None]:
pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') 
#[a-zA-Z0-9._%+-]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the characters ._%+-, and the + means one or more times
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9.-]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the characters .-, and the + means one or more times
#\ is the escape character, it is used to escape the . character, so it is not interpreted as a special character
#[a-zA-Z]{2,} is the pattern to search for, it means any letter a-z or A-Z, and the {2,} means two or more times
sample2='''My primary email is is324socialnetworkanalysis@illinois.edu. Please contact me! 
If I am not replying, you can contact me at jaihyunpark@illinois.edu'''
result=re.findall(pattern, sample2)
print(result)

In [None]:
pattern = re.compile(r'https?://\S+')
#http is the pattern to search for, it means the characters http
#s? is the pattern to search for, it means the character s, and the ? means zero or one time
#:// is the pattern to search for, it means the characters ://
#\S+ is the pattern to search for, it means any character that is not a white space, and the + means one or more times
sample3='Look at what is happening in Washington! https://is324.com'
result=re.findall(pattern, sample3)
print(result)

In [None]:
pattern = re.compile(r'#[a-zA-Z0-9]+')
## is the pattern to search for, it means the character #
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times
sample4='I am so excited for the class #IS324'
result=re.findall(pattern, sample4)
print(result)

In [None]:
pattern = re.compile(r'@[a-zA-Z0-9]+')
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times
sample5='@jaihyunpark is the instructor for #IS324'
result=re.findall(pattern, sample5)
print(result)

In [None]:
pattern = re.compile(r'(rt\s+@[a-zA-Z0-9]+ | @[a-zA-Z0-9]+)')
#The pattern is composed of two subpatterns, separated by the | character, 
#which means "or" and grouped by parentheses.
#rt is the pattern to search for, it means the characters rt
#\s+ is the pattern to search for, it means any white space, and the + means one or more times
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times

#Another subpattern is looking for mentions without the rt prefix
#@ is the pattern to search for, it means the character @
#[a-zA-Z0-9]+ is the pattern to search for, it means any letter a-z or A-Z, 
#any digit 0-9, and the + means one or more times
sample6='rt @jaihyunpark: I am so excited for the class #IS324 @socialmedia'
result=re.findall(pattern, sample6)
print(result)

> * Let's practice regex with the `search` function.
> * Scan through the string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.

In [None]:
pattern = re.compile(r'\d+') # \d+ is the pattern to search for, it means 
#any digit 0-9 and the + means one or more times
sample1='I have 2 dogs and 3 cats'
result=re.search(pattern, sample1) #search method returns the index of the first match
print(result)
# span=(7, 8) means that the match was found from the 7th to the 8th character of the string

In [None]:
print(sample1[result.span()[0]:result.span()[1]])

In [None]:
pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') 
#Matches one or more alphanumeric characters, dots, underscores, percent signs, plus signs, or hyphens. 
#The @ symbol is also matched.
sample2='''My primary email is is324socialnetworkanalysis@illinois.edu. 
Please contact me! If I am not replying, you can contact me at jaihyunpark@illinois.edu'''
result=re.search(pattern, sample2)
print(result)

In [None]:
print(sample2[result.span()[0]:result.span()[1]])

In [None]:
pattern = re.compile(r'https?://\S+')
#Matches http or https, followed by ://, followed by any non-whitespace characters.
sample3='Look at what is happening in Washington! https://is324.com'
result=re.search(pattern, sample3)
print(result)

In [None]:
print(sample3[result.span()[0]:result.span()[1]])

In [None]:
pattern = re.compile(r'#[a-zA-Z0-9]+')
#Matches hashtags that start with #, followed by one or more alphanumeric characters.
sample4='I am so excited for the class #IS324'
result=re.search(pattern, sample4)
print(result)

In [None]:
print(sample4[result.span()[0]:result.span()[1]])

In [None]:
pattern = re.compile(r'@[a-zA-Z0-9]+')
#Matches mentions that start with @, followed by one or more alphanumeric characters.
sample5='@jaihyunpark is the instructor for #IS324'
result=re.search(pattern, sample5) #search method returns the index of the first match
print(result)

In [None]:
print(sample5[result.span()[0]:result.span()[1]])

In [None]:
pattern = re.compile(r'(rt\s+@[a-zA-Z0-9]+ | @[a-zA-Z0-9]+)')
#Matches retweets that start with rt, followed by one or more white spaces, 
#followed by @, followed by one or more alphanumeric characters.
sample6='rt @jaihyunpark: I am so excited for the class #IS324 @socialmedia'
result=re.search(pattern, sample6)
print(result)

In [None]:
print(sample6[result.span()[0]:result.span()[1]])

> * Let's practice regex with `match` function.
> * If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

In [None]:
pattern = re.compile(r'\d+') # \d+ is the pattern to search for, 
#it means any digit 0-9 and the + means one or more times
sample1='I have 2 dogs and 3 cats'
result=re.match(pattern, sample1) #check if the pattern is at the beginning of the string
print(result)

> * Let's extract hashtags and mentions from the content DataFrame.

> * First, we are extracting mentions.

In [None]:
pattern = re.compile(r'@[a-zA-Z0-9]+')
content['mentions']=content['text-lower'].apply(lambda x: re.findall(pattern, x))

> * Let's check if mention is extracted correctly.

In [None]:
content['text-lower'].iloc[0]

In [None]:
content['mentions'].iloc[0]

In [None]:
content['text-lower'].iloc[1]

In [None]:
content['mentions'].iloc[1]

> * Let's extract hashtags.

In [None]:
pattern = re.compile(r'#([a-zA-Z0-9]+)')
content['hashtags']=content['text-lower'].apply(lambda x: re.findall(pattern, x))

In [None]:
content['text-lower'].iloc[0]

In [None]:
content['hashtags'].iloc[0]

In [None]:
content['text-lower'].iloc[1]

In [None]:
content['hashtags'].iloc[1]

> * We can further extract https links 

In [None]:
pattern=re.compile(r'https?://\S+')
content['http']=content['text-lower'].apply(lambda x: re.findall(pattern, x))

In [None]:
content['text-lower'].iloc[0]

In [None]:
content['http'].iloc[0]

>* Pandas Series also has a `str` attribute to apply string methods.
>* The useful built-in functions are `str.contains`,  `str.replace`, and `str.findall`

> * `str.contains` is used to check if a pattern is contained in each string of the Series.

In [None]:
content['text-lower'].str.contains('http')

> * How many tweets contain a URL?

In [None]:
content[content['text-lower'].str.contains('http')].shape

>* `str.replace` is used to replace a pattern with another pattern in each string of the Series.

In [None]:
content['no-url']=content['text-lower'].str.replace(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', regex=True)

In [None]:
content['text-lower'].iloc[0]

In [None]:
content['no-url'].iloc[0]

>* `str.findall` is used to find all matches of a pattern in each string of the Series.

In [None]:
content['mentions-str']=content['text-lower'].str.findall(r'@([a-zA-Z0-9]+)')

> * The result of `str.findall` in Pandas is identical to the regular expression `findall` function.

In [None]:
content['mentions-str'].iloc[2277]

In [None]:
content['mentions'].iloc[2277]

Practice

**Let's rename the column name for 'label' into 'from'**

In [None]:
#YOUR CODE HERE

**We created the column that contains the mentions in 'mentions' column. How many unique accounts have been mentioned?**

> * You might want to flatten the list of mentions and then count the frequency of each mention.
> * This code below will flatten the list of mentions.
> * `[item for sublist in content['mentions'] for item in sublist]`

In [None]:
#YOUR CODE HERE

**We created the column that contains hashtags in 'hashtags' column. How many unique hashtags have been used?**

> * You might want to flatten the list of hashtags.
> * This code below will flatten the list of hashtags.
> * `[item for sublist in content['hashtags'] for item in sublist]`

In [None]:
#YOUR CODE HERE

**Who mentioned the most in one tweet?**

In [None]:
#YOUR CODE HERE

**Looks like `YOUR ANSWER` mentioned nine accounts in one tweet. What is the content of the tweet? Use 'text-lower' column**

In [None]:
#YOUR CODE HERE

**Who used hashtags the most in one tweet?**

In [None]:
#YOUR CODE HERE

**Looks like `YOUR ANSWER` used eight hastags in one tweet. What is the content of the tweet that used the most hashtags? Use 'text-lower' column**

In [None]:
#YOUR CODE HERE

**In the 'from' column, the name of the politician comes with 'From: ' in front of the name and whether the politician is Representative or Senator. Let's clean this.**

**First, clean the 'From: ' in the 'from' column.**

In [None]:
#YOUR CODE HERE

**Second, extract where they are from (the name of the State) and create a new column called 'state'.**

In [None]:
#YOUR CODE HERE

**Third, extract the name of the politician and create a new column called 'politician'.**

In [None]:
#YOUR CODE HERE