# ADS 509 Module 3: Group Comparison 

The task of comparing two groups of text is fundamental to textual analysis. There are innumerable applications: survey respondents from different segments of customers, speeches by different political parties, words used in Tweets by different constituencies, etc. In this assignment you will build code to effect comparisons between groups of text data, using the ideas learned in reading and lecture.

This assignment asks you to analyze the lyrics and Twitter descriptions for the two artists you selected in Module 1. If the results from that pull were not to your liking, you are welcome to use the zipped data from the “Assignment Materials” section. Specifically, you are asked to do the following: 

* Read in the data, normalize the text, and tokenize it. When you tokenize your Twitter descriptions, keep hashtags and emojis in your token set. 
* Calculate descriptive statistics on the two sets of lyrics and compare the results. 
* For each of the four corpora, find the words that are unique to that corpus. 
* Build word clouds for all four corpora. 

Each one of the analyses has a section dedicated to it below. Before beginning the analysis there is a section for you to read in the data and do your cleaning (tokenization and normalization). 


In [1]:
import os
import re
import emoji
import pandas as pd

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


In [2]:
# Use this space for any additional import statements you need
import ast


In [61]:
# Place any addtional functions or constants you need here. 
# below are defined functions that will create a pipeline to normalize and tokenize while keeping # and emojis

# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison; a set of all punctuation
tw_punct = punctuation - {"#"} # this omits the hashtag from the punctuation set

# define stopwords
sw = set(stopwords.words("english"))

# Two useful regex
# first to ID whitespace and second to ID hashtags
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()

# UNICODE_EMOJI is deprecated; use is_emoji instead
# in order to recode, important to understand that this is a function to add emojis in a list
# defined in is_emoji to all_language_emojis set
emoji_list = []
for country in emoji_list: 
    for em in emoji.is_emoji[country]: 
        all_language_emojis.add(em)

# and now our functions
def descriptive_stats(tokens, num_tokens = 10, verbose=True) :
    """
        Given a list of tokens, print number of tokens, number of unique tokens, 
        number of characters, lexical diversity, and num_tokens most common
        tokens. Return a list of 
    """

    # Place your Module 2 solution here
    num_tokens = len(tokens) # len gets the sum of values in a list
    # see https://www.geeksforgeeks.org/how-to-count-unique-values-inside-a-list/
    num_unique_tokens = len(set(tokens)) # sets don't contain duplicates, therefore we can sum a set to get unique values
    lexical_diversity = len(set(tokens))/len(tokens)
    # see https://stackoverflow.com/questions/25934586/finding-the-amount-of-characters-of-all-words-in-a-list-in-python
    num_characters = sum([len(i) for i in tokens])
    
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")

        # print the five most common tokens
        # see https://www.geeksforgeeks.org/find-k-frequent-words-data-set-python/
        print(Counter(tokens).most_common(10))
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters])

# lexical diversity is a measure of the number of unique words in a text
# lexical diversity = number of unique words/ total number of words   
# code from descriptive_stats defined function taken from Module 2 Assignment, cell #4

# see slack from Prof. Marbut for change
def is_emoji(s):
    return(emoji.is_emoji(s))

def contains_emoji(s):
    s = str(s)
    emojis = [ch for ch in s if is_emoji(ch)]
    return(len(emojis) > 0)

# below is a function to remove stop tokens
# see https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
def remove_stop(tokens) :
    # modify this function to remove stopwords
    return[w for w in tokens if w not in sw]
 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    # modify this function to return tokens
    # we defined the whitespace_pattern earlier
    # we'll have it return i, where i = tokens
    return(i for i in whitespace_pattern.split(text))

# let's define a lowercase function
# see # see https://www.programiz.com/python-programming/methods/string/casefold

# def lowercase(text):
    # return(text.casefold())
    # was going to define lowercase function, but already defined a few cells down

# everything ^^^ is to prepare to load into vvv; a pipeline of functions to leave us with
# twitter data that includes hashtags and emojis
# no need to define a string conversion as it is already provided
def prepare(text, pipeline) : 
    tokens = str(text)
    for transform in pipeline : 
        tokens = transform(tokens)    
    return(tokens)


## Data Ingestion

Use this section to ingest your data into the data structures you plan to use. Typically this will be a dictionary or a pandas DataFrame.

In [4]:
# Feel free to use the below cells as an example or read in the data in a way you prefer
# we'll use the data pulled from Module 1
data_location = "/Users/evachow/Documents/GitHub/ADS509/ADS509_Module_1/"
twitter_folder = "twitter/"
lyrics_folder = "lyrics/"

artist_files = {'mychemicalromance':'MCRofficial_followers.txt',
                'missy':'MissyElliott_followers.txt'}


In [5]:
twitter_data = pd.read_csv(data_location + twitter_folder + artist_files['mychemicalromance'],
                           sep='\t',
                           lineterminator='\n', # added because couldn't parse?
                           error_bad_lines=False, # some lines saw additional fields?
                           quoting=3)

twitter_data['artist'] = "mychemicalromance"



  twitter_data = pd.read_csv(data_location + twitter_folder + artist_files['mychemicalromance'],
b'Skipping line 75449: expected 7 fields, saw 8\nSkipping line 76732: expected 7 fields, saw 8\nSkipping line 86300: expected 7 fields, saw 11\nSkipping line 107413: expected 7 fields, saw 8\n'


In [6]:
twitter_data_2 = pd.read_csv(data_location + twitter_folder + artist_files['missy'],
                             sep='\t',
                             lineterminator='\n',
                             error_bad_lines=False,
                             quoting=3)
twitter_data_2['artist'] = "missy"

twitter_data = pd.concat([
    twitter_data,twitter_data_2])
    
del(twitter_data_2)



  twitter_data_2 = pd.read_csv(data_location + twitter_folder + artist_files['missy'],
b'Skipping line 66847: expected 7 fields, saw 8\nSkipping line 69573: expected 7 fields, saw 8\nSkipping line 73464: expected 7 fields, saw 23\n'


In [7]:
# let's check the twitter data dataframes
twitter_data.head(5)

Unnamed: 0,id,username,name,location,follower_count,following,description,artist
0,1132811754176765952,nutman71234668,quaintqueef420,,26.0,509.0,i smelly,mychemicalromance
1,1570075902817583106,DemoLoversMCR,Demolition Lovers Gang - MCR,"Newark, New Jersey",1.0,6.0,Official petition. We want to hear Demoition L...,mychemicalromance
2,1537290582556454914,Meooowcy,CréamyLatté,"Lungsod ng Valenzuela, Pambans",2.0,65.0,,mychemicalromance
3,1232241829191593984,jamieexisted,jamie,"he/him, 19",33.0,310.0,🏳️‍🌈🏳️‍⚧️,mychemicalromance
4,1570087687683555328,KarmenWeaks,Karmen Weaks,,0.0,89.0,,mychemicalromance


In [8]:
# read in the lyrics here
# this is taken from assignment submitted for Module 2 with modifications
# define our unique artists
artist = ['mychemicalromance', 'missy']

# define the path (define the lyric folder then the folder for each artist)
path_lyrics = data_location + lyrics_folder

# see https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
path_lyrics_artist = [f for f in os.listdir(path_lyrics) if not f.startswith('.')]

# let's append artist, song, and lyrics of each song
singer = [] # defined as singer because will probably use artist for loops later
lyric = []

# first, a loop to go through each artist
for artist in path_lyrics_artist:
    unique_artist = path_lyrics + '/' + artist
    unique_song = os.listdir(unique_artist)
    
    # now a loop to go through each artist's song
    for music in unique_song:
        path_songs = unique_artist + '/' + music
        # see https://www.adamsmith.haus/python/answers/how-to-read-a-text-file-into-a-list-in-python
        with open(path_songs) as the_path:
            the_songs = the_path.readlines()
            
        singer.append(artist)
        lyric.append(''.join(the_songs[1:]))

In [9]:
# dataframe to check read in
headers = {'artist': singer,
          'lyric': lyric}
lyrics_data = pd.DataFrame(headers)
lyrics_data.head(5)

Unnamed: 0,artist,lyric
0,mychemicalromance,\n\n\nGravity don't mean too much to me\nI'm w...
1,mychemicalromance,\n\n\nAnd we can run from the backdrop of thes...
2,mychemicalromance,\n\n\nWe could be perfect one last night\nAnd ...
3,mychemicalromance,"\n\n\nYou're not in this alone, let me break t..."
4,mychemicalromance,"\n\n\n""They're, they're these terrors, and it'..."


In [10]:
# remove \n or tabs that have appeared
lyrics_data = lyrics_data.replace(r'\n', '', regex=True)
lyrics_data.head(5)

Unnamed: 0,artist,lyric
0,mychemicalromance,Gravity don't mean too much to meI'm who I've ...
1,mychemicalromance,And we can run from the backdrop of these gear...
2,mychemicalromance,We could be perfect one last nightAnd die like...
3,mychemicalromance,"You're not in this alone, let me break this aw..."
4,mychemicalromance,"""They're, they're these terrors, and it's like..."


## Tokenization and Normalization

In this next section, tokenize and normalize your data. We recommend the following cleaning. 

**Lyrics** 

* Remove song titles
* Casefold to lowercase
* Remove punctuation
* Split on whitespace
* Remove stopwords (optional)

Removal of stopwords is up to you. Your descriptive statistic comparison will be different if you include stopwords, though TF-IDF should still find interesting features for you.

**Twitter Descriptions** 

* Casefold to lowercase
* Remove punctuation other than emojis or hashtags
* Split on whitespace
* Remove stopwords

Removing stopwords seems sensible for the Twitter description data. Remember to leave in emojis and hashtags, since you analyze those. 

In [11]:
# apply the `pipeline` techniques from BTAP Ch 1 or 5
# we'll apply the pipeline functions to both, so stopwords will be removed for twitter and lyrics data

my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]

lyrics_data["tokens"] = lyrics_data["lyric"].apply(prepare, pipeline=my_pipeline)
lyrics_data["num_tokens"] = lyrics_data["tokens"].map(len) 

twitter_data["tokens"] = twitter_data["description"].apply(prepare,pipeline=my_pipeline)
twitter_data["num_tokens"] = twitter_data["tokens"].map(len) 


In [12]:
twitter_data['has_emoji'] = twitter_data["description"].apply(contains_emoji)

In [13]:
twitter_data.head(5)

Unnamed: 0,id,username,name,location,follower_count,following,description,artist,tokens,num_tokens,has_emoji
0,1132811754176765952,nutman71234668,quaintqueef420,,26.0,509.0,i smelly,mychemicalromance,[smelly],1,False
1,1570075902817583106,DemoLoversMCR,Demolition Lovers Gang - MCR,"Newark, New Jersey",1.0,6.0,Official petition. We want to hear Demoition L...,mychemicalromance,"[official, petition, want, hear, demoition, lo...",15,False
2,1537290582556454914,Meooowcy,CréamyLatté,"Lungsod ng Valenzuela, Pambans",2.0,65.0,,mychemicalromance,[nan],1,False
3,1232241829191593984,jamieexisted,jamie,"he/him, 19",33.0,310.0,🏳️‍🌈🏳️‍⚧️,mychemicalromance,[🏳️‍🌈🏳️‍⚧️],1,True
4,1570087687683555328,KarmenWeaks,Karmen Weaks,,0.0,89.0,,mychemicalromance,[nan],1,False


Let's take a quick look at some descriptions with emojis.

In [14]:
twitter_data[twitter_data.has_emoji].sample(10)[["artist","description","tokens","num_tokens"]]

Unnamed: 0,artist,description,tokens,num_tokens
78548,mychemicalromance,24 ♎️ | 🌱 | PA | @asking_andrew ‘s wife 💍 | Wi...,"[24, ♎️, 🌱, pa, askingandrew, ‘s, wife, 💍, win...",16
20660,missy,clearing and forwarding Agent@PERO RELIABLE NI...,"[clearing, forwarding, agentpero, reliable, ni...",13
57944,mychemicalromance,photography (protect the environment) (cat da...,"[photography, protect, environment, cat, dad❤️]",5
105341,missy,Mr k👀l GUY,"[mr, k👀l, guy]",3
70745,mychemicalromance,Just a gender fluid witch🌑🌒🌓🌔🌕🌖🌗🌘 trying figur...,"[gender, fluid, witch🌑🌒🌓🌔🌕🌖🌗🌘, trying, figure,...",10
84851,mychemicalromance,blockers since 10/25/21| genderless nightmares...,"[blockers, since, 102521, genderless, nightmar...",11
111286,missy,None of your business🥰,"[none, business🥰]",2
34745,missy,Jayce 🤎🤎🤎,"[jayce, 🤎🤎🤎]",2
95129,mychemicalromance,🥀🩸 𝖇𝖔𝖗𝖓 𝖜𝖎𝖙𝖍 𝖍𝖔𝖗𝖓𝖘,"[🥀🩸, 𝖇𝖔𝖗𝖓, 𝖜𝖎𝖙𝖍, 𝖍𝖔𝖗𝖓𝖘]",4
75474,mychemicalromance,I love my friends in the puter • language nerd...,"[love, friends, puter, •, language, nerd, japa...",16


In [15]:
lyrics_tok_nom = lyrics_data[['artist', 'lyric', 'tokens', 'num_tokens']]
lyrics_tok_nom.sample(10)

Unnamed: 0,artist,lyric,tokens,num_tokens
135,missy,"Any Give Sunday babyOww! Yo, yo, yoOh zigi-zig...","[give, sunday, babyoww, yo, yo, yooh, zigizigi...",319
81,mychemicalromance,"Love, was just a passing phaseA fever that ran...","[love, passing, phasea, fever, ran, away, til,...",139
171,missy,[Intro: Missy]This is a Misdemeanor exclusiveI...,"[intro, missythis, misdemeanor, exclusiveif, r...",313
163,missy,[Verse One:]You never know a good thing till I...,"[verse, oneyou, never, know, good, thing, till...",160
34,mychemicalromance,"When I was a young boy, my fatherTook me into ...","[young, boy, fathertook, city, see, marching, ...",242
12,mychemicalromance,See the man who stands upon the hillHe dreams ...,"[see, man, stands, upon, hillhe, dreams, battl...",136
16,mychemicalromance,みんあ！くるま わ まんたん だしすーつけーす に わ ばくだん つめてーいるじんせい わ ...,"[みんあ！くるま, わ, まんたん, だしすーつけーす, に, わ, ばくだん, つめてーい...",283
61,mychemicalromance,It's the tearing sound of love notesDrowning o...,"[tearing, sound, love, notesdrowning, gray, st...",215
28,mychemicalromance,Late dawns and early sunsetsJust like my favor...,"[late, dawns, early, sunsetsjust, like, favori...",152
32,mychemicalromance,"Make room! Make room!Down on the coffin, there...","[make, room, make, roomdown, coffin, theres, c...",164


With the data processed, we can now start work on the assignment questions. 

Q: What is one area of improvement to your tokenization that you could theoretically carry out? (No need to actually do it; let's not make perfect the enemy of good enough.)

A: A consecutive series of emojis is identified as one single token. Given patience and time, the emoji sequence could be broken up (much like letters of a word) so that each emoji would be represented as a token. In addition, we can see that a consecutive series of text and emoji together with no spacing also presents itself as one single token. This could be adjusted so that the text would be separated from the emoji, with each counting as a token. In addition, the nature of Missy Elliott's lyrics may require a customized list of stopwords to include the ommission of other "filler" words in her lyrics such as "uh" "ohhh" and other lyrics sung to add groove.

## Calculate descriptive statistics on the two sets of lyrics and compare the results. 


In [52]:
# the following returns a dataframe with tokens for respective artist lyrics data
mcr_tokens = lyrics_tok_nom[lyrics_tok_nom['artist'] == 'mychemicalromance']
missy_tokens = lyrics_tok_nom[lyrics_tok_nom['artist'] == 'missy']

# this results in a column where every row is a list
# now we need to make a list for each artists' tokens
mcr_tokens_list = []
missy_tokens_list = []

In [53]:
# we define each of the tokens
# extend adds our tokens from the column of lists of tokens for that artist
# to the list of all tokens
# see https://www.programiz.com/python-programming/methods/list/extend
for token in mcr_tokens['tokens']:
    mcr_tokens_list.extend(token)
    
for token in missy_tokens['tokens']:
    missy_tokens_list.extend(token)

In [62]:
print('\nDescriptive statistics for My Chemical Romance lyrics:')
descriptive_stats(mcr_tokens_list, verbose=True)


Descriptive statistics for My Chemical Romance lyrics:
There are 15355 tokens in the data.
There are 4377 unique tokens in the data.
There are 86365 characters in the data.
The lexical diversity is 0.285 in the data.
[('dont', 225), ('never', 205), ('na', 198), ('im', 179), ('get', 174), ('like', 144), ('got', 140), ('go', 117), ('well', 110), ('want', 102)]


[15355, 4377, 0.28505372842722243, 86365]

In [63]:
print('\nDescriptive statistics for Missy Elliott lyrics:')
descriptive_stats(missy_tokens_list, verbose=True)


Descriptive statistics for Missy Elliott lyrics:
There are 29517 tokens in the data.
There are 9560 unique tokens in the data.
There are 161824 characters in the data.
The lexical diversity is 0.324 in the data.
[('like', 556), ('im', 449), ('dont', 362), ('get', 349), ('got', 307), ('know', 267), ('make', 197), ('wanna', 168), ('missy', 163), ('come', 158)]


[29517, 9560, 0.32388115323372973, 161824]

Q: What observations do you make about these data? 

A: Despite the assumption that as a rapper, Missy Elliott would have a significantly higher lexical diversity compared to My Chemical Romance, the lexical diversity between the two artists is not too far off, with Missy Elliott showing slightly more diverse language with a lexical diversity of 0.32 compared to My Chemical Romance's 0.29. However, what's interesting to see is that Missy Elliott had nearly double the amount of words and unique words as My Chemical Romance.  


## Find tokens uniquely related to a corpus

Typically we would use TF-IDF to find unique tokens in documents. Unfortunately, we either have too few documents (if we view each data source as a single document) or too many (if we view each description as a separate document). In the latter case, our problem will be that descriptions tend to be short, so our matrix would be too sparse to support analysis. 

To avoid these problems, we will create a custom statistic to identify words that are uniquely related to each corpus. The idea is to find words that occur often in one corpus and infrequently in the other(s). Since corpora can be of different lengths, we will focus on the _concentration_ of tokens within a corpus. "Concentration" is simply the count of the token divided by the total corpus length. For instance, if a corpus had length 100,000 and a word appeared 1,000 times, then the concentration would be $\frac{1000}{100000} = 0.01$. If the same token had a concentration of $0.005$ in another corpus, then the concentration ratio would be $\frac{0.01}{0.005} = 2$. Very rare words can easily create infinite ratios, so you will also add a cutoff to your code so that a token must appear at least $n$ times for you to return it. 

An example of these calculations can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1P87fkyslJhqXFnfYezNYrDrXp_GS8gwSATsZymv-9ms). Please don't hesitate to ask questions if this is confusing. 

In this section find 10 tokens for each of your four corpora that meet the following criteria: 

1. The token appears at least `n` times in all corpora
1. The tokens are in the top 10 for the highest ratio of appearances in a given corpora vs appearances in other corpora.

You will choose a cutoff for yourself based on the side of the corpus you're working with. If you're working with the Robyn-Cher corpora provided, `n=5` seems to perform reasonably well.

In [18]:
# let's view the top 10 tokens for each artist and see the concentration of that token in 
# their respective artist as well as the other artist for their lyrics and follower descriptions

Q: What are some observations about the top tokens? Do you notice any interesting items on the list? 

A: 

## Build word clouds for all four corpora. 

For building wordclouds, we'll follow exactly the code of the text. The code in this section can be found [here](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb). If you haven't already, you should absolutely clone the repository that accompanies the book. 


In [19]:
from matplotlib import pyplot as plt

def wordcloud(word_freq, title=None, max_words=200, stopwords=None):

    wc = WordCloud(width=800, height=400, 
                   background_color= "black", colormap="Paired", 
                   max_font_size=150, max_words=max_words)
    
    # convert data frame into dict
    if type(word_freq) == pd.Series:
        counter = Counter(word_freq.fillna(0).to_dict())
    else:
        counter = word_freq

    # filter stop words in frequency counter
    if stopwords is not None:
        counter = {token:freq for (token, freq) in counter.items() 
                              if token not in stopwords}
    wc.generate_from_frequencies(counter)
 
    plt.title(title) 

    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    
    
def count_words(df, column='tokens', preprocess=None, min_freq=2):

    # process tokens and update counter
    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(tokens)

    # create counter and run through all data
    counter = Counter()
    df[column].map(update)

    # transform counter into data frame
    freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
    freq_df = freq_df.query('freq >= @min_freq')
    freq_df.index.name = 'token'
    
    return freq_df.sort_values('freq', ascending=False)

Q: What observations do you have about these (relatively straightforward) wordclouds? 

A: 