# ADS 509 Module 3: Group Comparison 

The task of comparing two groups of text is fundamental to textual analysis. There are innumerable applications: survey respondents from different segments of customers, speeches by different political parties, words used in Tweets by different constituencies, etc. In this assignment you will build code to effect comparisons between groups of text data, using the ideas learned in reading and lecture.

This assignment asks you to analyze the lyrics for the two artists you selected in Module 1 and the Twitter descriptions pulled for Robyn and Cher. If the results from that pull were not to your liking, you are welcome to use the zipped data from the “Assignment Materials” section. Specifically, you are asked to do the following: 

* Read in the data, normalize the text, and tokenize it. When you tokenize your Twitter descriptions, keep hashtags and emojis in your token set. 
* Calculate descriptive statistics on the two sets of lyrics and compare the results. 
* For each of the four corpora, find the words that are unique to that corpus. 
* Build word clouds for all four corpora. 

Each one of the analyses has a section dedicated to it below. Before beginning the analysis there is a section for you to read in the data and do your cleaning (tokenization and normalization). 


In [22]:
import os
import re
import emoji
import pandas as pd

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

In [23]:
# Use this space for any additional import statements you need
import warnings

In [31]:
# Place any addtional functions or constants you need here. 

# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords
sw = stopwords.words("english")

# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()

for country in emoji.EMOJI_DATA : 
    for em in emoji.EMOJI_DATA[country] : 
        all_language_emojis.add(em)

# and now our functions
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    """
        Given a list of tokens, print number of tokens, number of unique tokens, 
        number of characters, lexical diversity, and num_tokens most common
        tokens. Return a list of 
    """

    # Fill in the correct values here. 
    num_tokens = len(tokens)
    num_unique_tokens = len(set(tokens))
    lexical_diversity = num_unique_tokens / num_tokens
    num_characters = sum(len(token) for token in tokens)
    
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # Print the five most common tokens
        top_five = Counter(tokens).most_common(5)
        print(top_five)
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters])


    
def contains_emoji(s):
    
    s = str(s)
    emojis = [ch for ch in s if emoji.is_emoji(ch)]

    return(len(emojis) > 0)


def remove_stop(tokens) :
    
    tokens = [word for word in tokens if word not in sw]
    
    return(tokens)
 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))


def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    
    return text.split()

def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

# Function to remove the song title from the lyrics
def remove_title(text):
   
    # Song title is found before the first \n
    lyrics = text.split('\n', 1)[1]
            
    return lyrics


## Data Ingestion

Use this section to ingest your data into the data structures you plan to use. Typically this will be a dictionary or a pandas DataFrame.

In [32]:
# Data location
data_location = "/Users/clairebentzen/Desktop/MDAS/ADS 509 - Applied Text Mining/Module 2/Assignment2.1/M1 Results/"

# Subfolders data from the Module 1 assignment
twitter_folder = "twitter/"
lyrics_folder = "lyrics/"

# Specify artist_files for twitter data
artist_files = {'cher':'cher_followers_data.txt',
                'robyn':'robynkonichiwa_followers_data.txt'}

In [33]:
# Read cher twitter data
twitter_data = pd.read_csv(data_location + twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)

twitter_data['artist'] = "cher"

In [34]:
# Read robyn twitter 
twitter_data_2 = pd.read_csv(data_location + twitter_folder + artist_files['robyn'],
                             sep="\t",
                             quoting=3)
twitter_data_2['artist'] = "robyn"

# Concat twitter dataframes
twitter_data = pd.concat([
    twitter_data,twitter_data_2])
    
del(twitter_data_2)

In [35]:
# Read in the lyrics data
# Specify pathway to lyrics folder
lyrics_path = data_location + lyrics_folder

# Create a dataframe to store results
lyrics_data = pd.DataFrame(columns=['artist', 'song', 'lyrics'])

# Iterate through each file in the lyrics folder
for artist in os.listdir(lyrics_path):
    artist_path = os.path.join(lyrics_path, artist)
    
    # Iterate through each file in the artist folders
    for song in os.listdir(artist_path):
        song_path = os.path.join(artist_path, song)
        rem_prefix = song.removeprefix(f'{artist}_')
        song_title = rem_prefix.removesuffix('.txt')

        # Open and read the contents of the file (song)
        with open(song_path, 'r') as file:
            contents = file.read()
            # Prepare data to add to dataframe
            data = {'artist': artist, 'song': song_title, 'lyrics': contents}
            # The df.append() function is deprecated, so we will ignore warnings here
            with warnings.catch_warnings():
                warnings.simplefilter('ignore')
                # Append row of data to lyrics_df
                lyrics_data = lyrics_data.append(data, ignore_index=True)

## Tokenization and Normalization

In this next section, tokenize and normalize your data. We recommend the following cleaning. 

**Lyrics** 

* Remove song titles
* Casefold to lowercase
* Remove stopwords (optional)
* Remove punctuation
* Split on whitespace

Removal of stopwords is up to you. Your descriptive statistic comparison will be different if you include stopwords, though TF-IDF should still find interesting features for you. Note that we remove stopwords before removing punctuation because the stopword set includes punctuation.

**Twitter Descriptions** 

* Casefold to lowercase
* Remove stopwords
* Remove punctuation other than emojis or hashtags
* Split on whitespace

Removing stopwords seems sensible for the Twitter description data. Remember to leave in emojis and hashtags, since you analyze those. 

In [36]:
# Initialize pipeline to remove title, casefold, remove sw, remove punctuation, and split lyrics
lyrics_pipeline = [remove_title, str.lower, remove_punctuation, tokenize, remove_stop]

# Initialize pipeline to casefold, remove sw, remove punctuation, and split twitter descriptions
twitter_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]

# Apply pipeline to lyrics
lyrics_data["tokens"] = lyrics_data["lyrics"].apply(prepare, pipeline=lyrics_pipeline)
# Calculate number of tokens in each row
lyrics_data["num_tokens"] = lyrics_data["tokens"].map(len) 

# Apply pipeline to twitter descriptions
twitter_data["tokens"] = twitter_data["description"].apply(prepare,pipeline=twitter_pipeline)
# Calculate number of tokens in each row
twitter_data["num_tokens"] = twitter_data["tokens"].map(len) 

In [43]:
# Check which twitter descriptions have emojis
twitter_data['has_emoji'] = twitter_data["description"].apply(contains_emoji)

Let's take a quick look at some descriptions with emojis.

In [45]:
twitter_data[twitter_data.has_emoji].sample(10)[["artist","description","tokens"]]

Unnamed: 0,artist,description,tokens
1277283,cher,Mom who is sick of all these damn racists/bigo...,"[mom, sick, damn, racistsbigotsxenophobes, bul..."
312403,cher,I am low key a living meme 😃,"[low, key, living, meme, 😃]"
1284757,cher,World citizen. Kind of shy here. ✌️,"[world, citizen, kind, shy, ✌️]"
1475074,cher,"I ❤my family, my little dogs. I like to bake f...","[❤my, family, little, dogs, like, bake, scratc..."
1315427,cher,"maybe just hit 50 but age, mheeee 🤷🏻‍♂️, educa...","[maybe, hit, 50, age, mheeee, 🤷🏻‍♂️, educates,..."
24391,cher,🇵🇭 l 📚 Epistemophile l 🎥 Primera Cinema Produc...,"[🇵🇭, l, 📚, epistemophile, l, 🎥, primera, cinem..."
1098649,cher,just here to fulfill my fantasies 😋,"[fulfill, fantasies, 😋]"
263416,cher,💚💛❤️,[💚💛❤️]
188180,robyn,♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ♥ ...,"[♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ♥, ..."
2109999,cher,"mostly high thots, LPT 🦋she/her","[mostly, high, thots, lpt, 🦋sheher]"


With the data processed, we can now start work on the assignment questions. 

Q: What is one area of improvement to your tokenization that you could theoretically carry out? (No need to actually do it; let's not make perfect the enemy of good enough.)

A: One area of improvement I could perform on my tokenization has to do with the emojis. It appears that the emojis are only tokenized individually if there are spaces between them. If an emoji is next to another emoji or other character, then the emoji will not be a separate token. Ideally, each emoji should be its own token regardless of it there is a space next to it or not.

## Calculate descriptive statistics on the two sets of lyrics and compare the results. 


In [50]:
# Cher Lyrics Stats
# Subset cher lyrics
cher_lyrics = lyrics_data[lyrics_data['artist'] == 'cher']
# Concat cleaned lyrics into one string
cher_lyrics['cleaned_lyrics'] = cher_lyrics['tokens'].apply(lambda x: ' '.join(x))
cher_lyrics_str = cher_lyrics['cleaned_lyrics'].str.cat(sep=' ')

# Calculate descriptive stats
cher_lyrics_desc = descriptive_stats(cher_lyrics_str.split())
cher_lyrics_desc

There are 35233 tokens in the data.
There are 3684 unique tokens in the data.
There are 169244 characters in the data.
The lexical diversity is 0.105 in the data.
[('love', 966), ('im', 511), ('know', 480), ('dont', 430), ('youre', 332)]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cher_lyrics['cleaned_lyrics'] = cher_lyrics['tokens'].apply(lambda x: ' '.join(x))


[35233, 3684, 0.10456106491073709, 169244]

In [51]:
# Robyn Lyrics Stats
# Subset robyn lyrics
robyn_lyrics = lyrics_data[lyrics_data['artist'] == 'robyn']
# Concat cleaned lyrics into one string
robyn_lyrics['cleaned_lyrics'] = robyn_lyrics['tokens'].apply(lambda x: ' '.join(x))
robyn_lyrics_str = robyn_lyrics['cleaned_lyrics'].str.cat(sep=' ')

# Calculate descriptive stats
robyn_lyrics_desc = descriptive_stats(robyn_lyrics_str.split())
robyn_lyrics_desc

There are 15041 tokens in the data.
There are 2139 unique tokens in the data.
There are 72804 characters in the data.
The lexical diversity is 0.142 in the data.
[('know', 305), ('im', 299), ('dont', 297), ('love', 269), ('got', 249)]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  robyn_lyrics['cleaned_lyrics'] = robyn_lyrics['tokens'].apply(lambda x: ' '.join(x))


[15041, 2139, 0.1422112891430091, 72804]

Q: what observations do you make about these data? 

A: The Cher lyrics contain more tokens, unique tokens, and characters, but the lexical diversity is slighly lower than the Robyn lyrics. An interesting observation is that out of the top 5 most common words, these two artists share 4 of them, although they are in different orders. These 4 words they have in common are love, know, im, and dont.

## Find tokens uniquely related to a corpus

Typically we would use TF-IDF to find unique tokens in documents. Unfortunately, we either have too few documents (if we view each data source as a single document) or too many (if we view each description as a separate document). In the latter case, our problem will be that descriptions tend to be short, so our matrix would be too sparse to support analysis. 

To avoid these problems, we will create a custom statistic to identify words that are uniquely related to each corpus. The idea is to find words that occur often in one corpus and infrequently in the other(s). Since corpora can be of different lengths, we will focus on the _concentration_ of tokens within a corpus. "Concentration" is simply the count of the token divided by the total corpus length. For instance, if a corpus had length 100,000 and a word appeared 1,000 times, then the concentration would be $\frac{1000}{100000} = 0.01$. If the same token had a concentration of $0.005$ in another corpus, then the concentration ratio would be $\frac{0.01}{0.005} = 2$. Very rare words can easily create infinite ratios, so you will also add a cutoff to your code so that a token must appear at least $n$ times for you to return it. 

An example of these calculations can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1P87fkyslJhqXFnfYezNYrDrXp_GS8gwSATsZymv-9ms). Please don't hesitate to ask questions if this is confusing. 

In this section find 10 tokens for each of your four corpora that meet the following criteria: 

1. The token appears at least `n` times in all corpora
1. The tokens are in the top 10 for the highest ratio of appearances in a given corpora vs appearances in other corpora.

You will choose a cutoff for yourself based on the side of the corpus you're working with. If you're working with the Robyn-Cher corpora provided, `n=5` seems to perform reasonably well.

In [None]:
# your code here

Q: What are some observations about the top tokens? Do you notice any interesting items on the list? 

A: 

## Build word clouds for all four corpora. 

For building wordclouds, we'll follow exactly the code of the text. The code in this section can be found [here](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb). If you haven't already, you should absolutely clone the repository that accompanies the book. 


In [None]:
from matplotlib import pyplot as plt

def wordcloud(word_freq, title=None, max_words=200, stopwords=None):

    wc = WordCloud(width=800, height=400, 
                   background_color= "black", colormap="Paired", 
                   max_font_size=150, max_words=max_words)
    
    # convert data frame into dict
    if type(word_freq) == pd.Series:
        counter = Counter(word_freq.fillna(0).to_dict())
    else:
        counter = word_freq

    # filter stop words in frequency counter
    if stopwords is not None:
        counter = {token:freq for (token, freq) in counter.items() 
                              if token not in stopwords}
    wc.generate_from_frequencies(counter)
 
    plt.title(title) 

    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    
    
def count_words(df, column='tokens', preprocess=None, min_freq=2):

    # process tokens and update counter
    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(tokens)

    # create counter and run through all data
    counter = Counter()
    df[column].map(update)

    # transform counter into data frame
    freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
    freq_df = freq_df.query('freq >= @min_freq')
    freq_df.index.name = 'token'
    
    return freq_df.sort_values('freq', ascending=False)

Q: What observations do you have about these (relatively straightforward) wordclouds? 

A: 