# Problem Set 1

We will use the dataset containing songs that we used in Class 3.

### Importing the data
Make sure that you upload the following file to your drive (or if you are solving the problem set on your laptop, download the file):

https://drive.google.com/file/d/1-hv2yowSJ34DgDG_WdKpfW4hyVoApnZO/view?usp=drive_link

This file contains songs from Spotify, the same set of songs we used in Class 3.

In [None]:
import pandas as pd
import spacy
import re
import nltk
import ast
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Songs_PS1.csv',
                 converters={'tokens': lambda x: ast.literal_eval(x)}) #this function converts the strings in token to a list

#### Question 1:
1. Which columns are in this dataset? How many rows?

In [None]:
print(list(df.columns)) #this indicates the columns
print(df.shape[0]) #this show how many rows
df.head(-10)

['artist', 'song', 'text', 'tokens']
57650


Unnamed: 0,artist,song,text,tokens
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd...","[look, at, her, face, ,, it, be, a, wonderful,..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...","[take, it, easy, with, i, ,, please, touch, i,..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,"[i, will, never, know, why, i, have, to, go, w..."
3,ABBA,Bang,Making somebody happy is a question of give an...,"[make, somebody, happy, be, a, question, of, g..."
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,"[make, somebody, happy, be, a, question, of, g..."
...,...,...,...,...
57635,Zebrahead,Let Me Go,Well some wear their feelings right on their s...,"[well, some, wear, their, feeling, right, on, ..."
57636,Zebrahead,Livin' Libido Loco,Enrique played in a band \nDown at the sand ...,"[enrique, play, in, a, band, down, at, the, sa..."
57637,Zebrahead,Lobotomy For Dummies,You can lie to me and say it's you I adore \n...,"[you, can, lie, to, i, and, say, it, be, you, ..."
57638,Zebrahead,Mental Health,Let's go \nThe lights are on but there is no ...,"[let, us, go, the, light, be, on, but, there, ..."


### Question 2:
The column `tokens` was created using the following code:

```
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def process_texts(texts):
    cleaned_texts = (re.sub(r'\s+', ' ', text).strip() for text in texts)
    docs = nlp.pipe(cleaned_texts, n_process=-1)
    return [[token.lemma_.lower() for token in doc] for doc in docs]

# Process the entire dataframe
df['tokens'] = process_texts(df['text'])
```

Part of the code is there to make the computation faster. Focus on the line:

```
return [[token.lemma_.lower() for token in doc] for doc in docs]
```
Can you explain in your own words what this line does?

The documentation for spaCy could be useful to answer: https://spacy.io/usage/linguistic-features and https://spacy.io/api/doc

This line processes each token in each doc to its stem word, and turns it to lowercase. (The doc is a list of the tokens, and docs is the dataframe of docs of tokens.) Then the line iterates this process for each doc in the docs, which is generated by nlp.pipe.

### Question 3:

Look at the tokens for the song titled *Andante, Andante* and compare them with the original text of the song. Why do we have `'i'` as the fifth token?

Because the lemmatized word of "me" is "I", and then the lowercase is "i".

## Identifying bigrams

### Question 4:
Before we can proceed, we need to further simplify the text. Follow a similar approach to what we did in class. For example, you may want to remove characters that are not word characters, remove digits, and keep only strings with length greater than 1.

In [None]:
# Step 1: Remove non-word characters except spaces from each token
df['tokens_simple'] = df['tokens'].apply(lambda x: [re.sub(r'[^\w\s]', '', i) for i in x])

# Step 2: Remove digits and non-alphabetic characters from each token
df['tokens_simple'] = df['tokens_simple'].apply(lambda x: [re.sub(r'\d+', '', i) for i in x])
df['tokens_simple'] = df['tokens_simple'].apply(lambda x: [re.sub(r'[^a-zA-Z]', '', i) for i in x])

# Step 3: Remove strings of length 1 or 0
df['tokens_simple'] = df['tokens_simple'].apply(lambda x: [i for i in x if len(i)>1])

### Question 5:
Use the code seen in Class 3 to identify bigrams.
Start by filtering bigrams with at least 5 occurrences, and print the 20 with the highest PMI score. Then move to filter with at least 50 occurrences, then with at least 500 occurrences.
What do you notice?
What changes with respect to the results we saw in class? How do you explain this difference?



In [None]:
text_tokens_all = [token for song in df['tokens_simple'] for token in song]

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(text_tokens_all)
finder.apply_freq_filter(4) #Filter, at leat 5 occurrences
finder.nbest(bigram_measures.pmi, 20)

[('baaaaarar', 'baaaaaarar'),
 ('bekymmer', 'suddar'),
 ('beng', 'deng'),
 ('courant', 'fugitif'),
 ('craise', 'finton'),
 ('dirsi', 'addio'),
 ('ingonyama', 'nengw'),
 ('inscru', 'tability'),
 ('kastiyong', 'buhangin'),
 ('kingman', 'barstow'),
 ('lilywhite', 'lilith'),
 ('locamente', 'enamorado'),
 ('meshugana', 'moils'),
 ('nakayang', 'ipadama'),
 ('nannimo', 'shinpai'),
 ('nengw', 'enamabala'),
 ('oingo', 'boingo'),
 ('ojig', 'neoegeman'),
 ('owain', 'glyndwr'),
 ('pactum', 'fraudi')]

In [None]:
finder.apply_freq_filter(49) #Filter, at leat 50 occurrences
finder.nbest(bigram_measures.pmi, 20)

[('bala', 'bala'),
 ('deja', 'vu'),
 ('helter', 'skelter'),
 ('diddit', 'diddit'),
 ('ly', 'ly'),
 ('baa', 'baa'),
 ('steve', 'millikan'),
 ('nicki', 'minaj'),
 ('los', 'angeles'),
 ('blah', 'blah'),
 ('scumbag', 'scumbag'),
 ('wim', 'weh'),
 ('bla', 'bla'),
 ('spa', 'baba'),
 ('pag', 'ibig'),
 ('barbar', 'ann'),
 ('dolly', 'parton'),
 ('wu', 'tang'),
 ('sen', 'sen'),
 ('voulez', 'vous')]

In [None]:
finder.apply_freq_filter(499) #Filter, at leat 500 occurrences
finder.nbest(bigram_measures.pmi, 20)

[('santa', 'claus'),
 ('nah', 'nah'),
 ('pum', 'pum'),
 ('oo', 'oo'),
 ('ba', 'ba'),
 ('doo', 'doo'),
 ('bye', 'bye'),
 ('ha', 'ha'),
 ('uh', 'huh'),
 ('da', 'da'),
 ('na', 'na'),
 ('new', 'york'),
 ('talkin', 'bout'),
 ('brand', 'new'),
 ('merry', 'christmas'),
 ('whoa', 'whoa'),
 ('uh', 'uh'),
 ('la', 'la'),
 ('ah', 'ah'),
 ('each', 'other')]

The more occurances we filter, the more recognizable and meaningful in Enbglish the bigrams will become. Compared to what we saw in class, the occurance filter changes, and we are finding the top 20 PMI instead of top 10. We are seeing a lot from class the numbers, and foreign words relating to each other, which is also the case in the 5-occurance filter, because we didn't exclude the foreign songs and PMI are more sensitive to rare word pairs. In a 500-occurance filter compared to the 100-occurance, there are even more words with true meanings. Instead of the common practice in song writing, like "gon na""wan na""got ta", the 500-occurance shows "santa claus""brand new""new york""talkin bout""merry christmens".

## Document-feature matrix and Pointwise Mutual Information

To see ane simple application in of this method you can look at [this comparison](https://www.pewresearch.org/decoded/2022/07/13/analyzing-text-for-distinctive-terms-using-pointwise-mutual-information/) between liberals and conservatives.

### Question 6:
Start by creating a new dataframe, call it df_filter, and select there the songs coming from only two artists that you want to compare.

In [None]:
df_filter = df[(df['artist'] == 'ABBA' ) | (df['artist'] == 'Avril Lavigne')]

### Question 7:
Create the document-feature matrix using df_filter, add options to filter frequency so that the features selected to build the matrix need to be in at least 0.5% of the documents, no more than 90% of the documents, and should be no more than 10000.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=None,
                             binary=True,
                             min_df=0.005,  #minimum document frequency
                             max_df=0.90, #maximum document frequency
                             max_features=10000 #not more than 10000
                             )

X = vectorizer.fit_transform(df_filter['tokens_simple'].apply(lambda tokens: " ".join(tokens)))

Now we want to calculate PMI scores to see which words are associated with each of the two artists. Complete the following code inserting the names of the artists you chose:

In [None]:
# Convert categories to a binary format (for a single category in this example)
a = np.where(df_filter['artist'] == 'ABBA', 1, 0)

# Calculate PMI for each feature with the category
n_docs = len(df_filter)
p_a = np.sum(a) / n_docs
pmi_scores = []

for i, token in enumerate(vectorizer.get_feature_names_out()):
    p_x = np.sum(X[:, i]) / n_docs #Probability of token
    p_xa = np.sum(X[a == 1, i]) / n_docs  # Probability of token and artist
    if p_x * p_a > 0 and p_xa > 0:
        pmi = np.log2((p_xa / (p_x * p_a)))  # Added a small value to prevent log(0)

        pmi_scores.append((token, pmi))

# Sort tokens by PMI scores
pmi_scores.sort(key=lambda x: x[1], reverse=True)

# Print PMI scores
for token, pmi in pmi_scores[0:30]:
    print(f"{token}: {pmi}")

action: 1.1798210375848124
ad: 1.1798210375848124
afar: 1.1798210375848124
affair: 1.1798210375848124
agnetha: 1.1798210375848124
agree: 1.1798210375848124
aha: 1.1798210375848124
ai: 1.1798210375848124
alice: 1.1798210375848124
alley: 1.1798210375848124
almost: 1.1798210375848124
among: 1.1798210375848124
angry: 1.1798210375848124
arrive: 1.1798210375848124
autumn: 1.1798210375848124
await: 1.1798210375848124
ball: 1.1798210375848124
bang: 1.1798210375848124
bank: 1.1798210375848124
beast: 1.1798210375848124
bet: 1.1798210375848124
bill: 1.1798210375848124
bird: 1.1798210375848124
bitter: 1.1798210375848124
book: 1.1798210375848124
boomerang: 1.1798210375848124
breakfast: 1.1798210375848124
bus: 1.1798210375848124
buses: 1.1798210375848124
business: 1.1798210375848124


In [None]:
# Convert categories to a binary format (for a single category in this example)
a = np.where(df_filter['artist'] == 'Avril Lavigne', 1, 0)

# Calculate PMI for each feature with the category
n_docs = len(df_filter)
p_a = np.sum(a) / n_docs
pmi_scores = []

for i, token in enumerate(vectorizer.get_feature_names_out()):
    p_x = np.sum(X[:, i]) / n_docs #Probability of token
    p_xa = np.sum(X[a == 1, i]) / n_docs  # Probability of token and artist
    if p_x * p_a > 0 and p_xa > 0:
        pmi = np.log2((p_xa / (p_x * p_a)))  # Added a small value to prevent log(0)

        pmi_scores.append((token, pmi))

# Sort tokens by PMI scores
pmi_scores.sort(key=lambda x: x[1], reverse=True)

# Print PMI scores
for token, pmi in pmi_scores[0:30]:
    print(f"{token}: {pmi}")

actually: 0.8401286632216106
against: 0.8401286632216106
already: 0.8401286632216106
american: 0.8401286632216106
argue: 0.8401286632216106
askin: 0.8401286632216106
asleep: 0.8401286632216106
ass: 0.8401286632216106
attention: 0.8401286632216106
avril: 0.8401286632216106
aware: 0.8401286632216106
bail: 0.8401286632216106
bedroom: 0.8401286632216106
behavior: 0.8401286632216106
below: 0.8401286632216106
blast: 0.8401286632216106
bleed: 0.8401286632216106
bored: 0.8401286632216106
boring: 0.8401286632216106
bottle: 0.8401286632216106
bottom: 0.8401286632216106
boyfriend: 0.8401286632216106
brag: 0.8401286632216106
brand: 0.8401286632216106
breathing: 0.8401286632216106
brown: 0.8401286632216106
carefully: 0.8401286632216106
cell: 0.8401286632216106
choir: 0.8401286632216106
choke: 0.8401286632216106


### Question 8:
Explain the method above and the results.

Can you think of another similar approach that would not need to filter for the two artists you chose?

This method first chose the rows where the artist was one that I picked, and transformed their simplified tokens into vectors, which was binary. Then it calculated PMI by the probability of a song being by the artist and the probability of the token appearing. Finally it printed out the top 30 tokens with the highest PMI.


It was the similar case to the other artist.


The results were quite meaningless because each token printed in the top 30 had the same PMI and they seemed like random words in alphabetical order.


The alternative way to do this without filtering out the two artists is to calculate PMI for each artist (by using a loop, or a lambda x as we did in class) and their token, and store them in a dictionary, with the artist being the key. Therefore, it is possible to find out the top 30 PMI token for each artist.

### Dictionary Methods: the use of pronouns
In this excercise we will try to see whether artists differ in their use of pronouns. We will use dictionaries provided by [Harvard-IV-4](https://inquirer.sites.fas.harvard.edu/homecat.htm), which distinguish between pronouns referring to the _singular self_ and pronouns referring to the _inclusive self_.

In [None]:
singular_self=['i', "i'm", 'me', 'mine', 'my', 'myself', 'oneself']
inclusive_self=["let's", 'our', 'ours', 'ourselves', 'us', 'we']

To correctly identify pronouns, the best approach would be to use part of speech taggers as the one provided by spaCy and briefly shown in class. Here we will try a simpler and less precise approach.

### Question 9:
1. Use again the 'text' column of the dataset. The token are not usable here as we are focusing on words that we discarded.
2. Create a function to split into words, force lowercase and remove digits.
3. Apply this function to the dataset.

In [None]:
def simple_clean(text):
    # Split text into words, force lowercase, and remove digits
    text = re.sub(r'[^\w\s]', '', text)  # removes punctuation
    text = re.sub(r'\d+', '', text).lower()  # Remove digits and lowercase
    words = text.split()  # Split into words
    return words

In [None]:
df['words'] = df['text'].apply(simple_clean)# Apply the function to the 'text' column

In [None]:
print(df.loc[0,'words'])

['look', 'at', 'her', 'face', 'its', 'a', 'wonderful', 'face', 'and', 'it', 'means', 'something', 'special', 'to', 'me', 'look', 'at', 'the', 'way', 'that', 'she', 'smiles', 'when', 'she', 'sees', 'me', 'how', 'lucky', 'can', 'one', 'fellow', 'be', 'shes', 'just', 'my', 'kind', 'of', 'girl', 'she', 'makes', 'me', 'feel', 'fine', 'who', 'could', 'ever', 'believe', 'that', 'she', 'could', 'be', 'mine', 'shes', 'just', 'my', 'kind', 'of', 'girl', 'without', 'her', 'im', 'blue', 'and', 'if', 'she', 'ever', 'leaves', 'me', 'what', 'could', 'i', 'do', 'what', 'could', 'i', 'do', 'and', 'when', 'we', 'go', 'for', 'a', 'walk', 'in', 'the', 'park', 'and', 'she', 'holds', 'me', 'and', 'squeezes', 'my', 'hand', 'well', 'go', 'on', 'walking', 'for', 'hours', 'and', 'talking', 'about', 'all', 'the', 'things', 'that', 'we', 'plan', 'shes', 'just', 'my', 'kind', 'of', 'girl', 'she', 'makes', 'me', 'feel', 'fine', 'who', 'could', 'ever', 'believe', 'that', 'she', 'could', 'be', 'mine', 'shes', 'just',

### Question 10:
Calculate the frequency of the words in the two dictionaries above, divide by the total number of words of each artist.

In [None]:
from collections import Counter

#Create a column that contains the counters
df['counters'] = df['words'].apply(Counter)

#Calculate the sum of words in singular_self
df['singular'] = df['counters'].apply(lambda token_counts: sum(token_counts[word] for word in singular_self if word in token_counts))

#Calculate the sum of words in inclusive_self
df['inclusive'] = df['counters'].apply(lambda token_counts: sum(token_counts[word] for word in inclusive_self if word in token_counts))

#Calculate total number of words
df['total'] = df['counters'].apply(lambda token_counts: sum(token_counts.values()))

In [None]:
artists = df[['artist', 'singular', 'inclusive', 'total']].groupby('artist').sum()

In [None]:
artists['frequency_singular']=artists['singular']/artists['total']
artists['frequency_inclusive']=artists['inclusive']/artists['total']
artists['frequency_singular'].head(10)
#df[['counters', 'singular']].head()
df[['counters', 'inclusive']].head()

Unnamed: 0,counters,inclusive
0,"{'look': 2, 'at': 2, 'her': 3, 'face': 2, 'its...",2
1,"{'take': 2, 'it': 2, 'easy': 1, 'with': 2, 'me...",0
2,"{'ill': 1, 'never': 3, 'know': 2, 'why': 2, 'i...",4
3,"{'making': 1, 'somebody': 1, 'happy': 1, 'is':...",0
4,"{'making': 1, 'somebody': 1, 'happy': 1, 'is':...",0


### Question 11:
Compare the artists that have the highest frequency of singular_self words with artists that have the higher frequency of inclusive_self words. Consider artists with a high enough number of words, for example 2000 in total.

Which artists stand out? Are you familiar with any of them? Which possible explanations can you suggest?

In what sense what you have just done is a simple application of a dictionary method?

In [None]:
artists_filtered = artists[artists['total'] >= 2000]
# Get top 5 artists by frequency of singular_self and inclusive_self
top_singular = artists_filtered.nlargest(10, 'frequency_singular')[['frequency_singular']]
top_inclusive = artists_filtered.nlargest(10, 'frequency_inclusive')[['frequency_inclusive']]
print(top_singular)
print(top_inclusive)
common_artists = top_singular.join(top_inclusive, how='inner', lsuffix='_singular', rsuffix='_inclusive')
print("\nArtists appearing in both top lists:")
print(common_artists)

                 frequency_singular
artist                             
Israel Houghton            0.111725
Planetshakers              0.104037
Evanescence                0.101528
Britney Spears             0.099807
Freddie King               0.097668
Xscape                     0.096639
Sam Smith                  0.095873
Veruca Salt                0.095141
Whitney Houston            0.094815
Walk The Moon              0.094807
                     frequency_inclusive
artist                                  
Don Moen                        0.041565
Matt Redman                     0.036125
High School Musical             0.034530
Unearth                         0.033781
Lorde                           0.027750
Independence Day                0.025481
Jose Mari Chan                  0.023782
Youth Of Today                  0.023619
Yes                             0.022976
Manowar                         0.022803

Artists appearing in both top lists:
Empty DataFrame
Columns: [frequency_si

The top 5 artists that use singular self are Israel Houghton, Planetshakers, Evanescence, Britney Spears, and Freddie King. The top 5 artists that use inclusive self are Don Moen, Matt Redman, High School Musical, Unearth and Lorde.

I am only familiar with High School Musical. Because this is about high school life where friends share their school life while having different love and dreams. This context makes sense when they uses lots of "let's", "our", "ours", "ourselves", "us'", "we".


Counter is processing each single word as the key of the dictionary, and the counts of the words being items. I have done the summation of the items in the dictionary, and selected the designated keys for index. The filtering and ordering were also done based on dictionary operations.
