# <font color='#FFBD33'>**Assignment 6 - Recommend Song by Mood**</font> 

This is the <font color='cyan'>Assignment 6</font> for the LING360 - Computational Methods in Lingustics course and it is worth a total of  <font color='cyan'>**10 points** + **5 points (Bonus)**</font>.

In this assignment, we are going to create a novel application, which is a song recommender website which will recommend a list of songs generated on the spot using the emotion detection model and text similarity.

Our general plan is to:
1. Train an **emotion detection model**
2. Find emotions in **a song corpus**
3. Implement **web app** which can recommend songs by looking at the **similarity between input text and songs**. Similarity metrics are going to be the
    1. Emotion vector similarity    
    1. Count vector similarity

The topics include:
1. Naive Bayes
2. Streamlit


There's a total of  <font color='cyan'>**2 main tasks**</font> and <font color='cyan'>**x subs tasks**</font>. For each task, please write your code between the following lines:

```
## YOUR CODE STARTS



## YOUR CODE ENDS
```

Before working on the assignment, please copy this notebook to your own drive. You can use ```Save a copy in Drive``` under the ```File``` menu on top left.

Please, run every cell in your code to make sure that it works properly before submitting it. 

Once you are ready to submit, download two versions of your code:

*   Download .ipynb
*   Download .py

These are both available under the ```File``` menu on top left. 

Then, compress your files (zip, rar, or whatever) and upload the compressed file to Moodle.

If you have any questions, please contact with karahan.sahin@boun.edu.tr


In [None]:
# Run this block first!
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise_distances

In [None]:
# Steps to get the data:
# 1. Go to the link: https://drive.google.com/drive/folders/1ZNKHlrBxqbePdNElohyMtVgrWNaKFQhD?usp=sharing
# 2. Click on the folder name and click to `Copy to Drive` option (copy to root folder!)
# 3. Then give permissions after running this line
# 4. After giving permission, run the next line to copy files to your collab session.

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
! cp "/content/drive/MyDrive/Assignment 6/main.py" -t .
! cp "/content/drive/MyDrive/Assignment 6/song_dataset.csv" -t .
! cp "/content/drive/MyDrive/Assignment 6/train_emotion.csv" -t .
! cp "/content/drive/MyDrive/Assignment 6/test_emotion.csv" -t .

## <font color='#FFBD33'>**Q1:** Emotion Classifier</font> `10 points`

Train a model for emotion recognition using Naive Bayes Classifiers

### <font color='#FFBD33'>Q1.1: Import and Clean Data</font> `2 points`

First we need to get our data and preprocess it accordingly.

<font color='#FFBD33'>**Instructions:**</font>

1. Import train and test sets namely, `train_emotion.csv` and `test_emotion.csv`.
2. Read line by line to get (sentence, label) pairs by separating each line with `\t` character.
2. For training data, add sentences to a variable called `train_sentences` and add labels to a variable called `train_labels`.
2. For test data, add sentences to a variable called `test_sentences` and add labels to a variable called `test_labels`.
3. Then translate the numeric labels into names `{ 0: "sadness", 1: "joy" , 2:  "love" , 3: "anger", 4: "fear", 5: 'surprise'}` for train and test labels.

<font color='#FFBD33'>**Notes:**</font>

1. Don't forget to open your file with `encoding="utf-8"` parameter.

In [18]:
## YOUR CODE STARTS

train_emotion = 'train_emotion.csv'
test_emotion = 'test_emotion.csv'

columns = ['sentence', 'label']
train_emotion_data = pd.read_csv(train_emotion, sep='\t', names=columns, encoding="utf-8")
test_emotion_data = pd.read_csv(test_emotion, sep='\t', names=columns, encoding="utf-8")

train_sentences = train_emotion_data['sentence']
test_sentences = test_emotion_data['sentence']

train_labels = train_emotion_data['label']
test_labels = test_emotion_data['label']

label_to_int = {'sadness':0,'joy':1,'love':2,'anger':3,'fear':4,'surprise':5}
int_to_label = {0:'sadness',1:'joy',2:'love',3:'anger',4:'fear',5:'surprise'}

#train_labels = train_emotion_data['label'].map(label_to_int)
#test_labels = test_emotion_data['label'].map(label_to_int)

## YOUR CODE ENDS

### <font color='#FFBD33'>Q1.2: Train Model/Vectorizer</font> `3 points`

Process the train and test datasets and train your Naive Bayes model with train set.

<font color='#FFBD33'>**Instructions:**</font>

2. Then using `bow.fit_transform()` method, fit the `train_sentences` and assign it into a variable called `X_train`
3. Generate an instance of `MultinomialNB()` object, and assign it into a variable called `model`
4. Then using `model.fit()` method, fit the `X_train` and `train_labels`.
5. Finally run the last line to see your model results.

In [19]:
## YOUR CODE STARTS
bow = CountVectorizer() 

X_train = bow.fit_transform(train_sentences)
model = MultinomialNB()
model.fit(X_train, train_labels)

## YOUR CODE ENDS

In [20]:
X_test = bow.transform(test_sentences)
y_pred = model.predict(X_test)
print(classification_report(test_labels, y_pred))

              precision    recall  f1-score   support

       anger       0.91      0.55      0.68       912
        fear       0.86      0.51      0.64       749
         joy       0.71      0.96      0.81      2192
        love       0.88      0.21      0.34       547
     sadness       0.73      0.93      0.82      1959
    surprise       1.00      0.02      0.05       241

    accuracy                           0.74      6600
   macro avg       0.85      0.53      0.56      6600
weighted avg       0.78      0.74      0.71      6600



### <font color='#FFBD33'>Q1.3: Run on Song Corpus</font> `2 points`

Use emotion detection to find the emotions of songs in corpus. Using that model, we are going to emotion features for each song.

<font color='#FFBD33'>**Instructions:**</font>

1. Import train and test sets namely, `song_dataset.csv`.
2. Read line by line to get (song_name, lyrics) pairs by separating each line with `\t` character
2. For song data, add song_name to a variable called `song_titles` and add lyrics to a variable called `song_lyrics`.
3. Then using `bow.transform()` method, fit the `song_lyrics` and assign it into a variable called `X_text`
4. Then using `model.predict_proba()` method, predict the `song_lyrics` and assign it into variable called `X_mood`

In [None]:
example_sentences = [
    "I don't know you, but it will be okay", 
    "I am so angry today", 
    "Really sad day:( "
]
X_example = bow.transform(example_sentences)
print(model.predict_proba(X_example))

[[2.36264742e-02 2.04338778e-02 5.21180504e-01 1.34826548e-02
  4.21170078e-01 1.06410642e-04]
 [4.36998612e-01 4.45664929e-02 1.03311494e-01 6.89331113e-03
  4.06675570e-01 1.55452033e-03]
 [7.07038448e-02 5.69751005e-02 1.32374041e-01 5.08793669e-03
  7.32605983e-01 2.25309388e-03]]


As you can see, now our feature vectors are become the emotion scores extracted from the trained Naive Bayes Model

|sentence       |   anger      |   fear       |      joy       |     love       |    sadness     |   surprise    |
|--------------|--------------|--------------|----------------|----------------|----------------|---------------|
|I don't know you, but it will be okay|2.35078678e-02|2.04108623e-02| 5.20536354e-01 | 1.34679763e-02 | 4.21970631e-01 | 1.06308399e-04|
|I am so angry today|4.37101484e-01|4.45618521e-02| 1.03295068e-01 | 6.89272322e-03 | 4.06594385e-01 | 1.55448722e-03|
|Really sad day:( |7.06614679e-02|5.69343769e-02| 1.32273982e-01 | 5.08437190e-03 | 7.32794178e-01 | 2.25162335e-03|

In [21]:
## YOUR CODE STARTS
song_dataset = 'song_dataset.csv'
song_dataset_data = pd.read_csv(song_dataset, sep='\t', names=['song_name', 'lyrics'], encoding="utf-8")

song_titles = song_dataset_data['song_name']
song_lyrics = song_dataset_data['lyrics']

X_text = bow.transform(song_lyrics)
X_mood = model.predict_proba(X_test)

## YOUR CODE ENDS

In [None]:
song_dataset_data.head()

Unnamed: 0,song_name,lyrics
0,Careless Whisper - Ivete Sangalo,I feel so unsure As I take your hand and lead ...
1,Could You Be Loved / Citação Musical do Rap: S...,"Don't let them fool, ya Or even try to school,..."
2,Cruisin' (Part. Saulo) - Ivete Sangalo,"Baby, let's cruise, away from here Don't be co..."
3,Easy - Ivete Sangalo,"Know it sounds funny But, I just can't stand t..."
4,For Your Babies (The Voice cover) - Ivete Sangalo,You've got that look again The one I hoped I h...


### <font color='#FFBD33'>Q1.4: Recommend Song with Metrics</font> `2 points`

Calculate scores of song corpus and get highest scoring topK songs for given sentence.

<font color='#FFBD33'>**Instructions:**</font>

1. First transform your text into bow features and assign it into a variable called `x_text`
2. Then predict its emotion features using `.predict_proba()` and assign it into a variable called `x_mood`
3. Then extract the similarities emotion scores `input_text`, namely `x_mood` between all songs `X_mood`  using `calculateSimilarity()` and assign it to variable called `sim_mood`
3. Then extract the similarities bow scores `input_text`, namely `x_text` between all songs `X_text`  using `calculateSimilarity()` and assign it to variable called `sim_text`
4. The create an empty list called `mean_scores`
5. Then using `zip()`, zip mood and text scores, and iterate over scores to calculate mean score
    ```python
    list1 = [1,2,3]
    list2 = [4,5,6]
    list(zip(list1, list2))
    # Output: [ (1,4), (2,5), (3,6) ] 
    ```
 
6. Then zip the song_names and mean scores, turn the zip object into a dictionary, and sort them by the descending score values.
7. Finally return the `topk` keys of `sorted_scores` dictionary.

In [None]:
def calculateSimilarity(document_vector, all_vectors):
    """The function that checks similarity between input and all instance of features of corpus
    
    """
    return (1 - pairwise_distances(all_vectors,document_vector, metric="cosine")).T[0]

In [24]:
from typing import KeysView
def getSongRecommendation(text, topK=10):
    """Function returns a 
    
    :args:
        text (str): text of explaining current mood
        topK (int): number of song titles to be returned
        
    :returns:
        top_tracks (list): list containing `topK` number of song titles
    """
    
    ## YOUR CODE STARTS
    x_text = bow.transform([text])
    x_mood = model.predict_proba(x_text)

    sim_mood = calculateSimilarity(x_mood, X_mood)
    sim_text = calculateSimilarity(x_text, X_text)
    
    mean_scores = []
    for mood, text in list(zip(sim_mood, sim_text)):
        mean_score = (mood + text) / 2 
        mean_scores.append(mean_score)

    scores = dict(zip(song_titles, mean_scores))
    sorted_scores = dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

    top_tracks = [(key, sorted_scores[key]) for key in list(sorted_scores.keys())[:topK]]
    
    ## YOUR CODE ENDS
    
    return top_tracks

In [25]:
test = getSongRecommendation("""Very angry, want to punch everything""")
test

[('Hush Hush; Hush Hush - The Pussycat Dolls', 0.6345537825863967),
 ('Free - Michael Jackson', 0.6306096321725132),
 ('Is It Cool To Fuck? - Tupac Shakur', 0.6256453044917794),
 ('Want Some More - Nicki Minaj', 0.6230290249418718),
 ('Half Back - Nicki Minaj', 0.6047007230282915),
 ('Why, When, How - Justin Timberlake', 0.5916326271198478),
 ('Good Night - Kanye West', 0.5871296236545812),
 ('Getting Money - Tupac Shakur', 0.5865765149027218),
 ("Thug Love - Destiny's Child", 0.586064096779096),
 ('To Live and Die in LA - Tupac Shakur', 0.5845426826677821)]

In [17]:
assert getSongRecommendation("""Very angry, want to punch everything""") == [
    ('I Want To Be Old - The Cure', 0.4865881400301677), 
    ("Somebody's Everything - Dolly Parton", 0.46629072991070464), 
    ("Don't Wanna - Electric Light Orchestra", 0.46627239736234016), 
    ('Cross Oceans - First Aid Kit', 0.46423834544262976), 
    ('Kill Everybody - Skrillex', 0.45675276421899946), 
    ('Stand By - Jeremy Camp', 0.4564354645876385), 
    ('Without Reason - The Fray', 0.4543368996115371), 
    ('Slug - U2', 0.451413762608203), 
    ('Sundress - Ben Kweller', 0.4478342947514802), 
    ('Rock Star - Everclear', 0.4448005087202802)
  ]

AssertionError: ignored

## <font color='#FFBD33'>**BONUS:** Show on Streamlit</font> `5 points`

Use Streamlit to showcase what your model can do! The design is shown below:

<img src="https://lh3.googleusercontent.com/drive-viewer/AFGJ81ogzqu5cYKmGwe439OsZpByzxQqU5yxLl9_89Wo1IEcC0GOfzR7iW7dCq_mLQsG4Tr6nnFQZF40iYuWl4Zs1EBUQLIxPA=s1600" width="400" height="300"/>

<font color='#FFBD33'>**Instructions:**</font>

1. Click on `main.py` file
1. Run lines below.
1. Then generate user interface according to the instructions in the file and the mockup above.

In [None]:
!pip install streamlit pyngrok --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m681.2/681.2 kB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.8/164.8 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyngrok (setup

In [None]:
from pyngrok import ngrok
ngrok.set_auth_token("2PbTPSScMjMdmG4IRQJD3t7hKRM_83GUaT86FFv6pKShrzkeR") # Add your auth token in here



In [16]:
!streamlit run /content/main.py --server.port 8000 & npx localtunnel --port 8000

[..................] / rollbackFailedOptional: verb npm-session c51aec2b64ad105[0m[K
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
[0m
[K[?25hnpx: installed 22 in 8.106s
your url is: https://twenty-wings-notice.loca.lt
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8000[0m
[34m  External URL: [0m[1mhttp://35.237.175.122:8000[0m
[0m
2023-06-04 15:20:13.332 Uncaught app exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
  File "/content/main.py", line 53, in <module>
    st.session_state["bow"] = bow
NameError: name 'bow' is not defined
2023-06-04 15:25:19.732 Uncaught app exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/streamlit/runtime/scriptrunner/script_runner.py", line