## Roadmap
    
### Sources

    Collect sentiment dictionnary
    - positive-words.txt
    - negative-words.txt
    
    Collect mapping stemmed word-unstemmed word
    - mxm_reverse_mapping.txt
    
    Collect lyrics bag-of-words data file
    - mxm_dataset.db
    
    
    
### `mood_df`
#### Map each stemmed word with a mood score if available
    
    * INPUTS
    
      + STEMMED   | UNSTEMMED    mxm_reverse_mapping.txt
      + UNSTEMMED | MOOD (=1)    positive-words.txt
      + UNSTEMMED | MOOD (=-1)   negative-words.txt
            
            
    * OUTPUT
    
      + STEMMED | MOOD
            
           
           
### `lyrics_df`
#### dataFrame with Tracks, Count positive, Count negative, Count no mood info

    * INPUTS
    
      + TRACK_ID | STEMMED | COUNT | ISTEST      mxm_dataset.db        
      + STEMMED | MOOD                           mood.df
           
           
    * OUTPUT:
    
      + TRACK_ID | COUNT_POSITIVE | COUNT_NEGATIVE | COUNT_OTHER


### `final_df`

    * INPUT
    
       + TRACK_ID | COUNT_POSITIVE | COUNT_NEGATIVE | COUNT_OTHER
      
      
    * OUTPUT
    
      + TRACK_ID | TRACK_MOOD  lyrics_terms_df


## DATA SOURCES

###  Opinion Lexicon: Positive & Negative

`positive-words.txt` contains a list of POSITIVE opinion words (or sentiment words).
`negative-words.txt` contains a list of NEGATIVE opinion words (or sentiment words).

This file and the papers can all be downloaded from 
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Citation:
Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
Proceedings of the ACM SIGKDD International Conference on Knowledge 
Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.
Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web."
Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

### Mapping Stemmed word -> Unstemmed word

`mxm_reverse_mapping.txt` contains the mapping for Stemmed word -> Unstemmed word

This file can be downloaded from 
http://labrosa.ee.columbia.edu/millionsong/sites/default/files/mxm_reverse_mapping.txt

Citation:
musiXmatch dataset, the official lyrics collection for the Million Song Dataset
author: Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere
title: The Million Song Dataset

### Download sources under `<path>`:

+ [mxm_reverse_mapping.txt]:
http://labrosa.ee.columbia.edu/millionsong/sites/default/files/mxm_reverse_mapping.txt

+ [positive-words.txt] + [negative-words.txt]:            
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

### Preparation of the 'mood' data using Pandas

In [1]:
import re
import pandas as pd

path = '/home/eolus/Documents/MA755_data/LyricsData/'

# Put text content stemmed/unstemmed entries in list
f_1 = open(path+'mxm_reverse_mapping.txt', 'r')
lines_1 = [line.rstrip('\n').split('<SEP>') for line in f_1.readlines()]

# Put list in pandas df
df_1 = pd.DataFrame(lines_1, columns=['Stemmed', 'Unstemmed'])

# Remove non letter terms
df_stem_mapping = df_1[df_1.Stemmed.str.match("^[a-zA-Z]+$") == True]
df_stem_mapping.head()

Unnamed: 0,Stemmed,Unstemmed
1,pido,pido
2,hatr,hatred
3,pide,pide
4,yellow,yellow
5,four,four


In [2]:
# Put text content positive & negative unstemmed entries in list
f_2 = open(path+'positive-words.txt', 'r', encoding='ISO-8859-1')
lines_2 = [line.rstrip('\n') for line in f_2.readlines()]

f_3 = open(path+'negative-words.txt', 'r', encoding='ISO-8859-1')
lines_3 = [line.rstrip('\n') for line in f_3.readlines()]

In [6]:
# Put list positive in pandas df
df_2 = pd.DataFrame(lines_2, columns=['Unstemmed'])
df_2['Mood'] = pd.Series([1] * len(df_2.index) )
df_2.head()

Unnamed: 0,Unstemmed,Mood
0,a+,1
1,abound,1
2,abounds,1
3,abundance,1
4,abundant,1


In [5]:
# Put list negative in pandas df
df_3 = pd.DataFrame(lines_3, columns=['Unstemmed'])
df_3['Mood'] = pd.Series([-1] * len(df_3.index) )
df_3.head()

Unnamed: 0,Unstemmed,Mood
0,2-faced,-1
1,2-faces,-1
2,abnormal,-1
3,abolish,-1
4,abominable,-1


In [9]:
# Stack the positive and negative df on top of each other
df_mood = pd.concat([df_2, df_3], axis=0)
df_mood.head()
df_mood.tail()

Unnamed: 0,Unstemmed,Mood
4778,zaps,-1
4779,zealot,-1
4780,zealous,-1
4781,zealously,-1
4782,zombie,-1


In [14]:
# Define outter join data.frame to JOIN stem_mapping and df_mood on `Unstemmed` column
df_outter_join = pd.merge(df_stem_mapping, df_mood, on='Unstemmed', how='outer')
df_outter_join.head()

Unnamed: 0,Stemmed,Unstemmed,Mood
0,pido,pido,
1,hatr,hatred,-1.0
2,pide,pide,
3,yellow,yellow,
4,four,four,


In [11]:
# Filter out Nan values in `Stemmed` (meaning is not in Lyrics bag of word) and from `Mood`(meaning not in sentiment lexicon docs)
df_stem = df_outter_join[df_outter_join.Stemmed.notnull() & df_outter_join.Mood.notnull()]
print('Stemmed words tagged with a mood value: %d' %(len(df_stem.index)))
df_stem.head()

Stemmed words tagged with a mood value: 737


Unnamed: 0,Stemmed,Unstemmed,Mood
1,hatr,hatred,-1
7,thirst,thirst,-1
10,hate,hate,-1
17,pardon,pardon,1
20,sorri,sorry,-1


In [12]:
# Save `df_stem` to pickle file:
save_load_path = '/home/eolus/Documents/MA755_data/myPickles'
df_stem.to_pickle(save_load_path+'/df_stem.pkl')







### Preparation of the 'lyrics' data using SQLite and Pandas

In [None]:
import sqlite3

# Define path
lyrics_path = '/home/eolus/Documents/MA755_data/LyricsData'
pickle_path = '/home/eolus/Documents/MA755_data/myPickles'

# Put `lyrics` table from `mxm_dataset.db` into lyrics_df pandas df
con = sqlite3.connect(lyrics_path +'/mxm_dataset.db')
lyrics_df = pd.read_sql_query("SELECT * from lyrics LIMIT 100", con)
con.close()

In [None]:
# Save the lyrics_df to .pkl file 
lyrics_df.to_pickle(pickle_path+'/df_lyrics.pkl')