# Lyrics sentiment analysis and prediction using pyspark

https://www.kaggle.com/datasets/cakiki/muse-the-musical-sentiment-dataset

https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

This notebook file reads 2 CSV files, one with Songs lyricss and other with Songs classified by sentiments


## Adding dependencies

In [1]:
from IPython import display
import math
import pandas as pd
import numpy as np

from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.types import *

## Creating Spark context

In [2]:
sc =SparkContext()
sqlContext = SQLContext(sc)

24/03/21 12:36:45 WARN Utils: Your hostname, af-Inspiron-7566 resolves to a loopback address: 127.0.1.1; using 192.168.1.65 instead (on interface wlp3s0)
24/03/21 12:36:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/21 12:36:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Creating Schema for Emotions file

In [3]:
shcemaEmotions = StructType([
    StructField("lastfm_url", StringType()),
    StructField("title", StringType()),
    StructField("artist", StringType()),
    StructField("seeds", StringType()),
    StructField("number_of_emotion_tags", StringType()),
    StructField("valence_tags", StringType()),
    StructField("arousal_tags", StringType()),
    StructField("dominance_tags", StringType()),
    StructField("mbid", StringType()),
    StructField("spotify_id", StringType()),
    StructField("genre", StringType())
])

## Creating Schema for Lyrics file

In [4]:
schemaLyrics = StructType([
    StructField("title", StringType()),
    StructField("tag", StringType()),
    StructField("artist", StringType()),
    StructField("year", StringType()),
    StructField("views", StringType()),
    StructField("features", StringType()),
    StructField("lyrics", StringType()),
    StructField("id", StringType()),
    StructField("language_cld3", StringType()),
    StructField("language_ft", StringType()),
    StructField("language", StringType())
])

## Definition of files

In [5]:
#song_lyrics.csv file contains 3093218 songs data
classificationFile = '/home/af/Desktop/Spark/songs_clasification.csv'
lyricsFile = '/home/af/Desktop/Spark/song_lyrics.csv'
outputFile = '/home/af/Desktop/Spark/output.csv'

Reading CSV files for Emotions and Lyrics

In [6]:
dfE = sqlContext.read.format("csv").option("header", "true").schema(shcemaEmotions).load(classificationFile)#.limit(2000)
dfL = sqlContext.read.format("csv").option("ignoreLeadingWhiteSpace", "true").option("multiline", "true").option('quote','"').option('escape', '"').option("header", "true").schema(schemaLyrics).load(lyricsFile)#.limit(2000)

In [7]:
columns_to_drop = ['lastfm_url', 'mbid', 'spotify_id']
dfE = dfE.drop(*columns_to_drop)
dfE.show(5)
dfE.count()

+----------------+---------+--------------------+----------------------+-----------------+------------------+-----------------+-------+
|           title|   artist|               seeds|number_of_emotion_tags|     valence_tags|      arousal_tags|   dominance_tags|  genre|
+----------------+---------+--------------------+----------------------+-----------------+------------------+-----------------+-------+
|'Till I Collapse|   Eminem|      ['aggressive']|                     6|             4.55| 5.273124999999999|         5.690625|    rap|
|       St. Anger|Metallica|      ['aggressive']|                     8|             3.71| 5.832999999999999|5.427250000000002|  metal|
|        Speedin'|Rick Ross|      ['aggressive']|                     1|             3.08|              5.87|             5.49|    rap|
|    Bamboo Banga|   M.I.A.|['aggressive', 'f...|                    13|6.555071428571428|5.5372142857142865|5.691357142857143|hip-hop|
|      Die MF Die|     Dope|      ['aggressive']

90001

In [8]:
dfL = dfL.where(dfL.language == "en")
columns_to_drop = ['views', 'tag', 'features', 'id', 'language_cld3', 'language_ft', 'language']
dfL = dfL.drop(*columns_to_drop)
dfL.show(5)
dfL.count()

+-----------------+---------+----+--------------------+
|            title|   artist|year|              lyrics|
+-----------------+---------+----+--------------------+
|        Killa Cam|  Cam'ron|2004|[Chorus: Opera St...|
|       Can I Live|    JAY-Z|1996|[Produced by Irv ...|
|Forgive Me Father| Fabolous|2003|Maybe cause I'm e...|
|     Down and Out|  Cam'ron|2004|[Produced by Kany...|
|           Fly In|Lil Wayne|2005|[Intro]\nSo they ...|
+-----------------+---------+----+--------------------+
only showing top 5 rows



                                                                                

3374198

## Creating a combined list of both

In [9]:
innerJoin = dfE.join(dfL, ["artist", "title"],"inner")
innerJoin.show(5)

[Stage 9:>                                                          (0 + 1) / 1]

+------------+--------------------+------------+----------------------+-----------------+------------------+-----------------+-----------------+----+--------------------+
|      artist|               title|       seeds|number_of_emotion_tags|     valence_tags|      arousal_tags|   dominance_tags|            genre|year|              lyrics|
+------------+--------------------+------------+----------------------+-----------------+------------------+-----------------+-----------------+----+--------------------+
|     Afroman|                Hush|['positive']|                     1|             7.57|               5.5|             7.26|          hip-hop|2000|[Hook] (Afroman t...|
|  Aimee Mann|              You Do|  ['smooth']|                    15|5.512301587301589|3.2575396825396825|5.478571428571429|singer-songwriter|1999|[Verse 1]\nYou st...|
|  Air Supply|Even the Nights A...|['romantic']|                     4|7.420000000000001|            4.9625|5.911666666666666|        soft rock|1

                                                                                

In [10]:
from pyspark.sql.functions import col
innerJoin.groupBy("artist").count().orderBy(col("count").desc()).show(50)

[Stage 14:>                                                         (0 + 1) / 1]

+--------------------+-----+
|              artist|count|
+--------------------+-----+
|           Bob Dylan|   82|
|        Warren Zevon|   76|
|They Might Be Giants|   74|
|     Robbie Williams|   73|
|            The Cure|   73|
|           Radiohead|   67|
|         The Beatles|   67|
|Manic Street Prea...|   63|
|       Chelsea Wolfe|   59|
|  The Mountain Goats|   59|
|         of Montreal|   58|
|           Kate Bush|   55|
|           Tori Amos|   54|
|       Elliott Smith|   53|
|    Barenaked Ladies|   53|
|         Bright Eyes|   52|
|   Animal Collective|   51|
|         Yo La Tengo|   51|
|      Regina Spektor|   50|
|       Kylie Minogue|   49|
|           Tom Waits|   49|
| The Magnetic Fields|   48|
|           Cat Power|   47|
|            Coldplay|   46|
|                Cold|   46|
|         David Bowie|   46|
|        Jack Johnson|   45|
|                Beck|   45|
|      Sufjan Stevens|   45|
|        The National|   43|
|The Smashing Pump...|   42|
|        Lana 

                                                                                

In [11]:
innerJoin.count()

                                                                                

27349

## Generating file to train model

In [12]:
innerJoin.toPandas().to_csv(outputFile, index=False)

                                                                                

In [13]:
innerJoin.printSchema()

root
 |-- artist: string (nullable = true)
 |-- title: string (nullable = true)
 |-- seeds: string (nullable = true)
 |-- number_of_emotion_tags: string (nullable = true)
 |-- valence_tags: string (nullable = true)
 |-- arousal_tags: string (nullable = true)
 |-- dominance_tags: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- year: string (nullable = true)
 |-- lyrics: string (nullable = true)

