<h1 style="text-align:center;color:red;">Wilco</h1>
<p style="text-align:center;">Por Maycon Cypriano Batestin</p>


### About the Dataset

The objective of this project is to analyze the lyrics of the band WILCO (or in this case any other band with a long history) throughout their career and be able to predict when, what and how the group's next songs will be. Using machine learning and NLP

- **Fonte original:** Spotify
- **Libertado por:** Maycon Batestin
- **Licença:** Creative Commons Attribution-ShareAlike 4.0 International ([CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))



<h1 style="text-align:center;color:red;">Glossary</h1>


Fields	                                                  | Type  	  |    Description                              |
----------------------------------------------------------|:---------:|:-------------------------------------------:|
artist                            						  |string     | name of artist                               |
album                                                     |string     | name of album                                |
track													  |string     | name of a song belong to the album           |
year													  |int        | year of release of the album                 |
lyrics													  |string     | the lyrics about the song                    |
duration_ms                                               |int        | Duration each song in ms                     |
count_letter                                              |int        | count letter of lyric                        |





<h1 style="text-align:center;color:red;">Getting the Dataset </h1>


In [52]:
artist = "Wilco"
artist = artist.replace(" ","_")

In [53]:
!clear
!python /Users/mayconcyprianobatestin/Documents/repositorios/DATA_SCIENCE/MUSIC/scripts/create_dataset.py $artist


  0%|                                                    | 0/20 [00:00<?, ?it/s]Searching for "Infinite Surprise" by wilco...
Done.
Searching for "Ten Dead" by wilco...
Done.
Searching for "Levee" by wilco...
Done.
Searching for "Evicted" by wilco...
Done.
Searching for "Sunlight Ends" by wilco...
Done.
Searching for "A Bowl and A Pudding" by wilco...
Done.
Searching for "Cousin" by wilco...
Done.
Searching for "Pittsburgh" by wilco...
Done.
Searching for "Soldier Child" by wilco...
  0%|                                                    | 0/20 [00:24<?, ?it/s]
Traceback (most recent call last):
  File "/Users/mayconcyprianobatestin/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/mayconcyprianobatestin/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/

<h1 style="text-align:center;color:red;">Librarys </h1>


In [54]:
### Librarys

import pandas as pd
import numpy as np
import plotly.graph_objs as go
import plotly.offline as pyo
import plotly.express as px
import plotly.subplots as sp
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud
import nltk 
from nltk import tokenize, RSLPStemmer
import matplotlib.pyplot as plt
from string import punctuation
import unidecode
import re




<h1 style="text-align:center;color:red;">Threating & Prepare Data </h1>


In [95]:
path = f'/Users/mayconcyprianobatestin/Documents/repositorios/DATA_SCIENCE/MUSIC/dataset/dataset_{artist.lower()}.csv'
df = pd.read_csv(path)

df = df[['album', 'track', 'year', 'lyrics', 'duration_ms']].groupby("year").apply(lambda x: x.sort_values('year')).reset_index(drop=True)

df.head()

Unnamed: 0,album,track,year,lyrics,duration_ms
0,Being There,Misunderstood,1996,17 ContributorsMisunderstood Lyrics[Intro]\nWh...,388267
1,Being There,Why Would You Wanna Live,1996,6 ContributorsWhy Would You Wanna Live Lyrics[...,256293
2,Being There,(Was I) In Your Dreams,1996,5 Contributors(Was I) In Your Dreams LyricsWas...,210867
3,Being There,Kingpin,1996,9 ContributorsKingpin LyricsI want to be your ...,316853
4,Being There,Someone Else's Song,1996,6 ContributorsSomeone Else’s Song LyricsI can'...,201480


In [96]:
#check for NA value

def checkNAN(df):
    if df.isnull().values.any():
        df.dropna(inplace=True) 
        df.reset_index(drop=True, inplace=True)
        print("Checking for NaN values and fixing!.")
    else:
        print("There no NaN values on your dataset")

checkNAN(df)




Checking for NaN values and fixing!.


In [97]:
# checking for duplicates

def remove_duplicates_from_dataframe(df):
    duplicates = df[df.duplicated()]
    df_no_duplicates = df.drop_duplicates()
    df.update(df_no_duplicates)

    return df

remove_duplicates_from_dataframe(df)

Unnamed: 0,album,track,year,lyrics,duration_ms
0,Being There,Misunderstood,1996,17 ContributorsMisunderstood Lyrics[Intro]\nWh...,388267
1,Being There,Why Would You Wanna Live,1996,6 ContributorsWhy Would You Wanna Live Lyrics[...,256293
2,Being There,(Was I) In Your Dreams,1996,5 Contributors(Was I) In Your Dreams LyricsWas...,210867
3,Being There,Kingpin,1996,9 ContributorsKingpin LyricsI want to be your ...,316853
4,Being There,Someone Else's Song,1996,6 ContributorsSomeone Else’s Song LyricsI can'...,201480
...,...,...,...,...,...
289,Cousin,A Bowl and A Pudding,2023,2 ContributorsA Bowl and A Pudding Lyrics[Vers...,243493
290,Cousin,Cousin,2023,2 ContributorsCousin Lyrics[Verse 1]\nI cut in...,250640
291,Cousin,Pittsburgh,2023,"2 ContributorsPittsburgh Lyrics[Verse 1]\nOh, ...",313907
292,Cousin,Soldier Child,2023,3 ContributorsSoldier Child Lyrics[Verse]\nSo ...,257347


In [98]:
#fixing the colun lyrics 

def clean_lyrics(text):

    if text.startswith('Investigation of the Ferguson Police'):
        return ''
    if text.startswith('Manifesto of the Communist Party In'):
        return ''
    if text.startswith('FREEDOM! Contents Introduction 1.'):
        return ''
    cleaned_text = re.sub(r'\[[^\]]+\]', '', text)  
    cleaned_text = re.sub(r'\d+ Contributors', '', cleaned_text)  
    cleaned_text = re.sub(r'\\n', ' ', cleaned_text)  
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    cleaned_text = re.sub(r'^.*Lyrics', ' ', cleaned_text)   
    return cleaned_text.strip()

df['lyrics'] = df['lyrics'].apply(clean_lyrics)
checkNAN(df)


There no NaN values on your dataset


In [99]:
#creating a new coluna of count word

def count_letters(text):
    words = text.split()
    total_letters = sum(len(word) for word in words)
    return total_letters

df['count_letter'] = df['lyrics'].apply(count_letters)

In [100]:
#normalize each album
df = df.drop([145, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 77, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 245, 246, 247])


<h1 style="text-align:center;color:red;">Visualization </h1>


### 
<p> In this visualization, we see Wilco's albums over the years. Starting in 1996 with Being There and ending in 2023 with the album Cousin </p>



In [112]:
def album_years(df):
    try:
        fig = px.scatter(df, x="year", color="album", symbol="album")
        fig.update_traces(marker_size=10)
        fig.update_layout(xaxis=dict(type='category'))
        fig.update_xaxes(tickangle=45)
        fig.update_layout(title_text='Albums over the years')
        fig.update_yaxes(tickangle=45, showticklabels=False, title_text='')

        return fig.show()
    except:
        return "Somenthing wrong with your dataframe"
album_years(df)

### 
<p> In this visualization, we see the number of music tracks per album, throughout the releases. </p>

In [113]:
def track_album_years(df):
    try: 
        album_track_counts = df.groupby(['album', 'year']).size().reset_index(name='num_tracks')
        line = px.bar(album_track_counts, y='num_tracks', color='album')
        line.update_layout(xaxis=dict(type='category'))
        line.update_xaxes(tickangle=45, showticklabels=False, title_text=' ')
        line.update_layout(title_text='Track from album, over the year')
        line.update_yaxes(tickangle=45, showticklabels=False, title_text='')
        for i, count in enumerate(album_track_counts['num_tracks']):
            line.add_annotation(text=count, x=album_track_counts.index[i], y=count)
        line.update_traces(textposition='outside')
        line.add_annotation(
            text="Number of Tracks",
            xref="paper",
            yref="paper",
            x=0.5,  
            y=-0.1, 
            showarrow=False
        )
        
        line.show()
    except Exception as e:
        return f"Something went wrong: {str(e)}"

track_album_years(df[['album', 'track', 'year']])



### 
<p> In this view, we see the longest songs and the shortest songs. The percentage in relation to the total of the whole. </p>

In [139]:
def create_pie_charts(df):
    def ms_to_min_sec(ms):
        minutes, seconds = divmod(ms // 1000, 60)
        return f'{minutes}m{seconds}s'

    df = df.sort_values(by='duration_ms', ascending=False)
    top_10_longest_tracks = df.head(10)
    top_10_longest_tracks.loc[:, 'duration_formatted'] = top_10_longest_tracks['duration_ms'].apply(ms_to_min_sec)
    df = df.sort_values(by='duration_ms')
    top_10_shortest_tracks = df.head(10)
    top_10_shortest_tracks.loc[:, 'duration_formatted'] = top_10_shortest_tracks['duration_ms'].apply(ms_to_min_sec)
    fig1 = px.pie(top_10_longest_tracks, names='track', values='duration_ms', color='track',
                 hover_data=['track', 'duration_formatted'],
                 title='Top 10 Longest Tracks')

    fig1.update_traces(textinfo='label+percent', pull=[0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    fig2 = px.pie(top_10_shortest_tracks, names='track', values='duration_ms',
                 hover_data=['track', 'duration_formatted'],
                 title='Top 10 Shortest Tracks')
    fig2.update_traces(textinfo='label+percent', pull=[0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

    return fig1, fig2
fig1, fig2 = create_pie_charts(df)
fig1.show()
fig2.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### 
<p> In this view, we see the duration of each album in the band. </p>

In [141]:
def create_album_duration(df):
        def ms_to_min_sec(ms):
                minutes, seconds = divmod(ms // 1000, 60)
                return f'{minutes}m{seconds}s'

        df_album_duration = df.sort_values(by='duration_ms', ascending=False)
        df_album_duration.loc[:, 'duration_formatted'] = df_album_duration['duration_ms'].apply(ms_to_min_sec)
        fig = px.pie(df_album_duration, names='album', values='duration_ms',
                hover_data=['album', 'duration_formatted'],
                title='Album Duration')
        fig.update_traces(textinfo='label+percent', pull=[0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
        return fig.show()
create_album_duration(df)

### 
<p> In this visualization we see the number of lyrics per song grouped by albums. </p>

In [63]:
#count by letter of each album

def count_by_letter(df):
    name_album =  [ name for name in df['album'].unique() ]
    for i in name_album:
        album_name = df[df['album'] == f'{i}']
        album_name = album_name.sort_values(by='count_letter', ascending=False)
        fig = px.funnel(album_name, x='count_letter', y='track', color="track")
        fig.update_xaxes(tickangle=45, showticklabels=False, title_text=' ')
        fig.update_yaxes(tickangle=45, showticklabels=False, title_text=' ')
        fig.update_traces(orientation='h')
        fig.add_annotation(
                    text=f"Album - {i}",
                    xref="paper",
                    yref="paper",
                    x=0.5,  
                    y=-0.1, 
                    showarrow=False
                )
        fig.show()

    return None
count_by_letter(df)


In [142]:
df.describe()

Unnamed: 0,year,duration_ms,count_letter
count,268.0,268.0,268.0
mean,2009.156716,232383.216418,1012.996269
std,8.387037,87743.499256,4749.449489
min,1996.0,22880.0,21.0
25%,2002.0,185586.5,481.0
50%,2009.0,218387.0,634.5
75%,2016.0,256433.0,848.0
max,2023.0,901333.0,77694.0


In [64]:
# accuracy 

df.head()

vetor = CountVectorizer()
bag = vetor.fit_transform(df['album'])
train, test, class_train, class_test = train_test_split(bag, df['year'])
regres_logistc = LogisticRegression()
regres_logistc.fit(train, class_train)
acuracy = regres_logistc.score(test, class_test)
final = {"acuracy": acuracy}
graph = pd.DataFrame.from_dict(final, orient='index', columns=['Value'])
graph.head()

fig = go.Figure()

fig.add_trace(go.Indicator(
        mode="number+gauge+delta",
        value=graph['Value'][0],
        title={'text': "acuracy"},
        domain={'row': 0, 'column': 0}
    ))

fig.update_layout(
        title="Contagem de Acuracia",
        height=300,
    )
fig.update_traces(uirevision="top center")

fig.show() 


In [85]:
# Ordenar os dados por ano em ordem decrescente
df['album_id'] = pd.factorize(df['album'])[0]

reg = LinearRegression()

X = df[['album_id']].values
y = df['year'].values

reg.fit(X, y)


query = df[df['album'] == 'Cousin']
number_album = query.iloc[0]['album_id']
name = query.iloc[0]['album']
album_name = np.array([[number_album]])

predicted_year = int(reg.predict(album_name))

print(f"Com base no album {name}, Wilco irá lançar o próximo álbum no ano {predicted_year}")

Com base no album Cousin, Wilco irá lançar o próximo álbum no ano 2022


In [93]:
import pandas as pd
import numpy as np
from datetime import date
from statsmodels.tsa.arima_model import ARIMA

current_year = date.today().year
df = df[df['year'] <= current_year]

df['album_id'] = pd.factorize(df['album'])[0]

model = ARIMA(df['year'], order=(1, 1, 1))
model_fit = model.fit(disp=0)

album_name = 'Cousin'
album_id = df[df['album'] == album_name].iloc[0]['album_id']
years_to_predict = range(current_year + 1, current_year + 6)  

for year in years_to_predict:
    forecast = model_fit.forecast(steps=1)  
    predicted_year = int(forecast[0])
print(f"Com base no álbum {album_name}, a previsão para o ano {year}")


Com base no álbum Cousin, a previsão para o ano 2028



An unsupported index was provided and will be ignored when e.g. forecasting.


An unsupported index was provided and will be ignored when e.g. forecasting.

