# Steps in Creating a Data Science Project

1. Defining the problem
**Technical problem description**

2. Getting the data
**Setup environment**
*     Import libraries
*     Define constants and control variables    

3. Exploratory data analysis

4. Preparing the data for Machine Learning Algorithms
* Data cleaning
* Feature engineering
* preprocessing

5. Creating and evaluating multiple machine learning models

6. Tuning and selecting final models

7. Presenting findings and/or solutions

8. Launch, monitoring, and maintenance of system

# 1. Defining the problem

    **Technical problem description:**
    
    We will use various NLP machine learning models in an attempt to analyze song lyrics and identify the artist who authored them.

# 2. Getting the data

# Setup environment

    # Import libraries

In [86]:
import pandas as pd

# used to split our data into test and training sets
from sklearn.model_selection import train_test_split

# a simple "Count" vectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.svm import LinearSVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

# library helper
# run: importnb-install from Conda before using
from importnb import Notebook
with Notebook(): 
    import Utility

# custom helper class (from jupyter notebook)
helper = Utility.Helper()

Class 'Helper' v1.3 has been loaded


    Function to reload changes in Jupyter notebooks

In [87]:
# reload changes in Jupyter notebooks
from importlib import reload
with Notebook(): __name__ == '__main__' and reload(Utility)

    Define constants and control variables

In [88]:
DATA_PATH = '../../data/'

FIGSIZE = (20,15)

In [89]:
lyrics_file = DATA_PATH + "artist_scrape_with_lyrics.csv"
    
# lyrics_df = helper.create_df(lyrics_file)

# temporary call as data has not been fully scraped and captured in CSV, so lyrics column contains both str values (where lyrics have been scraped) and NaN values where this is pendingdtype={'user_id': int}
lyrics_df = pd.read_csv(lyrics_file, dtype={'Song Lyrics': str})

# 3. Exploratory data analysis

In [90]:
lyrics_df.head(-10)

Unnamed: 0,Artist Name,Artist URL,Song Title,Song URL,Song Lyrics
0,A B,artist/A-B/472398,Con el Tic Tac del Reloj,/lyric/3455846/A+B/Con+el+Tic+Tac+del+Reloj,Era tan facil soÃ±ar Que te podias quedar Eras...
1,A Bad Think,artist/A-Bad-Think/2137849593,Now You Know,/lyric/36417131/A+Bad+Think/Now+You+Know,"Burning candles at all ends, to mend your hear..."
2,A Baffled Republic,artist/A-Baffled-Republic/2137849643,Bad Boys (Move in Silence),/lyric/2262594/A+Baffled+Republic/Bad+Boys+%28...,"Why, why, why, why Why, why, why, why Why, wh..."
3,A Banca 021,artist/A-Banca-021/2137850524,Cor de Mel,/lyric/37798632/A+Banca+021/Cor+de+Mel,Você me lembra o azul do céu Pintando as nuven...
4,"A Band Called ""O""",artist/A-Band-Called-%22O%22/19641,Sleeping,/lyric/637199/A+Band+Called+%22O%22/Sleeping,"For the life we chose in the evening we rose, ..."
...,...,...,...,...,...
1352573,ZZ Ward,artist/ZZ-Ward/2633167,Ride,/lyric/33934100/ZZ+Ward/Ride,
1352574,ZZ Ward,artist/ZZ-Ward/2633167,Ride [*],/lyric/34157509/ZZ+Ward/Ride+%5B%2A%5D,
1352575,ZZ Ward,artist/ZZ-Ward/2633167,Love 3x,/lyric/31899204/ZZ+Ward/Love+3x,
1352576,ZZ Ward,artist/ZZ-Ward/2633167,LOVE 3X,/lyric/32112984/ZZ+Ward/LOVE+3X,


In [91]:
lyrics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1352588 entries, 0 to 1352587
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   Artist Name  1352588 non-null  object
 1   Artist URL   1352588 non-null  object
 2   Song Title   1352588 non-null  object
 3   Song URL     1352588 non-null  object
 4   Song Lyrics  82102 non-null    object
dtypes: object(5)
memory usage: 51.6+ MB


    Temporary data split, as dataframe has not been fully populated yet

In [92]:
# create dataframe where Song Lyrics are not null
lyrics_df = lyrics_df.loc[~lyrics_df['Song Lyrics'].isnull()]

In [93]:
lyrics = lyrics_df['Song Lyrics']
artists = lyrics_df['Artist Name']

X_train, X_test, y_train, y_test = train_test_split(lyrics, artists, test_size=0.1, random_state=1337)

# 4. Preparing the data for Machine Learning Algorithms

# Data cleaning

# Feature engineering

# Data preprocessing

# 5. Creating and evaluating multiple machine learning models

    The next steps are to vectorize our dataset, train the classifier, and check the predicted results against our actual results

In [94]:
vectorizer = CountVectorizer()
svm = LinearSVC()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
_ = svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)



NameError: name 'y_pred' is not defined

In [95]:
print(classification_report(y_test, y_pred_svm))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                                                         precision    recall  f1-score   support

                                         A Band of Bees       0.33      1.00      0.50         1
                                               A Boogie       0.00      0.00      0.00         0
                                 A Boogie wit da Hoodie       0.67      0.36      0.47        11
                                                 A Camp       1.00      1.00      1.00         1
                                        A Certain Ratio       1.00      1.00      1.00         1
                                       A Change of Pace       0.00      0.00      0.00         1
                                           A Cor Do Som       1.00      0.67      0.80         3
                                                A Davey       1.00      1.00      1.00         1
                                      A Day to Remember       0.50      0.50      0.50         6
                             

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


    Let's compare our results against a simpler and more easily interpretable classifier — the Decision Tree

In [96]:
dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print(classification_report(y_test, y_pred_dt))
print(confusion_matrix(y_test, y_pred_dt))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                                                                  precision    recall  f1-score   support

                                                  A Band of Bees       0.50      1.00      0.67         1
                                                        A Boogie       0.00      0.00      0.00         0
                                          A Boogie wit da Hoodie       0.62      0.45      0.53        11
                                                          A Camp       1.00      1.00      1.00         1
                                                 A Certain Ratio       1.00      1.00      1.00         1
                                                A Change of Pace       0.00      0.00      0.00         1
                                                       A Chicken       0.00      0.00      0.00         0
                                                    A Cor Do Som       1.00      0.67      0.80         3
                                             

# 6. Tuning and selecting final models

    Use the TfidfVectorizer with character ngrams instead of words and a larger ngram_range. You should see performance improve to around 80%.

# 7. Presenting findings and/or solutions