# Steps in Creating a Data Science Project

1. Defining the problem
**Technical problem description**

2. Getting the data
**Setup environment**
*     Import libraries
*     Define constants and control variables    

3. Exploratory data analysis

4. Preparing the data for Machine Learning Algorithms
* Data cleaning
* Feature engineering
* preprocessing

5. Creating and evaluating multiple machine learning models

6. Tuning and selecting final models

7. Presenting findings and/or solutions

8. Launch, monitoring, and maintenance of system

# 1. Defining the problem

    **Technical problem description:**
    
    We will use various NLP machine learning models in an attempt to analyze song lyrics and identify the artist who authored them.
    
    Known sometimes formally as: Authorship attribution

# 2. Getting the data

# Setup environment

    # Import libraries

In [1]:
import pandas as pd

# used to split our data into test and training sets
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

# a simple "Count" vectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.svm import LinearSVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

# library helper
# run: importnb-install from Conda before using
from importnb import Notebook
with Notebook(): 
    import Utility

# custom helper class (from jupyter notebook)
helper = Utility.Helper()

Class 'Helper' v1.3 has been loaded


    Function to reload changes in Jupyter notebooks

In [2]:
# reload changes in Jupyter notebooks
from importlib import reload
with Notebook(): __name__ == '__main__' and reload(Utility)

    Define constants and control variables

In [3]:
DATA_PATH = '../../data/'

FIGSIZE = (20,15)

In [4]:
lyrics_file = DATA_PATH + "artist_scrape_with_lyrics.csv"
    
# lyrics_df = helper.create_df(lyrics_file)

# temporary call as data has not been fully scraped and captured in CSV, so lyrics column contains both str values (where lyrics have been scraped) and NaN values where this is pendingdtype={'user_id': int}
lyrics_df = pd.read_csv(lyrics_file, dtype={'Song Lyrics': str})

# 3. Exploratory data analysis

In [5]:
lyrics_df.head(-10)

Unnamed: 0,Artist Name,Artist URL,Song Title,Song URL,Song Lyrics
0,A B,artist/A-B/472398,Con el Tic Tac del Reloj,/lyric/3455846/A+B/Con+el+Tic+Tac+del+Reloj,Era tan facil soÃ±ar Que te podias quedar Eras...
1,A Bad Think,artist/A-Bad-Think/2137849593,Now You Know,/lyric/36417131/A+Bad+Think/Now+You+Know,"Burning candles at all ends, to mend your hear..."
2,A Baffled Republic,artist/A-Baffled-Republic/2137849643,Bad Boys (Move in Silence),/lyric/2262594/A+Baffled+Republic/Bad+Boys+%28...,"Why, why, why, why Why, why, why, why Why, wh..."
3,A Banca 021,artist/A-Banca-021/2137850524,Cor de Mel,/lyric/37798632/A+Banca+021/Cor+de+Mel,Você me lembra o azul do céu Pintando as nuven...
4,"A Band Called ""O""",artist/A-Band-Called-%22O%22/19641,Sleeping,/lyric/637199/A+Band+Called+%22O%22/Sleeping,"For the life we chose in the evening we rose, ..."
...,...,...,...,...,...
1352573,ZZ Ward,artist/ZZ-Ward/2633167,Ride,/lyric/33934100/ZZ+Ward/Ride,
1352574,ZZ Ward,artist/ZZ-Ward/2633167,Ride [*],/lyric/34157509/ZZ+Ward/Ride+%5B%2A%5D,
1352575,ZZ Ward,artist/ZZ-Ward/2633167,Love 3x,/lyric/31899204/ZZ+Ward/Love+3x,
1352576,ZZ Ward,artist/ZZ-Ward/2633167,LOVE 3X,/lyric/32112984/ZZ+Ward/LOVE+3X,


In [6]:
lyrics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1352588 entries, 0 to 1352587
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   Artist Name  1352588 non-null  object
 1   Artist URL   1352588 non-null  object
 2   Song Title   1352588 non-null  object
 3   Song URL     1352588 non-null  object
 4   Song Lyrics  282997 non-null   object
dtypes: object(5)
memory usage: 51.6+ MB


    Temporary data split, as dataframe has not been fully populated yet

In [7]:
# create dataframe where Song Lyrics are not null
lyrics_df = lyrics_df.loc[~lyrics_df['Song Lyrics'].isnull()]

In [8]:
lyrics_df.head(-10)

Unnamed: 0,Artist Name,Artist URL,Song Title,Song URL,Song Lyrics
0,A B,artist/A-B/472398,Con el Tic Tac del Reloj,/lyric/3455846/A+B/Con+el+Tic+Tac+del+Reloj,Era tan facil soÃ±ar Que te podias quedar Eras...
1,A Bad Think,artist/A-Bad-Think/2137849593,Now You Know,/lyric/36417131/A+Bad+Think/Now+You+Know,"Burning candles at all ends, to mend your hear..."
2,A Baffled Republic,artist/A-Baffled-Republic/2137849643,Bad Boys (Move in Silence),/lyric/2262594/A+Baffled+Republic/Bad+Boys+%28...,"Why, why, why, why Why, why, why, why Why, wh..."
3,A Banca 021,artist/A-Banca-021/2137850524,Cor de Mel,/lyric/37798632/A+Banca+021/Cor+de+Mel,Você me lembra o azul do céu Pintando as nuven...
4,"A Band Called ""O""",artist/A-Band-Called-%22O%22/19641,Sleeping,/lyric/637199/A+Band+Called+%22O%22/Sleeping,"For the life we chose in the evening we rose, ..."
...,...,...,...,...,...
299985,Desmond Dekker,artist/Desmond-Dekker/2878,0.0.7 (Shanty Town),/lyric/26959285/Desmond+Dekker/0.0.7+%28Shanty...,"0-0-7, 0-0-7 At ocean eleven And now rude bo..."
299986,Desmond Dekker,artist/Desmond-Dekker/2878,You Can Get It If You Really Want,/lyric/19957037/Desmond+Dekker/You+Can+Get+It+...,You can get it if you really want You can get ...
299987,Desmond Dekker,artist/Desmond-Dekker/2878,You Can Get It If You Really Want,/lyric/20984184/Desmond+Dekker/You+Can+Get+It+...,You can get it if you really want You can get ...
299988,Desmond Dekker,artist/Desmond-Dekker/2878,You Can Get It If You Really Want,/lyric/21993371/Desmond+Dekker/You+Can+Get+It+...,You can get it if you really want You can get ...


In [9]:
lyrics = lyrics_df['Song Lyrics']
artists = lyrics_df['Artist Name']

X_train, X_test, y_train, y_test = train_test_split(lyrics, artists, test_size=0.1, random_state=1337)

# 4. Preparing the data for Machine Learning Algorithms

# Feature engineering

    n/a

# Data cleaning, preprocessing

In [10]:
cnt_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('std_scaler', StandardScaler(with_mean=False))
])

    Use min_df as an integer to remove rare-occurring words. If they only occur once or twice, they won't add much value and are usually really obscure. Furthermore, there's generally a lot of them so ignoring them with say min_df=5 can greatly reduce your memory consumption and data size
    Use stop_words to remove less-meaningful english words
    Set ngram_range to (1,1) for outputting only one-word tokens, (1,2) for one-word and two-word tokens, (2, 3) for two-word and three-word tokens, etc
    We will update the token_pattern for words to include apostrophes, i.e. to prevent "don't" being converted to "don"

In [11]:
tfidf_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(min_df=5, stop_words='english', ngram_range=(1, 2), token_pattern=r"\b\w[\w\']+\b"))
])

In [12]:
X_train_count = cnt_pipeline.fit_transform(X_train)
X_test_count = cnt_pipeline.transform(X_test)

X_train_tfidf = tfidf_pipeline.fit_transform(X_train)
X_test_tfidf = tfidf_pipeline.transform(X_test)

# 5. Creating and evaluating multiple machine learning models

    The next steps are to vectorize our dataset, train the classifier, and check the predicted results against our actual results

In [13]:
svm = LinearSVC()

In [1]:
_ = svm.fit(X_train_count, y_train)
y_pred_svm = svm.predict(X_test_count)

NameError: name 'svm' is not defined

    from: https://stackoverflow.com/questions/52670012/convergencewarning-liblinear-failed-to-converge-increase-the-number-of-iterati

    Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.

        1. Normalize your training data so that the problem hopefully becomes more well conditioned, which in turn can speed up convergence. One possibility is to scale your data to 0 mean, unit standard deviation using Scikit-Learn's StandardScaler for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.

        2. Related to 1), make sure the other arguments such as regularization weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
        
        3. Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
        
        4. Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks @Nino van Hooff for pointing this out, and @JamesKo for spotting my mistake.

In [None]:
# Regularization parameter
base = 10
exp = -5

reg_prm = base ** exp

In [2]:
svm_con = LinearSVC(C=reg_prm)

_ = svm.fit(X_train_count, y_train)
y_pred_svm = svm.predict(X_test_count)

NameError: name 'LinearSVC' is not defined

In [35]:
print(classification_report(y_test, y_pred_svm, zero_division=0))

                                                                  precision    recall  f1-score   support

                                                  A Band of Bees       1.00      1.00      1.00         1
                                          A Boogie wit da Hoodie       1.00      0.25      0.40         8
                                                          A Camp       1.00      1.00      1.00         1
                                                 A Certain Ratio       1.00      1.00      1.00         1
                                                    A Cor Do Som       1.00      1.00      1.00         2
                                            A Covenant of Thorns       0.00      0.00      0.00         0
                                               A Day to Remember       0.67      0.40      0.50         5
                                                  A Dozen Furies       0.00      0.00      0.00         1
                                             

    Let's compare our results against a simpler and more easily interpretable classifier — the Decision Tree

In [19]:
dt = DecisionTreeClassifier()

dt.fit(X_train_count, y_train)
y_pred_dt = dt.predict(X_test_count)

In [20]:
# Sets the value to return when there is a zero division
print(classification_report(y_test, y_pred_dt, zero_division=0))
print(confusion_matrix(y_test, y_pred_dt))

                                                                  precision    recall  f1-score   support

                                                  A Band of Bees       0.25      1.00      0.40         1
                                                        A Boogie       0.00      0.00      0.00         0
                                          A Boogie wit da Hoodie       0.60      0.38      0.46         8
                                                          A Camp       0.33      1.00      0.50         1
                                                 A Certain Ratio       1.00      1.00      1.00         1
                                                A Change of Pace       0.00      0.00      0.00         0
                                       A Consommer de Préférence       0.00      0.00      0.00         0
                                                    A Cor Do Som       1.00      1.00      1.00         2
                                             

# 6. Tuning and selecting final models

    Use the TfidfVectorizer with character ngrams instead of words and a larger ngram_range

In [33]:
dt.fit(X_train_tfidf, y_train)
y_pred_dt_tfidf = dt.predict(X_test_tfidf)

In [34]:
# Sets the value to return when there is a zero division
print(classification_report(y_test, y_pred_dt_tfidf, zero_division=0))
print(confusion_matrix(y_test, y_pred_dt_tfidf))

                                                                  precision    recall  f1-score   support

                                                  A Band of Bees       0.50      1.00      0.67         1
                                                        A Boogie       0.00      0.00      0.00         0
                                          A Boogie wit da Hoodie       0.50      0.50      0.50         8
                                                          A Camp       1.00      1.00      1.00         1
                                                 A Certain Ratio       1.00      1.00      1.00         1
                                                       A Chicken       0.00      0.00      0.00         0
                                                    A Cor Do Som       0.67      1.00      0.80         2
                                                         A Davey       0.00      0.00      0.00         0
                                             

# 7. Presenting findings and/or solutions