**Goal:** Correctly classify Taylor Swift lyric to correct album based on text data alone

In [None]:
# calculation packages
import numpy as np 
import pandas as pd 

# textual analysis packages 
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# loading in dataset
data = pd.read_csv('thisone.csv')

![](https://i.imgur.com/AkFSOvF.png)

# TF-IDF Feature Engineering
Before I can use my data in a machine learning model, I first have to transform my lyrical text features to numerical features for the machine to read the data. To do so, I will be using TF-IDF vectorization. <br>
<br>
I am choosing to use tf-idf vectorization based on my knowledge of her lyrics as well as
what I observed in my data exploration, because I need to use something that will minimize the similaries observed throughout the catalog while maximizing the uniqueness of each album.
<br>


**Stop Words**
- The stop words that come preloaded with sk learn are not appropriate for my datatset. They contain words like "mine", "ours", and "fifteen" which are important 'words' in Taylor Swift's lyrics. Because of this, I will be creating my own custom list

In [3]:
# stop words
stop_words = ['a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by', 'for', 'if', 'in', 'into', 'is', 'it', 
              'not', 'of', 'on', 'or', 'such', 'that', 'the', 'their', 'then', 'there', 'these', 'they', 'this', 'to', 'was', 'will', 'with']

# isolating the textual data
lyric_text = data['lyrics']

# creating tf-idf object, removing stop words as well as stripping ascii characters.
vectorizer = TfidfVectorizer(strip_accents = 'ascii', stop_words = stop_words)

# fitting and transforming the text data
X = vectorizer.fit_transform(lyric_text)

# checking shape
print(X.shape)

# getting feature names for further analysis
feature_names = vectorizer.get_feature_names_out()


(4584, 3907)


In [7]:
# getting more information about my matrix
X

<4584x3907 sparse matrix of type '<class 'numpy.float64'>'
	with 40591 stored elements in Compressed Sparse Row format>

**Shape**
- This shape (4584, 3906) shows that I have 4584 data points and a feature space of 3906. This means that I have 3906 unique features in my dataset.


**Matrix**
- Tf-Idf vectorization creates a sparse matrix, which means that most of the values within the matrix are 0. The sparsity of the matrix arises from the nature of TF-IDF vectorization. In a TF-IDF matrix, each row corresponds to a document, or in my case, a lyrical snippet, and each column corresponds to a unique feature. Since most documents/lyrics only contain a handful of those features, most of the features for the given data point will be 0.
- To explain this further, I will look closer at my first datapoint


In [8]:
# this show that out of the 3700 features, my first datapoint only has 13 with a TF-IDF value greater than 0
X[0]

<1x3907 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [9]:
# those features are:
print(X[0])

# which are TF-IDF representations of the words from the lyric from the first data point:
print(data['lyrics'][0])

# the reason why there are only 13 non-zero features comes from our removal of stop words, we removed 'the' and 'that'

  (0, 2242)	0.21791958139285975
  (0, 2918)	0.3133642375920805
  (0, 3197)	0.3133642375920805
  (0, 1390)	0.42213123800187724
  (0, 3437)	0.2897126203042577
  (0, 2597)	0.2666172045126449
  (0, 2938)	0.3551549893266243
  (0, 1145)	0.2376855262647906
  (0, 343)	0.2897126203042577
  (0, 2194)	0.13630087493371815
  (0, 3729)	0.2312342039416239
  (0, 2811)	0.21102944450978636
  (0, 1560)	0.19939999199169645
He said the way my blue eyes shined Put those Georgia stars to shame that night


- Now that I have applied TF-IDF vectorization to my text data, I want to be able to look at the most important features in the dataset. This will not only allow me to visualize the top features, but I will also be able to remove unnecessary features that do not make sense.

In [4]:
# function that will allow me to visualize the top features
def looking_at_top(X, features):
    '''
    Takes in vectorized data and feature names to show the top 20 most 
    important features
    Inputs:
        X: (numpy array) an array of tf-idf vectorized features
        features: (numpy array) an array of the names of the features
    '''
    
    feature_np = np.asarray(np.sum(X, axis = 0))
    feature_np = feature_np.reshape(-1)

    top = []

    for x in np.argsort(feature_np)[::-1][:21]:
        top.append((features[x], feature_np[x]))

        df = pd.DataFrame(top)
    
    return df

In [11]:
looking_at_top(X, feature_names)

Unnamed: 0,0,1
0,you,325.207925
1,me,162.340023
2,my,134.966603
3,oh,119.146428
4,your,110.488174
5,we,110.443744
6,all,103.57004
7,like,97.357143
8,know,93.512426
9,so,80.27518


- From this we can see a lot more information about what features and important to the dataset. 
- It seems as though when I applied ascii stripping in the vectorizaiton, some of the words got cut off. Those probably need to be removed.
- However, I need to look closely at words before I remove them.

In [5]:
# finding where the feature is located
print("\'re\' can be found:", np.where(feature_names == 're'))

# and looking at two of the text representations of that feature
np.array(lyric_text)[np.where(X[:, 2637].toarray()>0)[0][:2]]

# from here, we can see that it does indeed come from a contraction that was stripped during ascii.
# I will be adding this feature to my stop words

're' can be found: (array([2637]),)


array(["So go and tell your friends that I'm obsessive and crazy That's fine I'll tell mine that you're gay!",
       "You never let me drive You're a redneck heartbreak"], dtype=object)

In [6]:
# repeating for the other words

# finding where the feature is located
print("\don\' can be found:", np.where(feature_names == 'don'))
# and looking at two of the text representations of that feature
np.array(lyric_text)[np.where(X[:, 963].toarray()>0)[0][:2]]
# same thing happend here, adding to stop words

\don' can be found: (array([963]),)


array(["He's the song in the car I keep singing. Don't know why I do Drew walks by me",
       "The only thing that keeps me wishing on a wishing star He's the song in the car I keep singing. Don't know why I do"],
      dtype=object)

In [7]:
stop_words.extend(['re', 'don'])

# since I have added to stop words, I will now rerun the TF-IDF vectorization with the new stop words and then run the top features function to see if the top words make sense
vectorizer = TfidfVectorizer(strip_accents = 'ascii', stop_words = stop_words)

# fitting and transforming the text data
X = vectorizer.fit_transform(lyric_text)

# checking shape
print(X.shape)

# getting feature names for further analysis
feature_names = vectorizer.get_feature_names_out()

# checking top features
looking_at_top(X, feature_names)

(4584, 3905)


Unnamed: 0,0,1
0,you,328.488988
1,me,163.383899
2,my,135.400737
3,oh,119.578644
4,we,111.157969
5,your,110.884543
6,all,103.901335
7,like,97.841402
8,know,94.489719
9,so,80.845722


In [8]:
# there are still words that do not make sense, so I will be repeating my process again
# finding where the feature is located
print("\'ve\' can be found:", np.where(feature_names == 've'))
# looking at two of the text representations of that feature
np.array(lyric_text)[np.where(X[:, 3461].toarray()>0)[0][:2]]

# finding where the feature is located
print("\'ll\' can be found:", np.where(feature_names == 'll'))
# looking at two of the text representations of that feature
np.array(lyric_text)[np.where(X[:, 1859].toarray()>0)[0][:2]]

# extending the list
stop_words.extend(['ve', 'll', 'isn'])

've' can be found: (array([3641]),)
'll' can be found: (array([1953]),)


In [16]:
# since I have added to stop words, I will now rerun the TF-IDF vectorization with the new stop words and then run the top features function to see if the top words make sense
vectorizer = TfidfVectorizer(strip_accents = 'ascii', stop_words = stop_words)

# fitting and transforming the text data
X = vectorizer.fit_transform(lyric_text)

# checking shape
print(X.shape)

# getting feature names for further analysis
feature_names = vectorizer.get_feature_names_out()

# checking top features
looking_at_top(X, feature_names)

(4584, 3902)


Unnamed: 0,0,1
0,you,330.548768
1,me,164.06799
2,my,136.058706
3,oh,119.832002
4,we,111.687749
5,your,111.220208
6,all,104.576143
7,like,97.988772
8,know,95.218517
9,so,81.211411


- I want to experiment with the lenth of the n-grams, so I will also be creating a (2,2) TF-IDF vectorization

In [9]:
vectorizer = TfidfVectorizer(strip_accents = 'ascii', stop_words = stop_words, ngram_range= (1, 3))

# fitting
X = vectorizer.fit_transform(lyric_text)

# checking shape
print(X.shape)

# checking features
feature_names = vectorizer.get_feature_names_out()

looking_at_top(X, feature_names)

(4584, 49457)


Unnamed: 0,0,1
0,you,148.890813
1,me,73.944305
2,my,63.418646
3,oh,59.750744
4,we,52.2669
5,your,52.208628
6,all,47.852716
7,like,46.305066
8,know,42.28858
9,so,37.412549


In [26]:
vectorizer = TfidfVectorizer(strip_accents = 'ascii', stop_words = stop_words, ngram_range= (1, 3))

# fitting
X_13 = vectorizer.fit_transform(lyric_text)

# checking shape
print(X_13.shape)

# checking features
feature_names_13 = vectorizer.get_feature_names_out()

looking_at_top(X_13, feature_names_13)

(4584, 49457)


Unnamed: 0,0,1
0,you,148.890813
1,me,73.944305
2,my,63.418646
3,oh,59.750744
4,we,52.2669
5,your,52.208628
6,all,47.852716
7,like,46.305066
8,know,42.28858
9,so,37.412549
