# Pre-Processing and Modeling

In this notebook I am going to pre-process my data by train test splitting and bootstrapping to have balanced classes. I will also  create a model that predicts whether or not a song was a hit based on its lyrics and the year it was released. 

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import pandas as pd
import pickle

### Pre-processing

Reading in my TFIDF data. (I ended up going with TFIDF becuase it performed better than count vectorization.)

In [25]:
df_token = pd.read_csv('./df_token.csv',index_col=0)

In [26]:
df = pd.read_csv('./clean_data_years.csv',index_col=0)

Creating my feature matrix, X, and my target array, Y. 

In [27]:
X = df_token
y = df['hot100']

Train test splitting

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

### Bootstrapping

Since my data is horribly unbalanced, I will have to bootstrap my positive class so it is equal in size to my negative class. I wll do this by grabbing the indices for observations in my train that are positive, and then filtering my train so I can bootstrap only the positive observations

In [29]:
positive_train = y_train[y_train == 1].index

Grabbing the index from my test's positive class

In [30]:
positive_test = y_test[y_test == 1].index

Testing my filter for the train

In [31]:
X_train.loc[positive_train, :].head()

Unnamed: 0,00,000,007,01,02,03,04,05,06,07,...,zoovier,zoowap,zorro,zu,zucchi,zuckerberg,zulu,zy,ándale,date_year
14611,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2017.0
474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2012.0
14531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0
5690,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2016.0
15858,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2000.0


Now I will bootstrap my train and test positive classes. It is important to do this after train test splitting so there aren't duplicate rows in the train and test, which would artifically increase your model's score.

In [32]:
X_train_bootstrap = X_train.loc[positive_train, :].sample(n=12000,replace=True, random_state=42)

In [33]:
y_train_bootstrap = y_train.loc[X_train_bootstrap.index]

In [34]:
X_test_bootstrap = X_test.loc[positive_test, :].sample(n=4000,replace=True, random_state=42)

In [35]:
y_test_bootstrap = y_test.loc[X_test_bootstrap.index]

Now I am concatinating my bootstrapped observations with my regular observations

In [36]:
y_test_boot = pd.concat([y_test_bootstrap,y_test])

In [37]:
y_train_boot = pd.concat([y_train_bootstrap,y_train])

In [38]:
X_train_boot = pd.concat([X_train_bootstrap,X_train])

In [39]:
X_test_boot = pd.concat([X_test_bootstrap,X_test])

Saving y_test_boot and y_train_boot

In [42]:
pd.DataFrame(y_test_boot) = pd.to_csv('./y_test_boot.csv',index_col=0)

In [43]:
pd.DataFrame(y_train_boot) = pd.to_csv('./y_train_boot.csv',index_col=0)

### SVD

To reduce features and combat overfitness, I am going to decompose my matrix through singular value decomposition. This will create components that explain the most variance in my data, which will become my new features and essentially act as topics. I also found SVD (and NMF, for that matter) to perform better than just fitting a model on a non-decomposed matrix

In [160]:
from sklearn.decomposition import TruncatedSVD

Fitting my SVD model on my bootstrapped X train and then trasforming both my X train and X test. I found 100 components to be the best as well as 'arpack' to be the best algorithm

In [161]:
SVD = TruncatedSVD(n_components=100, algorithm='arpack')
X_train_svd = SVD.fit_transform(X_train_boot)
X_test_svd = SVD.transform(X_test_boot)

In [142]:
X_train_svd_og = SVD.transform(X_train)
X_test_svd_og = SVD.transform(X_test)

In [162]:
with open('./SVD.pkl', 'wb+') as f:
    pickle.dump(SVD, f)

After extensive experimentation I found the following parameters to be the best

In [163]:
lr = LogisticRegression(C=5000, penalty='l2', tol = .0001, random_state=1)

C: the inverse regularization term = $5000$, so there will be very little regularization
___________________________________________________________
Penalty: the penalty to be used in regularizatinon = L2 Norm

In [164]:
lr.fit(X_train_svd,y_train_boot)

LogisticRegression(C=5000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Saving my SVD dataframes and model

In [129]:
pd.DataFrame(X_train_svd).to_csv('./X_train_svd.csv')

In [130]:
pd.DataFrame(X_test_svd).to_csv('./X_test_svd.csv')

In [165]:
with open('./lr.pkl', 'wb+') as f:
    pickle.dump(lr, f)

Fitting the same model but changing the scoring metric to area under the ROC curve. This metric trys to balance the rate of false positives and false negatives and thus might be more suitable for our scenario since our classes are so unbalanced

In [116]:
gs = GridSearchCV(lr, param_grid={},scoring='roc_auc')

In [143]:
gs.fit(X_train_svd,y_train_boot)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=5000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [120]:
with open('./gs_svd.pkl', 'wb+') as f:
    pickle.dump(gs, f)