# Loyal Health Data Science Coding Challenge

Instructions: The following questions are designed to assess your understanding of common data science concepts with which you should be familiar. We’ll have you complete some basic analysis over text reviews and their metadata from the popular music review site Pitchfork (https://pitchfork.com/). The data can be downloaded here (https://www.kaggle.com/nolanbconaway/pitchfork-data) in the form of a SQLite database.  We expect this to take around 2 hours (at most 3 hours) to complete. Although the completion of the assignment will not be strictly timed, please do not go over the allotted time. If time is an issue, focus the most on problems 2, 4, and 5. 

Write all of your code in this Jupyter notebook. When you’ve completed the assessment, please create a GitHub repository, and email us a link to this repository.


In [162]:
# Import here
import sqlite3
import re
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

from scipy.stats import spearmanr
from textblob import TextBlob
from collections import Counter

## 1. Cursory Data Analysis:
a) Compute the number of albums belonging to each genre. You should notice that some albums have multiple genres listed (e.g. Folk/Country,Pop/R&B,Rock) separated by commas. Consider albums with multiple genres as belonging to each of those genres (i.e. an album with Rap,Rock as it’s genres will be counted as one Rap album and one Rock album). 

b) Compute the number of albums released each year.

c) Compute the ten artists with the highest number of albums reviewed in the data set.

d) Compute the mean, median, standard deviation, minimum, and maximum album scores. 

e) Compute the average score by each review author and return the result in a dataframe sort in descending order.

f) Compute the average album score per artist and return the result in a dataframe with an additional column for the number of albums they’ve had reviewed.
    i) Return the artists with the top 10 highest average scores
    ii) Return the artists with the top 10 lowest average scores


In [163]:
'''
a). using Counter to collect the total number of albums belonging to each genre
'''
con = sqlite3.connect("database.sqlite")

cur = con.cursor()

mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT * FROM genres;'):
    mydata.append(row)

# Be sure to close the connection
con.close()

albumsType = [str(x[1]).split('/') for x in mydata]
flattenType = sum(albumsType, [])
albumsFreq = Counter(flattenType)
albumsFreq

Counter({'electronic': 3874,
         'metal': 860,
         'rock': 9436,
         'None': 2367,
         'rap': 1559,
         'experimental': 1815,
         'pop': 1432,
         'r&b': 1432,
         'folk': 685,
         'country': 685,
         'jazz': 435,
         'global': 217})

In [164]:
'''
b). using Counter to Compute the number of albums released each year.
'''
con = sqlite3.connect("database.sqlite")

cur = con.cursor()

mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT * FROM years;'):
    mydata.append(row)

# Be sure to close the connection
con.close()


yearData = [x[1] for x in mydata]
yearData
Counter(yearData)


Counter({1998: 23,
         2016: 1205,
         2017: 1,
         1996: 33,
         1966: 8,
         1991: 18,
         1968: 16,
         2006: 1182,
         1993: 19,
         1988: 16,
         1962: 3,
         1984: 11,
         2003: 1030,
         1976: 10,
         1994: 26,
         1979: 25,
         1971: 27,
         1975: 13,
         1980: 25,
         1974: 11,
         1969: 14,
         1978: 11,
         1989: 14,
         1997: 24,
         1970: 18,
         2002: 966,
         1995: 19,
         1982: 17,
         2000: 220,
         1973: 13,
         1990: 21,
         1999: 116,
         1992: 25,
         1986: 7,
         1972: 15,
         1964: 5,
         1977: 25,
         1987: 12,
         2011: 1140,
         1985: 19,
         1965: 7,
         1983: 16,
         2009: 1149,
         1981: 25,
         1960: 3,
         2015: 1153,
         2010: 1139,
         2012: 1179,
         2014: 1134,
         2013: 1200,
         2004: 1046,
         2008

In [165]:
'''
c) Compute the ten artists with the highest number of albums reviewed in the data set.
'''
con = sqlite3.connect("database.sqlite")

cur = con.cursor()

mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT * FROM artists;'):
    mydata.append(row)

# Be sure to close the connection
con.close()

artistsData = [x[1] for x in mydata]
Counter(artistsData).most_common(10)

[('various artists', 688),
 ('neil young', 23),
 ('guided by voices', 23),
 ('bonnie prince billy', 22),
 ('david bowie', 21),
 ('the beatles', 21),
 ('gucci mane', 20),
 ('of montreal', 20),
 ('mogwai', 20),
 ('robert pollard', 19)]

In [166]:
'''
d) Compute the mean, median, standard deviation, minimum, and maximum album scores.
I assume the scores are stored in "reviews" table.
from the below example, it seems the 9.3 is the score field.

[(22703,
  'mezzanine',
  'massive attack',
  'http://pitchfork.com/reviews/albums/22703-mezzanine/',
  9.3,
  0,
  'nate patrin',
  'contributor',
  '2017-01-08',
  6,
  8,
  1,
  2017),
'''
con = sqlite3.connect("database.sqlite")

cur = con.cursor()

mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT * FROM reviews;'):
    mydata.append(row)

# Be sure to close the connection
con.close()

#to use the mean, median, etc from pandas
df = pd.DataFrame([x[4] for x in mydata])

#mean
print('mean :', df[0].mean())

#median
print('median :', df[0].median())

#standard deviation
print('standard deviation :', df[0].std())

#minimum
print('minimum :', df[0].min())

#maximum
print('maximum :', df[0].max())


mean : 7.00577937258735
median : 7.2
standard deviation : 1.2936745021540692
minimum : 0.0
maximum : 10.0


In [167]:
'''
e) Compute the average score by each review author and return the result in a dataframe sort in descending order.
'''
con = sqlite3.connect("database.sqlite")

cur = con.cursor()
mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT author, avg(score) FROM reviews GROUP BY author ORDER BY avg(score) DESC '
                      ):
    mydata.append(row)

# Be sure to close the connection
con.close()

df = pd.DataFrame(mydata)
df = df.rename(columns = {0:'author', 1:'average score'})
df.head()

Unnamed: 0,author,average score
0,nelson george,10.0
1,maura johnston,10.0
2,carvell wallace,9.833333
3,dorian lynskey,9.5
4,rollie pemberton & nick sylvester,9.4


In [168]:

'''
f) Compute the average album score per artist and return the result in a dataframe with an additional 
column for the number of albums they’ve had reviewed. 
i) Return the artists with the top 10 highest average scores 
ii) Return the artists with the top 10 lowest average scores

not sure what the "with an additional column" asking for? combine both i) and ii) in one column?
I would assume maybe return a data frame with the following column names:
   1           2                          3                                   4
artist, average score, artist with 10 highest average score, artist with 10 lowest average score,

'''
con = sqlite3.connect("database.sqlite")

#1&2
cur = con.cursor()

mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT artist, avg(score) FROM reviews GROUP BY artist ORDER BY avg(score) DESC;'):
    mydata.append(row)

#3
top10 = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT artist FROM reviews GROUP BY artist ORDER BY avg(score) DESC LIMIT 10;'):
    top10.append(row)

#4
bot10 = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT artist FROM reviews GROUP BY artist ORDER BY avg(score) ASC LIMIT 10;'):
    bot10.append(row)

# Be sure to close the connection
con.close()

df = pd.concat([pd.DataFrame(mydata).rename(columns = {0:'artist', 1:'average score'}), 
                pd.DataFrame(top10).rename(columns = {0:'artist with 10 highest average score'}),
                pd.DataFrame(bot10).rename(columns = {0:'artist with 10 highest average score'})], axis = 1)
df.head()

Unnamed: 0,artist,average score,artist with 10 highest average score,artist with 10 highest average score.1
0,the stone roses,10.0,the stone roses,travis morrison
1,television,10.0,television,push kings
2,talk talk,10.0,talk talk,dan le sac vs. scroobius pip
3,stevie wonder,10.0,stevie wonder,shat
4,slint,10.0,slint,liars academy


## 2) SQL:

Merge the database tables into a dataframe containing all of the relevant metadata.


In [169]:
#try to merge all table into a dataframe
con = sqlite3.connect("database.sqlite")

cur = con.cursor()
mydata = []
# available tables: artists, content, genres, labels, reviews, years
for row in cur.execute('SELECT r.reviewid, r.title, r.artist, r.url, r.score, r.best_new_music, r.author, r.author_type, r.pub_date, r.pub_weekday, c.content, g.genre, l.label, y.year FROM reviews AS r JOIN content AS c ON r.reviewid = c.reviewid JOIN genres AS g ON c.reviewid = g.reviewid JOIN labels AS l ON g.reviewid = l.reviewid JOIN years AS y ON g.reviewid = y.reviewid '
                      ):
    mydata.append(row)

# Be sure to close the connection
con.close()

df = pd.DataFrame(mydata).rename(columns = {0:'reviewid', 1:'title', 2:'artist', 
                             3:'url', 4:'score', 5:'best_new_music', 
                             6:'author', 7:'author_type', 8:'pub_date', 
                             9:'pub_weekday', 10:'content', 11:'genre', 12:'label', 13:'year'})
df.head()

Unnamed: 0,reviewid,title,artist,url,score,best_new_music,author,author_type,pub_date,pub_weekday,content,genre,label,year
0,22703,mezzanine,massive attack,http://pitchfork.com/reviews/albums/22703-mezz...,9.3,0,nate patrin,contributor,2017-01-08,6,"“Trip-hop” eventually became a ’90s punchline,...",electronic,virgin,1998.0
1,22721,prelapsarian,krallice,http://pitchfork.com/reviews/albums/22721-prel...,7.9,0,zoe camp,contributor,2017-01-07,5,"Eight years, five albums, and two EPs in, the ...",metal,hathenter,2016.0
2,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,Minneapolis’ Uranium Club seem to revel in bei...,rock,fashionable idiots,2016.0
3,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,Minneapolis’ Uranium Club seem to revel in bei...,rock,static shock,2016.0
4,22661,first songs,"kleenex, liliput",http://pitchfork.com/reviews/albums/22661-firs...,9.0,1,jenn pelly,associate reviews editor,2017-01-06,4,Kleenex began with a crash. It transpired one ...,rock,kill rock stars,2016.0


## 3) Dataframe Manipulation (Using the Dataframe from part 2) create new DataFrames based on the stipulations below):

a) Create a new DataFrame excluding all artists with names that start with the letter “M” (either upper or lowercase).

b) Create a new DataFrame excluding albums with a score less than 4.0.

c) Create a new DataFrame excluding albums from the label Columbia

d) Create a new DataFrame excluding albums that belong to the metal genre.

e) Create a new DataFrame excluding albums where that artist’s name contains an even number of characters (including whitespace as characters)

f) Combine these DataFrames into one where each album meets the conditions required for each.


In [170]:
'''
a)Create a new DataFrame excluding all artists with names that start with the letter “M” (either upper or lowercase).
'''
df[[not str(x).lower().startswith('m') for x in df['artist']]].head()

Unnamed: 0,reviewid,title,artist,url,score,best_new_music,author,author_type,pub_date,pub_weekday,content,genre,label,year
1,22721,prelapsarian,krallice,http://pitchfork.com/reviews/albums/22721-prel...,7.9,0,zoe camp,contributor,2017-01-07,5,"Eight years, five albums, and two EPs in, the ...",metal,hathenter,2016.0
2,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,Minneapolis’ Uranium Club seem to revel in bei...,rock,fashionable idiots,2016.0
3,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,Minneapolis’ Uranium Club seem to revel in bei...,rock,static shock,2016.0
4,22661,first songs,"kleenex, liliput",http://pitchfork.com/reviews/albums/22661-firs...,9.0,1,jenn pelly,associate reviews editor,2017-01-06,4,Kleenex began with a crash. It transpired one ...,rock,kill rock stars,2016.0
5,22661,first songs,"kleenex, liliput",http://pitchfork.com/reviews/albums/22661-firs...,9.0,1,jenn pelly,associate reviews editor,2017-01-06,4,Kleenex began with a crash. It transpired one ...,rock,mississippi,2016.0


## 4) Feature Engineering:

a) Construct a Pandas DataFrame (see problem 2) containing all album reviews and metadata. Remove any rows that have null values in any column.

b) Add a column to the dataframe for each genre. The entry in this column should be a 1 if the album/row in question belongs to that genre and 0 otherwise. Remember that albums can belong to multiple genres.

c) Add an additional two columns with categorical variables for 1) the author of the review and 2) the role of the author.

d) Create a column for the number of words in the review.

e) Create a column containing the sentiment score of the review. Treat the review as a single string and take the TextBlob polarity score (https://textblob.readthedocs.io/en/dev/quickstart.html).

In [171]:
'''
a)Construct a Pandas DataFrame (see problem 2) containing all album reviews and metadata. 
Remove any rows that have null values in any column.
'''
df = df.dropna()

'''
b) Add a column to the dataframe for each genre. 
The entry in this column should be a 1 if the album/row in question belongs to that genre and 0 otherwise. 
Remember that albums can belong to multiple genres.

From question 1 we know the genre has the following kinds:
'electronic','metal','rock','rap','experimental','pop','r&b','folk','country','jazz','global'
'''
genreList = ['electronic','metal','rock','rap','experimental','pop',
             'r&b','folk','country','jazz','global']

#insert each genre type with value 0
#then change value to 1 if condition met
for i in genreList:
    #     index = ['electronic' in x for x in df['genre']]
    #     df.loc[index, 'electronic'] = 1
    df[i] = 0
    index = [i in x for x in df['genre']]
    df.loc[index, i] = 1
    
# print('check column names :', df.columns)

# print('check if jazz has value change to 1 if condition met :', df['jazz'].value_counts())

# test = df[df['folk'] == 1]
# print('check one example to see if it has the case of both fold and country set to 1 :', 
#       test[test['country'] == 1].head(1))

'''
c) Add an additional two columns with categorical variables for 
1) the author of the review and 
2) the role of the author.

create additional two columns with categroical variables? we already have "author" and "author_type".
They are categorical already. Not sure what this question is asking.
'''

'''
d) Create a column for the number of words in the review.
number of words in the review? the reviews table has multiple columns. 
Which column do we mean here or all column together from reviews table?

I could only assume the review here means the "content" variable from the dataframe.
'''
df['review words'] = [len(re.split(r'\W+', x)) for x in df['content']]


'''
e) Create a column containing the sentiment score of the review. Treat the review as a single string and 
take the TextBlob polarity score (https://textblob.readthedocs.io/en/dev/quickstart.html).
'''

df['sentiment_score'] = [TextBlob(x).sentiment.polarity for x in df['content']]
df.head()


Unnamed: 0,reviewid,title,artist,url,score,best_new_music,author,author_type,pub_date,pub_weekday,...,rap,experimental,pop,r&b,folk,country,jazz,global,review words,sentiment_score
0,22703,mezzanine,massive attack,http://pitchfork.com/reviews/albums/22703-mezz...,9.3,0,nate patrin,contributor,2017-01-08,6,...,0,0,0,0,0,0,0,0,1593,0.097281
1,22721,prelapsarian,krallice,http://pitchfork.com/reviews/albums/22721-prel...,7.9,0,zoe camp,contributor,2017-01-07,5,...,0,0,0,0,0,0,0,0,449,0.04164
2,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,...,0,0,0,0,0,0,0,0,624,0.123304
3,22659,all of them naturals,uranium club,http://pitchfork.com/reviews/albums/22659-all-...,7.3,0,david glickman,contributor,2017-01-07,5,...,0,0,0,0,0,0,0,0,624,0.123304
4,22661,first songs,"kleenex, liliput",http://pitchfork.com/reviews/albums/22661-firs...,9.0,1,jenn pelly,associate reviews editor,2017-01-06,4,...,0,0,0,0,0,0,0,0,1337,0.161576


## 5) Logistic Regression: 

You will now use the features you constructed in the previous exercise to complete a binary logistic regression task accounting for whether an album reviews Pitchfork’s designation of “Best New Music.” This is represented by the binary “bnm” variable in the dataset. 

a) Scale all non-categorical variables as needed.

b) Perform your logistic regression model using the statsmodel library (https://www.pythonfordatascience.org/logistic-regression-python/ ). Treat the best new music variable as your dependent variable and use the release year, word count, sentiment, all genre binary variables, author, and author role as your independent variables. 

c) Calculate the odds ratios for your independent variables

d) What features are most/least predictive of a best new music designation and why do you think that is?

e) If you were to engineer an additional feature for the regression, what would it be? Describe how you would approach constructing that feature.


In [172]:
# df['best_new_music']
'''
a) Scale all non-categorical variables as needed.

For this question, I will scale any non-categorical variables if needed.
Also, will check if some numeric categorical variable is under the range with no typo.
'''
# check non-categorical
#check score
df.score.value_counts()#score seems not need to be scaled.

#check review words
sorted(df['review words'].unique())#this review words have range from 1 to thousands. Need to scale this one.
df['normalized_review_words'] = (df['review words'] - df['review words'].mean())/df['review words'].std()


#check categorical
#check pub_weekday
df.pub_weekday.value_counts()#from 0-6 which is fine

#check pub_date
df.pub_date.value_counts()
pd.Series([len(x) for x in df.pub_date]).value_counts()#seems fine, all have the same length. 

#check year
df.year = [int(x) for x in df['year']]
sorted(df.year.unique())#1961 year data is missing. But it should be fine.

#check electronic
df.electronic.value_counts()#no problem as expected

df['rnb'] = df['r&b']

df=df.reset_index(drop=True)

'''
b) Perform your logistic regression model using the statsmodel library 
(https://www.pythonfordatascience.org/logistic-regression-python/ ). 
Treat the best new music variable as your dependent variable and use the 
release year, word count, sentiment, all genre binary variables, author, and author role as 
your independent variables.

genre list:'electronic','metal','rock','rap','experimental','pop',
             'r&b','folk','country','jazz','global'

Before run the model, I actually want to check how balance the data is. 

let's them check correlation among variables especially music type
'''
corr = df[['electronic','metal','rock','rap','experimental','pop',
             'r&b','folk','country','jazz','global']].corr()
# corr = df.corr()
corr

Unnamed: 0,electronic,metal,rock,rap,experimental,pop,r&b,folk,country,jazz,global
electronic,1.0,-0.094636,-0.445967,-0.146329,-0.152262,-0.136137,-0.136137,-0.090926,-0.090926,-0.068205,-0.055666
metal,-0.094636,1.0,-0.180366,-0.059181,-0.061581,-0.055059,-0.055059,-0.036774,-0.036774,-0.027585,-0.022513
rock,-0.445967,-0.180366,1.0,-0.278888,-0.290197,-0.259463,-0.259463,-0.173295,-0.173295,-0.129991,-0.106094
rap,-0.146329,-0.059181,-0.278888,1.0,-0.095218,-0.085134,-0.085134,-0.056861,-0.056861,-0.042652,-0.034811
experimental,-0.152262,-0.061581,-0.290197,-0.095218,1.0,-0.088586,-0.088586,-0.059167,-0.059167,-0.044382,-0.036223
pop,-0.136137,-0.055059,-0.259463,-0.085134,-0.088586,1.0,1.0,-0.052901,-0.052901,-0.039681,-0.032387
r&b,-0.136137,-0.055059,-0.259463,-0.085134,-0.088586,1.0,1.0,-0.052901,-0.052901,-0.039681,-0.032387
folk,-0.090926,-0.036774,-0.173295,-0.056861,-0.059167,-0.052901,-0.052901,1.0,1.0,-0.026503,-0.021631
country,-0.090926,-0.036774,-0.173295,-0.056861,-0.059167,-0.052901,-0.052901,1.0,1.0,-0.026503,-0.021631
jazz,-0.068205,-0.027585,-0.129991,-0.042652,-0.044382,-0.039681,-0.039681,-0.026503,-0.026503,1.0,-0.016226


In [173]:
'''
From the above table:
1. we can see pop and r&b are perfect correlated with correlation of 1.0. So we would only need one of them.
2. folk and country has correlation of 1.0. We only need one of them.
3. rock and electronic have moderate correlation.
4. The others have low correlations among them.


Then, I would like to do a stepwise forward selection to check which variable are significant to our outcome.
'''
#1) run single varaible year
model = smf.logit("best_new_music~ year", data = df).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.272732
         Iterations 7


0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17902.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,0.001333
Time:,20:09:57,Log-Likelihood:,-4883.0
converged:,True,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,0.000305

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,25.5467,7.490,3.411,0.001,10.867,40.226
year,-0.0140,0.004,-3.741,0.000,-0.021,-0.007


In [188]:
'''
the result from above year only model seems interesting. The p-value is 0 < 0.05. It indicate that for every one unit increase
in year, the best new music decrease with log odds by 0.0140. Somehow we can tell people are setting higher 
expectation for best new music year by year.
'''
''

''

In [175]:
#2). run single varaible normalized_review_words

model = smf.logit("best_new_music~ normalized_review_words", data = df).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.228672
         Iterations 7


0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17902.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,0.1627
Time:,20:09:57,Log-Likelihood:,-4094.1
converged:,True,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.8264,0.035,-81.526,0.000,-2.894,-2.758
normalized_review_words,0.9149,0.025,36.372,0.000,0.866,0.964


In [176]:
'''
The p-value is 0 < 0.05. It indicate that for every one unit increase
in normalized_review_words, the best new music increase with log odds by 0.9149.
'''



'\nThe p-value is 0 < 0.05. It indicate that for every one unit increase\nin normalized_review_words, the best new music increase with log odds by 0.9149.\n'

In [177]:
#3). run single varaible sentiment_score

model = smf.logit("best_new_music~ sentiment_score", data = df).fit()

model.summary()

Optimization terminated successfully.
         Current function value: 0.272826
         Iterations 7


0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17902.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,0.0009884
Time:,20:09:57,Log-Likelihood:,-4884.7
converged:,True,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,0.001878

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.6635,0.068,-39.129,0.000,-2.797,-2.530
sentiment_score,1.5382,0.496,3.102,0.002,0.566,2.510


In [189]:
'''
The p-value is 0 < 0.05. It indicate that for every one unit increase
in sentiment_score, the best new music increase with log odds by 1.5382.
'''
''

''

In [179]:
#4). run single varaible C(author_type)

model = smf.logit("best_new_music~ C(author_type)", data = df).fit()

model.summary()

         Current function value: 0.265241
         Iterations: 35




0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17889.0
Method:,MLE,Df Model:,14.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,0.02876
Time:,20:09:57,Log-Likelihood:,-4748.9
converged:,False,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,9.384e-52

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.7918,0.624,-2.873,0.004,-3.014,-0.570
C(author_type)[T.associate editor],-29.9437,1.09e+06,-2.75e-05,1.000,-2.14e+06,2.14e+06
C(author_type)[T.associate features editor],-0.1542,0.980,-0.157,0.875,-2.075,1.767
C(author_type)[T.associate reviews editor],1.3535,0.686,1.972,0.049,0.008,2.699
C(author_type)[T.associate staff writer],-1.3083,0.773,-1.692,0.091,-2.824,0.207
C(author_type)[T.contributing editor],-0.4691,0.663,-0.708,0.479,-1.768,0.830
C(author_type)[T.contributor],-0.8528,0.624,-1.366,0.172,-2.077,0.371
C(author_type)[T.deputy news editor],-1.1939,0.774,-1.543,0.123,-2.711,0.323
C(author_type)[T.editor-in-chief],0.1178,0.698,0.169,0.866,-1.251,1.487


In [180]:
'''
we can tell most of the time, the author type does not correlate to our outcome best_new_music.
Only C(author_type)[T.associate reviews editor] and C(author_type)[T.associate staff writer] have 
significant p-value. But it also not make sense if we have have these two significant but has something like
senior staff writer that has no significant p-value.

Also, from the below spearman corelation, we can tell author_type has very small correlation with best_new_music 
'''

corr, _ = spearmanr(df['best_new_music'], df['author_type'])
corr

0.08068350886487367

In [181]:
#5). run single varaible C(author)

'''
if we simply run below regression, it will raise singular matrix problem. 
It is due to we have many different authors. Compare to out data size, the variables are too much.
model = smf.logit("best_new_music~ C(author)", data = df).fit()
model.summary()

df['author'].value_counts() #we have 246 different authors 

Instead of using all the authors, 
maybe we can try making them to three groups: >200: high, 100-200: moderate, 100<: low
'''
myCounter = Counter(df['author'])
high_list = [x for x, y in myCounter.items() if y >200 ]
moderate_list = [x for x, y in myCounter.items() if y >100 and y<=200]
low_list = [x for x, y in myCounter.items() if y <=100]

df['cat_author'] = list(df['author'])
for j in range(len(df['cat_author'])):
    if df['cat_author'][j] in high_list:
        df['cat_author'][j] = 'high'
    elif df['cat_author'][j] in moderate_list:
        df['cat_author'][j] = 'moderate'
    else:
        df['cat_author'][j] = 'low'
        
model = smf.logit("best_new_music~ C(cat_author)", data = df).fit()
model.summary()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cat_author'][j] = 'high'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cat_author'][j] = 'low'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cat_author'][j] = 'moderate'


Optimization terminated successfully.
         Current function value: 0.273093
         Iterations 6


0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17901.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,1.035e-05
Time:,20:10:00,Log-Likelihood:,-4889.5
converged:,True,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,0.9507

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4675,0.035,-71.074,0.000,-2.536,-2.399
C(cat_author)[T.low],-0.0189,0.072,-0.264,0.792,-0.159,0.121
C(cat_author)[T.moderate],-0.0181,0.079,-0.230,0.818,-0.173,0.137


In [190]:
'''
From the above result, we can conclude even if we further group author to 3 types, 
there is no relationship bettween author and outcome.
'''
''

''

In [183]:
#6). we then run varaibles music type

'''
remember the conclusion from previous
1. we can see pop and r&b are perfect correlated with correlation of 1.0. So we would only need one of them.
2. folk and country has correlation of 1.0. We only need one of them.
3. rock and electronic have moderate correlation.
'''

model = smf.logit("best_new_music~ C(rock)", data = df).fit()
'''
The p-value < 0.05. It indicate the best new music increase with log odds by 0.1654 if we have variable rock.
'''

model = smf.logit("best_new_music~ C(rap)", data = df).fit()
'''
not significant
'''

model = smf.logit("best_new_music~ C(experimental)", data = df).fit()
'''
The p-value < 0.05. It indicate the best new music increase with log odds by 0.2253 if we have variable experimental.
'''

model = smf.logit("best_new_music~ C(pop)", data = df).fit()
'''
The p-value < 0.05. It indicate the best new music increase with log odds by 0.2017 if we have variable pop.
'''

model = smf.logit("best_new_music~ C(folk)", data = df).fit()
'''
The p-value < 0.05. It indicate the best new music decrease with log odds by 0.5399 if we have variable folk.
folk is unfavorable for best_new_music
'''

model = smf.logit("best_new_music~ C(jazz)", data = df).fit()
'''
not significant
'''

model = smf.logit("best_new_music~ C(metal)", data = df).fit()
'''
The p-value < 0.05. It indicate the best new music decrease with log odds by 0.5875 if we have variable metal.
metal is unfavorable for best_new_music
'''

model.summary()

Optimization terminated successfully.
         Current function value: 0.272852
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.273033
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.272931
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.272987
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.272838
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.273085
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.272775
         Iterations 7


0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17902.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,0.001175
Time:,20:10:01,Log-Likelihood:,-4883.8
converged:,True,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,0.0006994

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4570,0.028,-86.993,0.000,-2.512,-2.402
C(metal)[T.1],-0.5875,0.189,-3.109,0.002,-0.958,-0.217


In [191]:
dir(model)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cache',
 '_data_attr',
 '_data_in_cache',
 '_get_endog_name',
 '_get_robustcov_results',
 '_use_t',
 'aic',
 'bic',
 'bse',
 'conf_int',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'f_test',
 'fittedvalues',
 'get_margeff',
 'initialize',
 'k_constant',
 'llf',
 'llnull',
 'llr',
 'llr_pvalue',
 'load',
 'mle_retvals',
 'mle_settings',
 'model',
 'nobs',
 'normalized_cov_params',
 'params',
 'pred_table',
 'predict',
 'prsquared',
 'pvalues',
 'remove_data',
 'resid_dev',
 'resid_generalized',
 'resid_pearson',
 'resid_response',
 'save',
 'scale',
 'set_null_options',
 'summary'

In [192]:
#then let's combine all significant variables step by step and achieve the following:
'''
model = smf.logit("best_new_music~ C(rock) + C(experimental)", data = df).fit()

model = smf.logit("best_new_music~ C(rock) + C(experimental) + C(pop)", data = df).fit()

model = smf.logit("best_new_music~ C(rock) + C(experimental) + C(pop)+ normalized_review_words", data = df).fit()

model = smf.logit("best_new_music~ C(rock) + C(experimental) + C(pop)+ normalized_review_words + sentiment_score", data = df).fit()

some steps are omit here.
'''
#the best model so far and the model converge
model = smf.logit("best_new_music~ C(rock) + C(experimental)+ \
normalized_review_words + sentiment_score + year +C(rnb) + C(electronic)", data = df).fit()

#not going to add the interaction term here due to time limit and also it is usually not easy for interpretation by adding interaction term.
model.summary()

Optimization terminated successfully.
         Current function value: 0.226075
         Iterations 8


0,1,2,3
Dep. Variable:,best_new_music,No. Observations:,17904.0
Model:,Logit,Df Residuals:,17896.0
Method:,MLE,Df Model:,7.0
Date:,"Mon, 04 Oct 2021",Pseudo R-squ.:,0.1722
Time:,20:11:26,Log-Likelihood:,-4047.7
converged:,True,LL-Null:,-4889.5
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-38.2514,8.004,-4.779,0.000,-53.938,-22.564
C(rock)[T.1],0.5222,0.092,5.654,0.000,0.341,0.703
C(experimental)[T.1],0.8290,0.124,6.700,0.000,0.586,1.071
C(rnb)[T.1],0.6035,0.133,4.551,0.000,0.344,0.863
C(electronic)[T.1],0.6057,0.109,5.561,0.000,0.392,0.819
normalized_review_words,0.9552,0.026,36.495,0.000,0.904,1.007
sentiment_score,2.4647,0.588,4.191,0.000,1.312,3.617
year,0.0172,0.004,4.331,0.000,0.009,0.025


In [193]:
'''
c) Calculate the odds ratios for your independent variables
'''
model_odds = pd.DataFrame(np.exp(model.params), columns= ['OR'])
model_odds['z-value']= model.pvalues
model_odds[['2.5%', '97.5%']] = np.exp(model.conf_int())

model_odds

Unnamed: 0,OR,z-value,2.5%,97.5%
Intercept,2.4414510000000003e-17,1.759677e-06,3.757651e-24,1.586279e-10
C(rock)[T.1],1.685794,1.56896e-08,1.406631,2.020359
C(experimental)[T.1],2.290978,2.087065e-11,1.797629,2.919723
C(rnb)[T.1],1.828494,5.350701e-06,1.409962,2.371264
C(electronic)[T.1],1.832589,2.684649e-08,1.480287,2.268738
normalized_review_words,2.599221,1.330547e-291,2.469245,2.73604
sentiment_score,11.76016,2.775265e-05,3.714012,37.23776
year,1.017385,1.484223e-05,1.009481,1.025352


In [194]:
'''
d) What features are most/least predictive of a best new music designation and why do you think that is?

1).We have removed some features by the stepwise forward selection process and correlation analysis. Details have 
been talked during above process. 
I am not goning to repeat here why some features are not good and have been removed.

2).From the above best model, we have all significant features left. 
Among these features, we can see sentiment_score is a very strong feature which has 1.17e+01 OR.
It does make sense cause how the best music selected should be really based on its content.

In terms of review words, people is more likely to write longer review for what they loved.

Also, some music type might be more favorable like rock, r&b, etc compare to jazz which is likely more minority.
'''
''

''

In [195]:
'''
e) If you were to engineer an additional feature for the regression, what would it be? 
Describe how you would approach constructing that feature.

1) I would probably consider month data. Maybe during some month close to music season. 
The music that published close to such time might get more attention and are likely to achieve a better review.

2) Also, I might want to create a feature that capture artist rating. 
In order to get such feature, I could web scrapping artist review, description, etc from source like twitter.

3) Also instead of using rock, r&b, jazz, etc that many class of music, we might want to classify them instead 
of pupularity. Low, moderate, high popularity group based on the number of data points.

'''
''

''

## 6) Data Visualization (Optional): 

Using the results from your regression and data analysis create a visualization that tells a story about the data. Feel free to take personal liberties with this and be as creative as you like. 