# Regression Analysis
This is the second part of my project on Japanese web novel title length. In this notebook, I am going to use regression analysis to find answers for the following questions:<br>
- Are longer titles correlated with more views?<br>
- Are longer titles correlated with the number of readers who actually read the novel?<br>
- Are longer titles correlated with a better rating?<br>

## Import the Dataset

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
pd.options.display.max_columns = None

In [2]:
df = pd.read_excel('japanesewebnovel.xlsx')
df = df[['ID','Title_Length','Weekly_Viewers','Bookmarks','Average_Score','Genre','Complete','Year','Latest_Update','Word_Count','Reviews','Parts','Average_Length']]
df['No_update'] = np.where(((df['Complete'] == 'On Going') & (df['Latest_Update'] < 2021)) , 1, 0)
df['Complete'] = df['Complete'].replace(['Completed','On Going'], [1,0])
df = df.replace({'Genre': 'Non-genre'}, 'Others')
df.head()

Unnamed: 0,ID,Title_Length,Weekly_Viewers,Bookmarks,Average_Score,Genre,Complete,Year,Latest_Update,Word_Count,Reviews,Parts,Average_Length,No_update
0,N0680D,9,438,3150,9.45,Fantasy,1,2007,2020,2838217,4,328,8653.1,0
1,N5702B,12,270,2521,9.5109,Fantasy,0,2007,2012,563349,2,88,6401.7,1
2,N5914B,5,171,55,9.4,Others,0,2007,2021,13184,0,18,732.4,0
3,N0523D,18,122,250,9.4062,Romance,1,2007,2021,233453,1,116,2012.5,0
4,N2436C,7,108,86,9.2222,Others,0,2007,2008,55785,0,28,1992.3,1


**List of variables in this dataset:**<br>
**Title_Length:** Length of the novel title. (The maximum is 100 Japanese characters.)<br>
**Weekly_Viewers:** Number of unique readers who have clicked on this novel in the past week (the week before the dataset was collected, which was about mid-July 2021).<br>
**Bookmarks:** Number of readers who have bookmarked this novel.<br>
**Average_Score:** Average rating the readers give to this novel. (The rating is out of 10.)<br>
**Genre:** A categorical variable that shows the genres of the novel.<br>
**Complete:** A dummy variable that shows whether this novel is completed or ongoing. (1= completed, 0 = ongoing)<br>
**Year:** The year the novel was first published on this website.<br>
**Latest_Update:** The year the novel was last updated. <br>
**Word_Count:** The length of the novel.<br>
**Reviews:** Number of readers who have written a review for this novel.<br>
**Parts:** Number of parts/chapters.<br>
**Average_Length:** Average length (in Japanese characters) per part.<br>
**No_update:** A dummy variable that shows whether this novel is an ongoing work with no update for over 6 months. (The data was collected in July 2021. Ongoing stories that haven’t have any update in 2021 are considered to be 'no update'.)<br>

## Summary statistics

In [3]:
df.agg({
    "Title_Length": ["min", "max", "median", "mean", "std"],
    "Bookmarks": ["min", "max", "median", "mean", "std"],
    "Average_Score": ["min", "max", "median", "mean", "std"],
    "Word_Count": ["min", "max", "median", "mean", "std"],
    "Weekly_Viewers": ["min", "max", "median", "mean", "std"],
}).round(2)

Unnamed: 0,Title_Length,Bookmarks,Average_Score,Word_Count,Weekly_Viewers
min,1.0,0.0,2.0,3.0,100.0
max,100.0,275746.0,10.0,22748426.0,310790.0
median,12.0,140.0,8.84,66225.5,100.0
mean,16.76,1741.2,8.59,207708.43,1073.66
std,14.69,6962.82,1.15,459418.12,7747.77


*Minimum weekly views is 100 because the website only shows "100 or less" for novels with less than 100 views in one week.

### Summary statistics by genre

In [4]:
df.groupby(["Genre"]).agg({
    "Title_Length": ["count","min", "max", "median", "mean", "std"],
    "Bookmarks": ["min", "max", "median", "mean", "std"],
    "Average_Score": ["min", "max", "median", "mean", "std"],
    "Weekly_Viewers": ["min", "max", "median", "mean", "std"],
}).round(1)

Unnamed: 0_level_0,Title_Length,Title_Length,Title_Length,Title_Length,Title_Length,Title_Length,Bookmarks,Bookmarks,Bookmarks,Bookmarks,Bookmarks,Average_Score,Average_Score,Average_Score,Average_Score,Average_Score,Weekly_Viewers,Weekly_Viewers,Weekly_Viewers,Weekly_Viewers,Weekly_Viewers
Unnamed: 0_level_1,count,min,max,median,mean,std,min,max,median,mean,std,min,max,median,mean,std,min,max,median,mean,std
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Fantasy,12468,1,100,18,24.0,19.7,0,275746,604,4046.6,12051.8,2.0,10.0,8.8,8.6,0.9,100,310790,190,2543.0,13135.7
Literature,4083,1,100,13,17.3,13.4,0,157710,207,1275.9,4532.5,2.0,10.0,9.0,8.7,1.2,100,201663,100,815.2,5265.2
Others,22223,1,100,9,10.7,7.2,0,82301,2,227.3,1223.8,2.0,10.0,8.8,8.4,1.5,100,11889,100,109.8,166.9
Romance,11335,1,100,15,19.8,14.8,0,163764,734,2226.9,5840.9,2.0,10.0,8.9,8.8,0.8,100,248818,150,1335.9,7577.9
Sci-fi,1831,1,100,17,20.9,14.6,0,109623,649,2448.5,6685.2,2.0,10.0,8.8,8.6,1.0,100,217034,109,1719.4,9133.1


### Summary statistics by year

In [5]:
df.groupby(["Year"]).agg({
    "Title_Length": ["count","min", "max", "median", "mean", "std"],
    "Bookmarks": ["min", "max", "median", "mean", "std"],
    "Average_Score": ["min", "max", "median", "mean", "std"],
    "Weekly_Viewers": ["min", "max", "median", "mean", "std"],
}).round(2)

Unnamed: 0_level_0,Title_Length,Title_Length,Title_Length,Title_Length,Title_Length,Title_Length,Bookmarks,Bookmarks,Bookmarks,Bookmarks,Bookmarks,Average_Score,Average_Score,Average_Score,Average_Score,Average_Score,Weekly_Viewers,Weekly_Viewers,Weekly_Viewers,Weekly_Viewers,Weekly_Viewers
Unnamed: 0_level_1,count,min,max,median,mean,std,min,max,median,mean,std,min,max,median,mean,std,min,max,median,mean,std
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2007,2796,1,59,8,9.13,6.02,0,3150,1.0,12.13,98.42,2.0,10.0,8.4,7.87,2.2,100,438,100,100.22,7.29
2008,3365,1,62,8,9.76,6.26,0,13958,1.0,29.69,332.16,2.0,10.0,8.6,8.07,2.0,100,972,100,100.64,16.9
2009,3622,1,86,8,9.84,6.54,0,64697,2.0,85.86,1219.67,2.0,10.0,8.89,8.41,1.65,100,27234,100,109.51,452.91
2010,3883,1,100,9,10.12,6.63,0,107228,10.0,287.02,2777.91,2.0,10.0,8.82,8.47,1.46,100,37706,100,133.33,862.64
2011,4014,1,93,9,10.52,6.81,0,157710,34.0,565.84,4073.99,2.0,10.0,8.74,8.44,1.27,100,114574,100,183.97,2138.77
2012,4004,1,71,10,11.31,6.96,0,216174,79.0,1201.67,7333.75,2.0,10.0,8.76,8.51,1.23,100,218673,100,354.52,4536.28
2013,3979,1,78,10,12.4,7.8,0,275746,130.0,1680.29,9502.29,2.0,10.0,8.79,8.54,1.19,100,263264,100,630.92,7968.98
2014,4017,1,93,11,13.84,8.97,0,111144,218.0,1820.42,6806.56,2.0,10.0,8.8,8.56,1.18,100,105667,100,416.44,3355.07
2015,3709,1,98,13,15.37,10.0,0,195295,377.0,2653.57,9706.09,2.0,10.0,8.78,8.62,0.97,100,222695,100,778.3,7426.52
2016,3279,1,100,14,17.3,10.84,0,203045,729.0,3195.53,9877.55,2.0,10.0,8.82,8.7,0.75,100,310790,100,1124.57,9821.18


## OLS regression analysis
I used Robust linear regression models to estimate title length’s effect on the weekly viewers, average score, and the number of bookmarks. Each regression table contains the result of 4 regression models:

1. Simple linear regression of the outcome variable (which is weekly viewers, average score, or number of bookmarks) on title length.
2. Multiple linear regression of the outcome variable on title length, genre, published year, number of parts, whether or not the work is completed, and whether or not the work is an ongoing novel with no update for over 6 months.
3. Multiple linear regression of the outcome variable on title length, **title length squared**, genre, published year, number of parts, whether or not the work is completed, and whether or not the work is an ongoing novel with no update for over 6 months.
4. Multiple linear regression of the outcome variable on title length, **title length squared**, genre, published year, number of parts, whether or not the work is completed, whether or not the work is an ongoing novel with no update for over 6 months, **interaction terms between the genre and title length**, and **interaction terms between the genre and title length squared**.

※The coefficients of the published year are not included in the regression table in order to save space.

In [6]:
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
from statsmodels.formula.api import rlm

### Create dummy variabes and interaction terms

In [7]:
dummy = pd.get_dummies(df['Genre'])
dummy = dummy.rename(columns={"Sci-fi": "Sci_fi"})
dummy.head()

Unnamed: 0,Fantasy,Literature,Others,Romance,Sci_fi
0,1,0,0,0,0
1,1,0,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,1,0,0


In [8]:
# dummy variabes for genres
df1 = pd.concat([df, dummy], axis=1)
df1.drop(columns=['Others'])

# interaction terms (title_length * genre)
df1['tlength_fantasy'] = df1['Title_Length']*df1['Fantasy']
df1['tlength_literature'] = df1['Title_Length']*df1['Literature']
df1['tlength_romance'] = df1['Title_Length']*df1['Romance']
df1['tlength_scifi'] = df1['Title_Length']*df1['Sci_fi']

# interaction terms (title_length_squared * genre)
df1['tlength_squared_fantasy'] = df1['Title_Length']*df1['Title_Length']*df1['Fantasy']
df1['tlength_squared_literature'] = df1['Title_Length']*df1['Title_Length']*df1['Literature']
df1['tlength_squared_romance'] = df1['Title_Length']*df1['Title_Length']*df1['Romance']
df1['tlength_squared_scifi'] = df1['Title_Length']*df1['Title_Length']*df1['Sci_fi']

### 1. Weekly Viewers

In [9]:
df2 = df1[df1['Weekly_Viewers'] > 100]

model1 = rlm('Weekly_Viewers ~ Title_Length',data=df2).fit()
model2 = rlm('Weekly_Viewers ~ Title_Length + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update',data=df2).fit()
model3 = rlm('Weekly_Viewers ~ Title_Length + np.power(Title_Length, 2) + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update',data=df2).fit()
model4 = rlm('Weekly_Viewers ~ Title_Length + np.power(Title_Length, 2) + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update + tlength_fantasy + tlength_literature + tlength_romance + tlength_scifi + tlength_squared_fantasy + tlength_squared_literature + tlength_squared_romance + tlength_squared_scifi',data=df2).fit()

stargazer = Stargazer([model1, model2, model3, model4])
stargazer.rename_covariates({'np.power(Title_Length, 2)': 'Title_length_squared',
                             'tlength_fantasy': 'Title length x Fantasy',
                             'tlength_literature': 'Title length x Literature',
                             'tlength_romance': 'Title length x Romance',
                             'tlength_scifi': 'Title length x Sci-fi',
                             'tlength_squared_fantasy':'Title length Squared x Fantasy',
                             'tlength_squared_literature':'Title length Squared x Literature',
                             'tlength_squared_romance':'Title length Squared x Romance',
                             'tlength_squared_scifi':'Title length Squared x Sci-fi'})
stargazer.covariate_order(['Intercept','Title_Length','np.power(Title_Length, 2)','Fantasy','Literature','Romance','Sci_fi','tlength_fantasy','tlength_literature','tlength_romance','tlength_scifi','tlength_squared_fantasy','tlength_squared_literature','tlength_squared_romance','tlength_squared_scifi','Parts','Complete','No_update'])
HTML(stargazer.render_html())

0,1,2,3,4
,,,,
,Dependent variable:Weekly_Viewers,Dependent variable:Weekly_Viewers,Dependent variable:Weekly_Viewers,Dependent variable:Weekly_Viewers
,,,,
,(1),(2),(3),(4)
,,,,
Intercept,413.674***,170.729,151.175,252.979
,(8.333),(239.577),(240.277),(245.438)
Title_Length,9.481***,3.939***,6.328***,-4.624
,(0.258),(0.285),(0.810),(5.497)
Title_length_squared,,,-0.031***,0.089



### 2. Average Score

In [10]:
model5 = rlm('Average_Score ~ Title_Length',data=df1).fit()
model6 = rlm('Average_Score ~ Title_Length + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update',data=df1).fit()
model7 = rlm('Average_Score ~ Title_Length + np.power(Title_Length, 2) + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update',data=df1).fit()
model8 = rlm('Average_Score ~ Title_Length + np.power(Title_Length, 2) + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update + tlength_fantasy + tlength_literature + tlength_romance + tlength_scifi + tlength_squared_fantasy + tlength_squared_literature + tlength_squared_romance + tlength_squared_scifi',data=df1).fit()

stargazer = Stargazer([model5, model6, model7, model8])
stargazer.rename_covariates({'np.power(Title_Length, 2)': 'Title_length_squared',
                             'tlength_fantasy': 'Title length x Fantasy',
                             'tlength_literature': 'Title length x Literature',
                             'tlength_romance': 'Title length x Romance',
                             'tlength_scifi': 'Title length x Sci-fi',
                             'tlength_squared_fantasy':'Title length Squared x Fantasy',
                             'tlength_squared_literature':'Title length Squared x Literature',
                             'tlength_squared_romance':'Title length Squared x Romance',
                             'tlength_squared_scifi':'Title length Squared x Sci-fi'})
stargazer.covariate_order(['Intercept','Title_Length','np.power(Title_Length, 2)','Fantasy','Literature','Romance','Sci_fi','tlength_fantasy','tlength_literature','tlength_romance','tlength_scifi','tlength_squared_fantasy','tlength_squared_literature','tlength_squared_romance','tlength_squared_scifi','Parts','Complete','No_update'])
stargazer.significant_digits(5)
HTML(stargazer.render_html())

0,1,2,3,4
,,,,
,Dependent variable:Average_Score,Dependent variable:Average_Score,Dependent variable:Average_Score,Dependent variable:Average_Score
,,,,
,(1),(2),(3),(4)
,,,,
Intercept,8.74603***,8.43907***,8.46229***,8.49075***
,(0.00556),(0.02518),(0.02562),(0.02852)
Title_Length,0.00129***,-0.00044,-0.00346***,-0.00688***
,(0.00023),(0.00028),(0.00068),(0.00182)
Title_length_squared,,,0.00004***,0.00008**



### 3. Number of Bookmarks

In [11]:
model9 = rlm('Bookmarks ~ Title_Length',data=df1).fit()
model10 = rlm('Bookmarks ~ Title_Length + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update',data=df1).fit()
model11 = rlm('Bookmarks ~ Title_Length + np.power(Title_Length, 2) + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update',data=df1).fit()
model12 = rlm('Bookmarks ~ Title_Length + np.power(Title_Length, 2) + Fantasy + Literature + Romance + Sci_fi + C(Year) + Parts + Complete + No_update + tlength_fantasy + tlength_literature + tlength_romance + tlength_scifi + tlength_squared_fantasy + tlength_squared_literature + tlength_squared_romance + tlength_squared_scifi',data=df1).fit()

stargazer = Stargazer([model9, model10, model11, model12])
stargazer.rename_covariates({'np.power(Title_Length, 2)': 'Title_length_squared',
                             'tlength_fantasy': 'Title length x Fantasy',
                             'tlength_literature': 'Title length x Literature',
                             'tlength_romance': 'Title length x Romance',
                             'tlength_scifi': 'Title length x Sci-fi',
                             'tlength_squared_fantasy':'Title length Squared x Fantasy',
                             'tlength_squared_literature':'Title length Squared x Literature',
                             'tlength_squared_romance':'Title length Squared x Romance',
                             'tlength_squared_scifi':'Title length Squared x Sci-fi'})
stargazer.covariate_order(['Intercept','Title_Length','np.power(Title_Length, 2)','Fantasy','Literature','Romance','Sci_fi','tlength_fantasy','tlength_literature','tlength_romance','tlength_scifi','tlength_squared_fantasy','tlength_squared_literature','tlength_squared_romance','tlength_squared_scifi','Parts','Complete','No_update'])
HTML(stargazer.render_html())

0,1,2,3,4
,,,,
,Dependent variable:Bookmarks,Dependent variable:Bookmarks,Dependent variable:Bookmarks,Dependent variable:Bookmarks
,,,,
,(1),(2),(3),(4)
,,,,
Intercept,67.915***,-315.947***,-330.036***,-262.895***
,(3.601),(11.966),(12.305),(13.127)
Title_Length,20.704***,7.558***,9.725***,2.544***
,(0.162),(0.174),(0.410),(0.912)
Title_length_squared,,,-0.036***,-0.027




### Summary

In [12]:
stargazer = Stargazer([model4, model8, model12])
stargazer.rename_covariates({'np.power(Title_Length, 2)': 'Title_length_squared',
                             'tlength_fantasy': 'Title length x Fantasy',
                             'tlength_literature': 'Title length x Literature',
                             'tlength_romance': 'Title length x Romance',
                             'tlength_scifi': 'Title length x Sci-fi',
                             'tlength_squared_fantasy':'Title length Squared x Fantasy',
                             'tlength_squared_literature':'Title length Squared x Literature',
                             'tlength_squared_romance':'Title length Squared x Romance',
                             'tlength_squared_scifi':'Title length Squared x Sci-fi'})
stargazer.covariate_order(['Intercept','Title_Length','np.power(Title_Length, 2)','Fantasy','Literature','Romance','Sci_fi','tlength_fantasy','tlength_literature','tlength_romance','tlength_scifi','tlength_squared_fantasy','tlength_squared_literature','tlength_squared_romance','tlength_squared_scifi','Parts','Complete','No_update'])
stargazer.custom_columns(['Weekly Viewers', 'Average Score', 'Bookmarks'], [1, 1, 1])
HTML(stargazer.render_html())

0,1,2,3
,,,
,,,
,Weekly Viewers,Average Score,Bookmarks
,(1),(2),(3)
,,,
Intercept,252.979,8.491***,-262.895***
,(245.438),(0.029),(13.127)
Title_Length,-4.624,-0.007***,2.544***
,(5.497),(0.002),(0.912)
Title_length_squared,0.089,0.000**,-0.027
