Kaggle version of the notebook is <a href="https://www.kaggle.com/benjamincabalonajr/decent-score-using-simpler-models"> here. </a>  Using this notebook (Instead of Kaggle) is recommended to experience interactivity. I've used a python package that will convert this notebook into a Dashboard, which is best for presenting to non technical users.


### Steps to maximize experience in this notebook:

- Install the packages on the below cell.
- In your CLI, run voila dashboard.ipynb

In [None]:
# !pip install pandas
# !pip install numpy
# !pip install seaborn
# !pip install matplotlib
# !pip install ipywidgets
# !pip install scipy
# !pip install voila
# !pip install -U altair vega_datasets notebook vega

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
import altair as alt
from ipywidgets import interact
from scipy.stats import pearsonr

%matplotlib inline
alt.renderers.enable('notebook')
warnings.filterwarnings('ignore')
sns.set(style='ticks')

In [2]:
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()

Unnamed: 0,Cost,Ad Responses,Location Participation,Revenue
0,2468.194218,7809,0.98141,19998.644579
1,2479.934984,7974,0.709877,20631.784811
2,2422.831211,7944,0.256007,20396.132898
3,2477.629499,8064,0.746282,20730.528821
4,2467.967981,7903,0.925235,18621.556046


We're able to get a slight correlation with Ad Responses only. But intuitively, i think Cost should still be an important predictor. Plotting manually, as interact is not working on rendered notebook.

In [4]:
def plot(column):
    g = sns.jointplot(x=column, y="Revenue", data=train, kind='reg',joint_kws={'line_kws':{'color':'cyan'}}) 
    g.annotate(pearsonr)
    plt.show()

interactive = interact(plot,column=train.columns)

interactive(children=(Dropdown(description='column', options=('Cost', 'Ad Responses', 'Location Participation'…

In [7]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn import metrics

## Using all available features

In [8]:
X = train.drop('Revenue',axis=1)
y = train['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [9]:
model = LinearRegression(fit_intercept=False)
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

In [10]:
cv = np.mean(cross_val_score(model, X_train, y_train, cv=5,scoring='neg_mean_squared_error'))
print ("Model RMSE with 5 cross validation :",np.sqrt(-cv))
y_predict_test = model.predict(X_test)
score_test = np.sqrt(metrics.mean_squared_error(y_test, y_predict_test))
print('Test RMSE',score_test)

Model RMSE with 5 cross validation : 881.3798183523224
Test RMSE 809.2991133586107


## Removing the Uncorrelated features

In [11]:
X = train[['Ad Responses']]
y = train['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = LinearRegression(fit_intercept=False)
model.fit(X_train,y_train)
cv = np.mean(cross_val_score(model, X_train, y_train, cv=5,scoring='neg_mean_squared_error'))
print ("Model RMSE with 5 cross validation :",np.sqrt(-cv))
y_predict_test = model.predict(X_test)
score_test = np.sqrt(metrics.mean_squared_error(y_test, y_predict_test))
print('Test RMSE',score_test)

Model RMSE with 5 cross validation : 889.8860619138013
Test RMSE 727.6002668395048


### Using Ad Responses + Cost

In [12]:
X = train[['Ad Responses','Cost']]
y = train['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = LinearRegression(fit_intercept=False)
model.fit(X_train,y_train)
cv = np.mean(cross_val_score(model, X_train, y_train, cv=5,scoring='neg_mean_squared_error'))
print ("Model RMSE with 5 cross validation :",np.sqrt(-cv))
y_predict_test = model.predict(X_test)
score_test = np.sqrt(metrics.mean_squared_error(y_test, y_predict_test))
print('Test RMSE',score_test)

Model RMSE with 5 cross validation : 892.939120412048
Test RMSE 727.991082426755


For our final solution, as we can see above the best test performance was achieved with only using the correlated feature. We'll retrain our model with the entire data.

In [13]:
model.fit(X[['Ad Responses']],y)
pred = model.predict(test[['Ad Responses']])

In [14]:
predictions= test[['index']].copy()
predictions['Revenue'] = pred
predictions.to_csv('submission.csv', index=False)

### An even simpler model: The Mean of Training Revenue.

In [15]:
prediction = [train['Revenue'].mean()]*len(y_test)
score_test = np.sqrt(metrics.mean_squared_error(y_test, prediction))
print('RMSE:',score_test)

RMSE: 772.4861146812424
