# Prediction Problem

This notebook treats this as a prediction model minimizing the squared error of one-year lagged predictions. Data after 2000 is used as a test set and data before is used to train a simple linear predictor conditioning on current popularity and local estimates of the derivative.

In [6]:
df = pd.read_feather('data/names.feather').set_index(['year', 'name', 'gender'])

In [16]:
pct_df = 100 * df / df.groupby('year').transform(sum)
pct_df = pct_df.groupby(['year', 'name']).sum()

In [48]:
pct_df['count_lag1'] = pct_df.groupby('name')['count'].shift(1)
pct_df['count_lag2'] = pct_df.groupby('name')['count'].shift(2)
pct_df['delta_lag1'] = pct_df['count_lag1'] - pct_df['count_lag2']
pct_df['count_ratio'] = pct_df['count'] / pct_df['count_lag1']

In [49]:
train = pct_df.dropna().loc[:2000,:]
test = pct_df.dropna().loc[2001:,:]

In [34]:
fit = smf.ols('count ~ count_lag1 + delta_lag1', data=train).fit()

In [45]:
print(fit.summary())

                            OLS Regression Results                            
Dep. Variable:                  count   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.996
Method:                 Least Squares   F-statistic:                 1.414e+08
Date:                Sat, 03 Nov 2018   Prob (F-statistic):               0.00
Time:                        20:19:18   Log-Likelihood:             4.3578e+06
No. Observations:             1110584   AIC:                        -8.715e+06
Df Residuals:                 1110581   BIC:                        -8.715e+06
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   6.832e-05   4.58e-06     14.912      0.0

In [46]:
np.sqrt(np.mean(((test['count'] - test['count_lag1']))**2))

0.0020220767594426189

In [47]:
np.sqrt(np.mean(((test['count'] - fit.predict(test)))**2))

0.0016642949411842993