# Machine Learning: Regression

## Regression analysis of #posts in heise newsticker

### Linear regression with `scipy`

In [None]:
import pandas as pd
heise_monthly = pd.read_csv("heise-monthly.csv", parse_dates=["month"], index_col="month")

In [None]:
heise_monthly["count"].plot()

In [None]:
from scipy.stats import linregress
lrc = linregress(range(len(heise_monthly)), heise_monthly["count"].values)
lrc

Integrate into `DataFrame`:

In [None]:
heise_monthly["predict_count"] = [i*lrc.slope+lrc.intercept for i in range(len(heise_monthly))]

In [None]:
heise_monthly[["count", "predict_count"]].plot()

### Linear regression with `scikit-learn`

In [None]:
from sklearn import linear_model
slrt = linear_model.LinearRegression()
X = [[i] for i in range(len(heise_monthly))]
Y = heise_monthly["count"].values
slrt.fit(X, Y)

In [None]:
heise_monthly["predict_count_sklearn_linear"] = slrt.predict(X)

In [None]:
heise_monthly[["count", "predict_count_sklearn_linear"]].plot()

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
def print_scores(ground_truth, predict):
    print('mean quadratic error', mean_squared_error(ground_truth, predict))
    print('Coefficient of determination: %.2f' % r2_score(ground_truth, predict))
    
print_scores(Y, heise_monthly["predict_count_sklearn_linear"])

Perform a train/test split

In [None]:
(X_train, X_test) = (X[:-50], X[-50:])
(Y_train, Y_test) = (Y[:-50], Y[-50:])
slrt.fit(X_train, Y_train)

In [None]:
print_scores(Y_test, slrt.predict(X_test))

In [None]:
heise_monthly["predict_count_sklearn_linear"] = slrt.predict(X)

In [None]:
heise_monthly["predict_count_sklearn_linear_train"] = list(slrt.predict(X_train)) + [None]*len(X_test)
heise_monthly["predict_count_sklearn_linear_test"] = [None]*len(X_train) + list(slrt.predict(X_test))
heise_monthly[["count", "predict_count_sklearn_linear_train", "predict_count_sklearn_linear_test"]].plot()

In [None]:
from sklearn.tree import DecisionTreeRegressor

dtt = DecisionTreeRegressor(max_depth=4)

dtt.fit(X_train, Y_train)
print_scores(Y_test, dtt.predict(X_test))

In [None]:
heise_monthly["predict_count_sklearn_dt_train"] = list(dtt.predict(X_train)) + [None]*len(X_test)
heise_monthly["predict_count_sklearn_dt_test"] = [None]*len(X_train) + list(dtt.predict(X_test))
heise_monthly[["count", "predict_count_sklearn_dt_train", "predict_count_sklearn_dt_test"]].plot()

In [None]:
from sklearn.ensemble import AdaBoostRegressor

abt = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                          n_estimators=300, random_state=42)

abt.fit(X_train, Y_train)
print_scores(Y_test, abt.predict(X_test))

In [None]:
heise_monthly["predict_count_sklearn_ab_train"] = list(abt.predict(X_train)) + [None]*len(X_test)
heise_monthly["predict_count_sklearn_ab_test"] = [None]*len(X_train) + list(abt.predict(X_test))
heise_monthly[["count", "predict_count_sklearn_ab_train", "predict_count_sklearn_ab_test"]].plot()

In [None]:
from sklearn import ensemble
gbt = ensemble.GradientBoostingRegressor()
gbt.fit(X_train, Y_train)
print_scores(Y_test, gbt.predict(X_test))

In [None]:
heise_monthly["predict_count_sklearn_gb_train"] = list(gbt.predict(X_train)) + [None]*len(X_test)
heise_monthly["predict_count_sklearn_gb_test"] = [None]*len(X_train) + list(gbt.predict(X_test))
heise_monthly[["count", "predict_count_sklearn_gb_train", "predict_count_sklearn_gb_test"]].plot()

## Improve prediction with better software/algorithms

Special packages like [`prophet`](https://facebook.github.io/prophet/).

In [None]:
from prophet import Prophet

Die Daten müssen dazu ein bisschen anders vorbereitet werden, du brauchst einen speziellen `DataFrame`:

In [None]:
pa = pd.DataFrame()
pa["ds"] = heise_monthly.index.values
pa["y"] = heise_monthly["count"].values
pa

In [None]:
m = Prophet()
m.fit(pa)

Create a `DataFrame` for future values, you need *monthly frequencies*:

In [None]:
future = m.make_future_dataframe(periods=20, freq='M')

In [None]:
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
m.plot(forecast)

`Prophet` can also predict long term developments and seasonality

In [None]:
m.plot_components(forecast)