# Basics [statsmodels]
---
- Author: Diego InÃ¡cio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [basics_Numba.ipynb](https://github.com/diegoinacio/machine-learning-notebooks/blob/master/Tips-and-Tricks/basics_Numba.ipynb)
---
Basic functions and statistical models using [statsmodels](https://www.statsmodels.org/stable/index.html).

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (16, 8)

## Installation
---

[Installation](https://www.statsmodels.org/stable/install.html) command for *anaconda* and *pip*:

```bash
$ conda install -c conda-forge statsmodels
```

or

```bash
$ pip install statsmodels
```

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Datasets
---
*Statsmodels* provides list of [available datasets](https://www.statsmodels.org/dev/datasets/index.html#available-datasets) samples that can be read as *pandas dataframes* for testing.

In [None]:
# Load dataset
df1 = sm.datasets.cancer.load_pandas().data

# Plot dataset
df1.plot.scatter(x="population", y="cancer", s=10)
plt.show()

## Examples
---

### Simple linear model
---
A simple linear model using [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares).

In [None]:
model = smf.ols("cancer ~ population", data=df1)
result = model.fit()

result.summary()

In [None]:
x = np.linspace(0, df1.population.max()*1.1, 2)

# getting params for
# y = b + mx
b, m = result.params
y_hat = b + m*x

# Visualization
df1.plot.scatter(x="population", y="cancer", s=10)
plt.plot(x, y_hat, c="red")
plt.show()

### Time series analysis
---

In [None]:
# Load dataset
df2 = sm.datasets.co2.load_pandas().data
df2 = df2[df2.index >= "1995"]

# Plot dataset
df2.plot()

# Plot average mean
(
    df2.co2
    .rolling(7, center=True)
    .mean()
    .plot(c="red", label="average mean")
)

plt.legend()
plt.show()