# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Using-statsmodels" data-toc-modified-id="Using-statsmodels-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Using statsmodels</a></div>

In [1]:
import pandas as pd

In [2]:
import seaborn as sns

In [3]:
tips = sns.load_dataset('tips')

In [4]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
from sklearn import linear_model

In [6]:
lr = linear_model.LinearRegression()

In [7]:
lr.fit(X=tips[['total_bill', 'size']], y=tips['tip'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [8]:
lr.coef_

array([0.09271334, 0.19259779])

In [9]:
lr.intercept_

0.6689447408125027

In [10]:
lr = linear_model.LinearRegression()
lr.fit(X=tips[['total_bill', 'size', 'time']], y=tips['tip'])

ValueError: could not convert string to float: 'Dinner'

In [11]:
tips_dummy = pd.get_dummies(tips, drop_first=True)

In [12]:
tips_dummy.columns

Index(['total_bill', 'tip', 'size', 'sex_Female', 'smoker_No', 'day_Fri',
       'day_Sat', 'day_Sun', 'time_Dinner'],
      dtype='object')

In [13]:
lr = linear_model.LinearRegression()
lr.fit(X=tips_dummy[['total_bill', 'size', 'sex_Female', 'smoker_No', 'day_Fri',
       'day_Sat', 'day_Sun', 'time_Dinner']], y=tips_dummy['tip'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [14]:
lr.coef_

array([ 0.09448701,  0.175992  ,  0.03244094,  0.08640832,  0.1622592 ,
        0.04080082,  0.13677854, -0.0681286 ])

# Using statsmodels

Some of the benefits of `statsmodels` is that you can use a formula method to create the model.
This will also automatically create dummy variables for you.

You also get nice statistical output, instead of manually calling coefficients.

One of the drawbacks is there are more model options in `scikit-learn`.

In [15]:
import statsmodels.formula.api as smf

In [16]:
model = smf.ols('tip ~ total_bill', data=tips)
results = model.fit()
results.summary()


0,1,2,3
Dep. Variable:,tip,R-squared:,0.457
Model:,OLS,Adj. R-squared:,0.454
Method:,Least Squares,F-statistic:,203.4
Date:,"Fri, 16 Aug 2019",Prob (F-statistic):,6.69e-34
Time:,18:02:31,Log-Likelihood:,-350.54
No. Observations:,244,AIC:,705.1
Df Residuals:,242,BIC:,712.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.9203,0.160,5.761,0.000,0.606,1.235
total_bill,0.1050,0.007,14.260,0.000,0.091,0.120

0,1,2,3
Omnibus:,20.185,Durbin-Watson:,2.151
Prob(Omnibus):,0.0,Jarque-Bera (JB):,37.75
Skew:,0.443,Prob(JB):,6.35e-09
Kurtosis:,4.711,Cond. No.,53.0


In [17]:
model = smf.ols('tip ~ total_bill + sex', data=tips).fit()
model.summary()

0,1,2,3
Dep. Variable:,tip,R-squared:,0.457
Model:,OLS,Adj. R-squared:,0.452
Method:,Least Squares,F-statistic:,101.3
Date:,"Fri, 16 Aug 2019",Prob (F-statistic):,1.1800000000000001e-32
Time:,18:02:31,Log-Likelihood:,-350.52
No. Observations:,244,AIC:,707.0
Df Residuals:,241,BIC:,717.5
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.9067,0.175,5.182,0.000,0.562,1.251
sex[T.Female],0.0266,0.138,0.192,0.848,-0.246,0.299
total_bill,0.1052,0.007,14.110,0.000,0.091,0.120

0,1,2,3
Omnibus:,20.499,Durbin-Watson:,2.149
Prob(Omnibus):,0.0,Jarque-Bera (JB):,38.652
Skew:,0.447,Prob(JB):,4.05e-09
Kurtosis:,4.733,Cond. No.,63.0
