# Regression functions demo notebook

If you have not already done so, run the following command to install the statsmodels package:

`easy_install -U statsmodels`

Run the following command to install scipy and scikit-learn:

`conda install scipy`

`conda install scikit-learn`

Use the data cleaning package to import a data set:

In [None]:
from data_cleaning_utils import import_data
dat = import_data('../Data/mississippi/pool82014-10-02cleaned.csv')

The following function runs a random model with a random independent variable y and four random covariates, using both the statsmodels and scikit-learn packages. The user can compare output from the two tools.

In [None]:
from regression import compare_OLS
compare_OLS(dat)

The two models produce the same results. 

There is no standard regression table type output from sklearn. However, sklearn offers greater features for prediction, by incorporating machine learning functionality. For that reason, we will likely wish to use both packages, for different purposes.

The `user_model` function prompts the user to input a model formula for an OLS regression, then runs the model in `statsmodel`, and outputs model results and a plot of y data vs. model fitted values.

At the prompt, you may either input your own model formula, or copy and paste the following formula as an example:

`CO2uM ~ TempC + ChlAugL + SPCuScm + NITRATEM`

In [None]:
%matplotlib inline
from regression import user_model
user_model(data=dat)

In [1]:
from __future__ import print_function

import os
import pandas as pd
import numpy as np
import random
import math
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std

from sklearn import linear_model
from sklearn.linear_model import LinearRegression

import seaborn as sns; sns.set(style="ticks", color_codes=True)

%matplotlib inline

from data_cleaning_utils import import_data
data = import_data('../Data/mississippi/pool82014-10-02cleaned.csv')

from regression import correl

Index(['FID', 'time', 'XCO2Dpp', 'XCH4Dpp', 'TempC', 'SPCuScm', 'ChlARFU',
       'ChlAugL', 'BGAPCRF', 'BGAPCuL', 'TurbFNU', 'fDOMRFU', 'fDOMQSU',
       'ODOsat', 'ODOmgL', 'pH', 'Pressps', 'NITRATEU', 'NITRATEM', 'ABS254',
       'ABS350', 'TINT', 'TSPEC', 'TLAMP', 'Temp_ta', 'SPC_tau', 'pH_tau',
       'fDOM_QS', 'ODO_tau', 'ChlA_ta', 'Turb_ta', 'CO2_tau', 'CH4uM',
       'CH4Sat', 'CO2uM', 'CO2Sat', 'CO2uM_t', 'CO2St_t'],
      dtype='object')
datetime column name? time


In [30]:
floatVars = list(data.columns[data.dtypes == 'float64'])
# floatVars.remove('FID')
floatData = data[floatVars]
nParams = len(floatVars)

largeCorrX = []
largeCorrY = []

for i in range(0, nParams):
    for j in range(i+1, nParams):
        corr = correl(floatData.iloc[:,i], floatData.iloc[:,j])
        if 0.2 < abs(corr) < 0.9:
                largeCorrX.append(floatVars[i])
                largeCorrY.append(floatVars[j])
uniqueX = list(np.unique(largeCorrX))

corrs = pd.DataFrame(largeCorrX, largeCorrY).reset_index()
corrs.columns = ['x','y']

In [87]:
sets = []
setlist = []
for i in range(0, len(uniqueX)):
    y = corrs.y[corrs.x == uniqueX[i]]
    sets = list(corrs.y[corrs.x == uniqueX[i]])
    sets.append(uniqueX[i])
    setlist.append(sets)
#    sns.pairplot(data[set])

In [117]:
i = 0
while i < len(setlist):
    if len(setlist[i]) == 1:
        setlist.remove(setlist[i])
    else:
        i = i+1

In [132]:
i = 0
while i < len(setlist):
    j = 0
    while j < len(setlist):
        if j == i:
            j = j+1
        else:
            if all(x in setlist[j] for x in setlist[i]):
                  setlist.remove(setlist[i])
            j = j+1
    i = i+1

In [133]:
len(setlist)

14

In [134]:
setlist

[['XCO2Dpp', 'ODOsat', 'ODOmgL', 'ODO_tau', 'CH4uM'],
 ['XCH4Dpp',
  'SPCuScm',
  'ODOsat',
  'ODOmgL',
  'pH',
  'Pressps',
  'SPC_tau',
  'pH_tau',
  'ODO_tau',
  'CH4uM',
  'CH4Sat',
  'CO2Sat'],
 ['XCH4Dpp',
  'SPCuScm',
  'ODOsat',
  'ODOmgL',
  'pH',
  'Pressps',
  'SPC_tau',
  'pH_tau',
  'ODO_tau',
  'CH4uM',
  'CH4Sat',
  'CO2uM'],
 ['XCO2Dpp',
  'XCH4Dpp',
  'TempC',
  'SPCuScm',
  'ChlARFU',
  'ChlAugL',
  'BGAPCRF',
  'BGAPCuL',
  'fDOMRFU',
  'fDOMQSU',
  'pH',
  'Temp_ta',
  'SPC_tau',
  'pH_tau',
  'fDOM_QS',
  'ODO_tau'],
 ['XCO2Dpp',
  'XCH4Dpp',
  'TempC',
  'SPCuScm',
  'ChlARFU',
  'ChlAugL',
  'BGAPCRF',
  'BGAPCuL',
  'fDOMRFU',
  'fDOMQSU',
  'ODOmgL'],
 ['XCO2Dpp',
  'XCH4Dpp',
  'TempC',
  'SPCuScm',
  'ChlARFU',
  'ChlAugL',
  'BGAPCRF',
  'BGAPCuL',
  'fDOMRFU',
  'fDOMQSU',
  'ODOsat'],
 ['XCO2Dpp', 'fDOMRFU', 'fDOMQSU', 'Pressps'],
 ['XCO2Dpp',
  'TempC',
  'fDOMRFU',
  'fDOMQSU',
  'ODOsat',
  'ODOmgL',
  'pH',
  'Temp_ta',
  'SPC_tau'],
 ['SPCuScm', 'fDOM