--- 
# Naive Model 
***

The overall goal of this section is to use a simple linear regression find a naive model for mortality rate using food consumption data. First, we will find a null model, representing the 'average' input and representing a baseline estimation that we will then improve upon. Then we will fit a multilinear regression to all of the predictors (all livestock and all crop predictors), and find a cross-validated $R^2$ for this naive model.

To summarize, our null model achieved a cross-validated $R^2$ score of 0 for all three diseases. Our naive model achieved a cross-validated score of $$ for diabetes, $$ for cancer, and $$ for cardiovascular diseases.

## Null Model

Before fitting the linear regression, we will find a simple null model for global food consumption data. To calculate the null model, we found the average of each predictor column in the Dataframe. This gives us a 'global average' of consumption of each predictor. We can then use the null model to establish a baseline $R^2$ that we will then improve upon using our linear regression models.

In [20]:
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cmx
import matplotlib.colors as colors
import pandas as pd
import math
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression as LinReg
from sklearn.cross_validation import train_test_split as sk_split
import statsmodels.api as sm

%matplotlib inline

In [9]:
# Code for null model
x_df = pd.read_csv('datasets/predictors_filled.csv')

# read in disease rates
diabetes_df = pd.read_csv('datasets/diabetes_df.csv',index_col = 0)
cardio_df = pd.read_csv('datasets/cardio_df.csv',index_col = 0)
cancer_df= pd.read_csv('datasets/cancer_df.csv',index_col = 0)

### Null Model testing:

As expected, testing the null model on various training set give us a cross-validated $R^2$ of approximately zero for all three diseases. 

#### Cancer: 
The null model for cancer will always predict the mean cancer mortality rate. Testing on cancer, we get an $R^2$ of 3.33 E -16, which is ~ 0:

In [12]:
# Null Model Cancer
null_model = LinReg()
null_model.fit(x_df, [np.mean(cancer_df['Cancer Mortality Rate'])]*x_df.shape[0])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [13]:
# Test Cancer.
null_model.score(x_df, cancer_df)

3.3306690738754696e-16

#### Diabetes
Testing on diabetes, we also get an $R^2$ of 0.

In [14]:
# Fit Diabetes Null Model
null_model = LinReg()
null_model.fit(x_df, [np.mean(diabetes_df['Diabetes Mortality Rate'])]*x_df.shape[0])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [15]:
# Test Diabetes.
null_model.score(x_df, diabetes_df)

0.0

#### Test Cardiovascular Diseases

In [16]:
# Test cardiovascular diseases
null_model = LinReg()
null_model.fit(x_df, [np.mean(cardio_df['Cardio Mortality Rate'])]*x_df.shape[0])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [18]:
# Test cardiovascular diseases.
null_model.score(x_df, cardio_df)

0.0

# Cancer LinReg

Now, we will fit a simple multi-linear regression to all of the food consumption inputs for each of the diseases. First, for cancer, our regression has an initial $R^2$ on the training set of .85, and a cross-validated $R^2$ of ** PUT THE R^2 HERE**. 

In [31]:
linreg = LinReg()
linreg.fit(x_df, cancer_df)
linreg.score(x_df, cancer_df)

0.85858267482230188

In [39]:
cross_val_score(LinReg(), x_df,cancer_df, cv = KFold(151, 3), scoring = "r2")

array([ -49.79566569, -920.64809763,  -13.32859554])

To further examine the accuracy of this model, the map below displays the fractional difference of the model estimates as compared to the actual cancer data on a world map. As we can see, the vast majority of countries are colored a dark blue/ purple color, indicating they have a low fractional difference. Countries colored a brighter purple/pink color indicate an overestimate, while countries colored in a brighter blues indicate an underestimate.

In [11]:
# PUT GRAPH HERE

# Diabetes LinReg
For diabetes, our regression has an initial cross-validated $R^2$ of ** PUT THE R^2 HERE**. 

In [40]:
linreg = LinReg()
linreg.fit(x_df, diabetes_df)
linreg.score(x_df, diabetes_df)

0.83436874953586637

Again, we can examine a world map to see the fractional differences. Again, dark blue/purple colors indicate an accurate estimate,brighter purples/pinks indicate an overestimate, and brighter blues indicate an underestimate. We note that
### NOTE I"M NOT SURE ABOUT THIS BUT JUST WRITING SOME ANALYSIS 
this model has slightly worse performance than our cancer model, perhaps indicating that diabetes is less related to dietary intake than **SSOME OTHER DISEASE**.   

# Cardiovascular Diseases LinReg
For diabetes, our regression has an initial cross-validated $R^2$ of ** PUT THE R^2 HERE**. 

In [41]:
linreg = LinReg()
linreg.fit(x_df, cardio_df)
linreg.score(x_df, cardio_df)

0.85620305109453243

Again, we can examine a world map to see the fractional differences. Dark blue/purple colors indicate an accurate estimate,brighter purples/pinks indicate an overestimate, and brighter blues indicate an underestimate. While this model appears to be fairly accurate for *THESE COUNTRIES*, it could be improved for *THESE*. This might be due to certain predictors, such as *SOME RANDOM PREDICTOR*, that is more heavily weighted for larger countries than for the country that is seeing a larger fractional difference.