$R^2$ goes up when we take into account both

In [None]:
sns.pairplot(df_quality_living, hue='City')

### QualityLife * PurchasePower Interaction Effects

In [None]:
model = smf.ols(formula = 'MovehubRating ~ QualityLife * PurchasePower', data=df_quality_living).fit()
model.summary()

In [None]:
model.params

In [None]:
model.tvalues

In [None]:
model.rsquared

74% of the variability of MovehubRating is captured by the linear model



In [None]:
print 'QualityLife:'
print "\t- coefficient =", model.params.QualityLife
print "\t- std error =", model.bse.QualityLife
print "\t- t-value =", model.tvalues.QualityLife
print "\t- p-value =", model.pvalues.QualityLife

confidence_interval = model.conf_int().loc['QualityLife']

print "\t- 95% confidence interval = [{}, {}]".format(confidence_interval[0], confidence_interval[1])

In [None]:
print 'PurchasePower:'
print "\t- coefficient =", model.params.PurchasePower
print "\t- std error =", model.bse.PurchasePower
print "\t- t-value =", model.tvalues.PurchasePower
print "\t- p-value =", model.pvalues.PurchasePower

confidence_interval = model.conf_int().loc['PurchasePower']

print "\t- 95% confidence interval = [{}, {}]".format(confidence_interval[0], confidence_interval[1])

In [None]:
df_quality_living.plot(kind='scatter', x='QualityLife', y='MovehubRating')

In [None]:
# Rename column names to not have spaces
df_cost_living.columns = ['City', 'Coffee', 'Cinema', 'Wine', 'Gas', 'AvgRent', 'AvgDispIncome']

In [None]:
df_cost_living[['City','AvgRent']]

In [None]:
# Sort by highest rent

# df_cost_living[df_cost_living.AvgRent >= 1000].sort(['Av'])
df_cost_living.sort_values(['AvgRent'], ascending=False)

### Graph pipelining for quality of living

In [None]:
plt.figure(figsize=(20,10))

df_quality_living = pd.read_csv('datasets/movehubqualityoflife.csv')

cost = pd.melt(df_quality_living, "City", var_name="Attributes")

swarm_plot = sns.swarmplot(x="Attributes", y="value", hue="City", data=cost)
box = swarm_plot.get_position()
swarm_plot.set_position([box.x0 - 0.09, box.y0, box.width * 0.8, box.height])
plt.legend(bbox_to_anchor=(1.05, 1.08), loc=2, borderaxespad=0., ncol=5)

plt.show()

In [None]:
df_quality_living.columns = ['City', 'MovehubRating', 'PurchasePower', 'HealthCare', 'Pollution', 'QualityLife', 'CrimeRating']

In [None]:
#df_quality_living

### What independent variable am I trying to predict?

### What does the movehub rating mean?
Can we use that to initially create a predictive model for smart cities? From movehub:

```Our data comes from a number of different sources and is always improving. We combine data from www.numbeo.com, data from the CIA World Factbook, Census data from several governments, data from the WHO and our own vast database of real international moves to come up with cost of living figures, crime rates, quality of life, pollution, purchasing power and our overall MoveHub rating (a balance of all of the scores)```



### Correlations
Let's try to answer the following questions, with the goal of interpretation:
- Which 2 variables seem to affect MovehubRating the most?
- Can we use these 2 variables simulatenously? Why or why not?

In [None]:
# Try out some correlation
df_quality_living[ ['City', 'MovehubRating', 'PurchasePower', 'HealthCare', 'Pollution', 'QualityLife', 'CrimeRating']].corr()

In [None]:
# Correlation Matrix
corr = df_quality_living.corr()
corr = (corr)
sns.heatmap(corr,
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values)


### Correlations: PurchasePower and QualityLife
PurchasePower and QualityLife are most highly correlated with the MovehubRating, which makes sense.. but its like pollution and crime rating don't even matter. 

We cannot use these 2 variables simulatenously because they are highly correlated (0.845 correlation)

They're more correlated with each other than the MovehubRating

In [None]:
sns.lmplot(x='QualityLife', y='MovehubRating', data=df_quality_living)

### Residuals

In [None]:
sm.qqplot(model.resid, line='s')


In [None]:
df_quality_living.columns

## Logistic Regression

Can't do it as is, because my response vector is continuous (MovehubRating), but maybe I could one-hot encode it based on a limit. e.g. for > 100, mark 1, as a high score city. Then I could define feature matrix and response vector, etc like so below:

### Notes on the model above
* **Dependent variable:** MovehubRating
* **Independent variable:** PurchasePower
* **Association: Coefficient** reported is 26%?
* **Is this relationship statistically significant?:** Yes, It has a t-value well over 2, p-value < .025 and the confidence interval doesnt cross 0

In [None]:
model.resid.plot(kind='hist', bins=250)

In [None]:
sm.qqplot(model.resid, line='s')
pass

In [None]:
sm.graphics.plot_regress_exog(model, 'PurchasePower', fig = plt.figure(figsize = (12, 8)))

pass

## Linear Regression: QualityLife
MovehubRating + QualityLife


In [None]:
model = smf.ols(formula = 'MovehubRating ~ QualityLife', data=df_quality_living).fit()
model.summary()

In [None]:
sm.graphics.plot_regress_exog(model, 'QualityLife', fig = plt.figure(figsize = (12, 8)))

pass

$R^2$ for `QualityOfLife` model: 0.55 <br />
55% of the variability of `MovehubRating`is captured by the linear model  <br />
$R^2$ for `PurchasePower` model: 0.68  <br />
68% of the variability of `MovehubRating`is captured by the linear model  <br />
        

In [None]:
X = df_quality_living[['PurchasePower', 'QualityLife']]
y = df_quality_living.MovehubRating

model = linear_model.LinearRegression().fit(X,y)

print model.intercept_
print model.coef_

In [None]:
model.score(X, y)