Use *FemPreg* dataset to do a linear regression model, to model the relationship between two variables: age of mother at pregnancy (*agepreg*) and birth weight (*totalwgt_lb*).

In [None]:
url = 'https://github.com/otu-yz/stats-2020/blob/master/FemPreg2002.csv?raw=true'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
preg = pd.read_csv(url)

In [None]:
preg.agepreg

In [None]:
preg.agepreg = preg.agepreg*100

###Task 1
We test to see if there is a difference between the two means of birth weights of first-born babies and of not-first-born babies.

In [None]:
live = preg[preg.outcome==1]
first = live[live.birthord==1]
other = live[live.birthord!=1]

In [None]:
plt.hist([first.totalwgt_lb, other.totalwgt_lb], label=['first', 'other'])
plt.legend()

In [None]:
mean1 = first.totalwgt_lb.mean()
mean2 = other.totalwgt_lb.mean()
print("mean birth weight of first-born: %f lbs" % mean1)
print("mean birth weight of not first-born: %f lbs" % mean2)

Conduct a hypothesis test to determine if there is an **actual** (statistically significant) difference between mean birth weight of first-born and not-first-born babies.

1. calculate $SE=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$
2. calculate $t$ test statistic
3. calculate p-value

In [None]:
s1 = np.std(first.totalwgt_lb)
s2 = np.std(other.totalwgt_lb)
n1 = len(first)
n2 = len(other)
SE = np.sqrt(s1**2/n1 + s2**2/n2)
SE

$H_0$: $\mu_1-\mu_2=0$

$H_A$: $\mu_1-\mu_2\neq0$

In [None]:
T = ((mean1 - mean2) - 0) / SE
T

In [None]:
from scipy.stats import t

In [None]:
df = min(n1, n2) - 1
df

In [None]:
pvalue = 2*t.cdf(T, df)
pvalue

Because the p-value is very low, p-value = $0.0000233 < \alpha = 0.05$, we reject the null hypothesis and conclude that there is a **real difference** between the two mean birth weights from the two baby populations.

###Task 2
We do linear regression analysis on the relationship between mother\'s age and birth weight of baby.

Just did a hypothesis test that showed there is an actual difference between mean birth weights of first-born and not-first-born babies.  Mother\'s age is larger for not-first-born than for first-born.  If there is a linear association between mother\'s age and birth weight, then larger age means larger birth weight.  This could explain some of the difference in mean birth weights.

In [None]:
import statsmodels.formula.api as smf

*agepreg* is explanatory variable (x), *totalwgt_lb* is response variable (y).

In [None]:
formula = 'totalwgt_lb ~ agepreg'
model = smf.ols(formula, data=live)
results = model.fit()

In [None]:
results.summary()

The linear regression line is: $y = 6.8304 + 0.0175x$

In [None]:
intercept = results.params.Intercept
intercept

In [None]:
slope = results.params.agepreg
slope

Slope is $0.0175$ which means that for every year of increase in mother\'s age, the linear regression model predicts an expected increase of 0.0175 lb of increase in birth weight.

In [None]:
0.05**(1/2)

Correlation coefficient $R=0.224$ indicates a weak positive linear correlation

Let's calculate the difference in mother\'s ages of first-born and not-first-born, and what is the expected difference in birth weights due to that difference, as predicted by the linear regression model.

In [None]:
slope * (other.agepreg.mean() - first.agepreg.mean())

Calculate the correlation coefficient of two variables.

In [None]:
x=[1,3,4,5,7]
y=[5,9,7,1,13]
np.corrcoef(x, y)