## Module 11: Intro to Correlation and Linear Regression 

***

You've learned to explore your dataset features, clean up messy data, and use descriptive statistics to summarize the characteristics of your dataset. 

We are now moving forward with exploring the relationship <b>between</b> variables. Determining how your variables interact is one of the most useful skills you will learn while working with data. 

The first step is understanding the dependency of your variables. 

***

In [14]:
import pandas as pd
import numpy as np
import scipy.stats as stats

## <font color=DODGERBLUE>Independent and Dependent Variables</font>

When you start to look at the relationship between variables, there are two classifications of variables that need to be considered. 

***

### Independent Variables (IV)
An independent variable, also known as a predictor variable, is a variable that is independent of the other variables in your dataset. Consider this variable the "cause". Independent variables are commonly represented with "x".  

### Dependent Variables (DV)
A dependent variable, also known as an outcome variable, is a variable whose value is dependent on the values of the other variables in your dataset. Consider this variable the "effect". We assume that changes in the values of the independent variable(s) will result in changes to the dependent variable. Dependent variables are commonly represented with "y". 

***

### Examples 
The distinction between IV and DV is an important concept to master because several analyses require identification of which variables are IV and DV. IV's cause a change in the DV, and it isn't possible for the DV to cause a change in the IV.

#### Do flowers grow fastest under fluorescent or natural light?

    - IV: Type of light flowers are grown under
    - DV: Rate of flower growth

#### What is the effect of diet and regular soda on blood sugar levels?

    - IV: Type of soda
    - DV: Blood sugar levels

#### How does cellphone use before bedtime influence sleep?

    - IV: Amount of phone use before bedtime
    - DV: Hours of sleep, quality of sleep, restfulness, etc. 

In [None]:
df = pd.read_csv("EduGradeData.csv")

df.head()

## which variables can be considered independent and dependent?

## <font color=GOLDENROD>Your Turn</font>

    1. Import the "insurance.csv" file; name the dataset 'ins'. Preview the first 5 rows. 
    2. In the space below, determine what type of variable is in each column (qualitative/quantitative). 
    3. What variable(s) could be considered dependent?

# <font color=DODGERBLUE>Introduction to Correlation & Association</font>

***

### Correlation
***
Correlation is a tool that can be used to determine th strength of a relationship between two numeric variables. Correlation describes the direction (+/-) and magnitude (how large) of a <b>linear relationship</b> exists between two numeric variables. This is the initial check if your numeric variables have any kind of meaningful relationship. 

Correlation values range from "-1" to "+1". Variables that are positively correlated are closer to "+1" and variables that are negatively correlated are closer to "-1". 

#### Positively correlated variables move in the same direction - as one variable increases, the other variable increases in the same direction.

        - Increase in study time >> increases in test score
        - Decrease in sugar consumption >> decrease in blood sugar levels
        
#### Negatively correlated variables move in the opposite direction - as one variable increases, the other variable decreases (or vice versa). 

        - Decrease in daily spending >> increase in total savings
        - Increase in weight >> decrease in mobility

In [None]:
df.head()

In [None]:
## Create a correlation matrix

df.corr()

### Strength of Relationship between Variables
***
The strength of the relationship between variables is determined by the value of the correlation coefficient. To interpret its value, see which of the following values your correlation coefficient is closest to:

- <b>Exactly –1</b>: A perfect downhill (negative) linear relationship
- <b>–0.70</b>: A strong downhill (negative) linear relationship
- <b>–0.50</b>: A moderate downhill (negative) relationship
- <b>–0.30</b>: A weak downhill (negative) linear relationship
- <b>0</b>: No linear relationship
- <b>+0.30</b>: A weak uphill (positive) linear relationship
- <b>+0.50</b>: A moderate uphill (positive) relationship
- <b>+0.70</b>: A strong uphill (positive) linear relationship
- <b>Exactly +1</b>: A perfect uphill (positive) linear relationship

#### It is important to understand that <u>correlation is not the same as causation</u>. Identifying a correlation between two factors does not automatically mean one factor causes another factor to occur.

***

## <font color=GOLDENROD>Your Turn</font>

    1. Create a correlation matrix with the insurance dataset. 
    2. What can you say about the relationship between the independent variables and the dependent variable?

### Qualitative Variables and Association
***
The correlation matrix is great for examining the relationship between two numeric variables, however, this method does not work when you want to assess the relationship between qualitative variables, or one qualitative and one numeric variable. 

For these situations, we are interested in determining the differences between groups. These differences include variations in counts or means. If we identify notable differences between groups (the average test score for male students is 89, and the average test score for female students is 92) - this is evidence that there is some underlying relationship present that can be further explored. 

In [None]:
## Two qualitative variables

pd.crosstab(df["gender"], df["level_of_fit"])

In [None]:
## One qualitative variable, one numeric

df["grade"].groupby(df["gender"]).mean()

## <font color=GOLDENROD>Your Turn</font>

    1. Using the insurance dataset, determine if there is any association between the categorical independent variables and the dependent variable. 

### Variance and your Dependent Variable
***
Understanding which variables have a relationship with your dependent variable is the first step. The next step is to better understand what is influencing that relationship, or the variance seen in your dependent variable. <B>Variance</B> describes the variation in values for a specific variable. 

We know that some people are taller than others - not everyone is exactly the same height. If we were to ask 100 people their specific height - we would not receive 100 identical answers; instead, we would likely receive different values (with some overlap). This variation in values is variance. When variance is observed, the next step is to determine what the factors are between the variance - in other words, what factors are responsible for the difference in height between the 100 people? Gender is likely an important consideration - as males tend to be taller on average. But is that the only reason some people are taller or shorter than others? Maybe ethnicity is influencing this variance - some ethnic groups are naturally taller than others. Genetics might also contribute; as taller parents typically have taller children.

Gender, ethnicity, genetics, etc - are all factors that may contribute to the variation in height. In this example, these factors are independent variables and the height of the individual is the dependent variable. The key question is: <b>how do changes in the independent variables explain the variance in the dependent variable?</b> An additional question is: <b>how much of the variation in the dependent variable can be attributed to each independent variable?</b>

Once we understand the factors that are influencing the dependent variable, we can purposefully manipulate the values of the independent variables to predict how that will change the dependent variable. This is where regression comes in handy! 

In [None]:
df.head(10)

# <font color=DODGERBLUE>Introduction to Linear Regression</font>
***

<b>Linear Regression</b> is a powerful statistical tool that allows for a closer examination of the linear relationship between <b>a continuous dependent variable</b> and various independent variables. 

Linear regression allows you to :
1. Determine if the relationship between the dependent and independent variables is statistically significant (or accurate).
2. Identify how much of the variation in the dependent variable is explained by the selection of independent variables. 
3. Determine the direction and magnitude of the relationships between variables, and 
4. Predict what the value of the dependent variable would be given specific input from the independent variables. 

You can use linear regression to predict the salary of a lawyer (DV) based on the number of years they practiced law (IV). You could also determine just how much of the variation in salary is attributed to the number of years they have practiced law.

***

A simple linear regression models the relationships between a single dependent and independent variable, <b>where the independent variable is predicting the value of the dependent variable</b>. A linear regression model is mathematically represented by the formula of a line:
### \begin{align}  y = mx + b \end{align}
Where “y” is the the value of the dependent variable, “m” is the slope (also known as the coefficient), “x” represents the value of the independent variable, and “b” is the y-intercept (also known as the constant) which is the value of “y” when the coefficient is equal to 0. 

<b>Linear regression models will determine the line-of-best-fit, also known as the regression line, which is the best fitting straight line through your data points.</b> Most commonly, the best fitting line is the line that minimizes errors. The equation for the regression line is what is used to make predictions for your dependent variable. 

<center><img src='https://s3.amazonaws.com/stackabuse/media/linear-regression-python-scikit-learn-1.png'></center>

***
## Things to Consider when using Linear Regression

***

### Feature Selection

Once we get a sense of the relationship between our variables, we need to make some decisions on which variables to include in our regression analyses. <b>We only want to include the variables that have some kind of relationship with your dependent variable(s) of interest.</b> This allows us to cut back on the empty "noise" in your dataset and focus on the variables that are meaningful. 

### Redundancy
Be careful with including multiple variables that show similar information. For example, if you have a variable "Hours of Study" which represents the total hours that a student studied, and you've also binned that data and created a new variable "Study Level" which groups students into "high, med, low" - you now have two variables in your dataset that show similar information, although one variable is a lot more specific. When you have this situation, you should elect to keep the variable that has the more detailed information (i.e. the specific hours of study). 

### Multiple Groups
Variables with a lot of different levels (i.e. State of Residence) don't do well in regression models. If you have a variable that has a large number of different groups, unless this variable is vital to your analyses, you should work to reduce the number of overall groups (i.e State can become Region), or leave the variable out of the analyses. It is possible to include these complicated variables, but it isn't always the easiest interpretation.

### Confounding Variables
When we preform regression analyses, we are interested in further understanding the relationship between the dependent and independent variable. Simply put, we are interested in investigating how the independent variable effects the dependent variable. However, it's rarely this simple and there are other factors that need to be understood and controlled. A confounding variable is a third variable that influences both the independent and dependent variable to some degree. It is important to acknowledge this third variable to ensure that the results of your analyses are valid. 

For example, you collect data on sunburns and ice cream consumption. You find that higher ice cream consumption is associated with a higher rate of sunburns. Does this mean that eating ice cream causes sunburns? Absolutely not, there are several other factors that can be attributed to this trend -- but most likely, the confounding variable is temperature. Hot temperatures cause people to eat more ice cream and result in people spending more time outdoors, which can result in more sunburns. Without accounting for the confounding variable(s), you may find relationships between variables that might not actually exist. To control for the potential effects of confounding variables, you simply have to include them in your regression model as another independent variable. 

When you include all of your assumed confounding variables in your regression model, you are controlling for the effects of all of them, and if you find there is still a relationship between a specific independent variable and your dependent variable, you will know that relationship isn't being influenced by any of these other factors.

In [None]:
## new library alert! ##

## Import the StatsModels library for our regression analyses

import statsmodels.formula.api as sm

### Creating a Linear Regression Model

* <b>result</b> is the name that we are assigning the regression formula
* <b>sm</b> is the shorthand for the linear regression model library
* <b>ols</b> is Ordinary Least Squares, the most common method of calculating the regression line
* the regression equation starts with the dependent variable on the left, followed by the independent variables
* independent variables are separated by "+"
* categorical variables must be in parentheses and annotated with a "C"
* data = is where you specify your dataset that you're pulling variables from
* <b>.fit()</b> function uses the predictive values to calculate the best linear regression line
* <b>.summary()</b> function will show the calculated values (slopes and y-intercept) for the linear regression formula

In [None]:
## create the regression model
result = sm.ols('grade ~ hours + age + exercise + C(gender)', data = df).fit()

## print the regression model summary
result.summary()

# <font color=DODGERBLUE>Interpreting Linear Regression Result</font>
***
### Determining Model Fit
***
Linear regression calculates the regression line (equation) that minimizes the distance between the regression line and all the data points.  If our data points fall closely to the generated regression line, we consider the model to be a good fit. 

<img src='https://blog.minitab.com/hubfs/Imported_Blog_Media/residual_illustration-1.gif'>

But what does that mean non-graphically? Linear regression may not always be the right technique to use for the specific set of data. The fit of the model describes how well your variables explain the variance in the dependent variable. 

To assess the model fit, we look at the adjusted R-squared (Adj. R-squared). The Adj. R-squared is a statistical measure of how closely the data are to the fitted regression line. <B>The Adj. R-squared is the percentage of variation in the dependent variable that can be explained by all the independent variables included in the model.</B> For example, how much of the variation in student grades (i.e. grades ranging from 32-100) can be explained by hours of study, age, hours of exercise, gender, etc? Values range from 0 to 1; the higher the value, the better the fit. 

***
### Intercept
***
The y-intercept, or constant, is the value given to the dependent variable if all independent variables are equal to 0. For example, when all independent variables are equal to 0, the expected student grade is approximately 58. Don't worry too much about this interpretation, oftentimes this won't make sense - for example, if age equals 0, but it's important to know how to review this. 

***
### Coefficients (coef)
***
These values show how changes in the independent variable influence the dependent variable, and in what direction. 

#### <b> Interpreting Numeric Variable Coefficients </b>
When you are looking at numeric variables (i.e. age), each coefficient represents the numeric change in the dependent variable given a one-unit change in the independent variable. For example, for every one hour increase in study time (hours), grade increases by 1.9 points.

* <b>INTERCEPT</b>: when all other IV's are zero, expected grade is around 58

* <b>AGE</b>: for every one year increase in age, grade increases by 0.04 points (when controlling for hours of study, exercise, and gender)

* <b>EXERCISE</b>: for every one hour increase in exercise, grade increases by approximately 1 point (when controlling for age, hours of study, and gender)

* <b>HOURS</b>: for every one hour increase in study time, grade increases by 1.9 points (when controlling for age, hours of exercise, and gender)

If you have a variable that gives you a negative coefficient, the relationship moves in opposite directions. For example, say the coefficient for age was <b>- 0.0405</b>, you would interpret it in the following way: 

* <b>AGE</b>: for every one year increase in age, grade <u>decreases</u> by 0.04 points. 

#### <b> Interpreting Categorical Variable Coefficients </b>

When working with categorical variables, the model will automatically take one of the categories and use it as a reference (i.e. comparison category). This reference category will not show up in the listed coefficients, and that is how you will be able to identify which category is serving as the reference. In our model, gender is the only categorical variable, and can be interpreted: 

<b>GENDER</b>:
* Female is the reference category
* On average, Male students have a grade 0.44 points lower than Female students (when controlling for age, hours of study, and exercise) 

The grade is lower because the coefficient is negative in this model. If we have more than 2 categories for our categorical variable, we would still compare each level to the reference. For example, if we had a group of students who choose not to disclose their gender, the coefficient for the 'undisclosed' group would show up in our model with it's own coefficient and we would compare those results to the Female category. 

#### Standard Error (std err)

The standard error reflects the level of accuracy of the coefficients. The lower the value, the higher the level of accuracy.

***
### Statistical Significance 
***
#### p-value (P>|t|)
When trying to determine if the results we received are statistically important - we need to consider the p-value. The p-value is the probability that you will receive the same results solely by chance (aka there is no meaning behind the results). 

Because the p-value is representing the probability of random findings, we always want to minimize this value. If the p-value was .5 (50%), that would mean that 50% of the time the results we see are just by chance. The p-value reflects how confident we are in our results and requires strict interpretation. A commonly used cut-off is 0.05 (5% chance the results observed are by chance) or below. 
* If the p-value is less than or equal to 0.05, we can deem our results to be statistically significant. 
* If the p-value is greater than 0.05, our results are not statistically significant. 

***
### What do I do with non-significant variables?
***
Regression analyses are rarely a "one and done" situation. After you run your analyses, you should tweak your model (equation) and see if you can improve the model fit. A good place to start is by removing non-significant variables from our model. In this example, gender and age are not significant - let's try a model without them!

In [None]:
## Removing non-significant variables 

result2 = sm.ols('grade ~ hours + exercise', data = df).fit()

result2.summary()

# <font color=DODGERBLUE>Comparing Models and Making Predictions</font>

When comparting two models, the first thing you want to check is if there are any changes in the adjusted r-squared. In this example, the adj r-squared as not changed at all between the two models. The next item you want to check is the p-value of the remaining variables in the model. Did removing variables increase/decrease the p-values for the remaining variables?

You do not need to continue running your model until all your variables are significant. You should focus on maximizing your adj r-square, regardless of the significance of your variables. 

### What can I do with this information?

Now that you have a better understanding of how your variables interact, you are in a better place to describe your data. You now know that an increase of just one hour of studying will, on average, show a significant increase in a students final grade.  You can also make predictions about future grades...

### Making Predictions based on the Regression Results

Recall that a simple linear regression model is mathematically represented by the formula of a line: <b> y = mx + b</b>. When you are creating a multiple linear regression equation, the formula is similar - with some added features: <b> y = b + (m1 x X1) + (m2 x X2) + (m3 x X3) + ....etc. </b> where each "m x X" is representative of one of the coefficients in your model, and "b" represents the intercept. Once you have your regression output, you can plug these values in to make predictions on your dependent variable given specific values for your independent variables.

    grade(y) = intercept(b) + [hours of study coef(m1) x hours of study(x1)] + [hours of exercise coef(m2) x hours exercise(x2)]

    grade(y) = 58.5316 + [1.9162 x hours of study(x1)] + [0.9892 x hours exercise(x2)]

### What grade can we expect from a student that studied 8 hours and exercised 4.3 hours?

    grade(y) = 58.5316 + [1.9162 x hours of study(8)] + [0.9892 x hours exercise(4.3)]

    grade(y) = 58.5316 + [1.9162 x (8)] + [0.9892 x (4.3)]

    grade(y) = 58.5316 + [‭15.3296‬] + ‭[4.25356‬]

    grade = ‭78.11476

## <font color=SALMON>Ta Da! You can now predict the future!</font>

<LEFT><img src='https://i0.wp.com/www.learning-mind.com/wp-content/uploads/2019/10/psychic-spiritual-energy.jpg?resize=768%2C512&ssl=1'></LEFT>

## Making Predictions: the simple way!

What grade can we expect from a 16 year old female student that studied 8 hours and exercised 5.7 hours? 

In [None]:
## We can use the predict function (from statsmodel library) to predict the outcome given specific input
## reference the model with your function to reference the appropriate coef's!

# model_name.predict({'variable1_name':value1, 'variable2_name':value2, ...})

result.predict({
    'hours': 8, 
    'age': 16, 
    'exercise': 5.7, 
    'gender': "female"})

In [None]:
## What about another scenario?

result.predict({
    'hours': 14, 
    'age': 18, 
    'exercise': 12, 
    'gender': "male"})

# <font color=DODGERBLUE>Model 11 Exercises</font>

#### 1. Complete the code below to import the four libraries we've used most commonly.

In [None]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("datasets/babies.csv")
df.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0


#### 2. Import the "babies.csv" file and name it df. 

<b>Background Info</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>Variables</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

#### 3. Check the shape of the dataset. How many columns and rows are there?

In [4]:
df.shape

(1236, 8)

#### 4. Check the first 10 rows and the last 10 rows. Drop the column "case". 

In [5]:
df.head(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
6,7,138,244.0,0,33.0,62.0,178.0,0.0
7,8,132,245.0,0,23.0,65.0,140.0,0.0
8,9,120,289.0,0,25.0,62.0,125.0,0.0
9,10,143,299.0,0,30.0,66.0,136.0,1.0


In [7]:
df.tail(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
1226,1227,109,244.0,1,21.0,63.0,102.0,1.0
1227,1228,103,278.0,0,30.0,60.0,87.0,1.0
1228,1229,118,276.0,0,34.0,64.0,116.0,0.0
1229,1230,127,290.0,0,27.0,65.0,121.0,0.0
1230,1231,132,270.0,0,27.0,65.0,126.0,0.0
1231,1232,113,275.0,1,27.0,60.0,100.0,0.0
1232,1233,128,265.0,0,24.0,67.0,120.0,0.0
1233,1234,130,291.0,0,30.0,65.0,150.0,1.0
1234,1235,125,281.0,1,21.0,65.0,110.0,0.0
1235,1236,117,297.0,0,38.0,65.0,129.0,0.0


#### 5. Is there any missing data? Check!

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1236 entries, 0 to 1235
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   case       1236 non-null   int64  
 1   bwt        1236 non-null   int64  
 2   gestation  1223 non-null   float64
 3   parity     1236 non-null   int64  
 4   age        1234 non-null   float64
 5   height     1214 non-null   float64
 6   weight     1200 non-null   float64
 7   smoke      1226 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 77.4 KB


#### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data.

In [9]:
df.dropna(inplace = True)
df

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
...,...,...,...,...,...,...,...,...
1231,1232,113,275.0,1,27.0,60.0,100.0,0.0
1232,1233,128,265.0,0,24.0,67.0,120.0,0.0
1233,1234,130,291.0,0,30.0,65.0,150.0,1.0
1234,1235,125,281.0,1,21.0,65.0,110.0,0.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1174 entries, 0 to 1235
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   case       1174 non-null   int64  
 1   bwt        1174 non-null   int64  
 2   gestation  1174 non-null   float64
 3   parity     1174 non-null   int64  
 4   age        1174 non-null   float64
 5   height     1174 non-null   float64
 6   weight     1174 non-null   float64
 7   smoke      1174 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 82.5 KB


#### 7. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [17]:
dfz = df.copy()
print(dfz.shape)

(1174, 8)


In [18]:
dfz["zbwt"] = np.abs(stats.zscore(dfz["bwt"]))
dfz.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke,zbwt
0,1,120,284.0,0,27.0,62.0,100.0,0.0,0.029337
1,2,113,282.0,0,33.0,64.0,135.0,0.0,0.352741
2,3,128,279.0,0,28.0,64.0,115.0,1.0,0.465998
4,5,108,282.0,0,23.0,67.0,125.0,1.0,0.625654
5,6,136,286.0,0,25.0,62.0,93.0,0.0,0.902658


In [20]:
z_outliers = dfz.loc[dfz["zbwt"] > 3].index
## Preview list of index values
print(z_outliers)

Int64Index([632, 829, 912, 978, 1139], dtype='int64')


#### 8. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [21]:
df["age"].mean()

27.228279386712096

In [22]:
df["gestation"].mean()

279.1013628620102

#### 9. Let's model birthweight based on the characteristics of the mother. But first... 

We want to easily distinguish between the numeric and categorical variables. Replace the values 0/1 in the "parity" and "smoke" column with meaningful labels (i.e. smokes, doesn't smoke).

In [25]:
df["smoke"].replace(0.0, "No", inplace = True)
df["smoke"].replace(1.0, "Yes", inplace = True)
df.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,No
1,2,113,282.0,0,33.0,64.0,135.0,No
2,3,128,279.0,0,28.0,64.0,115.0,Yes
4,5,108,282.0,0,23.0,67.0,125.0,Yes
5,6,136,286.0,0,25.0,62.0,93.0,No


#### 10. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? 

Describe the strength of the correlation between all the numeric variables and birthweight. 

In [26]:
df.corr()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight
case,1.0,-0.064523,0.020642,0.197342,0.013212,-0.032239,-0.060275
bwt,-0.064523,1.0,0.407543,-0.043908,0.026983,0.203704,0.155923
gestation,0.020642,0.407543,1.0,0.080916,-0.053425,0.07047,0.023655
parity,0.197342,-0.043908,0.080916,1.0,-0.351041,0.043543,-0.096362
age,0.013212,0.026983,-0.053425,-0.351041,1.0,-0.006453,0.147322
height,-0.032239,0.203704,0.07047,0.043543,-0.006453,1.0,0.435287
weight,-0.060275,0.155923,0.023655,-0.096362,0.147322,0.435287,1.0


#### 11. Determine the relationship between birthweight and the categorical variables: parity and smoke. 

Use the groupby function to determine if there are any differences between birthweight and the different groups.  Does it seem like there is a relationship between these variables and birthweight?

In [27]:
pd.crosstab(df["bwt"], df["parity"])

parity,0,1
bwt,Unnamed: 1_level_1,Unnamed: 2_level_1
55,1,0
58,1,0
62,1,0
63,0,1
65,2,0
...,...,...
169,1,0
170,0,1
173,1,0
174,3,0


In [28]:
df["bwt"].groupby(df["smoke"]).mean()

smoke
No     123.085315
Yes    113.819172
Name: bwt, dtype: float64

#### 12. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? 

In the space below, write your justification for why you are including each variable. 

#### 13. Construct your regression model and print the summary. 

Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

#### 14. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors. Use the information in the model summary to make these predictions. 