# Week 9: Regresion
## Exercises from Think Stats chapter 11 

## Exercise 11-1

Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

In [36]:
#Turn off panda 3.0 future warnings 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [38]:
import first
live, firsts, others = first.MakeFrames()
live = live[live.prglngth>30]

import statsmodels.formula.api as smf
model_1 = smf.ols('prglngth ~ (birthord) + (race) + (nbrnaliv) + (agepreg) + (babysex)', data=live)

results = model_1.fit()
print(results.summary()) 

import statsmodels.formula.api as smf
model_2 = smf.ols('prglngth ~ (birthord) + (nbrnaliv)', data=live)

results = model_2.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               prglngth   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     10.77
Date:                Wed, 05 Feb 2025   Prob (F-statistic):           2.40e-10
Time:                        10:48:39   Log-Likelihood:                -18261.
No. Observations:                8878   AIC:                         3.653e+04
Df Residuals:                    8872   BIC:                         3.658e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     39.6460      0.168    235.690      0.0

### Solution: 
Based on the models, I would use number of pregnancies (birthord), and number of fetuses (nbrnaliv). 

## Exercise 11-2

The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status.

Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.

As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?

In [100]:
import pandas as pd
import statsmodels.formula.api as smf
import nsfg

# Load the data from the NSFG pregnancy and respondent datasets
preg = nsfg.ReadFemPreg()  # Pregnancy dataset
resp = nsfg.ReadFemResp()  # Respondent dataset

# Merge the datasets based on 'caseid'
merged_data = pd.merge(preg, resp, on='caseid', how='inner')

# Check for missing data and print a summary
print("Missing data summary before cleaning:")
print(merged_data.isnull().sum())

# Drop rows with missing target 'babysex'
merged_data = merged_data.dropna(subset=['babysex'])

# Ensure 'babysex' exists in the dataframe and check for unique values
print(f"Unique values in 'babysex': {merged_data['babysex'].unique()}")

def explore_significant_factors(df, p_value_threshold=0.05):
    """Searches for factors influencing the likelihood of a baby being a boy.

    df: DataFrame containing pregnancy and respondent records
    p_value_threshold: Significance level to filter predictors

    returns: list of (p-value, predictor variable name) pairs for significant variables
    """
    # Create a binary target column 'is_boy' where 1 = boy, 0 = not boy
    df['is_boy'] = (df['babysex'] == 1).astype(int)

    significant_results = []

    # Iterate over columns, excluding the target variable 'is_boy'
    for column in df.columns:
        try:
            # Skip 'is_boy' column and any columns with low variance
            if column == 'is_boy' or df[column].var() < 1e-7:
                continue

            # Handle potential categorical columns (e.g., 'agepreg')
            if df[column].dtype == 'object':
                formula = f'is_boy ~ agepreg + C({column})'  # Treat as categorical variable
            else:
                formula = f'is_boy ~ agepreg + {column}'  # Treat as continuous variable

            logit_model = smf.logit(formula, data=df)

            # Ensure that the dataset has enough observations to perform the test
            total_observations = len(logit_model.endog)
            if total_observations < len(df) / 2:
                continue

            # Fit the logistic regression model
            fitted_model = logit_model.fit()

            # Get p-values for each variable
            p_values = fitted_model.pvalues

            # Check if the p-value for the predictor variable is below the threshold
            if p_values[column] < p_value_threshold:
                significant_results.append((p_values[column], column))

        except Exception as e:
            print(f"An error occurred while processing '{column}': {e}")
            continue

    return significant_results

# Pass the merged data to the explore_significant_factors function
significant_factors = explore_significant_factors(merged_data)

# Output the significant factors and their corresponding p-values
if significant_factors:
    print ("Significant factors:")
    for p_value, factor in significant_factors:
        print(f"{factor}: {p_value:.4f}")
else:
    print("No significant factors found.")


Missing data summary before cleaning:
caseid            0
pregordr          0
howpreg_n     13241
howpreg_p     13244
moscurrp_x    13241
              ...  
sest_y            0
cmintvw_y         0
cmlstyr           0
screentime        0
intvlngth         0
Length: 3330, dtype: int64
Unique values in 'babysex': [1. 2.]
Optimization terminated successfully.
         Current function value: 0.693005
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692931
         Iterations 3
An error occurred while processing 'howpreg_n': zero-size array to reduction operation maximum which has no identity
An error occurred while processing 'howpreg_p': zero-size array to reduction operation maximum which has no identity
An error occurred while processing 'moscurrp_x': zero-size array to reduction operation maximum which has no identity
An error occurred while processing 'nowprgdk': zero-size array to reduction operation maximum which has no identity
Optimiz

  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q * linpred)))


Optimization terminated successfully.
         Current function value: 0.692894
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692935
         Iterations 3
An error occurred while processing 'ltrimest': zero-size array to reduction operation maximum which has no identity
Optimization terminated successfully.
         Current function value: 0.692972
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692899
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693093
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693088
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692712
         Iterations 4
An error occurred while processing 'fedsolid3': zero-size array to r

  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q * linpred)))


         Current function value: inf
         Iterations: 35
An error occurred while processing 'paydeliv_i': Singular matrix
Optimization terminated successfully.
         Current function value: 0.692954
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.692942
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692866
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.692983
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693001
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692983
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692998
         Iterations 4
Optimization terminated successfully.
         Current fun



Optimization terminated successfully.
         Current function value: 0.692959
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693012
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692874
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693003
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692984
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693006
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692972
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692942
  



Optimization terminated successfully.
         Current function value: 0.692726
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692742
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692837
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692954
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692766
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693009
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692990
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692985
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693012
  



Optimization terminated successfully.
         Current function value: 0.693005
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693010
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693015
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692999
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692996
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692740
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692869
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692993
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692828
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692807
  



Optimization terminated successfully.
         Current function value: 0.692946
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692788
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692883
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693023
         Iterations 3
An error occurred while processing 'sourcem4': zero-size array to reduction operation maximum which has no identity
Optimization terminated successfully.
         Current function value: 0.693014
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692879
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692792
         Iterations 3
Optimization terminated successfully.
         Current function valu



Optimization terminated successfully.
         Current function value: 0.692675
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692469
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692797
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693013
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692990
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693008
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693011
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.693011
  

## Solution
I found multiple variables that could help us to predict the baby's sex, including: "cmintstr," "fmarout5", "intr_ec3," and "educmom". 

## Exercise 11-3

If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called poisson. It works the same way as ols and logit. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called numbabes.

Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?

In [158]:
import nsfg
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

# Load the data
preg = nsfg.ReadFemPreg()
resp = nsfg.ReadFemResp()

# Join the dataframes based on 'caseid'
live = live[live.prglngth>30]
resp = nsfg.ReadFemResp()
resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')
join.shape

# Replace missing values in 'numbabes' (code 97 as missing)
join['numbabes'] = join.numbabes.replace([97], np.nan)

# Create additional age-related features
join['age2'] = join.age_r**2

# Define the Poisson regression formula
formula = 'numbabes ~ age_r + age2 + C(race) + totincr + educat'

# Fit the model
model = smf.poisson(formula, data=join)
results = model.fit()

# Display model summary
print(results.summary())

# Predict for a 35-year-old black woman, college graduate, with income > $75,000
columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new_data = pd.DataFrame([[35, 35**2, 1, 14, 16]], columns=columns)  # Assuming 1 for Black, 14 for college, and 16 for income > $75k

# Make the prediction
predicted_children = results.predict(new_data)
print(f"Predicted number of children: {predicted_children.iloc[0]:.2f}")


Optimization terminated successfully.
         Current function value: 1.676846
         Iterations 7
                          Poisson Regression Results                          
Dep. Variable:               numbabes   No. Observations:                 9011
Model:                        Poisson   Df Residuals:                     9004
Method:                           MLE   Df Model:                            6
Date:                Wed, 05 Feb 2025   Pseudo R-squ.:                 0.03789
Time:                        13:58:15   Log-Likelihood:                -15110.
converged:                       True   LL-Null:                       -15705.
Covariance Type:            nonrobust   LLR p-value:                6.705e-254
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -1.1284      0.168     -6.713      0.000      -1.458      -0.799
C(race)[T.2]    -0.1379

## Solution
This person is likely to have 2 or 3 babies (Prediction: 2.49)

## Exercise 11-4.

If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called m1ogit. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called rmarital.

Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

In [172]:
import nsfg
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

# Load the data
preg = nsfg.ReadFemPreg()  # Pregnancy dataset
resp = nsfg.ReadFemResp()  # Respondent dataset

# Join the datasets based on 'caseid'
resp.index = resp.caseid  # Set 'caseid' as index for the response data
live = preg[preg.prglngth > 30]  # Filter out pregnancies with length <= 30 weeks
join = live.join(resp, on='caseid', rsuffix='_r')  # Join the datasets on 'caseid'

# Check the shape of the resulting dataframe
print(f"Shape of the joined dataset: {join.shape}")

# Create additional features (age squared for polynomial term)
join['age2'] = join.age_r**2

# Define the multinomial logistic regression formula for marital status prediction
formula = 'rmarital ~ age_r + age2 + C(race) + totincr + educat'

# Fit the multinomial logistic regression model
model = smf.mnlogit(formula, data=join)
results = model.fit()

# Display the model summary
print(results.summary())

# Prediction for a woman who is 25 years old, white, a high school graduate, and has an income of about $45,000
columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new_data = pd.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)  # Age=25, race=2 (white), education=12 (high school), income=11 (~$45,000)

# Get predicted probabilities for each marital status category
predicted_probs = results.predict(new_data)

# Define marital status categories for output
marital_categories = ['Married (0)', 'Living with opposite sex partner (1)', 'Widowed (2)', 
                      'Divorced (3)', 'Separated (4)', 'Never married (5)']

# Print predicted probabilities for each marital status category
print("\nPredicted probabilities for marital status categories:")
for i, prob in enumerate(predicted_probs.iloc[0]):
    print(f"{marital_categories[i]}: {prob * 100:.2f}%")

Shape of the joined dataset: (9011, 3331)
Optimization terminated successfully.
         Current function value: 1.084199
         Iterations 8
                          MNLogit Regression Results                          
Dep. Variable:               rmarital   No. Observations:                 9011
Model:                        MNLogit   Df Residuals:                     8976
Method:                           MLE   Df Model:                           30
Date:                Wed, 05 Feb 2025   Pseudo R-squ.:                  0.1691
Time:                        14:11:27   Log-Likelihood:                -9769.7
converged:                       True   LL-Null:                       -11758.
Covariance Type:            nonrobust   LLR p-value:                     0.000
  rmarital=2       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        9.0147      0.796     11.323      0.000      

## Solution
This person has around 74.88% chance of being currently married, 12.73% of living with an opposite sex partner, 0.15% of being Widowed, 3.34% of being Divorced, 2.11% of being Separated, and 6.79% of Never being married.