### a) Correlation Coefficient Sign Expectation
Explanation: The correlation coefficient between the number of members and the median income is expected to be positive. This expectation is based on the assumption that areas with higher median incomes have residents with more disposable income, making them more likely to afford membership fees for sports clubs.

### b) Create three dummy variables, pool, courts, and classes, that are equal to 1 if the observation contains this feature and equal to 0 if the observation does not contain this feature

In [16]:
import pandas as pd

# Define the data as a list of lists
data = [
    [1258, 32223, 'No', 'No', 'No'],
    [1479, 34975, 'No', 'No', 'No'],
    [1480, 43187, 'No', 'Yes', 'No'],
    [1701, 44337, 'No', 'No', 'No'],
    [2014, 52167, 'No', 'No', 'Yes'],
    [2271, 57521, 'No', 'No', 'Yes'],
    [2615, 58347, 'No', 'Yes', 'No'],
    [2632, 60960, 'Yes', 'No', 'No'],
    [2737, 62201, 'Yes', 'No', 'Yes'],
    [2810, 67993, 'No', 'No', 'Yes'],
    [3563, 68770, 'No', 'No', 'Yes'],
    [3765, 81289, 'Yes', 'Yes', 'Yes'],
    [3792, 83902, 'No', 'No', 'Yes'],
    [4069, 84594, 'Yes', 'No', 'Yes'],
    [4393, 86855, 'Yes', 'Yes', 'Yes'],
    [4787, 88381, 'Yes', 'Yes', 'Yes']
]

# Define the column names
columns = ['Number of Members', 'Median Income ($)', 'Pool?', 'Racquetball Courts?', 'Fitness Classes?']

# Create DataFrame
df = pd.DataFrame(data, columns=columns)

df['Pool'] = df['Pool?'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Courts'] = df['Racquetball Courts?'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Classes'] = df['Fitness Classes?'].apply(lambda x: 1 if x == 'Yes' else 0)

# Drop the original categorical columns
df.drop(columns=['Pool?', 'Racquetball Courts?', 'Fitness Classes?'], inplace=True)



In [17]:
df.head()

Unnamed: 0,Number of Members,Median Income ($),Pool,Courts,Classes
0,1258,32223,0,0,0
1,1479,34975,0,0,0
2,1480,43187,0,1,0
3,1701,44337,0,0,0
4,2014,52167,0,0,1


### c) Use statistical software to estimate the following regression models. In each case,
write the estimated regression equation and state whether the coefficient of the
independence variable is significant at the 0.10 level. (Make sure to include the
following in your answers: hypotheses H0 and Ha , test statistic value, p-value,
conclusion.)
i) Members = b0 + b 1 (Pool) + ei
ii) Members = b0 + b 1 (Courts) + e i
iii) Members = b0 + b 1 (Classes) + e 

In [18]:
import statsmodels.api as sm

# Model i) Members = b0 + b1(Pool) + ei
model_i = sm.OLS(df['Number of Members'], sm.add_constant(df['Pool']))
results_i = model_i.fit()

# Model ii) Members = b0 + b1(Courts) + ei
model_ii = sm.OLS(df['Number of Members'], sm.add_constant(df['Courts']))
results_ii = model_ii.fit()

# Model iii) Members = b0 + b1(Classes) + ei
model_iii = sm.OLS(df['Number of Members'], sm.add_constant(df['Classes']))
results_iii = model_iii.fit()

# Print summary for each model
print("Model i:")
print(results_i.summary())
print("\nModel ii:")
print(results_ii.summary())
print("\nModel iii:")
print(results_iii.summary())


Model i:
                            OLS Regression Results                            
Dep. Variable:      Number of Members   R-squared:                       0.413
Model:                            OLS   Adj. R-squared:                  0.371
Method:                 Least Squares   F-statistic:                     9.863
Date:                Sun, 31 Mar 2024   Prob (F-statistic):            0.00723
Time:                        20:49:05   Log-Likelihood:                -130.17
No. Observations:                  16   AIC:                             264.3
Df Residuals:                      14   BIC:                             265.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2298.3000    279.269      8.230



### Now, let's interpret the results for each model and determine the significance of the coefficients:

### Model i) Members = $b_0 + b_1(\text{Pool}) + e_i$
- **Hypotheses**:
  - $H_0: b_1 = 0$ (The number of members is not affected by the presence of a pool)
  - $H_a: b_1 \neq 0$ (The number of members is affected by the presence of a pool)

**Conclusion:**
- Test Statistic Value, p-value: Look for the coefficient corresponding to 'Pool'. If its p-value is less than 0.10, we reject the null hypothesis.
- If the p-value < 0.10, we conclude that the coefficient of 'Pool' is significant at the 0.10 level, indicating that the presence of a pool has a significant effect on the number of members.

### Model ii) Members = $b_0 + b_1(\text{Courts}) + e_i$
- **Hypotheses**:
  - $H_0: b_1 = 0$ (The number of members is not affected by the presence of racquetball courts)
  - $H_a: b_1 \neq 0$ (The number of members is affected by the presence of racquetball courts)

**Conclusion:**
- Test Statistic Value, p-value: Look for the coefficient corresponding to 'Courts'. If its p-value is less than 0.10, we reject the null hypothesis.
- If the p-value < 0.10, we conclude that the coefficient of 'Courts' is significant at the 0.10 level, indicating that the presence of racquetball courts has a significant effect on the number of members.

### Model iii) Members = $b_0 + b_1(\text{Classes}) + e_i$
- **Hypotheses**:
  - $H_0: b_1 = 0$ (The number of members is not affected by the offering of fitness classes)
  - $H_a: b_1 \neq 0$ (The number of members is affected by the offering of fitness classes)


**Conclusion:**
- Test Statistic Value, p-value: Look for the coefficient corresponding to 'Classes'. If its p-value is less than 0.10, we reject the null hypothesis.
- If the p-value < 0.10, we conclude that the coefficient of 'Classes' is significant at the 0.10 level, indicating that the presence of fitness classes has a significant effect on the number of members.

This approach gives you estimated regression equations for each model and helps you determine the significance of the coefficients at the 0.10 level.

### d) Estimate the following multiple regression model.
Members = b 0 + b 1 (Income) + b2 (Pool) + b 3 (Courts) + b4 (Classes) + ei
Write the estimated regression equation.

### Estimated Equation : 
$b_0 + b_1(\text{Classes}) + $b_0 + b_1(Income) + $b_0 + b_1(\text{Courts}) + $b_0 + b_1(Pool) + e_i$

In [19]:
# Add constant term for the intercept
df['Intercept'] = 1

# Define independent variables (features)
X = df[['Median Income ($)', 'Pool', 'Courts', 'Classes', 'Intercept']]

# Define dependent variable (target)
y = df['Number of Members']

# Fit the multiple regression model
model = sm.OLS(y, X).fit()

# Print the summary of the regression model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:      Number of Members   R-squared:                       0.957
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     61.81
Date:                Sun, 31 Mar 2024   Prob (F-statistic):           1.81e-07
Time:                        20:49:05   Log-Likelihood:                -109.19
No. Observations:                  16   AIC:                             228.4
Df Residuals:                      11   BIC:                             232.2
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Median Income ($)     0.0579      0.00



### e) Are any of the coefficients of the indicator variables significant at the 0.10 level?

##### To determine if any of the coefficients of the indicator variables are significant at the 0.10 level, you would look at the p-values associated with each coefficient in the regression output. If the p-value is less than 0.10, then the coefficient is considered significant at the 0.10 level.

### f) Explain why it is important to include the income variable in the regression model.

##### It is important to include the income variable in the regression model because the CEO believes that median income in the area is a significant factor in determining the number of people who join a neighborhood sports club. By including income in the regression model, you can assess its impact on club membership while controlling for other factors like the presence of a pool, racquetball courts, or fitness classes. Additionally, including income allows for a more comprehensive analysis of the determinants of club membership.

### g) After studying these regression results, how would you suggest the management of the sports club chain go about building their new location? Should they use any of the regression models you have estimated? Explain why or why not.

##### After studying the regression results, the management of the sports club chain should consider several factors when deciding how to build their new location:

##### The coefficients of the indicator variables (pool, racquetball courts, fitness classes) and their associated p-values should be carefully examined to determine if these amenities have a significant impact on club membership.
##### The coefficient of the income variable should also be considered to understand its effect on club membership.
##### It's important to remember that correlation does not imply causation. Even if certain amenities or income levels are correlated with club membership, it doesn't necessarily mean that building a new location with those features will guarantee higher membership.
##### Other factors not included in the regression model may also influence club membership, such as location, marketing strategies, competition from other clubs, etc.
##### Ultimately, the decision on how to build the new location should be based on a combination of regression results, industry knowledge, market research, and business objectives.
##### If the regression models provide significant evidence that amenities or income levels impact club membership, the management may consider incorporating these features into the new location. However, it's essential to carefully evaluate all factors and consider the broader context before making a decision.