## Regression Analysis

In [2]:
import piplite
await piplite.install(['numpy'],['pandas'])
import numpy as np
import pandas as pd
import statsmodels.api as sm
from js import fetch
import io

URL = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'
resp = await fetch(URL)
ratings_url = io.BytesIO((await resp.arrayBuffer()).to_py())

In [3]:
ratings_df = pd.read_csv(ratings_url)

### Regression with T-test
#### Hypothesis:
#### H_0: beta1 =0 (Gender has not effect on teaching evaluation scores)
#### H_1: beta1 is not equal to 0 (Gender has an effect on teaching evaluation scores)

In [4]:
# X is the input variables
x = ratings_df['female']
# y is dependent variable
y = ratings_df['eval']

x = sm.add_constant(x)

model = sm.OLS(y,x).fit()
predictions = model.predict(x)

model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.022
Model:,OLS,Adj. R-squared:,0.02
Method:,Least Squares,F-statistic:,10.56
Date:,"Tue, 20 Feb 2024",Prob (F-statistic):,0.00124
Time:,18:56:41,Log-Likelihood:,-378.5
No. Observations:,463,AIC:,761.0
Df Residuals:,461,BIC:,769.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0690,0.034,121.288,0.000,4.003,4.135
female,-0.1680,0.052,-3.250,0.001,-0.270,-0.066

0,1,2,3
Omnibus:,17.625,Durbin-Watson:,1.209
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.97
Skew:,-0.496,Prob(JB):,7.6e-05
Kurtosis:,2.981,Cond. No.,2.47


## Regression with ANOVA
### Hypothesis:
### H_0: u1=u2=u3 (three population means are equal)
### H_1: At least one of the means differ

In [5]:
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

In [6]:
from statsmodels.formula.api import ols
lm = ols('beauty ~ age_group', data = ratings_df).fit()
table= sm.stats.anova_lm(lm)
print(table)

              df      sum_sq    mean_sq          F        PR(>F)
age_group    2.0   20.422744  10.211372  17.597559  4.322549e-08
Residual   460.0  266.925153   0.580272        NaN           NaN


Conclusion: We can also see the same values for ANOVA like before and we will reject the null hypothesis since the p-value is less than 0.05 there is significant evidence that at least one of the means differ.

### Regression with ANOVA option 2


Create dummy variables - A dummy variable is a numeric variable that represents categorical data, such as gender, race, etc. Dummy variables are dichotomous, i.e they can take on only two quantitative values.


In [9]:
X = pd.get_dummies(ratings_df[['age_group']])

In [10]:
y = ratings_df['beauty']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.065
Method:,Least Squares,F-statistic:,11.71
Date:,"Tue, 20 Feb 2024",Prob (F-statistic):,2.1e-07
Time:,18:57:35,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1067.0
Df Residuals:,459,BIC:,1083.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.495,0.621,-0.041,0.069
age_group_40 years and younger,0.3224,0.058,5.568,0.000,0.209,0.436
age_group_57 years and older,-0.2596,0.056,-4.616,0.000,-0.370,-0.149
age_group_between 40 and 57 years,-0.0489,0.045,-1.080,0.281,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,1120000000000000.0


### Correlation: Using the teachers' rating dataset, Is teaching evaluation score correlated with beauty score?


In [11]:
## X is the input variables (or independent variables)
X = ratings_df['beauty']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,17.08
Date:,"Tue, 20 Feb 2024",Prob (F-statistic):,4.25e-05
Time:,18:57:57,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.948,4.048
beauty,0.1330,0.032,4.133,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.238
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


**Conclusion:** p < 0.05 there is evidence of correlation between beauty and evaluation scores
