# **Regression Analysis**

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

In [3]:
#read CSV
url = 'https://github.com/YannadatchO/Workspace/raw/main/teachingratings.csv'
ratings_df = pd.read_csv(url)

## Lab Exercises

In this section, you will learn how to run regression analysis in place of the t-test, ANOVA, and correlation

### Regression with T-test: Using the teachers rating data set, does gender affect teaching evaluation rates?


Initially, we had used the t-test to test if there was a statistical difference in evaluations for males and females, we are now going to use regression. We will state the null hypothesis:

-   $H_0: β1$ = 0 (Gender has no effect on teaching evaluation scores)
-   $H_1: β1$ is not equal to 0 (Gender has an effect on teaching evaluation scores)


We will use the female variable. female = 1 and male = 0 ##index from column 'female'

In [4]:
ratings_df.head(5)

Unnamed: 0,minority,age,gender,credits,beauty,eval,division,native,tenure,students,allstudents,prof,PrimaryLast,vismin,female,single_credit,upper_division,English_speaker,tenured_prof
0,yes,36,female,more,0.289916,4.3,upper,yes,yes,24,43,1,0,1,1,0,1,1,1
1,yes,36,female,more,0.289916,3.7,upper,yes,yes,86,125,1,0,1,1,0,1,1,1
2,yes,36,female,more,0.289916,3.6,upper,yes,yes,76,125,1,0,1,1,0,1,1,1
3,yes,36,female,more,0.289916,4.4,upper,yes,yes,77,123,1,1,1,1,0,1,1,1
4,no,59,male,more,-0.737732,4.5,upper,yes,yes,17,20,2,0,0,0,0,1,1,1


In [5]:
## X is the input variables (or independent variables) 
X = ratings_df['female']

## y is the target/dependent variable
y = ratings_df['eval']

## add an intercept (beta_0) to our model
X = sm.add_constant(X)

model = sm.OLS(y,X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.022
Model:,OLS,Adj. R-squared:,0.02
Method:,Least Squares,F-statistic:,10.56
Date:,"Tue, 26 Jan 2021",Prob (F-statistic):,0.00124
Time:,17:56:43,Log-Likelihood:,-378.5
No. Observations:,463,AIC:,761.0
Df Residuals:,461,BIC:,769.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0690,0.034,121.288,0.000,4.003,4.135
female,-0.1680,0.052,-3.250,0.001,-0.270,-0.066

0,1,2,3
Omnibus:,17.625,Durbin-Watson:,1.209
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.97
Skew:,-0.496,Prob(JB):,7.6e-05
Kurtosis:,2.981,Cond. No.,2.47


**Conclusion:** Like the t-test, the p-value is less than the alpha (α) level = 0.05, so we reject the null hypothesis as there is evidence that there is a difference in mean evaluation scores based on gender. The coefficient -0.1680 means that females get 0.168 scores less than men.


### Regression with ANOVA: Using the teachers' rating data set, does beauty  score for instructors  differ by age?

State the Hypothesis:

-   $H_0: µ1 = µ2 = µ3$ (the three population means are equal)
-   $H_1:$ At least one of the means differ


Then we group the data like we did with ANOVA

In [6]:
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

Use OLS function from the statsmodel library

In [7]:
from statsmodels.formula.api import ols
lm = ols('beauty ~ age_group', data =ratings_df).fit()
table = sm.stats.anova_lm(lm)
print(table)

              df      sum_sq    mean_sq          F        PR(>F)
age_group    2.0   20.422744  10.211372  17.597559  4.322549e-08
Residual   460.0  266.925153   0.580272        NaN           NaN


**Conclusion:** We can also see the same values for ANOVA like before and we will reject the null hypothesis since the p-value is less than 0.05 there is significant evidence that at least one of the means differ

### Regression with ANOVA option 2


Create dummy variables - A dummy variable is a numeric variable that represents categorical data, such as gender, race, etc. Dummy variables are dichotomous, i.e they can take on only two quantitative values.


In [8]:
X = pd.get_dummies(ratings_df['age_group'])
X

Unnamed: 0,40 years and younger,57 years and older,between 40 and 57 years
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,0,1,0
...,...,...,...
458,1,0,0
459,0,1,0
460,0,0,1
461,1,0,0


In [9]:
y = ratings_df['beauty']

X = sm.add_constant(X)
model =sm.OLS(y,X).fit()
predictions = model.predict(X)

model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.067
Method:,Least Squares,F-statistic:,17.6
Date:,"Tue, 26 Jan 2021",Prob (F-statistic):,4.32e-08
Time:,17:56:43,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1065.0
Df Residuals:,460,BIC:,1077.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.496,0.620,-0.041,0.069
40 years and younger,0.3224,0.058,5.574,0.000,0.209,0.436
57 years and older,-0.2596,0.056,-4.621,0.000,-0.370,-0.149
between 40 and 57 years,-0.0489,0.045,-1.081,0.280,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,5950000000000000.0


### Correlation: Using the teachers' rating dataset, Is teaching evaluation score correlated with beauty score?

State the hypothesis:

-   $H_0:$ Teaching evaluation score is not correlated with beauty score
-   $H_1:$ Teaching evaluation score is correlated with beauty score


In [10]:
## X is the input variables (or independent variables)
X = ratings_df['beauty']

## y is the target/dependent variable
y = ratings_df['eval']

## add an intercept (beta_0) to our model
X = sm.add_constant(X)

model = sm.OLS(y,X).fit()
prediction = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,17.08
Date:,"Tue, 26 Jan 2021",Prob (F-statistic):,4.25e-05
Time:,17:56:43,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.948,4.048
beauty,0.1330,0.032,4.133,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.238
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


<span style = 'color:red'>R-squared: <code>Sqrt(0.036 )</code> = Pearson R - Correlation coefficient = 0.189039
<span style = 'color:red'>Prob (F-statistic):	4.25e-05 = Pearson R - P value >> Reject Ho : Teaching evaluation score is correlated with beauty score

## Practice Questions

### Question 1: Using the teachers' rating data set, does tenure affect beauty scores?

-   Use α = 0.05


## hypothesis
-   $H_0: β1$ = 0 (tenure no effect on  beauty scores)
-   $H_1: β1$ is not equal to 0 (tenure has  effect on  beauty scores)

In [11]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 463 entries, 0 to 462
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   minority         463 non-null    object 
 1   age              463 non-null    int64  
 2   gender           463 non-null    object 
 3   credits          463 non-null    object 
 4   beauty           463 non-null    float64
 5   eval             463 non-null    float64
 6   division         463 non-null    object 
 7   native           463 non-null    object 
 8   tenure           463 non-null    object 
 9   students         463 non-null    int64  
 10  allstudents      463 non-null    int64  
 11  prof             463 non-null    int64  
 12  PrimaryLast      463 non-null    int64  
 13  vismin           463 non-null    int64  
 14  female           463 non-null    int64  
 15  single_credit    463 non-null    int64  
 16  upper_division   463 non-null    int64  
 17  English_speaker 

In [12]:
X = ratings_df['tenured_prof']
y = ratings_df['beauty']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
prediction = model.predict(X)

model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.002
Method:,Least Squares,F-statistic:,0.1689
Date:,"Tue, 26 Jan 2021",Prob (F-statistic):,0.681
Time:,17:56:43,Log-Likelihood:,-546.45
No. Observations:,463,AIC:,1097.0
Df Residuals:,461,BIC:,1105.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0284,0.078,0.363,0.717,-0.125,0.182
tenured_prof,-0.0364,0.089,-0.411,0.681,-0.210,0.138

0,1,2,3
Omnibus:,23.184,Durbin-Watson:,0.461
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.229
Skew:,0.507,Prob(JB):,9.03e-06
Kurtosis:,2.583,Cond. No.,4.05


<span style = 'color : red'>P value :	0.681  > Alpha 0.005 : Accpet Ho = there is no evidence that the mean difference of tenured and untenured instructors are different

### Question 2: Using the teachers' rating data set, does being an English speaker affect the number of students assigned to professors?

-   Use "allstudents"
-   Use α = 0.05 and α = 0.1 


## hypothesis
-   $H_0: β1$ = 0 (English speaker no effect on  number of students assigned to professors)
-   $H_1: β1$ is not equal to 0 (English speaker has  effect on  number of students assigned to professors)

In [13]:
X = ratings_df['vismin']
y= ratings_df['allstudents']

X = sm.add_constant(X)

model = sm.OLS(y,X).fit()
prediction = model.predict(X)

model.summary()

0,1,2,3
Dep. Variable:,allstudents,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,4.378
Date:,"Tue, 26 Jan 2021",Prob (F-statistic):,0.0369
Time:,17:56:43,Log-Likelihood:,-2653.7
No. Observations:,463,AIC:,5311.0
Df Residuals:,461,BIC:,5320.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,58.0902,3.745,15.513,0.000,50.731,65.449
vismin,-21.0746,10.072,-2.092,0.037,-40.867,-1.282

0,1,2,3
Omnibus:,428.368,Durbin-Watson:,0.716
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10382.823
Skew:,4.112,Prob(JB):,0.0
Kurtosis:,24.693,Cond. No.,2.96


<span style = 'color : red'>P value :	0.0629  > Alpha 0.05 : Accpet Ho = no evidence that being a native English speaker or a non-native English speaker affects the number of students assigned to an instructor

####

<span style = 'color : red'>P value :	0.0629  < Alpha 0.1 : reject  Ho = there is evidence that there is a significant difference of mean number of students assigned to native English speakers vs non-native English speakers.

### Question 3: Using the teachers' rating data set, what is the correlation between the number of students who participated in the evaluation survey and evaluation scores?

-   Use "students" variable


In [14]:
X = ratings_df['students']
y = ratings_df['eval']

X = sm.add_constant(X)

model = sm.OLS(y,X).fit()
predict  = model.predict(X)

model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.5806
Date:,"Tue, 26 Jan 2021",Prob (F-statistic):,0.446
Time:,17:56:43,Log-Likelihood:,-383.46
No. Observations:,463,AIC:,770.9
Df Residuals:,461,BIC:,779.2
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9823,0.033,119.689,0.000,3.917,4.048
students,0.0004,0.001,0.762,0.446,-0.001,0.002

0,1,2,3
Omnibus:,15.259,Durbin-Watson:,1.198
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.283
Skew:,-0.456,Prob(JB):,0.000291
Kurtosis:,2.888,Cond. No.,74.8


<span style = 'color:red'>R-square is 0.001, R will be √0.001, correlation coefficient is 0.03 (close to 0). There is a very weak correlation between the number of students who participated in the evaluation survey and evaluation scores