# Problem Set 3: Univariate Regressions

## 5 pts per question, 10 points out of 90 in total with 10 bonus points. 

### Summary and Motivation

Problem Set 2 is designed to introduce students to the basics of regression analysis, a fundamental tool in data analytics and econometrics. By using the NLSY97 dataset, students will explore relationships between key variables such as education, gender, and hourly compensation. This problem set emphasizes practical experience with Python and the statsmodels library, reinforcing theoretical knowledge through hands-on data manipulation and analysis. By completing these exercises, students will gain a deeper understanding of univariate regression techniques, preparing them for more advanced multivariate analyses and real-world applications in economics and business.

### Instruction
The dataset data NLSY97.xlsx, available on CANVAS, contains all necessary information. These data are sourced from
the NLSY97, a representative national database of individuals born in the early 1980s in the United States. Below are some variables we will use:
- educ: number of years of education completed
- gender: denotes the gender of the individual
- minority: 1 if the individual belongs to a minority group, 0 otherwise.
- mother_educ: 1 if the individual’s mother has a college degree, 0 otherwise.
- annual_income: Annual income frome wages.
- gpain8: GPA in 8th grade.
- retention: 1 if the individual was required to repeat a grade during middle school, 0 otherwise.

Please follow below questions and instructions to complete this problem set. In some questions, please write and execute Python code for data analysis in Cell mode. Comment your code to explain each step. Some questions need text discussion. Please Provide a detailed discussion of your results, including interpretations and answers to questions in Raw mode.

Once you have completed the assignment, save your Jupyter notebook with the following naming convention: ECN394_ProblemSetX_LastName_FirstName.ipynb (replace X with the assignment number).

### Exercise 1: Load and Explore the Dataset

- This is the same database we used for our previous problem set. Each row contains information for each individual in the sample. (check the previous problem set for the definition of the variables).
- Load the needed libraries and dataset dataNLSY97.xlsx.
- Display the first 10 rows of the dataset.

1. Code that loads the needed libraries and dataset.

In [1]:
# Please write your executable code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import statsmodels
import statsmodels.api as sm
# import the module that allows us to use the formula notation
import statsmodels.formula.api as smf
df = pd.read_excel('data_NLSY97.xlsx')


FileNotFoundError: [Errno 2] No such file or directory: 'data_NLSY97.xlsx'

2. Code that displays the dataset variables names and types and display first 10 rows.

In [None]:
# Please write your executable code here
df.head()


3. Print out information about the data set that shows how many values in each column an the data types

In [None]:
df.info()

4. Does everything look as expected?  `float64` are columns that include numbers with decimal points. `objects` are non-numeric data. `int` are integers, or numbers without decimal points.  Comment on what you see below. 

5. Provide a summary, descriptive statistics, of the dataset. 

In [None]:
# Please write your executable code here
df.describe()

### Exercise 2: Create New Variables

6. Create a new variable named `male` (using the variable gender) that takes value 1 if the individual is a male, 0 otherwise.

In [None]:
# Please write your executable code here
gender = pd.get_dummies(df['gender'])
df['male'] = gender['male']
(df['gender']=='male').astype(int)

7. Create a new variable named `college` that takes value 1 if the individual completed at least 13 years of education, 0 otherwise

In [None]:
# Please write your executable code here
df['college'] = (df['educ'] >= 13).astype(int)
df['college']


### Exercise 4: Perform a Univariate Regression Analysis

- Using statsmodels, perform a univariate regression analysis with annual_income as the dependent variable and educ as the independent variable
- Print the summary of the regression model.

8. Code for fitting the regression model using the formula notation

In [None]:
# Please write your executable code here
results1 = smf.ols('annual_income ~ educ', data=df).fit()

9. Write code to print a summary of the output

In [None]:
# Please write your code here
print(results1.summary())

### Exercise 5: Interpret the Regression Output

- Based on the regression output from Exercise 4, answer the following:


10. Interpret the coefficient corresponding to the constant

11. Intepret the coefficient on `educ`

12. Interpret the value of R squared

13. Is the relationship between education and annual income statistically significant?  Explain how you came to this conclusion. 

14. Create the residual plot for this regression. And discuss on patterns of the plot. Any issue you observe.

*Hint*:
- stats models does not include a built in plot method for this, but the residuals are an attribute of results. If you named the object that contains the regression results `results` then  `results.resid` returns a pandas series of residuals. 
- You can plot these using matplotlib. As usual, start with `fig, ax = plt.subplots()` and then follow with `ax.plot(results.resid, 'o')`.  The 'o' will make the plot dots instead of a line.

In [None]:
# Please write your executable code here
fig, ax = plt.subplots()
ax.plot(results1.resid, 'o')

### Exercise 6: Perform a Univariate Regression Analysis

- Perform a univariate regression analysis with educ as the dependent variable and male as the independent variable.
- Perform another univariate regression analysis with annual income as the dependent variable and male as the independent variable.
- Print the summary of the regression model.
- Interpret the results (i.e., coefficients, statistical significance, and R-squared)

15. Write the coode for fitting a regression model with `educ` as the dependent variable and `male` as the independent variable using formula notation.

In [None]:
# Please write your executable code here
results2 = smf.ols('educ ~ male', data=df).fit()

16. Print the output of the regression model.

In [None]:
# Please write your code to generate the summary here
print(results2.summary())

17. Interpret the results, including coefficients, statistical significance and $R^2$ 

18. Write the coode for fitting a regression model with `annual income` as the dependent variable and `male` as the independent variable using formula notation.

In [None]:
# Please write your executable code here
results3 = smf.ols('annual_income ~ male', data=df).fit()


19. Print the output of the regression model.

In [None]:
# Please write your code to generate the summary here
print(results3.summary())

20. Interpret the results, including coefficients, statistical significance and $R^2$ 