# Problem Set 3: Univariate Regressions

### Summary and Motivation

Problem Set 3 is designed to introduce students to the basics of regression analysis, a fundamental tool in data analytics and econometrics. By using the NLSY97 dataset, students will explore relationships between key variables such as education, gender, and annual income. This problem set emphasizes practical experience with Python and the statsmodels library, reinforcing theoretical knowledge through hands-on data manipulation and analysis. By completing these exercises, students will gain a deeper understanding of univariate regression techniques, preparing them for more advanced multivariate analyses and real-world applications in economics and business.

### Instruction
The dataset data_NLSY97.xlsx, available on CANVAS, contains all the necessary information. These data are sourced from
the NLSY97, a representative national database of individuals born in the early 1980s in the United States. Below are some variables we will use:
- educ: number of years of education completed
- Annual_Income: annual income that these individuals perceived when an adults
- gender: denotes the gender of the individual
- minority: 1 if the individual belongs to a minority group, 0 otherwise.
- m_degree_4: 1 if the individual’s mother has a college degree, 0 otherwise.
- family income: Annual family income when these individuals were teenagers, reported in thousands.
- gpain8: GPA in 8th grade.
- retention: 1 if the individual was required to repeat a grade during middle school, 0 otherwise.

Please follow the questions and instructions below to complete this problem set. For some questions, please write and execute Python code for data analysis in Cell mode. Comment your code to explain each step. Some questions need text discussion. Please provide a detailed discussion of your results, including interpretations and answers to questions in Raw mode.

Once you have completed the assignment, save your Jupyter notebook with the following naming convention: ECN310_ProblemSetX_LastName_FirstName.ipynb (replace X with the assignment number).

### Exercise 1: Load and Explore the Dataset

- This is the same database we used for our previous problem set. Each row contains information for each individual in the sample. (check the previous problem set for the definition of the variables).
- Load the needed libraries and dataset data_NLSY97.xlsx.
- Display the first 10 rows of the dataset.

1. Code that loads the needed libraries and dataset.

In [1]:
# Please write your executable code here
import pandas as pd
import numpy as np

df = pd.read_excel("data_NLSY97.xlsx", index_col=0)

2. Code that displays the dataset variables names and types and display first 10 rows.

In [2]:
# Please write your executable code here
# print all columns. Source - https://stackoverflow.com/questions/49188960/how-to-show-all-columns-names-on-a-large-pandas-dataframe
print(df.columns.tolist())
# Print first 10 rows - I am not using print because the display is not as neat and goes to new line.
df[0:10]

['educ', 'gpa_grade_8', 'retention', 'annual_income', 'total_weeks_exper', 'black', 'hispanic', 'white', 'mother_educ', 'minority', 'gender']


Unnamed: 0_level_0,educ,gpa_grade_8,retention,annual_income,total_weeks_exper,black,hispanic,white,mother_educ,minority,gender
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male
6,12.0,1.5,0.0,25000.0,791.0,0.0,1.0,0.0,0.0,1.0,female
7,11.0,2.0,0.0,,192.0,0.0,1.0,0.0,0.0,1.0,male
8,16.0,4.0,0.0,70000.0,906.0,0.0,0.0,1.0,0.0,0.0,female
9,17.0,3.5,0.0,120000.0,883.0,0.0,0.0,1.0,0.0,0.0,male
10,16.0,4.0,0.0,135000.0,680.0,0.0,0.0,1.0,0.0,0.0,male


3. Provide a summary, descriptive statistics, of the dataset. 

In [3]:
# Please write your executable code here

# we can remove print for a neater view.
print(df.describe())

              educ  gpa_grade_8    retention  annual_income  \
count  8151.000000  8601.000000  8969.000000    5201.000000   
mean     13.031162     2.803511     0.059204   62888.006729   
std       2.589030     0.848489     0.236019   59616.290306   
min       5.000000     0.500000     0.000000       0.000000   
25%      12.000000     2.000000     0.000000   30000.000000   
50%      12.000000     3.000000     0.000000   50000.000000   
75%      15.000000     3.500000     0.000000   79000.000000   
max      20.000000     4.000000     1.000000  380288.000000   

       total_weeks_exper        black     hispanic        white  mother_educ  \
count        8030.000000  8649.000000  8649.000000  8649.000000  8290.000000   
mean          668.770984     0.269973     0.219794     0.510232     0.169240   
std           323.591953     0.443971     0.414131     0.499924     0.374986   
min             0.000000     0.000000     0.000000     0.000000     0.000000   
25%           443.250000     0.0

### Exercise 2: Data Cleaning

- Check for string variables.

2. Convert strings variables into integer ones (you will need to create a new variable)

In [4]:
# Please write your executable code here

# I used a pandas function to find which columns are strings.
# this function returns columns of Object type.
# In our case strings are the only objects in the data.
# Source - https://saturncloud.io/blog/how-to-check-pandas-dataframe-column-for-string-type/#using-the-select_dtypes-method
string_columns = df.select_dtypes(include=['object']).columns
print(f"String columns: {string_columns}")

# gender is the only string variable
df['gender_as_a_number'] = df['gender'].map({'female': 0, 'male': 1})

String columns: Index(['gender'], dtype='object')


### Exercise 3: Create New Variables

- Create a new variable named `male` (using the variable gender) that takes value 1 if the individual is a male, 0 otherwise.
- Create a new variable named `college` that takes value 1 if the individual completed at least 13 years of education, 0 otherwise

1. Code that creates `male` variable

In [5]:
# Please write your executable code here

# takes value 1 if the individual is a male, 0 otherwise (same as what I did in the last cell)
df['male'] = df['gender'].map({'female': 0, 'male': 1})

2. Code that creates `college` variable

In [6]:
# Please write your executable code here
def college(years):
    return (1 if years >= 13 else 0)

# takes value 1 if the individual completed at least 13 years of education, 0 otherwise
df['college'] = df['educ'].apply(college)
df

Unnamed: 0_level_0,educ,gpa_grade_8,retention,annual_income,total_weeks_exper,black,hispanic,white,mother_educ,minority,gender,gender_as_a_number,male,college
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female,0,0,1
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male,1,1,1
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female,0,0,1
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female,0,0,0
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female,0,0,0
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male,1,1,1
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male,1,1,1
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male,1,1,1


### Exercise 4: Perform a Univariate Regression Analysis

- Using statsmodels, perform a univariate regression analysis with annual income as the dependent variable and educ as the independent variable.
- Print the summary of the regression model.

1. Code for fitting the regression model.

In [7]:
# Please write your executable code here
import statsmodels.api as sm
import statsmodels.formula.api as smf

# run simple linear regression
results = smf.ols('annual_income ~ educ', data=df).fit()

2. Output of the regression model summary.

In [8]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          annual_income   R-squared:                       0.116
Model:                            OLS   Adj. R-squared:                  0.116
Method:                 Least Squares   F-statistic:                     662.0
Date:                Sun, 06 Oct 2024   Prob (F-statistic):          2.74e-137
Time:                        22:59:56   Log-Likelihood:                -62448.
No. Observations:                5054   AIC:                         1.249e+05
Df Residuals:                    5052   BIC:                         1.249e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -4.549e+04   4288.417    -10.607      0.0

### Exercise 5: Interpret the Regression Output

- Based on the regression output from Exercise 4, answer the following:
    - Interpret the coefficient corresponding to the constant; why does it not make sense?
    - Interpret the coefficient for educ
    - What is the R-squared value, and what does it tell you about the model?
    - Is the relationship between annual income and educ statistically significant?

1. Interpretation of the coefficients.

2. Explanation of the R-squared value.

3. Discussion on the statistical significance of the relationship between annual_income and educ.

4. Create the residual plot for this regression. And discuss on patterns of the plot. Any issue you observe. (Optional question)

In [9]:
# Please write your executable code here

# OPTIONAL LEFT BLANK

### Exercise 6: Perform a Univariate Regression Analysis

- Now, perform a univariate regression analysis with educ as the dependent variable and male as the independent variable.
- Print the summary of the regression model.
- Interpret the results (i.e., coefficients, statistical significance, and R-squared)
- Perform another univariate regression analysis with annual income as the dependent variable and male as the independent variable.
- Print the summary of the regression model.
- Interpret the results (i.e., coefficients, statistical significance, and R-squared)

_Hints_: the variable male has a value of either 0 or 1. The "one unit increase in x" argument becomes much easier: "if you change x from 0 to 1, what happens to y?" 

1. Code for fitting the regression model (educ ~ male).

In [10]:
# Please write your executable code here

# run simple linear regression
results = smf.ols('educ ~ male', data=df).fit()

2. Output of the regression model summary (educ ~ male).

In [11]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     116.1
Date:                Sun, 06 Oct 2024   Prob (F-statistic):           6.72e-27
Time:                        22:59:56   Log-Likelihood:                -19262.
No. Observations:                8151   AIC:                         3.853e+04
Df Residuals:                    8149   BIC:                         3.854e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.3441      0.041    328.104      0.0

3. Interpretation of the results (educ ~ male).

4. Code for fitting the regression model (annual income ~ male).

In [12]:
# Please write your executable code here

# run simple linear regression
results = smf.ols('annual_income ~ male', data=df).fit()

5. Output of the regression model summary (annual income ~ male).

In [13]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          annual_income   R-squared:                       0.027
Model:                            OLS   Adj. R-squared:                  0.027
Method:                 Least Squares   F-statistic:                     145.2
Date:                Sun, 06 Oct 2024   Prob (F-statistic):           5.20e-33
Time:                        22:59:56   Log-Likelihood:                -64496.
No. Observations:                5201   AIC:                         1.290e+05
Df Residuals:                    5199   BIC:                         1.290e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   5.296e+04   1159.095     45.691      0.0

6. Interpretation of the results (annual income ~ male).