In [30]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

#set up a large font for our pyplot
font = {'family' : 'Arial',
        'weight' : 'bold',
        'size'   : 22}

plt.rc('font', **font)

# Python's default exception mode is too verbose, let's set it to 'plain'
%xmode plain

Exception reporting mode: Plain


### Removing null values from dataframe and sorting

In [31]:
data = pd.read_csv("salary.csv")
data = data.drop(data.index[208])
data = data.sort_values('yearsworked')
data.tail()

Unnamed: 0,salary,exprior,yearsworked,yearsrank,market,degree,otherqual,position,male,Field,yearsabs
326,53686.0,0,31,17,0.78,1,0,3,1,1,1
365,82508.0,0,32,23,0.93,0,1,3,1,2,0
376,72152.0,0,33,22,1.0,1,0,3,1,2,2
364,66334.0,0,35,23,0.78,1,0,3,1,1,0
382,64109.0,0,41,28,0.91,1,0,3,1,2,0


### a)	Run a simple linear regression for Salary with one predictor variable: Years Worked

In [32]:
x = data['yearsworked'] # Years worked (independent variable) as x-axis
y = data['salary'] # Salary (dependent variable) as y-axis

# Calculate slope, intercept, r-value, p-value, and standard error using scipy
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

# Calculate standard deviation of x and y
x_std = stats.tstd(x)
y_std = stats.tstd(y)

# Output all values
output  = "slope | intercept | r_value | p_value | x_std | y_std    | std_err\n"
output += '{:.2f}'.format(slope) + '|' + '{:9.2f}'.format(intercept) + '{:>3}'.format('|') + '{:5.2f}'.format(r_value)
output += '{:>5}'.format('|') + '{:5.2f}'.format(p_value) + '{:>5}'.format('|') + '{:5.2f}'.format(x_std)
output += '{:>3}'.format('|') + '{:9.2f}'.format(y_std) + '{:>2}'.format('|') + '{:6.2f}'.format(std_err)
print(output)

slope | intercept | r_value | p_value | x_std | y_std    | std_err
837.33| 40115.01  | 0.62    | 0.00    | 9.45  | 12685.13 | 46.44


### i.	What percentage of the variance in employees’ salaries is accounted for by the number of years they have worked?

In [33]:
# Variance is r-squared
r_value ** 2 * 100

38.88630734995997

### >>>> 38% of the variance in employees’ salaries is accounted for by the number of years they have worked. We can be certain about our hypothesis of the relation because our p-value is well below our significance threshold of 0.05.

### ii.	Does the model significantly predict the dependent variable? Report the amount of variance explained (R2) and significance value (p) to support your answer.

In [34]:
print("R-value: " + str(r_value))
print("P-value: " + str(p_value))

R-value: 0.6235888657598047
P-value: 1.2873351342922937e-56


#### >>> Yes, the model significantly predicts the dependent variable because our p-value is well-below our significance threshold of 0.05 . Meaning our coefficient of determination (0.3886307) indicates a relationship is more likely to exist.

### iii) Report and interpret the standard error of the estimate. (Hint: Compare the standard error of the estimate to the standard deviation for Salary, which you would have calculated in Question 1.a))

In [35]:
print('    std :', y_std)
print('std_err :', std_err)

    std : 12685.132357963685
std_err : 46.43634621012887


#### >>> The avarage difference of the predicted amounts to the actual amount in the "yearsworked" and 'salary' is 46.44

### b) Examine the output from the Coefficients table:
### (i) What does the unstandardized coefficient (B) tell you about the relationship between Years Worked and Salary?

In [43]:
sample = pd.DataFrame(data["yearsworked"]).sort_values('yearsworked').reset_index()
sample = sample.loc[[6,44, 50, 100, 130, 150]]
predicted_salary = 0

def predict_salary(yearsworked):
    return slope * year + intercept

for year in sample.yearsworked:
    last_salary = predicted_salary
    predicted_salary = slope * year + intercept
    difference = predicted_salary - last_salary
    print("Years Worked: " + str(year) + " | Predicted Salary = " + str(predicted_salary) + " | Difference = " + str(difference))

Years Worked: 0 | Predicted Salary = 40115.013787591495 | Difference = 40115.013787591495
Years Worked: 1 | Predicted Salary = 40952.34557759234 | Difference = 837.3317900008478
Years Worked: 2 | Predicted Salary = 41789.67736759318 | Difference = 837.3317900008406
Years Worked: 3 | Predicted Salary = 42627.00915759403 | Difference = 837.3317900008478
Years Worked: 4 | Predicted Salary = 43464.34094759487 | Difference = 837.3317900008406
Years Worked: 5 | Predicted Salary = 44301.67273759572 | Difference = 837.3317900008478


### >>> It tells us that for every year worked an employee earns \$837 more per annum.

### (ii) Compare the standardised coefficient (Beta) for Years Worked to the Pearson’s (zero-order) correlation coefficient from question 1.c). What do you notice? Explain why this is so.

In [46]:
std_x = np.std(x)
std_y = np.std(y)

predicted_salary_Std = slope * std_x + intercept
predicted_salary_Std = predicted_salary_Std / std_y
print(predicted_salary_Std)
print(r_value)

3.7890402274764092
0.6235888657598047


### >>> We notice that there is a positive relation between X and Y, because our r value (0.62) shows that there is a strongly positive relationship between our variables(x(years worked) and y(Salary)) and our Standardised coefficient (Beta) shows that when there is a 1 std increase in the years worked there is an increase of 3.75 std in the Salary earned.

### (iii) Calculate the expected salary for someone who graduated 12 years ago.

In [45]:
Predicted_salary_12 = slope * 12 + intercept
print(Predicted_salary_12 )

50162.99526760163


### >>> Using the trend line to make a prediction, a person with 12 years experience is expected to earn \$50 163 per annum.

### (iv) Calculate the expected salary for someone who graduated 80 years ago. Are there any problems with this prediction? If so, what are they?

In [44]:
p80 = slope * 80 + intercept
print(p80)

107101.55698765906


### >>> Following the same trend line, a person with 80 years working experience should earn at least \$107 102 per annum. However, this scenario is very unlikely, because a person working for 80 years means this person is likely to be over 100 years old. With that taken into account, the problem with trying to calculate salary based on this scenario, is that this scenario is unrealistic.

### c) We have only looked at the number of years and employee has worked. What are employee characteristics might influence their salary?

### >>> Position, Field of work, gender, marital status, level of education, experience prior to current position, market value and years worked at current position.