# Linear Regression Model for Salary Prediction 

##### The following notebook makes use of a linear regression model, a supervised machine learning technique, to predict the future salaries of employees based on the number of years they have worked (univariate linear regression) . Because we are only focusing on the number of years worked as an influencing factor, this is known as univariate linear regression. Linear Regression is based on the straight line equation y = bX + a, where b is the coefficient and a is the constatnt coefficient. 

##### Notebook Contents 
##### 1. Exploratory Data Analysis 
##### 1.1 Data description 
##### 1.2 Exploratory Data Analysis: the data is also visualized for distributions and checked for linearity to apply the linear regression model. Visualizations are checked for distributions, patterns and outliers 

##### 2.Linear Regression Model
##### 2.1 Sckit learn Linear Regression model : "Years worked" is the target value

##### 3. Scoring, Predictions and Evaluation
##### 3.1 Scoring and Predictions: The model is scored and used to predict the salaries of people with 2, 12 and 80 years work experience, respectively 
##### 3.2 Model Evaluation: the model metrics are analysed 
##### 3.3 Model Improvement 

##### 4. Conclusion 
##### The analysis conclusion can be found in the write up

## Exploratory Data Analysis 
### Data description 

In [None]:
# Packages are imported 
import pandas as pd
import matplotlib.pyplot as plt   
import numpy as np
import seaborn as sns
%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# A dataframe is created from the salary data 
salaries = pd.read_csv("salary.csv")
salaries.head()

In [None]:
salaries.isnull().sum()

In [None]:
nan_salary = salaries[salaries['salary'].isnull()]
nan_salary

##### The null value is replaced with the median salary values of people who have the same number of years worked, degree, position, and gender as the missing salary value. This ensures the profile or demographics are as close as possible to that of the missing value. Also, the Linear Regression model does not handle NaN values.

In [None]:
# The null value is replaced with an estimated guess
estimated_salary = salaries['salary'].loc[(salaries['yearsworked'] == nan_salary['yearsworked'].values[0]) & 
                            (salaries['degree'] == nan_salary['degree'].values[0]) & 
                            (salaries['position'] == nan_salary['position'].values[0]) &
                            (salaries['male'] == nan_salary['male'].values[0])].median()

salaries['salary'].fillna(estimated_salary, inplace = True)

In [None]:
# The mean salary for no full years of work experience is calculated 
salary_with_no_work_experience = salaries['salary'].loc[(salaries['yearsworked'] == 0)]
salary_with_no_work_experience.mean()

### Exploratory Data Analysis: the data is also visualized for distributions and checked for linearity to apply the linear regression model. Visualizations are checked for distributions, patterns and outliers 

In [None]:
def reshape(scalar_series):
    '''The following fucntion takes in a sinlge sample of type int or float or a series in order to reshape it
    for use during the modelling process and associated methods'''
    if type(scalar_series) == float or type(scalar_series) == int:
        scalar_series = np.array(scalar_series).reshape(1, -1)
        
    else:
        assert type(scalar_series) == pd.core.series.Series
        scalar_series= np.array(scalar_series).reshape(-1, 1)
        
    return scalar_series

In [None]:
# Visualize the training data in order to see if a similar linear relationship is still observed
X = salaries.yearsworked
y = salaries.salary

plt.title('The relationship between salaries and the number of years worked')
sns.regplot(X, y);

##### The scatterplot above and trendline show that a linear relationship exists between the salaries earned and the number of years worked by employees. This makes the data a good fit for a linear regression model. However the data seems to also present the possibility of categories as seen with the vertical groupings observed along the x-axis (years worked). 

In [None]:
# The distribution is now visualized
sns.distplot(salaries['yearsworked'], kde = True, color = 'red')
plt.title('The distribution of the number of Years Worked', fontsize = 18)
plt.xlabel('Years Worked (years)', fontsize = 16)
plt.ylabel('Frequency', fontsize = 16)
plt.show;

##### What is observed in the distribution above is a bimodal distribution. This implies that there are two different groups of employees within the dataset. There are those who have relatively high salaries despite working for less than 10 years. There is another group of employees who have relatively high salaries having worked more than 20 years. In order to run a model, the data needs to be a normal distribution. This is achieved using standard scaler in order to normalize the data.

## Linear Regression Model
### Sckit learn Linear Regression model : "Years worked" is the target value

In [None]:
X = reshape(X)
y = reshape(y)

In [None]:
# The data is split into a test and training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
# The test and train sets are scaled
scaler = StandardScaler() 

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
# Scikit Learn is used to train a Linear Regression model
lr = LinearRegression()
lr_model = lr.fit(X_train, y_train)

In [None]:
print(f'The model coeffient is {lr_model.coef_} and the model intercept is {lr_model.intercept_}.')

##### The coefficents simply mean that the model line can be plotted using the function f(x) = 822.97x + 40 722.35, where f(x) is the predicted salary value.

##  Scoring, Predictions and Evaluation
### Scoring and Predictions: The model is scored and used to predict the salaries of people with 2, 12 and 80 years work experience, respectively

In [None]:
#Predictions are now made and the model is scored
predictions = lr_model.predict(X_test)
r2_score = lr_model.score(X_test, y_test)
print(f'The model score is {r2_score}.')

##### The model score is the proportion of y (salary) variance as explained by indivudual variables in the model. It is an indication of the the goodness of fit of the model and measures how well unseen data will be predicted by the model. This means that the model does not fit the data very well, predicting only 45. 44% of the salaries correctly . It can be said that the model is underfitting the data.

##### The random state parameter is altered a few times to try imporve the score and very little positive change results from this. The model is further evaluated and other attempts are made at improvement.

In [None]:
pred_vs_actual = pd.DataFrame({'Years_worked': X_test.flatten(), 'Actual': y_test.flatten(), 'Predicted': predictions.flatten()})
pred_vs_actual.head()

In [None]:
# The actual and prediced values are plotted 
plt.scatter(pred_vs_actual['Years_worked'], pred_vs_actual['Actual'], c = 'b', label = 'Actual salaries')
plt.plot(pred_vs_actual['Years_worked'], pred_vs_actual['Predicted'], c = 'r', label = 'Predicted salaries regressor')
plt.xlabel('Years worked')
plt.ylabel('Salary')
plt.title('The Actual and Predicted salaries based on the number of years worked')
plt.legend()
plt.show();

In [None]:
# Predictions are made for people with 2, 12 and 80 years work experience 
predicted_salary_for_2_years_work_experience = lr_model.predict(reshape(2))
actual_salary_for_2_years_work_experience = (salaries['salary'].loc[(salaries['yearsworked'] == 2)]).mean()
print(f'The predicted salary for 2 years work experience is {predicted_salary_for_2_years_work_experience} and the actual mean salary for the same years worked is {actual_salary_for_2_years_work_experience}.')

In [None]:
predicted_salary_for_12_years_work_experience = lr_model.predict(reshape(12))
actual_salary_for_12_years_work_experience = (salaries['salary'].loc[(salaries['yearsworked'] == 12)]).mean()
print(f'The predicted salary for 12 years work experience is {predicted_salary_for_12_years_work_experience} and the actual mean salary for the same years worked is {actual_salary_for_12_years_work_experience}.')

In [None]:
predicted_salary_for_80_years_work_experience = lr_model.predict(reshape(80))
actual_salary_for_80_years_work_experience = (salaries['salary'].loc[(salaries['yearsworked'] == 80)]).mean()
print(f'The predicted salary for 80 years work experience is {predicted_salary_for_80_years_work_experience} and the actual mean salary for the same years worked is {actual_salary_for_80_years_work_experience}.')

##### The predicted and actual salaries are not too far off for 2 and 12 and years worked. The predicted salary for 80 years worked however is unexepcted considering there was no actual data provided for that input. The predicted value is made using the model fit line which, as already observed is scoring quite poorly. 

### Model Evaluation: the model metrics are analysed 

In [None]:
mean_absolute_error = metrics.mean_absolute_error(y_test, predictions)
print(mean_absolute_error)

##### This metric measures the sum of the absolute difference between each predicted value and the true value in the dataset and then divides this sum by the total number of samples in the dataset. It measures the overall error in the model. The value seen above is very large and further confirms that the model is not ideal.

In [None]:
mean_squared_error = metrics.mean_squared_error(y_test, predictions)
print(mean_squared_error)

##### This metric measures the squared sum of the absolute difference between each predicted value and the true value in the dataset and then divides this sum by the total number of samples in the dataset. It measures the overall error in the model.The square sum exaggereates the error and allows one to quickly pick up when a model is performing poorly as seen above.

In [None]:
explained_variance_score = metrics.explained_variance_score(y_test, predictions)
print(explained_variance_score)

##### This metric has a maximum value of 1. This occurs when the variance between the difference in the actual and predicted values divided by the variance of actual values is equal to zero. The explained variance of the model is 0.48. This means that the the variance ratio is 0. 52 and thus the model has still not accounted for enough of the variance in the dataset.

### Model Improvement

##### In attempts to impove the model, the following is done. First, an end to end pipeline is created that scales the data and creates an instance of the regressor.  The pipeline is then passed into a function that prints the r2 cross validation scores. Finally, the GridSearch method is also used to find the scores, estimators and parameters that would improve the model and return the best results. The GridSearch does not apply a cross validation initially.

In [None]:
def pipeline():
    '''The function below will be used to create an end-to-end pipeline that will be passed into the
    GridSearch method'''
    estimator = [('scaler', StandardScaler()), ('regressor', LinearRegression())]
    pipeline = Pipeline(estimator).fit(X_train, y_train)
    return pipeline

In [None]:
# Create an instance of pipeline in order to run a cross validation
pipe = pipeline().fit(X_train, y_train)

# Get the parameters of the instance created
parameters = pipe.get_params()
print(parameters)

In [None]:
# Create a function to display the scores for each fold during cross validation
def print_scores(pipeline):
    scores = cross_val_score(pipeline, X_train, y_train, scoring = 'r2', cv = 3)
    print(scores)

In [None]:
print_scores(pipe)

##### In the first model, the data is split 80/20 with 80% of the data being allocated to training whilst 20% of the data is used for testing and returns a score 0f 45. 44%. Since the model has been given much more data, a higher score is as expected.

##### A pipeline is created and the same 80% is used to run a cross validation of the model with 3 folds. Each model built using the cross_val_score method now uses 66% of the data for training  and  33% for testing. Because the model was trained on an original 80/20 split, this results in each model built during validation using 0.8*0.66=0.528 (52.8%) of the original data. 

##### The number of samples in the dataset are such the any number greater than 2 for k divides the data unevenly and each fold will not get an equal distribution of data. The data is continuous and thus the k-fold method is not stratified. 

##### Grid search is now implemented to search for optimal parameters. This is done with no cross validation so that the model is exposed to more data during the training process. When cv=None, or when it not passed as an argument, GridSearchCV will default to cv=3. I set it to cv = 1 to avoid any cross validation during the grid search.

In [None]:
# GridSearch 

parameter_grid = {'regressor__fit_intercept': [True, False], 'regressor__normalize': [True, False]}

grid_lr = GridSearchCV(pipe, parameter_grid, scoring = 'r2', cv = [(slice(None), slice(None))], verbose = 0)
grid_search_lr_model = grid_lr.fit(X_train, y_train)
grid_search_predictions = grid_lr.predict(X_test)
grid_estimator = grid_lr.best_estimator_
grid_score = grid_lr.best_score_
grid_params = grid_lr.best_params_

In [None]:
# The GridSearch results are displayed and interpreted
print(f'Best estimator:{grid_estimator}.')
print(f'Best score:{grid_score}.')
print(f'Best parameters:{grid_params}.')

##### The GridSearch above returns a best score of 37.08% which is lower than the original score from splitting  the data 80/20. This model still grossly underfits the data which leads me to believe that the selected model may not be the best one nor does it accurately explain the variance in the data. I would recommend that another linear model(maybe a classification) be considered for this dataset. It also speaks to how the linearity observed could be misleading. The correlation matrix below shows that there is a a stronger positive correlation with the salary and position versus the relationship observed between the salary and years worked/ ranked as suggested in the original assignment.

In [None]:
# A correlation matrix is drawn
salaries.corr()