# Cross Validation and Simple Linear Regression

This is the process that gives us the internal and the cross-validation measures of predictive accuracy for a simple linear regression. The data are randomly assigned to a number of "folds", which in our context is the test and training folds. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.

We will predict employee salaries from different employee characteristics (or features). We are going to use a simple supervised learning technique: linear regression. We want to build a simple model to determine how well Years Worked predicts an employee’s salary. 

## 1. Importing

In [None]:
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn import datasets, linear_model
from scipy.stats import iqr
from scipy import stats
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
from math import sqrt

In [None]:
# Reading data into a dataframe
salary = pd.read_csv('salary.csv')
salary.head()

<h1><center>Type of data</center></h1> 

| Continuous | Categorical | Binary |
| --- | --- | --- |
| Salary | position | degree |
| exprior | field | otherqual |
| Yearsworked | - | male |
| market | - | - |
| yearsranked | - | -|
| yearsabs | - | - |

<h1><center>Feature description</center></h1> 

| Feature | Description |
| --- | --- |
| exprior | Years of experience prior to working in this field |
| Yearsworked | Years worked in this field |
| yearsrank | Years worked at current rank |
| market | Market value (1 = salary at market value for position, <br> <1 salary lower than market value for position,<br> >1 salary higher than market value for position) |
| degree | Has degree (0 = no 1= yes) |
| otherqual | Has other post-secondary qualification (0 = no, 1=yes) |
| position | Position (1 = Junior Employee 2=Manager 3= Executive) |
| male | 0 = no 1 1=yes |
| Field | Field of work (1 = Engineering 2=Finance 3=Human Resource 4=Marketing) |
| yearsabs |Years absent from work (e.g. due to illness / child rearing / personal reasons)|

<h1><center>Response description</center></h1> 

| Response | Description |
| --- | --- |
| <font color='black'> salary </font>| <font color='black'>  Annual salary in dollars </font> | 

## 2. Cleaning data
Data cleaning is the process of detecting and correcting corrupt or inaccurate data from a dataset, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
### 2.1 Examining missing values

In [None]:
#Checking for missing values for our dataset

salary.isnull().sum()

In [None]:
#Filling the missing values with the average
salary = salary.fillna(salary['salary'].mean())
salary.head()

##### What we did with the missing values in the data

- We <b>filled</b> the missing value that appeared in the <b>salary</b> column with the <b>mean</b> of that column.

### 2.2 Examing duplicates within the dataset

In [None]:
#Dropping duplicates within the dataset
salary = salary.drop_duplicates

##### Dealing with duplicates within the dataset

We dropped any duplicates that may exist within the dataset

### 2.3 Examining outliers within the dataset
In statistics, an outlier is an observation point that is distant from other observations.


In [None]:
#Checking for extreme values
sns.set()
sns.set(style="whitegrid")
fig, axes = plt.subplots(1,2, figsize=(15, 15))
sns.boxplot(x=salary["salary"], ax=axes[0], data = salary)
sns.boxplot(x=salary["exprior"], ax=axes[1],data = salary)


In [None]:
#Removing the outliers
removed_outliers = salary['salary'].between(salary['salary'].quantile(.05), salary['salary'].quantile(.95))

salary[removed_outliers].plot().get_figure()

##### Outliers

We have observed <b> two outliers</b> for the salary values, and <b>none</b> for the yearsworked values.

how we intend on dealing with the outliers is to remove them, as they may....

## 3. Spliting data
These are two rather important concepts in data science and data analysis and are used as tools to prevent (or at least minimize) overfitting. We usually fit the model on a training set in order to make predications on a data that wasn’t trained (general data)

In [None]:
#Split our data
x = salary[['exprior','yearsworked','yearsrank','market','degree','otherqual','position', 'male','Field','yearsabs']]
y = salary['salary']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20)




In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
y_train = pd.DataFrame(y_train)

### 3.1 Viewing the split data

In [None]:
#showcasing the first 5 observations of the X_train dataset 
X_train.head()


In [None]:
#showcasing the first 5 observations of the y_train dataset 
y_train.head()

In [None]:
#Describing the dataset of y_train
y_train.describe()


Talk about the distribution:
mean
median
interquartile

### Describing the datasets
 - The training dataset contains <b> 80% of the overall dataset </b> and has been divided into two datasets the <b>feautures dataset </b> and the <b>response variable dataset</b>
 - talk about the <b>mean</b>, <b>median</b> and the <b>mode</b>
 

## 4. Distribution of the data
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur.

In [None]:
#Creating histograms with density line

sns.set()
sns.set(style="whitegrid")
fig, axes = plt.subplots(1, 2,figsize=(20, 6))
sns.distplot(X_train["yearsworked"], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4}, ax=axes[0])

sns.distplot(y_train, hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4}, ax=axes[1])

axes[0].set_title('The histogram for years worked')
axes[1].set_title('The histogram for salary')
plt.show()

##### Comment on distribution of the response and the feature

For salaries earned the peak of salaries is between 37000 dollars and 43000 dollars.There are potential outliers in this data. This histogram simply shows that a lot of people earn between 35000 dollars and 47000 dollars. Then only a few earn between 83000 dollars and 89000 dollars. This few is most likey to be the excutive and a lot of people are holding junior positions

## 5. Correlation
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

In [None]:
#Creating scatterplot


combined = pd.DataFrame(X_train)
combined.head()
# sns.lmplot(x=combined['yearsworked'],y=combined['salary'], data=combined)

Are the data appropriate for linear regression? Is there anything that needs to be transformed or edited first?

The line of best fit shows a strong relationship between years worked and the salary earned. There are some outliers which are the furthest points from the line of best fit.

In [None]:
#Run a simple linear regression model using statsmodels (excluding outliers)

model = smf.ols(formula = 'salary~yearsworked', data=X_train).fit()
model.summary()

### Interpreting correlation graph

- The above table shows the correlation between two variables, showcasing its <b>strengths and direction</b>. This helps us to select the features that have a significant impact on the response variable and therefore helping us predict the salaries.

- looking at the correlation between the response variable Salary and the feature variable yearsworked which is <b>0.623589</b>. This is a <b> fairly strong positive relationship </b>, and therefore we can assume as the number of years worked increases the value of salary also increases.

## Does the model significantly predict the dependent variable? Report the amount of variance explained (R^2) and significance value (p) to support your answer.
## What percentage of the variance in employees’ salaries is accounted for by the number of years they have worked



In [None]:
model.rquared

R^2

What percentage of the variance in employees’ salaries is accounted for by the number of years they have worked?
It reveals that about 37% of the data fit the regression model.

overall significance

# interpret coefficient of Years Worked and Salary

The coefficient of 827.1461 means that as the yerasworked variable increases by 1 year, the predicted value of salaries increases by 827.1461 dollars, i.e using units of the observed values to see the relationship.

#### answer

What do the 95% confidence intervals [0.025, 0.975] mean?

A 95% confidence interval is a range of values that you can be 95% certain that it contains the true mean of the population. Idealy, with regards to the interval,the true population mean value should be on that interval. If a confidence interval does not include a particular value, we can say that it is not likely that the particular value is the true population mean. However, even if a particular value is within the interval, we shouldn't conclude that the population mean equals that specific value.

The confidence interval can also be used for coefficients of the regression model Use the confidence interval to assess the estimate of the population coefficient for each term in the model.

The coefficient for the years worked is 827.1461. The 95% confidence interval is [714.150,940.143]. The coefficient falls on the interval. You can be 95% confident that the confidence interval contains the value of the coefficient for the population.

The same applies for the constant coefficient.

In [None]:
#Calculate expected salary for someone who worked for 12 years

experience = pd.DataFrame({'yearsworked':[12]})
predict_salary=model1.predict(experience)
p=predict_salary.iloc[0]
print('The expected salary of a person with 12 years experience is:',p,'dollars')

In [None]:
#Calculate expected salary for someone who worked for 80 years

experience = pd.DataFrame({'yearsworked':[80]})
predict_salary=model1.predict(experience)
p=predict_salary.iloc[0]
print('The expected salary of a person with 80 years experience is:',p,'dollars')

#### answer
Are there any problems with this prediction? If so, what are they?


There were no problems to make this prediction which is a concern because 80 years of experince is a rare value prediction in this Regression. What does this mean for the predictive model?

#### feature selection
We have only looked at the number of years an employee has worked. What other employee characteristics might influence their salary?


By the correlation map shown in the beginning, the feature that has a good correlation with salary, which is even better than the years worked is position. Years worked at current rank also has a good correlation with salary. The rest of the features have less than 50% correlation with salary. It would not make sense to build a model with features that have no or a very weak relationship with salary

In [None]:
#fitting model

predicted1=model1.predict(x_test)
predict = pd.DataFrame(predicted1,columns =['Predicted salary'])
result = pd.concat([y_test, predict], axis=1, sort=False)

How does your model compare when running it on the test set - what is the difference in the Root Mean Square Error (RMSE) between the training and test sets? Is there any evidence of overfitting?