Linear Regression - It is predictive modeling which help us to predict a continuous numerical value.

a) Linear Regression will have a line of best fit which means that each data point will have minimum distance from line of best fit. We also call this distance as error. A line of best fit is drawn in such a way that each data point have least error with respect to the line.

b) Line of best fit is represented as Y = mX + C for a simple linear regression and Y = m1X1 + m2X2 + C for a multiple linear regression

    Where Y is target or depended variable and X is independent variable

c) Both Y and X has to be continuous numerical variable

d) The Linear Regression model does not predict a specific value whereas it predicts a range of a values. This range of values is calculated based on RMSE

e) Each Linear Regression model will have Adjust R Square value. This represent the percentage of variance in Y because of X

f) A Linear Regression model with HIGH value of R Square and LOW RMSE is an ideal model.

Steps for building model in Python

Step 1: Read and access data
Step 2: Identify the variables (i.e dependent and independent variable)
Step 3: Dividing the data into training and test data
Step 4: Creating the model using on the training data set
Step 5: Find the values of slope, intercept and R Square

Step 6: Predict the values on test data using your model
Step 7: Calculate the RMSE value from test data
Step 8: Predict the values for your validation data using model and RMSE

### Problem Statement: ABC Inc an IT firm operating in telecom domain . They have around 192 employees in India. Lately they have started benchmarking employee ctc against last CTC. The HR has the ctc data.  Help them to perform an analysis which will show the relationship between Last CTC and CTC offered. Predict what CTC can be offered to a candidate with 6 lakhs of Last ctc.

In [1]:
#importing the datasets

import pandas as pd
import numpy as np

import warnings

warnings.filterwarnings("ignore")


In [2]:
#reading the datasets

df = pd.read_csv("./DATA/CTCdata.csv")
df.head()

Unnamed: 0,CTCoffered,LastCTC,Interview rating,Skill Set Index,Highest qualification,Total years of work exp
0,19,18,4,3,3,8.5
1,17,16,4,3,3,7.7
2,17,16,4,3,3,7.9
3,9,8,3,1,2,2.7
4,10,9,5,4,4,9.7


In [3]:
df_new = df[['CTCoffered','LastCTC']]
df_new.head()

Unnamed: 0,CTCoffered,LastCTC
0,19,18
1,17,16
2,17,16
3,9,8
4,10,9


In [4]:
# Defining input and output variables

X = df_new[['LastCTC']]
y = df_new[['CTCoffered']]

In [5]:
#Step 3 :Splitting the dataset into train and test data

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=123)


In [6]:
#Step 4: Building the simple linear regression model

from sklearn.linear_model import LinearRegression

# Building the model
slr = LinearRegression() 

#Fitting the model
model = slr.fit(X_train,y_train) 

model


LinearRegression()

In [7]:
#Step 5: Finding the slope,intercept and R-square

# 1) Slope

print("Slope: {}".format(model.coef_))

# 2) Intercept

print("Intercept: {}".format(model.intercept_))

# 3) R-square

print("R-square: {}".format(model.score(X_train,y_train)))

Slope: [[0.94325853]]
Intercept: [1.76208839]
R-square: 0.9723571733067193


Y = mX + C {Y is dependent/output variable; X is Input variable, m is slope and C is intercept}

CTCoffered = (0.94325853 * LastCTC) + 1.76208839

In [8]:
#Step 6: Predict the values on test data using the model

y_test['Predicted_CTC_Offered'] = model.predict(X_test)
y_test['Predicted_CTC_Offered'] = y_test['Predicted_CTC_Offered'].apply(lambda x:round(x,2))
y_test

Unnamed: 0,CTCoffered,Predicted_CTC_Offered
171,21,19.68
26,15,15.91
41,10,10.25
4,10,10.25
85,9,9.31
141,9,8.36
62,17,16.85
145,20,19.68
53,11,11.19
100,11,10.25


In [9]:
#Step 7 : Calculate the RMSE

from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test['CTCoffered'],y_test['Predicted_CTC_Offered']))
rmse

0.8051803429141727

In [10]:
# Step 8 : Predicting the values using on the model


#Creatimg a dataframe

val_data = pd.DataFrame({"LastCTC":[6,6.5,8.1,15.5]})
                        
#Predicting the CTCOffered based on their LastCTC

                        
predicted_CTC = model.predict(val_data)
predicted_CTC

array([[ 7.42163957],
       [ 7.89326884],
       [ 9.40248249],
       [16.38259562]])

Problem Statement: Find the range of CTC that can be offered to a candidate having "Total Years of Work Exp" as 10 years. Also comment on the RMSE and R-sq value of your model.

In [11]:
df_new_2 = df[['CTCoffered','Total years of work exp']]
df_new_2

Unnamed: 0,CTCoffered,Total years of work exp
0,19,8.5
1,17,7.7
2,17,7.9
3,9,2.7
4,10,9.7
...,...,...
186,7,5.5
187,21,5.3
188,14,10.3
189,10,9.5


In [13]:
#Step 1: Defining input and output variables

X = df_new_2[['Total years of work exp']]
y = df_new_2[['CTCoffered']]

In [14]:
#Step 2: Splitting the dataset into train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=123)

In [15]:
#Step 3: Building the model

slr = LinearRegression()
model = slr.fit(X_train, y_train)
model


LinearRegression()

In [16]:
#Step 4: Finding the slope,intercept and R-Squared

#Slope

print("Slope:{}".format(model.coef_))

#Intercept

print("Intercept: {}".format(model.intercept_))

#R-square
print("R-squared: {}".format(model.score(X_train,y_train)))

Slope:[[0.63242017]]
Intercept: [9.55399786]
R-squared: 0.17467611462820076


In [18]:
#Step 5: Predicting the values based on test data

y_test["Predicted_CTC_Offered"] = model.predict(X_test)
y_test

Unnamed: 0,CTCoffered,Predicted_CTC_Offered
171,21,17.016556
26,15,16.257652
41,10,13.475003
4,10,15.688474
85,9,15.435505
141,9,15.309021
62,17,12.526373
145,20,14.929569
53,11,13.538245
100,11,11.451258


In [20]:
#Find the RMSE

rmse = np.sqrt(mean_squared_error(y_test["CTCoffered"], y_test["Predicted_CTC_Offered"]))
rmse


4.225627534272566

In [22]:
#Predicting the values to validate the model

val = pd.DataFrame({"Total years of work exp":[6,7,10,12,15]})

pred_ctc = model.predict(val)
pred_ctc

array([[13.3485189 ],
       [13.98093908],
       [15.87819959],
       [17.14303994],
       [19.04030046]])

In [23]:
#minimum ctc 
minimum_ctc = pred_ctc - rmse
minimum_ctc

array([[ 9.12289137],
       [ 9.75531154],
       [11.65257206],
       [12.91741241],
       [14.81467292]])

In [25]:
#maximum_ctc
maximum_ctc = pred_ctc + rmse
maximum_ctc

array([[17.57414644],
       [18.20656661],
       [20.10382713],
       [21.36866747],
       [23.26592799]])