# Linear Regression
 

Linear regression is a statistical method used to model the relationship between a dependent variable (often denoted as \(y\)) and one or more independent variables (often denoted as \(x\)). It assumes that the relationship between the variables is linear, meaning that changes in the independent variable(s) are associated with proportional changes in the dependent variable.
The goal of linear regression is to find the best-fitting straight line that minimizes the difference between the observed values of the dependent variable and the values predicted by the linear model.

Linear regression can be used for various purposes, including:
1. **Prediction**: Given a set of independent variables, linear regression can be used to predict the value of the dependent variable.
2. **Inference**: Linear regression can help in understanding the relationship between the independent and dependent variables, such as identifying which independent variables have a significant impact on the dependent variable.
3. **Control**: In some cases, linear regression can be used to control for the effects of certain variables when studying the relationship between others.

***Data requirements for linear regression include:***
1. **Continuous Variables**: Both the dependent and independent variables should be continuous. However, categorical independent variables can be used with appropriate encoding techniques.
2. **Linear Relationship**: There should be a linear relationship between the independent and dependent variables. This can be assessed visually using scatter plots or statistically using correlation coefficients.
3. **No or Little Multicollinearity**: If multiple independent variables are used, they should not be highly correlated with each other, as this can lead to multicollinearity issues.
4. **Homoscedasticity**: The variance of the residuals (the differences between the observed and predicted values) should be constant across all levels of the independent variables.
5. **Independence of Observations**: Observations should be independent of each other, meaning that the value of one observation should not be influenced by another observation.

## Import Libraries

In [168]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

### Import Dataset

In [169]:
df = pd.read_csv('D:\\Data Practice JN\\Pre-Processing\\Wrangled Data of Salary_Dataset.csv')
df

Unnamed: 0.1,Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,0,32.0,0,Bachelor's,Software Engineer,5.000000,115296.421831
1,1,28.0,1,Master's,Data Analyst,3.000000,65000.000000
2,2,45.0,0,PhD,Software Engineer,8.090834,150000.000000
3,3,36.0,0,Bachelor's,Sales Associate,7.000000,60000.000000
4,4,52.0,0,Master's,Director,8.000000,200000.000000
...,...,...,...,...,...,...,...
6545,6698,49.0,1,PhD,Director of Marketing,20.000000,200000.000000
6546,6699,32.0,0,High School,Sales Associate,3.000000,50000.000000
6547,6700,30.0,1,Bachelor's,Financial Manager,4.000000,55000.000000
6548,6701,46.0,0,Master's,Marketing Manager,14.000000,140000.000000


In [170]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,0,Bachelor's,Software Engineer,5.0,115296.421831
1,28.0,1,Master's,Data Analyst,3.0,65000.0
2,45.0,0,PhD,Software Engineer,8.090834,150000.0
3,36.0,0,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,0,Master's,Director,8.0,200000.0


## Model Building

### Define Features and Labels

In [171]:
x = df[['Age', 'Years of Experience']]
y = df['Salary'] 

In [172]:
model = LinearRegression()
model.fit(x, y)

### Train Test Split the Data

In [173]:
from sklearn.model_selection import train_test_split

In [174]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=0)

In [175]:
x_train

Unnamed: 0,Age,Years of Experience
3679,33.0,6.0
4908,26.0,5.0
3559,25.0,1.0
5723,39.0,16.0
2111,27.0,3.0
...,...,...
4931,25.0,3.0
3264,34.0,3.0
1653,43.0,13.0
2607,24.0,2.0


In [176]:
y_train

3679     75000.0
4908     85000.0
3559     35000.0
5723    200000.0
2111     80000.0
          ...   
4931     65000.0
3264     50000.0
1653    185000.0
2607     55000.0
2732    106218.0
Name: Salary, Length: 4585, dtype: float64

In [177]:
x_test

Unnamed: 0,Age,Years of Experience
4660,36.0,11.0
1363,48.0,16.0
4032,29.0,1.0
3779,25.0,1.0
6269,26.0,2.0
...,...,...
562,24.0,1.0
95,39.0,12.0
3554,34.0,6.0
1050,28.0,5.0


In [178]:
y_test

4660    135000.0
1363    190000.0
4032     26000.0
3779     35000.0
6269     40000.0
          ...   
562      90000.0
95       65000.0
3554     75000.0
1050    150000.0
121      50000.0
Name: Salary, Length: 1965, dtype: float64

In [182]:
model.coef_

array([-3020.45306574, 11208.31453113])

In [183]:
model.intercept_

126960.82079141666

### Prediction

In [186]:
model.predict([[20, 0]])



array([66551.75947671])

In [189]:
model.predict([[25, 3]])



array([85074.43774142])

In [190]:
model.predict([[62, 30]])



array([275942.16664969])

### Evaluation of Model

In [192]:
print('Training Score of Model =',model.score(x_train, y_train))
print('Testing Score of Model = ',model.score(x_test, y_test))

Training Score of Model = 0.6864738328030133
Testing Score of Model =  0.6897123076395237
