## Linear Regression

### Importing the dataset

In [32]:
import pandas as pd
#variable and this is a function for uploading the dataset
dataset = pd.read_csv('Salary_Data_cleaned.csv') 

In [33]:
dataset.head() #display 5 rows of dataset

#6,705 rows
# 4 columns: Age, Gender, Years of Experience, Salary
# independent variables: Age, Gender, Year of experience
# dependent variable: Salary
# Note in Gender : 0 = Female ; 1 = Male

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,1,1,177,5.0,90000
1,28,0,3,18,3.0,65000
2,45,1,4,145,15.0,150000
3,36,0,1,116,7.0,60000
4,52,1,3,26,20.0,200000


### Getting the inputs and output

In [35]:
# [rows,columns]
X = dataset.iloc[:,:-1].values
X

array([[ 32.,   1.,   1., 177.,   5.],
       [ 28.,   0.,   3.,  18.,   3.],
       [ 45.,   1.,   4., 145.,  15.],
       ...,
       [ 30.,   0.,   1.,  42.,   4.],
       [ 46.,   1.,   3.,  97.,  14.],
       [ 26.,   0.,   2., 118.,   1.]])

In [36]:
y = dataset.iloc[:,-1].values
y

array([ 90000,  65000, 150000, ...,  55000, 140000,  35000])

### Creating the Training Set and the Test Set

In [37]:
# scikitlearn is a library
# model_selection is a module
# train_test_split is a function
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,random_state=0)

In [38]:
X_train

array([[ 34.,   0.,   4., 114.,  11.],
       [ 45.,   1.,   4.,  20.,  18.],
       [ 25.,   1.,   2.,  83.,   1.],
       ...,
       [ 42.,   1.,   4., 169.,  12.],
       [ 55.,   0.,   1., 159.,  30.],
       [ 37.,   1.,   1.,  18.,  11.]])

In [39]:
X_test

array([[ 35.,   1.,   3., 156.,  10.],
       [ 36.,   0.,   3., 169.,  13.],
       [ 28.,   1.,   2.,  83.,   2.],
       ...,
       [ 33.,   0.,   3., 107.,  11.],
       [ 43.,   0.,   1.,  97.,  12.],
       [ 36.,   1.,   4.,  28.,  12.]])

In [40]:
y_train

array([160000, 180000,  30000, ..., 170000, 183020, 160000])

In [41]:
y_test

array([120000, 140010,  35000, ..., 198000, 120000, 170000])

## Part 2 - Building and training the model

### Building the model

In [42]:
# linear_model is the module
# `LinearRegression is a class` is defining that `LinearRegression` is a class within the `linear_model` module. It indicates that `LinearRegression` is a blueprint or template for creating objects that represent linear regression models.
# Class is a pre-coded blueprint of something we want to build from which objects are created.
from sklearn.linear_model import LinearRegression
model = LinearRegression()

### Training the Model

In [43]:
# fit is a method inside LinearRegression class - they are like functions.
model.fit(X_train, y_train)

### Inference

In [44]:
y_pred = model.predict(X_test)
y_pred

array([136159.57802425, 152910.05246902,  75744.25712024, ...,
       142822.89429251, 119060.28627865, 160486.22950092])

#### Making the prediction of a single data point with age = 32, gender = 1,Education level =0,Job Ttile=177, years of experience = 5 

In [48]:
model.predict([[32,1,1,177,5]])


array([85745.79735184])

## Part 3: Evaluating the Model

### R-Squared

In [46]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

0.6706263076019943

### Adjusted R-Squared

In [47]:
k = X_test.shape[1]
n = X_test.shape[0]
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
adj_r2

0.6693926982671703