Skip to content

Latest commit

 

History

History

examples

Examples

Simple Linear Regression

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

  • One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
  • The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

The example shows the following

First indicate where to work and import the dataset

getwd()
setwd( "/home/chris/Documents/itt/Enero_Junio_2020/Mineria_de_datos/DataM
ining/MachineLearning/SimpleLinearRegression" )
getwd()
dataset <- read.csv( 'Salary_Data.csv' )

The dataset is divided with catools, indicating that we want the division to be ⅔, with this is saved in a training dataset and a test dataset

library (caTools)
set.seed( 123 )
split <- sample.split(dataset$Salary, SplitRatio = 2 / 3 )
training_set <- subset(dataset, split == TRUE )
test_set <- subset(dataset, split == FALSE )

A formula is created to obtain the indicated data from the dataset, applying the subtraction.

regressor = lm(formula = Salary ~ YearsExperience,
data = dataset)
summary(regressor)

The data is predicted with the training set.

y_pred = predict(regressor, newdata = test_set)

The training data is visualized with ggplot2, creating a graph where the line be on the x-axis the years of experience and on the y-axis be the value to be predicted, using what previously it was done. Points that come directly from the dataset, with the x-axis being years of experience and the y-axis being salary. The blue color is indicated for the line and the red points, as well as that it will have the title and the labels of each axis.

library (ggplot2)
ggplot() +
geom_point(aes(x=training_set$YearsExperience, y=training_set$Salary),
color = 'red' ) +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor,
newdata = training_set)),
color = 'blue' ) +
ggtitle( 'Salary vs Experience (Training Set)' ) +
xlab( 'Years of experience' ) +
ylab( 'Salary' )

Training_set

Now the same is done for the test dataset, in this case the points will be extracted from the test dataset, while the line will remain from the test dataset training.

ggplot() +
geom_point(aes(x=test_set$YearsExperience, y=test_set$Salary),
color = 'red' ) +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor,
newdata = training_set)),
color = 'blue' ) +
ggtitle( 'Salary vs Experience (Test Set)' ) +
xlab( 'Years of experience' ) +
ylab( 'Salary' )

Training_set

Multiple Linear Regression

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The objective of multiple linear regression (MLR) is to model the linear relationship between the explanatory variables (independent) and the response variable (dependent).

The example shows the following

First indicate where to work and import the dataset

getwd()
setwd( "/home/chris/Documents/itt/Enero_Junio_2020/Mineria_de_datos/DataM
ining/MachineLearning/MultipleLinearRegression" )
getwd()
dataset <- read.csv( '50_Startups.csv' )

In order to work with this data, you need to change values ​​of the column alphanumeric to numeric status.

dataset$State = factor(dataset$State, levels = c( 'New York' ,
'California' , 'Florida' ), labels = c( 1 , 2 , 3 ))

Now, the catools library is used to divide the dataset into two parts, being the column profit where we will be based, the first part will be training_set and the second the test_set, for this it is indicated that the ratio must be 0.8, which means that this fraction will be taken.

library (caTools)
set.seed( 123 )
split <- sample.split(dataset$Profit, SplitRatio = 0.8 )
training_set <- subset(dataset, split == TRUE )
test_set <- subset(dataset, split == FALSE )

the data of the training set are adapted to make the linear regression multiple.

regressor = lm(formula = Profit ~ ., data = training_set )
summary(regressor)

The prediction of the above adequacy is obtained using the training set.

y_pred = predict(regressor, newdata = test_set)
y_pred

The model does not fit the best, so an improved formula is created.

It can be done in two ways:

Form 1

Eliminating and checking one by one, verifying which are the fields that they have less weight and can alter the result, so a value is removed by each check, to be removed, must have the fewest asterisks.

regressor = lm(formula = Profit ~ R.D.Spend + Administration +
Marketing.Spend + State, data = dataset )
summary(regressor)

regressor = lm(formula = Profit ~ R.D.Spend + Administration +
Marketing.Spend, data = dataset )
summary(regressor)

regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
data = dataset )
summary(regressor)

regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
data = dataset )
summary(regressor)

With this you get a better prediction.

y_pred = predict(regressor, newdata = test_set)
y_pred

Form 2

The other way to solve this is to create a function, which will do what just performed, this function should only receive the training set and the minimum value that should be of importance.

backwardElimination <- function (x, sl) {

Inside the function, the following is done: the length of the dataset is obtained, which is the columns.

numVars = length(x)

In a for loop, each column will be iterated.

for (i in c( 1 :numVars)){

It is done as before, a filtering is done.

regressor = lm(formula = Profit ~ ., data = x)

Then the maximum value is obtained, extracting the coefficients of the previous filtering, choosing only column Pr (> | t |)

maxVar = max(coef(summary(regressor))[c( 2 :numVars), "Pr(>|t|)" ])

If the value obtained is greater than the minimum value indicated then the modifies by the value obtained in j in the dataset x.

if (maxVar > sl){
j = which(coef(summary(regressor))[c( 2 :numVars), "Pr(>|t|)" ]
== maxVar)
x = x[, -j]
}

A value is subtracted from numvars

numVars = numVars - 1
}

The values ​​that have been greater than the minimum value indicated are displayed.

return (summary(regressor))
}

Logistic Regression

In statistics, logistic regression is a type of regression analysis used to predict the outcome of a categorical variable (a variable that can take on a limited number of categories) based on the independent or predictor variables. It is useful for modeling the probability of an event occurring as a function of other factors.

The example shows the following

It is indicated first where the work will be done and the dataset is imported

getwd()
setwd( "/home/chris/Documents/itt/Enero_Junio_2020/Mineria_de_datos/DataM
ining/MachineLearning/LogisticRegression" )
getwd()
dataset <- read.csv( 'Social_Network_Ads.csv' )
dataset <- dataset[, 3 : 5 ]

The dataset is divided with catools, indicating that we want the division to be 0.75, with this is saved in a training dataset and a test dataset.

library (caTools)
set.seed( 123 )
split <- sample.split(dataset$Purchased, SplitRatio = 0.75 )
training_set <- subset(dataset, split == TRUE )
test_set <- subset(dataset, split == FALSE )

Matrices are scaled.

training_set[, 1 : 2 ] <- scale(training_set[, 1 : 2 ])
test_set[, 1 : 2 ] <- scale(test_set[, 1 : 2 ])

regression is filtered to training dataset

classifier = glm(formula = Purchased ~ ., family = binomial, data =
training_set)

the prediction is made to the test dataset

prob_pred = predict(classifier, type = 'response' , newdata =
test_set[- 3 ])
prob_pred
y_pred = ifelse(prob_pred > 0.5 , 1 , 0 )
y_pred

the confusion matrix is ​​performed.

cm = table(test_set[, 3 ], y_pred)
cm

It is graphed with ggplot how the curve would look and the points indicating different shapes the X axis and Y axis for the points, while for the curve the method and family are indicated that belongs to the data.

library (ggplot2)
ggplot(training_set, aes(x=EstimatedSalary, y=Purchased)) + geom_point()
+
stat_smooth(method= "glm" , method.args=list(family= "binomial" ),
se= FALSE )
ggplot(training_set, aes(x=Age, y=Purchased)) + geom_point() +
stat_smooth(method= "glm" , method.args=list(family= "binomial" ),
se= FALSE )
ggplot(test_set, aes(x=EstimatedSalary, y=Purchased)) + geom_point() +
stat_smooth(method= "glm" , method.args=list(family= "binomial" ),
se= FALSE )
ggplot(test_set, aes(x=Age, y=Purchased)) + geom_point() +
stat_smooth(method= "glm" , method.args=list(family= "binomial" ),
se= FALSE )

Now the results are displayed. For this, the elemstatlearn library is used

library (ElemStatLearn)

The training set is passed to a variable

set = training_set

The area where it will be graphed is created, indicating the minimum and maximum value of the grid.

X1 = seq(min(set[, 1 ]) - 1 , max(set[, 1 ]) + 1 , by = 0.01 )
X2 = seq(min(set[, 2 ]) - 1 , max(set[, 2 ]) + 1 , by = 0.01 )
grid_set = expand.grid(X1, X2)

The names of the x-axis and y-axis are indicated

colnames(grid_set) = c( 'Age' , 'EstimatedSalary' )

a prediction is made with the linear regression previously performed

prob_set = predict(classifier, type = 'response' , newdata = grid_set)

the values ​​are cataloged, if they are greater than 0.5, then the y-axis is 1, if not, it is 0

y_grid = ifelse(prob_set > 0.5 , 1 , 0 )

It is graphical indicating the title, labels of the x-axis and y-axis, as well as setting the limits of X.

plot(set[, - 3 ],
main = 'Logistic Regression (Training set)' ,
xlab = 'Age' , ylab = 'Estimated Salary' ,
xlim = range(X1), ylim = range(X2))

Limits indicating a region are plotted in contour mode

contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add
= TRUE )

Points are drawn for each dataset value.

points(grid_set, pch = '.' , col = ifelse(y_grid == 1 , 'springgreen3' ,
'tomato' ))
points(set, pch = 21 , bg = ifelse(set[, 3 ] == 1 , 'green4' , 'red3' ))

Training_set

The same is done but now with the training data, only the dataset is changed to the test one.

library (ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1 ]) - 1 , max(set[, 1 ]) + 1 , by = 0.01 )
X2 = seq(min(set[, 2 ]) - 1 , max(set[, 2 ]) + 1 , by = 0.01 )
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c( 'Age' , 'EstimatedSalary' )
prob_set = predict(classifier, type = 'response' , newdata = grid_set)
y_grid = ifelse(prob_set > 0.5 , 1 , 0 )
plot(set[, - 3 ],
main = 'Logistic Regression (Test set)' ,
xlab = 'Age' , ylab = 'Estimated Salary' ,
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add
= TRUE )
points(grid_set, pch = '.' , col = ifelse(y_grid == 1 , 'springgreen3' ,
'tomato' ))
points(set, pch = 21 , bg = ifelse(set[, 3 ] == 1 , 'green4' , 'red3' ))

Test_set

KNN

K-Nearest-Neighbor is a Machine Learning supervised type instance-based algorithm. It can be used to classify new samples (discrete values) or to predict (regression, continuous values).

The example shows the following

The dataset is imported and only columns 3 and 5 are chosen

dataset = read.csv( 'Social_Network_Ads.csv' )
dataset = dataset[ 3 : 5 ]

Encoding the target function as a factor

dataset$Purchased = factor(dataset$Purchased, levels = c( 0 , 1 ))

The dataset is divided into Training set and Test set with catools

library (caTools)
set.seed( 123 )
split = sample.split(dataset$Purchased, SplitRatio = 0.75 )
training_set = subset(dataset, split == TRUE )
test_set = subset(dataset, split == FALSE )

Matrices are scaled

training_set[- 3 ] = scale(training_set[- 3 ])
test_set[- 3 ] = scale(test_set[- 3 ])

KNN adaptation to the training set and prediction of the results of the test set

library (class)
y_pred = knn(train = training_set[, - 3 ], test = test_set[, - 3 ], cl = training_set[, 3 ], k = 5 , prob = TRUE )

Creating the confusion matrix.

cm = table(test_set[, 3 ], y_pred)

Now the results are displayed. For this, the elemstatlearn library is used.

library (ElemStatLearn)

The training set is passed to a variable

set = training_set

The area where it will be graphed is created, indicating the minimum and maximum value of the grid.

X1 = seq(min(set[, 1 ]) - 1 , max(set[, 1 ]) + 1 , by = 0.01 )
X2 = seq(min(set[, 2 ]) - 1 , max(set[, 2 ]) + 1 , by = 0.01 )
grid_set = expand.grid(X1, X2)

The names of the x-axis and y-axis are indicated

colnames(grid_set) = c( 'Age' , 'EstimatedSalary' )

The knn is performed to split the training dataset

y_grid = knn(train = training_set[, - 3 ], test = grid_set, cl =
training_set[, 3 ], k = 5 )

It is graphical indicating the title, labels of the x-axis and y-axis

plot(set[, - 3 ],
main = 'K-NN (Training set)' ,
xlab = 'Age' , ylab = 'Estimated Salary' ,

It sets the limits of X and Y

xlim = range(X1), ylim = range(X2))

Limits indicating a region are plotted in contour mode

contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add
= TRUE )

Points are drawn for each dataset value

points(grid_set, pch = '.' , col = ifelse(y_grid == 1 , 'springgreen3' ,
'tomato' ))
points(set, pch = 21 , bg = ifelse(set[, 3 ] == 1 , 'green4' , 'red3' ))

Training_set

To view with the test dataset, the same procedure is performed

library (ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1 ]) - 1 , max(set[, 1 ]) + 1 , by = 0.01 )
X2 = seq(min(set[, 2 ]) - 1 , max(set[, 2 ]) + 1 , by = 0.01 )
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c( 'Age' , 'EstimatedSalary' )
y_grid = knn(train = training_set[, - 3 ], test = grid_set, cl =
training_set[, 3 ], k = 5 )
plot(set[, - 3 ],
main = 'K-NN (Test set)' ,
xlab = 'Age' , ylab = 'Estimated Salary' ,
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add
= TRUE )
points(grid_set, pch = '.' , col = ifelse(y_grid == 1 , 'springgreen3' ,
'tomato' ))
points(set, pch = 21 , bg = ifelse(set[, 3 ] == 1 , 'green4' , 'red3' ))

Test_set

Decision Tree Classification

getwd()
setwd("C:/Users/Hp/Downloads/DataMining-master/MachineLearning/DesicionThree")
getwd()

Importing the dataset

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

Encoding the target feature as factor

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Splitting the dataset into the Training set and Test set

library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature Scaling

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Fitting Decision Tree Classification to the Training set

library(rpart)
classifier = rpart(formula = Purchased ~ .,
                   data = training_set)

Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
y_pred

Making the Confusion Matrix

cm = table(test_set[, 3], y_pred)
cm

Visualising the Training set results

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3],
     main = 'Decision Tree Classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Training_set

Visualising the Test set results

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree Classification (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Test_set

Plotting the tree

plot(classifier)
text(classifier, cex=0.6)

Decision Tree

Random Forest

getwd()
setwd("C:/Users/Hp/Downloads/DataMining-master/MachineLearning/RandomForest")
getwd()

Importing the dataset

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

Encoding the target feature as factor

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Splitting the dataset into the Training set and Test set

install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature Scaling

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Fitting Random Forest Classification to the Training set

install.packages('randomForest')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-3],
                          y = training_set$Purchased,
                          ntree =10)

Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3])
y_pred

Making the Confusion Matrix

cm = table(test_set[, 3], y_pred)
cm

Visualising the Training set results

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, grid_set)
plot(set[, -3],
     main = 'Random Forest Classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Training_set

Visualising the Test set results

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, grid_set)
plot(set[, -3], main = 'Random Forest Classification (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Choosing the number of trees

plot(classifier)

Test_set

SVM

Importing the dataset

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

Encoding the target feature as factor

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Splitting the dataset into the Training set and Test set

library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature Scaling

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Fitting SVM to the Training set

install.packages('e1071')
library(e1071)
classifier = svm(formula = Purchased ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'linear')
svm

Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3])
y_pred

Making the Confusion Matrix

cm = table(test_set[, 3], y_pred)
cm

Visualising the Training set results

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
     main = 'SVM (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Training_set

Visualising the Test set results

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Test_set

K-Means

getwd()
setwd("C:/Users/Hp/Downloads/DataMining-master/MachineLearning/K-Means")
getwd()

Importing the dataset

dataset = read.csv('Mall_Customers.csv')
dataset = dataset[4:5]

Using the elbow method to find the optimal number of clusters

set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)
plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

Fitting K-Means to the dataset

set.seed(29)
kmeans = kmeans(x = dataset, centers = 5)
y_kmeans = kmeans$cluster

Visualising the clusters

install.packages('cluster')
library(cluster)
clusplot(dataset,
         y_kmeans,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')
         

plot K-means