# Data Exploratory Analysis Using R
## Sukhjit Singh Sehra
### Guru NanaK Dev Engineering College, Ludhiana

A useful way to detect patterns and anomalies in the data is through the exploratory data analysis with
visualization. Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone. Variables x and y of the data frame data can instead be visualized in a
scatterplot, which easily depicts the relationship between two variables. 

In [None]:
x <- rnorm(50)
y <- x + rnorm(50, mean=0, sd=0.5)
data <- as.data.frame(cbind(x, y))


In [None]:
library(ggplot2)
ggplot(data, aes(x=x, y=y)) +
geom_point(size=2) +
ggtitle("Scatterplot of X and Y") +
theme(axis.text=element_text(size=12),
axis.title = element_text(size=14),
plot.title = element_text(size=20, face="bold"))


Adding Data to the Variable

In [None]:
data(anscombe)

## Visualization of Data Before Analysis

In [None]:
summary(anscombe)

In [None]:
# generates levels to indicate which group each data point belongs to
levels <- gl(4, nrow(anscombe))


In [None]:
levels

In [None]:
# Group anscombe into a data frame
mydata <- with(anscombe, data.frame(x=c(x1,x2,x3,x4), y=c(y1,y2,y3,y4),
mygroup=levels))


In [None]:
# Make scatterplots using the ggplot2 package
library(ggplot2)
 # set plot color theme
theme_set(theme_bw())



In [None]:
# create the four plots
ggplot(mydata, aes(x,y)) +
geom_point(size=4) +
geom_smooth(method="lm", fill=NA, fullrange=TRUE) +
facet_wrap(~mygroup)


 ### Diry Data

### How dirty data can be detected in the data exploration phase with visualizations. In general, analysts should look for anomalies, verify the data with domain knowledge, and decide the most appropriate approach to clean the data.


In [None]:
dirtydata <- read.csv(file='sampledatasets/internationaldirtydata.csv', header=T, sep=',')

In [None]:
summary(dirtydata)
head(dirtydata)

In [None]:
hist(dirtydata$Height_cm, breaks=50, main="height Distribution of Account Holders",xlab="height", ylab="Frequency", col="gray")


In [None]:
is.na(dirtydata$Importance_Internet_access)

In [None]:
mean(dirtydata$Importance_Internet_access)

In [None]:
mean(dirtydata$Importance_Internet_access)

In [None]:
mean(dirtydata$Importance_Internet_access,na.rm=TRUE)

## We can use exclude function to remove the undefined values from the variable

In [5]:
DF <- data.frame(x = c(1, 2, 3), y = c(10, 20, NA))

In [None]:
DF1 <- na.exclude(DF)

In [None]:
is.element(170,dirtydata$Height_cm)

## Remove particular value from the List

In [None]:
# The %in% operator tells you which elements are among the numers to remove:

a <- sample (1 : 10)
a
remove <- c (2, 3, 5)
a %in% remove
a [! a %in% remove]
