# Exploratory Data Analysis of European Sales Dataset

In [None]:
library(corrplot)

In [None]:
# We are getting the data and looking at the first five records.

In [None]:
europeanSalesData<-read.csv("EuropeanSales.csv",header=T)

head(europeanSalesData,5)

In [None]:
# Summarize the dataset

In [None]:
summary(europeanSalesData)

In [None]:
#Correlation between Attributes

corrs = cor(europeanSalesData[,-1])
corrplot(corrs, type = "full", method="ellipse", tl.col="black", tl.srt = 30, addCoef.col = "black")

![Ads%C4%B1z.png](attachment:Ads%C4%B1z.png)

In [None]:
# According to Correlation Matrix, SalesPerCapita is most correlated with GDPperHead and Educatian Spending.

# ComputerSales is most correlated with Population.

In [None]:
#SalesPerCapita

# Starting with the Kitchen Sink Model for SalesPerCapita, so we will build a model with all features.
# Fit the Data with all Attributes to the model

In [None]:
model <- lm(SalesPerCapita ~ Population + GDPperHead + UnemploymentRate + EducationSpending + ComputerSales, data=europeanSalesData)
summary(model)

In [None]:
# It seems attrbiutes don't show strong correlation with SalesPerCapita. Adjusted R-Squared is not bad but T-Values are not significant for the kitchen sink model.

# When we try the other combinations with starting eliminate insignificant features ( like UnenployementRate), 
# we can see that best model for SalesPerCapita includes only 2 features 
# (GDPperHead and EducationSpending are enough to cover nearly %50 of data).

In [None]:
model <- lm(SalesPerCapita ~ GDPperHead + EducationSpending, data=europeanSalesData)
summary(model)

In [None]:
# Adjusted R-Squared is still close (above %46) to the Kitchen sink model and 
# T-Values are more significant than Kitchen Sink. Also we prefered the simplest model if their scores are similar. 
# (It includes only 2 features)

# So the result is SalesPerCapita ~ GDPperHead + EducationSpending

In [None]:
# ComputerSales

# Starting with the Kitchen Sink Model for ComputerSales, so we will build a model with all features.
# Fit the Data with all Attributes to the model

In [None]:
model <- lm(SalesPerCapita ~ Population + GDPperHead + UnemploymentRate + EducationSpending + ComputerSales, data=europeanSalesData)
summary(model)

In [None]:
# It seems some attributes don't show strong correlation with ComputerSales. 
# Adjusted R-Squared is not bad  (%70) but  some T-Values are not significant for the kitchen sink model.

# Eliminate starting with most insignificant features (like UnenployementRate), 

In [None]:
model <- lm(ComputerSales ~ Population + GDPperHead  + EducationSpending + SalesPerCapita, data=europeanSalesData)
summary(model)

In [None]:
# Eliminate most insignificant features (like EducationSpending), 
model <- lm(ComputerSales ~ Population + GDPperHead   + SalesPerCapita, data=europeanSalesData)
summary(model)

In [None]:
# Eliminate most insignificant features (like GDPperHead), 
model <- lm(ComputerSales ~ Population + SalesPerCapita, data=europeanSalesData)
summary(model)

In [None]:
# We can see that best model for ComputerSales includes only 2 features (Population and SalesPerCapita).

# Adjusted R-squared is more than previous models and all features statistically significant.

# So the result is ComputerSales ~ Population + SalesPerCapita