<a href="https://colab.research.google.com/github/apmire3/curly-parakeet/blob/master/Allen_Mire_copy_data_R_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Practice session
===
Data Analysis in R

##Warming up: review workding folder basics
*  R works best if you have a dedicated folder for each separate project - the working folder. Put all data files in the working folder (or in subfolders).


Show current working folder:

In [0]:
getwd()

Create a new folder:

In [0]:
dir.create("data")

Go to the new folder:

In [0]:
setwd("data")

Show current working folder:

In [0]:
getwd()

##Case Study: 2019 Forbes Global list
*   The `forbes` dataset consists of 2000 rows (observations) on 8 variables describing companies’ rank, name, country, category, sales, profits, assets and market value. 
http://www.hpc.lsu.edu/training/weekly-materials/Downloads/Forbes2000_2019.csv.zip
> * **`rank`** the ranking of the company
> * **`name`** the name of the company
> * **`country`** the country the company is situated in
> * **`category`** the products the company produces
> * **`sales`** the amount of sales of the company in billion USD
> * **`profits`** the profit of the company in billion USD
> * **`assets`** the assets of the company in billion USD
> * **`marketvalue`** the market value of the company in billion USD


# Step by step Data Analysis in R


1. Get data
2. Read data
3. Inspect data
4. Preprocess data (missing and dubious values, discard columns not needed etc.)
5. Analyze data
6. Generate report







## 1. Getting Data


Raw data 
* http://www.hpc.lsu.edu/training/weekly-materials/Downloads/Forbes2000_2019.csv.zip

* Choose one of the following to download the raw data file from internet, then upload to the colab
> * Manually download the file to the working directory 
> * or with R function `download.file()`


In [0]:
# Fill the blanks in the download command
download.file("http://www.hpc.lsu.edu/training/weekly-materials/Downloads/Forbes2000.csv.zip", "Forbes2000.csv.zip")
list.files() # List files in current folder

* Unzip with the `unzip()` function

In [0]:
# Fill the blanks in the unzip command
unzip("Forbes2000.csv.zip","Forbes2000.csv")
list.files()   # List files in current folder

##2. Reading data
* R understands many different data formats and has lots of ways of reading/writing them (csv, xml, excel, sql, json etc.)

>Input | Output | Purpose
>--- | --- | ---
>read.table (read.csv) | write.table (write.csv) | for reading/writing tabular data
>readLines | writeLines | for reading/writing lines of a text file
>source | dump | for reading/writing in R code files
>dget | dput | for reading/writing in R code files
>load | save | for reading in/saving workspaces

* ` read.csv()` is identical to `read.table()` except that the default separator is a comma.

In [0]:
# Fill the blank in the read.csv command
forbes <- read.csv("Forbes2000.csv",header=T,stringsAsFactors = FALSE,na.strings ="NA",sep=",")

##3. Inspecting data
* `class()`: check object class
* `dim()`: dimension of the data
* `head()`: print on screen the first few lines of data, may use n as argument
* `tail()`: print the last few lines of data

In [0]:
# Fill the blanks in the following commands
class(forbes)
dim(forbes)
head(forbes,n=50)

Unnamed: 0_level_0,rank,name,country,category,sales,profits,assets,marketvalue
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,Citigroup,United States,Banking,94.71,17.85,1264.03,255.3
2,2,General Electric,United States,Conglomerates,134.19,15.59,626.93,328.54
3,3,American Intl Group,United States,Insurance,76.66,6.46,647.66,194.87
4,4,ExxonMobil,United States,Oil & gas operations,222.88,20.96,166.99,277.02
5,5,BP,United Kingdom,Oil & gas operations,232.57,10.27,177.57,173.54
6,6,Bank of America,United States,Banking,49.01,10.81,736.45,117.55
7,7,HSBC Group,United Kingdom,Banking,44.33,6.66,757.6,177.96
8,8,Toyota Motor,Japan,Consumer durables,135.82,7.99,171.71,115.4
9,9,Fannie Mae,United States,Diversified financials,53.13,6.48,1019.17,76.84
10,10,Wal-Mart Stores,United States,Retailing,256.33,9.05,104.91,243.74


In [0]:
# Fill the blank in the summary command
summary(forbes)

      rank            name             country            category        
 Min.   :   1.0   Length:2000        Length:2000        Length:2000       
 1st Qu.: 500.8   Class :character   Class :character   Class :character  
 Median :1000.5   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1000.5                                                           
 3rd Qu.:1500.2                                                           
 Max.   :2000.0                                                           
                                                                          
     sales            profits             assets          marketvalue    
 Min.   :  0.010   Min.   :-25.8300   Min.   :   0.270   Min.   :  0.02  
 1st Qu.:  2.018   1st Qu.:  0.0800   1st Qu.:   4.025   1st Qu.:  2.72  
 Median :  4.365   Median :  0.2000   Median :   9.345   Median :  5.15  
 Mean   :  9.697   Mean   :  0.3811   Mean   :  34.042   Mean   : 11.88  
 3rd Qu.:  9.547   3rd Qu.:  0

##4. Preprocess data 


### 4.1 Preprocessing - missing values
* Missing values are denoted in R by NA or NaN for undefined mathematical operations.
> * `is.na(x)` is used to test objects "x" if there are NAs
> * Which one is NA? `which(is.na(x))`

In [0]:
Use one of the commands in the last session to 

In [0]:
# Fill the blank to find out which ones are NAs on Sales
which(is.na(forbes$sales))

* more about missing value inspection
> * How many NAs? `table(is.na(x))`
> * list of observations with missing values on profits `x(is.na(x),)`



In [0]:
# Fill the blank to find out how many NAs on assets
table(is.na(forbes$assets))
# Fill the blank to find out observations with missing values on profits
table(is.na(forbes$profits))


FALSE 
 2000 


FALSE  TRUE 
 1995     5 

* remember many R functions also have a logical “`na.rm`” option
> * `na.rm=TRUE` means the NA values should be discarded


In [0]:
#Calculate the mean value of profits
mean(forbes$profits)  # will get NA
mean(forbes$profits, na.rm=T)

* **Note: Not all missing values are marked with “NA” in the raw data!**


* The simplest way to deal with the missing values is to remove them. 
> * If a column (variable) has a high percentage of the missing value, remove the whole column or just don’t use it for the analysis.
> * If a row (observation) has a missing value, remove the row with `na.omit()`. e.g. 


In [0]:
# remove the observations with missing values, save it to object "forbes2"
forbes2  <- na.omit(forbes)
# find out the dimensions of forbes2
dim(forbes2)

* Alternatively, the missing values can be replaced by basic statistics e.g. 
> * replace by mean 


In [0]:
for(i in 1:nrow(forbes)){
  if(is.na(forbes$profits[i])==TRUE){
  forbes$profits[i] <- mean(forbes$profits, na.rm = TRUE)
  }
}
for(i in 1:nrow(forbes)){
  if(is.na(forbes$sales[i])==TRUE){
  forbes$sales[i] <- mean(forbes$sales, na.rm = TRUE)
  }
}
for(i in 1:nrow(forbes)){
  if(is.na(forbes$assets[i])==TRUE){
  forbes$assets[i] <- mean(forbes$assets, na.rm = TRUE)
  }
}
dim(forbes)

###4.2 Preprocessing - subsetting data
* At most occasions we do not need all of the raw data
* There are a number of methods of extracting a subset of R objects
* Subsetting data can be done either by row or by column 


#### 4.2.1 Subsetting by row: use conditions
Fill the blanks to find all companies with negative profit:


In [0]:
forbes[forbes$profits < 0,c("name","sales","profits","assets")]

Unnamed: 0_level_0,name,sales,profits,assets
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
350,Allianz Worldwide,96.88,-1.23,851.24
354,Vodafone,47.99,-15.51,256.28
364,Deutsche Telekom,56.40,-25.83,132.01
372,Credit Suisse Group,38.01,-2.40,683.44
374,France Telecom,57.99,-21.78,107.86
382,Generali Group,57.90,-0.79,239.21
396,Sumitomo Mitsui Financial,29.17,-3.94,868.42
397,E.ON,37.95,-0.73,115.57
398,Mitsubishi Tokyo Finl,20.65,-1.37,827.48
402,Aviva,52.46,-0.86,287.58


Fill the blanks to find the number of companies in each country with profits above 5 billion US dollars

In [0]:
forbes3 <- forbes[forbes$profits >= 5, c("name","country","profits")]
table(forbes3[,"country"])


                      China                      France 
                          1                           1 
                    Germany                       Japan 
                          1                           1 
Netherlands/ United Kingdom                 South Korea 
                          1                           1 
                Switzerland              United Kingdom 
                          3                           3 
              United States 
                         20 

Find three companies with largest sale volumne:


In [0]:
companies <- forbes$name  # or companies <- forbes[,"name"]
#Note to self: This works because order returns the INDEX of the rows so when
#you call order_sales later it gives the index of the companies to select 
order_sales <- order(forbes$sales, decreasing=T)
#company names
companies[order_sales[1:3]]
#company sales
head(sort(forbes$sales,decreasing=T),n=3)

Fill the blanks below to find first 50 companies in the Forbes dataset with the highest profit


In [0]:
companies <- forbes$name   
order_profit <- order(forbes$profits, decreasing=T)   # order() returns the indices of the vector in sorted order
companies[order_profit[1:50]]
head(sort(forbes$profits,decreasing=T),n=50)

####4.2.2 Subsetting by row: use `subset()` function
Fill the blanks below to find all German companies with negative profit


In [0]:
#My solution
#Germanycomp <- subset(forbes, forbes$country == "Germany" & forbes$profits < 0)
#Germanycomp[  ,c("name","sales","profits","assets")]
Germanycomp <- subset(forbes, forbes$country == "Germany")
Germanycomp[Germanycomp$profits < 0  ,c("name","sales","profits","assets")]

Unnamed: 0_level_0,name,sales,profits,assets
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
350,Allianz Worldwide,96.88,-1.23,851.24
364,Deutsche Telekom,56.4,-25.83,132.01
397,E.ON,37.95,-0.73,115.57
431,HVB-HypoVereinsbank,40.52,-0.87,705.36
500,Commerzbank,22.43,-0.31,437.86
798,Infineon Technologies,7.18,-0.51,11.79
869,BHW Holding,7.46,-0.38,117.96
926,Bankgesellschaft Berlin,9.43,-0.74,182.69
1034,W&W-Wustenrot,7.57,-0.08,56.44
1187,mg technologies,8.54,-0.13,6.45


##Solution
Find all German companies with negative profit

In [0]:
Germanycomp <- subset(forbes, country == "Germany")
Germanycomp[Germanycomp$profits < 0,c("name","sales","profits","assets")]

Find the number of companies in each country with profits above 5 billion US dollars

In [0]:
forbes3 <- forbes[forbes$profits > 5,c("name","country","profits")]
table(forbes3[,"country"])


                      China                      France 
                          1                           1 
                    Germany                       Japan 
                          1                           1 
Netherlands/ United Kingdom                 South Korea 
                          1                           1 
                Switzerland              United Kingdom 
                          3                           3 
              United States 
                         20 

Find first 50 companies in the Forbes dataset with the highest profit


In [0]:
companies <- forbes$name   
order_profit <- order(forbes$profits, decreasing=T)   # order() returns the indices of the vector in sorted order
companies[order_profit[1:50]]
head(sort(forbes$profits,decreasing=T),n=50)

####4.2.3 Subsetting by column
Create another dataframe with only numeric variables (i.e. only keep columns of sales, profits, assets and marketvalue)

In [0]:
#use data.frame function
forbes3 <- data.frame(sales=forbes$sales,profits=forbes$profits,
           assets=forbes$assets, mvalue=forbes$marketvalue)
str(forbes3)

# Fill the blank below
#use subset() function
forbes4 <- subset(forbes,select = c(sales, profits, assets, marketvalue))
str(forbes4)

#or simply use indexing
# Fill the blank below
forbes5 <- forbes[1,5:8]
str(forbes5)

'data.frame':	2000 obs. of  4 variables:
 $ sales  : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits: num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets : num  1264 627 648 167 178 ...
 $ mvalue : num  255 329 195 277 174 ...
'data.frame':	2000 obs. of  4 variables:
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...
'data.frame':	1 obs. of  4 variables:
 $ sales      : num 94.7
 $ profits    : num 17.9
 $ assets     : num 1264
 $ marketvalue: num 255


## Solution

In [0]:
#use data.frame function
forbes3 <- data.frame(sales=forbes$sales,profits=forbes$profits,
           assets=forbes$assets, mvalue=forbes$marketvalue)
str(forbes3)

#use subset() function
forbes4 <- subset(forbes,select=c(sales,profits,assets,marketvalue))
str(forbes4)

#or simply use indexing
forbes5 <- forbes[,c(5:8)]
str(forbes5)

'data.frame':	2000 obs. of  4 variables:
 $ sales  : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits: num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets : num  1264 627 648 167 178 ...
 $ mvalue : num  255 329 195 277 174 ...
'data.frame':	2000 obs. of  4 variables:
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...
'data.frame':	2000 obs. of  4 variables:
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...


### 4.3 Preprocessing – Factors
* factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables


Convert characters to (unordered) factors:

In [0]:
forbes$country<-factor(forbes$country)
str(forbes)

'data.frame':	2000 obs. of  8 variables:
 $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ name       : chr  "Citigroup" "General Electric" "American Intl Group" "ExxonMobil" ...
 $ country    : Factor w/ 61 levels "Africa","Australia",..: 60 60 60 60 56 60 56 28 60 60 ...
 $ category   : chr  "Banking" "Conglomerates" "Insurance" "Oil & gas operations" ...
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...


* Small classes could be merged into a larger class. Why?
> * For better model performance. E.g. Classification and Regression Trees tend to split using the variables with many categories.
> * Actual needs
Some categories have just a few subjects

In [0]:
table(forbes$country)


                      Africa                    Australia 
                           2                           37 
   Australia/ United Kingdom                      Austria 
                           2                            8 
                     Bahamas                      Belgium 
                           0                            9 
                     Bermuda                       Brazil 
                           0                            0 
                      Canada               Cayman Islands 
                          56                            0 
                       Chile                        China 
                           0                           25 
              Czech Republic                      Denmark 
                           2                           10 
                     Finland                       France 
                          11                           63 
      France/ United Kingdom                      Germa

* Merge small classes into a larger classes

Merge all South American countries to "Venezuela"

In [0]:
forbes$country[(forbes$country=="Bahamas")|(forbes$country=="Bermuda")|(forbes$country=="Brazil")|(forbes$country=="Cayman Islands")|(forbes$country=="Chile")|(forbes$country=="Panama/ United Kingdom")|(forbes$country=="Peru")]<-"Venezuela"

Merge small classes into a larger classes

In [0]:
forbes$country[(forbes$country=="Austria")|(forbes$country=="Belgium")|(forbes$country=="Czech Republic")|(forbes$country=="Denmark")|(forbes$country=="Finland")|(forbes$country=="France")|(forbes$country=="Germany")|(forbes$country=="Greece")|(forbes$country=="Hungary")|(forbes$country=="Ireland")|(forbes$country=="Italy")|(forbes$country=="Luxembourg")|(forbes$country=="Netherlands")|(forbes$country=="Norway")|(forbes$country=="Poland")|(forbes$country=="Portugal")|(forbes$country=="Russia")|(forbes$country=="Spain")|(forbes$country=="Sweden")|(forbes$country=="Switzerland")|(forbes$country=="Turkey")|(forbes$country=="France/ United Kingdom")|(forbes$country=="United Kingdom/ Netherlands")|(forbes$country=="Netherlands/ United Kingdom")]<-"United Kingdom"
forbes$country[(forbes$country=="China")|(forbes$country=="Hong Kong/China")|(forbes$country=="Indonesia")|(forbes$country=="Japan")|(forbes$country=="Kong/China")|(forbes$country=="Korea")|(forbes$country=="Malaysia")|(forbes$country=="Philippines")|(forbes$country=="Singapore")|(forbes$country=="South Korea")|(forbes$country=="Taiwan")]<-"Thailand"
forbes$country[(forbes$country=="Africa")|(forbes$country=="Australia")|(forbes$country=="India")|(forbes$country=="Australia/ United Kingdom")|(forbes$country=="Islands")|(forbes$country=="Israel")|(forbes$country=="Jordan")|(forbes$country=="Liberia")|(forbes$country=="Mexico")|(forbes$country=="New Zealand")|(forbes$country=="Pakistan")|(forbes$country=="South Africa")|(forbes$country=="United Kingdom/ Australia")]<-"United Kingdom/ South Africa"

* Drop those levels with zero counts

Use `droplevels()` function:


In [0]:
forbes$country<-droplevels(forbes$country)

Now we can check the new frequency tables:

In [0]:
table(forbes$country)


                      Canada                     Thailand 
                          56                          499 
              United Kingdom United Kingdom/ South Africa 
                         531                          115 
               United States                    Venezuela 
                         751                            1 

* Rename each class

In [0]:
levels(forbes$country)<-c("Canada","East/Southeast Asia","Europe","Other","United States","Latin America")
levels(forbes$country)

###4.4 Export the cleaned dataset (Important!)
* Save forbes to Forbes2000_clean.csv

In [0]:
write.csv(forbes,"Forbes2000_2019_clean.csv",row.names=FALSE)
list.files()

## 5. Data analysis


###5.2 Import the cleaned dataset (Optional)
* Subsetting by column
Create a dataframe with the clean data

In [120]:
forbes_clean <- read.csv("Forbes2000_2019_clean.csv",header=T,stringsAsFactors = T,na.strings ="?",sep=",")
str(forbes_clean)

'data.frame':	2000 obs. of  8 variables:
 $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ name       : Factor w/ 2000 levels "Aareal Bank",..: 438 747 100 659 311 219 870 1827 663 1921 ...
 $ country    : Factor w/ 7 levels "Canada","East/Southeast Asia",..: 7 7 7 7 3 7 3 2 7 7 ...
 $ category   : Factor w/ 27 levels "Aerospace & defense",..: 2 6 16 19 19 2 2 8 9 20 ...
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...


###5.3 Extract Variables 
* Create another data frame with only numeric variables + country

Hint: Refer to 4.2.3 Subsetting by column

In [121]:
# Fill the blank below:
forbes_clean <- subset(forbes_clean, select = c("country", "sales", "profits","assets","marketvalue"))
str(forbes_clean)

'data.frame':	2000 obs. of  5 variables:
 $ country    : Factor w/ 7 levels "Canada","East/Southeast Asia",..: 7 7 7 7 3 7 3 2 7 7 ...
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...


###5.4 Training Set and Test Set
* Dataset could be randomly split into two parts: training set and test set. 


In [0]:
set.seed(1) #set random seed reproducible
indx <- sample(1:1995,size=1995,replace=F)
forbes.train <- forbes_clean[indx[1:1600],]
forbes.test <- forbes_clean[indx[1601:1995],]

##Exercise 
1. Use the `lm()` function to perform a multiple linear regression with profits as the response and all other numeric variables as the predictors. Use the `summary()` function to print the results. 


In [0]:
forbes_clean2 <- forbes_clean[,c(2:5)]  # create a new dataframe with only numeric variables included
set.seed(3) 
indx <- sample(1:1995,size=1995,replace=F)
forbes.train <- forbes_clean2[indx[1:1600],]
forbes.test <- forbes_clean2[indx[1601:1995],]
str(forbes.train)

'data.frame':	1600 obs. of  4 variables:
 $ sales      : num  2.86 4.72 9.91 1.16 3.95 ...
 $ profits    : num  0.5 0.05 -1.8 0.25 0.79 0.33 0.21 0.37 0.23 0.3 ...
 $ assets     : num  7.33 4.71 27.76 8.86 7.72 ...
 $ marketvalue: num  6.45 1.42 11.46 6.52 13.98 ...


In [0]:
lm <- lm(profits ~ . , data=forbes.train     )
summary(lm)


Call:
lm(formula = profits ~ ., data = forbes.train)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.2532  -0.0266   0.1169   0.2190   9.1538 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1211415  0.0439806  -2.754  0.00595 ** 
sales        0.0115652  0.0027422   4.217 2.61e-05 ***
assets      -0.0012144  0.0004237  -2.866  0.00421 ** 
marketvalue  0.0362598  0.0019938  18.186  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.537 on 1596 degrees of freedom
Multiple R-squared:  0.3164,	Adjusted R-squared:  0.3151 
F-statistic: 246.3 on 3 and 1596 DF,  p-value: < 2.2e-16


2. Comment on the output. For instance:  Is there a relationship between the predictors and the response? 


Yes 

3. Which predictors appear to have a statistically significant relationship to the response? 


Sales and Market Value have a strong correlation to profits while Assets have a weaker correlation while still being statistically significant.

4. What does the coefficient for the sales variable suggest?


### Bonus questions:

5. Use the ^ symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? 


In [127]:
lm2 <- lm(profits ~(sales+assets+marketvalue)^3, data = forbes.train)
summary(lm2)


Call:
lm(formula = profits ~ (sales + assets + marketvalue)^3, data = forbes.train)

Residuals:
     Min       1Q   Median       3Q      Max 
-28.6771  -0.0850   0.0129   0.1823   9.4978 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.163e-01  5.252e-02   2.214 0.026970 *  
sales                    -3.909e-04  4.089e-03  -0.096 0.923848    
assets                   -5.699e-03  9.631e-04  -5.917 3.99e-09 ***
marketvalue               2.471e-02  2.832e-03   8.725  < 2e-16 ***
sales:assets              6.229e-05  1.740e-05   3.580 0.000354 ***
sales:marketvalue         1.205e-04  3.312e-05   3.638 0.000284 ***
assets:marketvalue        5.301e-05  1.853e-05   2.860 0.004293 ** 
sales:assets:marketvalue -3.190e-07  1.835e-07  -1.739 0.082249 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.492 on 1592 degrees of freedom
Multiple R-squared:  0.3571,	Adjusted R-squared:  0.3543 


6. Compare two models you just fitted with MAD and RMSE. Which model has better predictive results in terms of MAD and RMSE?

In [128]:
# MLR model without interactions 
lm.yhat <- predict(lm, newdata = forbes.test)
lm.y <- forbes.test["marketvalue"]
lm.rmse <- sqrt(mean(data.matrix((lm.y - lm.yhat)^2)))
lm.rmse
lm.abs = abs(lm.y - lm.yhat)
lm.mad = (sum(lm.abs))/395 
lm.mad

# MLR model with all interactions
lm2.yhat <- predict(lm2, newdata = forbes.test)
lm2.y <- forbes.test["marketvalue"]
lm2.rmse <- sqrt(mean(data.matrix((lm2.y - lm2.yhat)^2)))
lm2.rmse
lm2.abs = abs(lm2.y - lm2.yhat) 
lm2.mad = (sum(lm2.abs))/395 
lm2.mad

##Solution 
1. Use the `lm()` function to perform a multiple linear regression with profits as the response and all other numeric variables as the predictors. Use the `summary()` function to print the results. 


In [0]:
lm <- lm(profits ~ ., data = forbes.train)
summary(lm)

5. Use the ^ symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? 


In [0]:
lm2 <- lm(profits ~ (sales + assets + marketvalue)^3, data = forbes.train)
summary(lm2)

6. Compare two models you just fitted with MAD and RMSE. Which model has better predictive results in terms of MAD and RMSE?

In [0]:
# MLR model without interactions 
lm.yhat <- predict(lm, newdata = forbes.test) 
lm.y <- forbes.test["profits"] 
lm.rmse <- sqrt(mean(data.matrix((lm.y - lm.yhat)^2)))
lm.rmse
lm.abs = abs(lm.y - lm.yhat) 
lm.mad = (sum(lm.abs))/395 
lm.mad

# MLR model with all interactions
lm2.yhat <- predict(lm2, newdata = forbes.test) 
lm2.y <- forbes.test["profits"] 
lm2.rmse <- sqrt(mean(data.matrix((lm2.y - lm2.yhat)^2)))
lm2.rmse
lm2.abs = abs(lm2.y - lm2.yhat) 
lm2.mad = (sum(lm2.abs))/395 
lm2.mad