## Multivariate Reduction Practice

Lets continue our discussion with multivariate data reduction. We will focus on dimensionality reduction mainly on principal component analysis. The questions are partially complete. You may have to debug/modify/complete the code to generate the desired output. 

**Load the data** into movies_data dataframe.

In [None]:
movies_data <- read.csv("../../../datasets/movies/movie_metadata.csv", header = T, sep=",")
head(movies_data)

Remove the rows that contain any NA values.

In [None]:
#count number of rows in the dataset
nrow(movies_data)
#Omit rows from  the dataset that contain NA values
movies_data=na.omit(movies_data)
#count number of rows again in the dataset
nrow(movies_data)

#Form a new dataframe called less_data excluding all rows from movies_data that contain NA values 
less_data=movies_data[!sapply(movies_data,class) %in% c("factor")]

In [None]:
cor(less_data)

#### Correlation Plot

__Reference__: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

In [None]:
# install.packages('corrplot',repos='http://cran.us.r-project.org')
library('corrplot') #package corrplot
cors <- cor(less_data) # get the correlations for less_data
#The output of the function cor() is the correlation coefficients between each and every variable combination in the dataset. 
#correlation to itself is always 1.

corrplot(cors, method = "number",number.cex=0.75) #plot the correlation of variables in the form of a matrix
# method takes different inputs like "number", "circle", "ellipse" etc. We chose number here, as we want to see correlation 
#between each variable.

Variables movie_facebook_likes, num_user_for_reviews, num_voted_users, num_critic_for_reviews and duration are the relatively most correlated with imdb_score. 

* The "cast_total_facebook_likes" has a strong positive correlation with the "actor_1_facebook_likes", and has smaller positive correlation with both "actor_2_facebook_likes" and "actor_3_facebook_likes"

* The "movie_facebook_likes" has strong correlation with "num_critic_for_reviews", meaning that the popularity of a movie in social network can be largely affected by the critics

* The "movie_facebook_likes" has decent amount of correlation with the "num_voted_users"

* The movie "gross" has strong positive correlation with the "num_voted_users"

##### Contradicting correlations
---------------------------

* The "imdb_score" has very small positive correlation with "director_facebook_likes". So we cant guarantee a popular director's movie will be great.

* The "imdb_score" has very small positive correlation with the "actor_1_facebook_likes". Just like a famous director, we cant guarantee a popular actor's movie will be great.

* The "imdb_score" has a small but positive correlation with "duration". Highly rated movies tend to be longer in duration.

* The variables num_voted_users, num_user_for_reviews have small positive correlation. May be more reviews are made on good movies.

* The "imdb_score" has almost no correlation with "budget". Big budget movies will not necessarily turn out great

**Question 1.a: **Use the information.gain function in FSelector package to find the information again for all variables in movies_data dataset. 

**Question 1.b: **Identify the variables which have an information gain of 0.7

### Feature Selection

In [None]:
# # install.packages("FSelector",repo="http://cran.uk.r-project.org")
library(FSelector)
weight_gains <- information.gain(<what goes in here>~., movies_data)

print(weight_gains)

subset <- cutoff.k(weight_gains, 6)

formula <- as.simple.formula(subset, "Prices")

print(formula)

From information gain function, 
````
        Answer for 1.b
        "Write the variable names identified here"
````

variables have information gain of over 0.77. 

But we see completely different set of variables movie_facebook_likes, num_user_for_reviews, num_voted_users, num_critic_for_reviews and duration as the most correlated from cor() function.

Lets continue our discussion with PCA. As we have seen in lab notebook we have to standardize the variables. 

**Question 2: ** Use scale() function to standardize the numeric variables in movies_data and assign the new data to a variable called standard_vars.

In [None]:
standard_vars <- as.data.frame(<what goes in here>(less_data))
dim(standard_vars)
head(standard_vars)

**Question 3: ** Run prcomp() function on standard_vars created above and assign the result to movies_data_pca 

In [None]:
# Compute the Principal Components. Run prcomp() function on standardardized variables created above.
movies_data_pca <- <what goes in here>(<what goes in here>)

**If you go to the help page for `prcomp` you will find,**

`The calculation is done by a singular value decomposition of the (centered and scaled which is what we are doing above standardizing the variables) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy.`

For `princomp()` you will see,

`The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp."`

In [None]:
summary(movies_data_pca)

In [None]:
screeplot(movies_data_pca, type="lines")

**Question 4: **What are your observations from the plot. Write a few words below. 

In [None]:
biplot(movies_data_pca) 

````

        Answer for question 4 goes here

````

Look at the dimensions of the PCA we ran. We are interested in the x part of crime_train_data_pca for the dimensions.

In [None]:
dim(movies_data_pca$x)

**Question 5: ** Fit a multiple regression model to predict imdb_score in movies_data using the first 4 principal components created above. 

In [None]:
#movies_data_pca$x is a list that contains all the principal components. You can access components using subscripts 1,2,3..so on
fit = lm(<what goes in here>~movies_data_pca$x[,1]+movies_data_pca$x[,2]+movies_data_pca$x[,3]+movies_data_pca$x[,4])
summary(fit)

**Question 6: ** Plot a ggplot for principal components 1 and 2.

#### Scatter plots of Principal components

In [None]:
library(ggplot2)
pca_comp1_comp2 <- ggplot(<what goes in here>, aes(x=<what goes in here>,y=<what goes in here>))

pca_comp1_comp2+geom_point(alpha = 0.8)

Lets try to fit a linear multiple regression model using the most correlated variables we found.

**Question 7.a: ** Fit a multiple regression model to predict movies_data using variables movie_facebook_likes+ num_user_for_reviews+ num_voted_users+ num_critic_for_reviews+ duration

In [None]:
fit1=lm(<what goes in here>~<what goes in here>+ <what goes in here>+ <what goes in here>+ <what goes in here>+ <what goes in here>,
       data=movies_data)
summary(fit1)

**Question 7.b: ** Compare the $R^2$ value for models fit1 and fit. Write your opinion about the models in a line.

````

        Answer for question 7.b goes here

````

**Question 8: ** Build a model to predict imdb_score using all the independent features of movies_data.

In [None]:
fit2=lm(imdb_score~ <what goes in here>, data=less_data)
summary(fit2)

This looks really significant when you compare the $R^2$ of model built using all 16 numeric variables in the dataset compared to model built using principal components. 

**Question 9: ** Run factanal() function to generate 2 factors for less_data.  

In [None]:
factors <- factanal(less_data, <what goes in here>, rotation=<what goes in here>)
print(factors, digits=2, cutoff=0.3, sort=TRUE)

Reference: [Factor Analysis](http://www.statpower.net/Content/312/R%20Stuff/Exploratory%20Factor%20Analysis%20with%20R.pdf)

# SAVE YOUR NOTEBOOK

In [None]:
# Add code here to save your work in to the version control
# Hint: The file name is "Practice_Multivariate_Reduction.ipynb"

