## Multivariate Exploration Practice

We are going to use movies dataset for illustrating multivariate data exploration. 
The dataset is a result of scrapping 5000+ movies from the International Movie Database (IMDB) website using a Python library called __scrapy__. 
It has 28 variables for 5043 movies, spanning across 100 years in 66 countries. 
There are 2399 unique director names, and thousands of actors/actresses.

### Loading data

Load the data `\datasets\movies\movie_metadata.csv` into R and name the dataframe as `movies_data`

In [None]:
movies_data <- read.csv("../../../datasets/movies/movie_metadata.csv", header = T, sep=",")

As usual lets take a quick look at the data, if we read the data correctly into the dataframe.

In [None]:
head(movies_data)

In [None]:
#The structure of the dataframe as follows
str(movies_data)

Now we have an overall sense of the dataset what it contains. Let's start with multivariate data analysis to gain insights from the data.

**Question 1.a: ** What are the different categories of movies based on the color attribute?

In [None]:
<what goes in here>(movies_data$<what goes in here>)

There are 19 movies with no information for color of the movie. 

In [None]:
summary(movies_data)

Looking at the summary of all variables, values are missing for variables director_name and color in many rows. 
Also NA values exist for multiple variables. 
NA values can be treated in multiple ways. 
One way is to replace NA values with the average value of that column. 
Second approach is to remove the rows with NA values. 

The choice of which approach is often driven by your goals and how you feel about approximating the column value versus removing data for your task at hand.

----
Lets see how many missing values are present per row and column. 
R has built in command row sums that perform sum of values in a row and column. 
**is.na()** will return logical values 0 or 1 for TRUE and FALSE when checking if a column in a row has NA value. 
**rowSums()** will sum up all this 1's and 0's in a row. 
**colSums()** will sum up all 1's and 0's in a column.

**Reference:** [colSums()](https://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html)

**Reference:** [rowSums()](https://stat.ethz.ch/R-manual/R-devel/library/base/html/rowsum.html)

In [None]:
rowSums(is.na(movies_data))

In [None]:
colSums(is.na(movies_data))

In [None]:
# Use factor() for nominal data to convert their labels
movies_data$color <- factor(movies_data$color, levels = c(""," Black and White","Color"), labels = c(1,2,3)) 
head(movies_data)

**Reference: ** [factor()](https://www.r-bloggers.com/data-types-part-3-factors/)

Lets convert the labels of color column back to original values

In [None]:
movies_data$color <- factor(movies_data$color, levels = c(1,2,3), labels = c(""," Black and White","Color")) 

Lets generate a two way table for color and language variables. Calculate proportions of the movies made in different languages by the picture color. 

In [None]:
movies_by_country_color <- table(movies_data$color,movies_data$language)
# Row proportions after rounding the values to two decimal levels
#round() will round the values that given as input according to given level. In our example we are rounding columns of 
#movies_by_country_color table to 2 decimals

#prop.table simply generates proprotions of values. Coverting frequency counts into probabilities.
#prop.table(movies_by_country_color,1) - 1 represents rows and 2 represents columns in prop.table() function. If you want column 
#wise proportions use 2 instead of 1.

round(prop.table(movies_by_country_color,1),2)

**Reference: ** [table()](http://www.cyclismo.org/tutorial/R/tables.html)

----
If you want to convert above proportions in to percentages, simply multiply the values with 100 as shown below.

In [None]:
round(100*prop.table(movies_by_country_color,1),2)

**Question 1.b** Perform chi-squared test for language and color variables using the two way table "movies_by_country_color" we created above. 

In [None]:
chisq.test(<what goes in here>)

**Question 2.a: ** subset the movies_data dataset based on NA value for `gross` in the dataset. Name the subset as `na_data` that should contain all rows which have `gross` as NA value.

In [None]:
na_data=movies_data[<what goes in here>(movies_data$gross),]

**Question 2.b: ** What is the distribution of 'NA' values in the `budget` variable in the na_data dataframe. 

In [None]:
table(<what goes in here>(na_data$budget))

For our analysis lets go ahead and remove the rows that contain any NA values.

In [None]:
nrow(movies_data)
movies_data=na.omit(movies_data)
nrow(movies_data)

The number of rows reduced from 5043 to 3801 when excluding all NA values fom the dataset. 
We lost more than 20% of the data after removing those rows. 

In [None]:
require(gridExtra)
require(ggplot2)

## grid.arrange(x1,x2,x3...xn,ncol=x,nrow=y)
## The command will arrange the plots x1,x2....xn in the desired outlet of specified rows and columns

# The number of bins should be chosen as appropriate. If you are not sure then trial and error is the best way to figure the 
# right number of bins. Each bin will have number of observations equal to bin size. 

grid.arrange(qplot(movies_data$duration,bins = 100,xlab='Movie duration'),
             qplot(movies_data$director_facebook_likes,bins = 50,xlab='director facebook likes'),
             qplot(movies_data$actor_3_facebook_likes,bins = 50,xlab='actor_3_facebook_likes'),
             qplot(movies_data$actor_1_facebook_likes,bins = 50,xlab='actor_1_facebook_likes'),
             qplot(movies_data$gross,bins = 50,xlab='gross'),
             qplot(movies_data$cast_total_facebook_likes,bins = 50,xlab='cast_total_facebook_likes'),
             qplot(movies_data$budget,bins = 50,xlab='budget'),
             qplot(movies_data$actor_2_facebook_likes,bins = 50,xlab='actor_2_facebook_likes'),
             qplot(movies_data$movie_facebook_likes,bins = 50,xlab='movie_facebook_likes'),
             ncol = 3)

Looking at the histograms, most of the variables are positively skewed except for movie duration which looks normally distributed. Maximum durartion of a movie is 511, which is weird. It will be an 8 hours 30 minutes movie. The extreme values in rest of the variables dont seem to be outliers, as some of the movies, directors or actors are very famous compared to rest all movies. A box plot might give us more information about these variables.  

Lets plot box plots for these variables.

In [None]:
# library(ggplot2)
# require(gridExtra)

grid.arrange(qplot(y=movies_data$duration, x= 1, geom = "boxplot",ylab='Duration'),
             qplot(y=movies_data$budget, x= 1, geom = "boxplot",ylab='Budget'),
             qplot(y=movies_data$gross, x= 1, geom = "boxplot",ylab='Gross'),
             qplot(y=movies_data$director_facebook_likes, x= 1, geom = "boxplot",ylab='Director fb likes'),
             qplot(y=movies_data$actor_1_facebook_likes, x= 1, geom = "boxplot",ylab='actor 1 fb likes'),
             qplot(y=movies_data$actor_2_facebook_likes, x= 1, geom = "boxplot",ylab='actor 2 fb likes'),
             qplot(y=movies_data$actor_3_facebook_likes, x= 1, geom = "boxplot",ylab='actor 3 fb likes'),
             qplot(y=movies_data$cast_total_facebook_likes, x= 1, geom = "boxplot",ylab='cast fb likes'),
             qplot(y=movies_data$movie_facebook_likes, x= 1, geom = "boxplot",ylab='movie fb likes'),
             ncol=3,nrow=3)

Above box plots do not show any interesting patterns. Data has extreme outliers so the plots are not very telling. The only column in the dataset which tells us if a movie is good or not is the imdb_score. 

Movies with higher rating anything above 8 will end up in IMDB top 250. They are considered as good by the critics. Movies with rating 7 to 8 are probably still good movies. Movies with rating 6 to 7 can be ok to watch but viewers may not gain anything from them. At last, movies with ratings 1 to 5 are generally considered as bad movies.

**Question 3: ** Add the values in the columns director_facebook_likes, actor_3_facebook_likes,actor_1_facebook_likes,cast_total_facebook_likes, actor_2_facebook_likes, movie_facebook_likes for every row in the dataset to find the total facebook likes a movie can claim.  

Hint: Use the appropriate apply() function to add the values.

In [None]:
result = <what goes in here>(sum, movies_data$director_facebook_likes, movies_data$actor_3_facebook_likes,
                movies_data$actor_1_facebook_likes,movies_data$cast_total_facebook_likes, 
       movies_data$actor_2_facebook_likes, movies_data$movie_facebook_likes)
head(result)

**Question 4: ** Summarize the data for columns country and color using a 2-way table. Add row and column sums at the end using addmargins() function. 

In [None]:
color_country <- table(<what goes in here>,<what goes in here>)
addmargins(<what goes in here>)

**Question 5: ** Find correlation of all variables in movies_data dataframe by excluding all the factor variables in the dataset.

In [None]:
#Create a dataframe excluding all factor variables in movies_data.
#sapply() will generate column types for each column. %in% will check each column type with "factor" and returns TRUE or FALSE.
#TRUE or FALSE will determine which rows to select and which not when assigning movies_data to less_data
less_data=movies_data[!<what goes in here>(movies_data,class) %<what goes in here>% c("<what goes in here>")]
cor(less_data)

Imdb score is the variable of interest in our movies dataset. Above correlation matrix tells us none of the continous variables are highly correlated with imdb_score. Following table shows how different variables are correlated with imdb_score. 

|Feature |imdb_score for population|
|-|-------------------------|
|num_critic_for_reviews|0.3438808|
|duration|0.36612369|
|director_facebook_likes|0.19083814|
|actor_3_facebook_likes|0.06497354|
|actor_1_facebook_likes|0.09313142|
|gross|0.21212439|
|num_voted_users|0.47791732|
|cast_total_facebook_likes|0.1062587|
|num_user_for_reviews|0.32252237|
|budget|0.02904057|
|title_year|-0.12926516|
|actor_2_facebook_likes|0.10206038|
|aspect_ratio|0.02845372|
|movie_facebook_likes|0.2794777|

With a 0.2 threshold for correlation, the 6 variables num_critic_for_reviews, duration, gross, num_voted_users, num_user_for_reviews and movie_facebook_likes looks promising predictors of imdb_score.

**Question 6: **Draw a plot with imdb_score on x-axis, num_voted_users on y-axis for movies_data dataset. Use color variable for color parameter and duration for size parameter. 

In [None]:
options(scipen=999)  

ggplot(movies_data,
       aes(x=imdb_score,        # independet variable, feature 1
           y=num_voted_users,              # dependent variable, feature 2
           color=<what goes in here>,  # independet variable, feature 3
           size=duration),      # independet variable, feature 4
       xlab("IMDB Score"),
       ylab("Number of voted users"),
       main("IMDB rating vs No. of voted users")
      ) + <what goes in here>

**Question 7: **Draw a plot with imdb_score on x-axis, director_facebook_likes on y-axis for movies_data dataset. Use color variable for color parameter and gross for size parameter. Debug/modify the code to generate the plot.

In [None]:
options(scipen=999)  
library(ggplot2)
ggplot(movies_data,
       aes(x=imdb_score,        # independet variable, feature 1
           y=director_facebook_likes,              # dependent variable, feature 2
           color=<what goes in here>,  # independet variable, feature 3
           size=<what goes in here>)  +    # independet variable, feature 4
       xlab("IMDB Score") +
       ylab("Director fb likes")+
       labs(title="IMDB rating vs Director fb likes")
       +geom_point()

**Question 8.a: **Draw a plot with imdb_score on x-axis, director_facebook_likes on y-axis for movies_data dataset. Use movie_facebook_likes variable for color parameter and gross for size parameter. 

In [None]:
options(scipen=999)  
library(ggplot2)
ggplot(movies_data,
       aes(x=imdb_score,        # independet variable, feature 1
           y=<what goes in here>,              # dependent variable, feature 2
           <what goes in here>=movie_facebook_likes,  # independet variable, feature 3
           size=gross)+      # independet variable, feature 4
       xlab("IMDB Score")+
       ylab("Director fb likes")+
       main("IMDB rating vs Director fb likes")
      ) +geom_point()

**Question 8.b: ** Write in a few words about your observations about the plot.

````

        Write your answer for 8.b here

````

**Question 9.a: **Draw a plot with imdb_score on x-axis, num_voted_users on y-axis for movies_data dataset. Use num_user_for_reviews variable for color parameter and gross for size parameter, color variable for shape parameter. 

In [None]:
options(scipen=999)  
library(ggplot2)
ggplot(movies_data,
       aes(x=imdb_score,        # independet variable, feature 1
           y=num_voted_users,              # dependent variable, feature 2
           color=num_user_for_reviews,  # independet variable, feature 3
           size=gross,                  # independet variable, feature 4
          <what goes in here>=<what goes in here>),      # independet variable, feature 5
       xlab("IMDB Score"),
       ylab("Director fb likes"),
       main("IMDB rating vs No of voted users")
      ) +geom_point()

**Question 9.b: ** Write in a few words about your observations about the plot.

````

        Write your answer for 9.b here

````

**Question 10.a: **Draw a 3d scatterplot with imdb_score on x-axis, num_voted_users on y-axis and num_user_for_reviews on z-axis for movies_data dataset. Use imdb_score variable variable for color parameter. Modify/complete the code to generate the plot.

In [None]:
library(scatterplot3d)
#Assigning a color to ranges of bathrooms
movies_data$colors[movies_data$imdb_score<=2] <- "green"
movies_data$colors[movies_data$imdb_score>=3 & movies_data$imdb_score<4] <- "magenta"
movies_data$colors[movies_data$imdb_score>=4 & movies_data$imdb_score<5] <- "red"
movies_data$colors[movies_data$imdb_score>=5 & movies_data$imdb_score<6] <- "blue"
movies_data$colors[movies_data$imdb_score>=6 & movies_data$imdb_score<7] <- "orange"
movies_data$colors[movies_data$imdb_score>=7 & movies_data$imdb_score<8] <- "cyan"
movies_data$colors[movies_data$imdb_score>=8 & movies_data$imdb_score<9] <- "purple"
movies_data$colors[movies_data$imdb_score>=9 & movies_data$imdb_score<10] <- "black"


<what goes in here>(movies_data, {
   # The scatter plot 3D was introduced in the extra / optional notebooks last week.
   # This function produces a three dimensional plot of points using 3 variables for position, instead of just 2.
                 #   x,        y,   and    z axis
   scatterplot3d(imdb_score, num_voted_users, num_user_for_reviews,       
                 <what goes in here>=<what goes in here>,             # put lines on the horizontal plane
                 angle = 45,           # angle=45 denotes how graph is oriented,
                 pch = 16,             # pch=16 denotes shape used to denote points on the plot
                 color=colors,         #color=colors suggests to colors variable created above
                 main="IMDB score vs user votes & reviews",        
                 xlab="IMDB Score",
                 ylab="No. of voted users",
                 zlab="No of user reviews")

legend("topleft", inset=.05,      # location where the legend should be positioned on the graph
    bty="n", cex=.5,              # suppress legend box, shrink text 50%
    title="IMDB score", 
    c("<2", "3 - 3.9", "4 - 4.9", "5 - 5.9", "6 - 6.9", "7 - 7.9", "8 - 8.9", "9 - 9.9"), 
       fill=c("green", "magenta", "red", "blue", "orange", "cyan", "purple", "black"))
}) # ends the context of the with()

**Question 10.b: ** Write in a few words about your observations about the plot.

````

        Write your answer for 10.b here

````

# SAVE YOUR NOTEBOOK

In [None]:
# Add code here to save your work in to the version control
# Hint: The file name is "multivariate_exploration.ipynb"

