# Data Visualization

---

### Topics Covered
- Types of Data
- Creasting and Reading Tables 
- Table Methods: Select, Drop, Sort
- Accessing Columns and Column Artihmetic
- Bar chart


### Table of Contents
1 - [Data Types](#section1)<br>

2 - [Data Visualization - Bar Charts](#section2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a) [Fruits Example](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b) [Movies Example](#subsection2)<br>

### Learning Objectives: 

By the end of this notebook you should be able to :
* Understand the difference between categorical and numerical data 
* Understand what makes good visualizations and a bad one 
* Understand how to use bar charts 
* Understand how make bar charts




**Dependencies:**

In [None]:
from datascience import *
import numpy as np
from IPython.lib.display import YouTubeVideo

import matplotlib.pyplot as plt
%matplotlib inline

___
### Data Types  <a id='section1'></a>

#### Please watch the following 4 minute video 



In [None]:
# Run thice cell 

YouTubeVideo('EHRg9ojcVRQ') 

#### Summary and terminology: 

Two important data types: 

1) **Numerical** - Each value is from a numerical scale
* Can do arithmetic on them (i.e. difference, addition, average, etc)
* Are ordered
   
2) **Categorical** - Each value is from a fixed inventory
* May or may not have order 

3) **Individuals**: Objects/ subjects whose **features** are recorded

4) **Variables**: Features 
* Variables have different **values** 
* **Values** are numerical OR categorical (and their sub-types)

---
#### Test your knowledge! 

For the following, please type the data type and write down why you selected that answer.

#### A) 

student_ID = 12345 
        

_YOUR ANSWER HERE_

#### B)

ice_cream_flavors = chocolate 

_YOUR_ANSWER HERE_

#### C)

account_balance = 14.59 

_YOUR ANSWER HERE_

---

## Data Visualization <a id='section2'></a>

---
In order to undertand our data better, a tool you might want to us is data visualization. Data visualization allows you to further your analysis, draw insights, and communicate trends in data that otherise hard to see from just looking at data tables. 

---

As we learned above, data generally falls into two main umbrellas, numerical and categorical. Categorical doesn't necessarily have to have an order. Here are some examples:

* Colors: individuals are my color pencils, and the varaible is the color of the pencil
* Dogs: individuals are dogs in my house, and the variable is the breed of the dog
* Movies: the individuals are years, and the variable is the name of the highest grossing studio of the year. 

### Fruits Example<a id='subsection1'></a>

Jesse loves fruit! In fact, this week she bought 30 fruits at her local supermarket. For this question, you will be creating a table with the following information and saving it on a variable called "fruits". She bought 5 *types* of fruit. There are 5 mangos, 10 apples, 8 peaches, 5 bananas, and 2 pineapples. 

In [None]:

fruits = Table().with_columns(
    'name', make_array('mango', 'apple', 'peach', 'banana', 'pineapple'),
    'quantity', make_array(5, 10, 8, 5, 2))
fruits


The _values_ of of the categorical variable "name" are mango, apple, peach, banana, and pineapple.<br>
The table above has shows the quantity for each type of fruit. This is the _distribution_ table. A distribution displays all of the values of a variable, and their frequencies. 

---

#### Bar Charts <a id='section3'></a>


We can use bar charts to visualize categorical distributions. Bar charts display a bar for each category. In bar chart for the fruit table would display 5 bars since there are 5 categories. These bars are equally spaced and equally wide, with their length being proportional to the frequency of its corresponding category. 

To empahasize, when drawing these charts keep in mind the following:

* One axis is a categorical variable, while the other is numerical frequencies.
* The length of the bars are proportioanl to the frequency corresponding to each category.
* **Distributions describe frequencies of the variables**


Let's draw a bar chart for our fruits table. 

In [None]:
# Method (1)

fruits.barh('name', 'quantity')

In [None]:
# Method (2) 

fruits.barh('name')

Can you see why **Method (2)** worked?

Whenever we have a table that has just **one** column with categories (e.g. fruit name) and **one** for frequency (e.g. fruit quantity), can can call the method on the column with the _categories_, and the method will know how to draw the frequencies. 

Another cool thing about bar graphs is that they can be drawn in any order since the categories do not hold a universal rank. This gives us the flexibility to arrange them in an order that makes sense to you and your analysis. 
* Universal rank: 1, 2, 7, 10 
* No universal rank: "banana" , "pineapple", "mango" , etc.



Let's redo our bar graph, but now we want to display the bars in the decreasing order based on the quantity.

Remember our the method **sort**? This is the time to use it. 

In [None]:
fruits.sort('quantity', descending = True).barh('name', 'quantity')

This plot contains the same information as before, but now we can easily tell which fruit Jesse has the most of.

Now, it's your turn to try some visualizations!

--- 

### Movies Example<a id='subsection2'></a>

In this section you will be working with the `actors` dataset. The source of data is [Internet Movie Database](https://www.imdb.com/) or IMBD, which is an online service that stores information about movies, televesion shows, video games, and more. The [Box Office Mojo](https://www.boxofficemojo.com/) is site that provided summaries of IMDB data. We will be using data from here. 

The table `actors` contains the following data: 

|Column Name|Description|
|------|--------|
|Actor	|Name of actor|
|Total Gross|	Total gross domestic box office receipt, in millions of dollars, of all of the actor's movies|
|Number of Movies|	The number of movies the actor has been in|
|Average per Movie|	Total gross divided by number of movies|
|#1 Movie|	The highest grossing movie the actor has been in|
|Gross|	Gross domestic box office receipt, in millions of dollars, of the actor's #1 Movie|

<div class="alert alert-warning">
    
**Question 1.** 

Read the dataset by running the following cell. Save the output to a new variable, `actors`


In [None]:
... = Table.read_table("../data/actors.csv")
...

In [None]:
# ANSWER KEY
actors = Table.read_table("../data/actors.csv")
actors

Great! Now take a minute to observe the data. Can you see guess which variables are categorical? What about numerical?
<br><br>
For the next couple of questions we are going to plot the top 10 values for different categorical variables. For this section you will use the table method **take**, which will be introduced at a later section. **take**, is the row version for the method **select**. Recall that **select** returns a table with only somecolumns? **take**, returns a table witn only some number rows. 

<div class="alert alert-warning">
    
**Question 2.**

Make a bar plot of the top 10 actors base on their total gross income. Recall that total gross income is the total gross domestic box office receipt, in millions of dollars, of all of the actor's movies.
    


**A)** Create a new table follwing the next steps:

1) First, let's create a new table that contiains the actor's name and their gross income. <br>
2) Call this table **by_total_gross**. <br>
3) Sort by descending oder(since we want the information for the top 10 earners) <br>

Hint: Use the method **select** and **sort**

In [None]:
by_total_gross = actors....("Actor", "...").sort("...", descending = ...)
by_total_gross

In [None]:
# ANSWER KEY
by_total_gross = actors.select("Actor", "Total Gross").sort("Total Gross", descending = True)
by_total_gross

**B)** Now we will use the method **take** to select only the top 10 earners, and we will save it on a variable called **top_10_total_gross**

In [None]:
top_10_total_gross = by_total_gross.take(np.arange(0,10))
top_10_total_gross

Who is the the number one top earner? Who came in 10th place?

_YOUR ANSWER HERE_

**C)** Make a bar plot of the top 10 actors base on their total gross income.

Hint: Use the method **barh**. 

In [None]:
top_10_total_gross...

In [None]:
# ANSWER KEY
top_10_total_gross.barh("Actor","Total Gross")

Do you notice anything interesting about this chart?

_YOUR ANSWER HERE_

<div class="alert alert-warning">

**Question 2.**

Make a bar plot of the top 10 actors base on the number of movies the actor has been in.
    

**A)** Create a new table follwing the next steps:

1) First, let's create a new table that contiains the actor's name and the number of movies the actor has been in . <br>
2) Call this table **by_num_movies**. <br>
3) Sort by descending oder(since we want the information for the top 10 earners) <br>

Hint: Use the method **select** and **sort**

In [None]:
# ANSWER KEY
... = actors.select("...", "...").sort("...", ...)
...

In [None]:
# ANSWER KEY
by_num_movies = actors.select("Actor", "Number of Movies").sort("Number of Movies", descending = True)
by_num_movies

**B)** Now we will use the method **take** to select only the top 10 earners, and we will save it on a variable called **top_10_num_movies**

In [None]:
... = by_num_movies.take(np.arange(0,10))
...

In [None]:
# ANSWER KEY
top_10_num_movies = by_num_movies.take(np.arange(0,10))
top_10_num_movies

**C)** Make a bar plot of the top 10 actors base on the number of movies they have been in 

Hint: Use the method **barh**. 

In [None]:
# YOUR CODE HERE 
... 

In [None]:
# ANSWER KEY
top_10_num_movies.barh("Actor","Number of Movies")

Do you notice anything interesting about this chart?

_YOUR ANSWER HERE_

---

## Bibliography

The Foundations of Data Science by Ani Adhikari and John Denero with contributions by David Wagner and Henry Milner (Data 8 Textbook - Chapter 7)   https://www.inferentialthinking.com/chapters/07/Visualization.html 


---
Notebook developed by: Ashley Quiterio, Alleanna Clark and Karla Palos

Data Science Modules: http://data.berkeley.edu/education/modules
