# Learning Pandas' internal Series 
To better understand how it does things under the hood.

The three key data structures in Pandas are:
- Series objects (collections of values)
- DataFrames (collections of Series objects)
- Panels (collections of DataFrame objects)

In this mission, we will focus on the Series Object. 

Series objects uses NumPy arrays for fast computation, but add valuable features to them for analyzing data. 
While Numpy arrays use an integer index, for example, Series objects can use other index types, such as a string index. Series objects also allow for mixed data types, and use the NaN Python value for handling missing values.

A Series object can hold many data types, including: 
- `float` - For float values
- `int` - For integer values
- `bool` - For Boolean values
- `datetime64[ns]` - For date & time, without time zone
- `datetime64[ns, tz]` - For date & time, with time zone
- `timedelta[ns]` - For representing differences in dates & times (seconds, minutes, etc.)
- `category` - For categorical values
- `object` - For string values



## Fandango

Before we go into further depth, let's introduce the data set we'll be working with. The FiveThirtyEight team recently released a data set containing scores for all movies that have substantive user and critic reviews on IMDB, Rotten Tomatoes, Metacritic, and Fandango. **We'll be working with the file `fandango_score_comparison.csv`**, which you can download from their Github repository. Here are some of the columns in the data set:

- `FILM` - Film name
- `RottenTomatoes` - Average critic score on Rotten Tomatoes
- `RottenTomatoes_User` - Average user score on Rotten Tomatoes
- `RT_norm` - Average critic score on Rotten Tomatoes (normalized to a 0 to 5-point system)
- `RT_user-norm` - Average user score on Rotten Tomatoes (normalized to a 0 to 5-point system)
- `Metacritic` - Average critic score on Metacritic
- `Metacritic_User` - Average user score on Metacritic

The full list of columns, along with their descriptions, is available on the [Github repository](https://github.com/fivethirtyeight/data/tree/master/fandango).

**Instructions:**
- Use the pd.read_csv() function to read "fandango_score_comparison.csv" into a DataFrame object called fandango.
- Then, use the .head() method to print the first two rows.


In [1]:
import pandas as pd
import numpy as np
fandango = pd.read_csv("fandango_score_comparison.csv")
fandango.head(2)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5


DataFrames use Series objects to represent columns. When we select a single column from a DataFrame, pandas will return the Series object representing that column. By default, pandas indexes each individual Series object in a DataFrame with the integer data type. Each value in the Series has a unique integer index, or position. Like most Python data structures, the Series object uses 0-indexing. The indexing ranges from 0 to n-1, where n is the number of rows. We can use an integer index to select an individual value in a Series if we know its position.

With both NumPy arrays and Series objects, we can pass integer indexes into bracket notation to slice and select values. With Series objects, however, we can also specify custom indexes.

To explore this idea further, let's use two Series objects representing the film names and Rotten Tomatoes scores.

**Instruction**
- Select the FILM column, assign it to the variable series_film, and print the first five values.
- Then, select the RottenTomatoes column, assign it to the variable series_rt, and print the first five values.

In [2]:
series_film = fandango["FILM"]
print(series_film[0:5])
print()
series_rt = fandango["RottenTomatoes"]
print(series_rt[0:5])

0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object

0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64


## Custom index

Both these Series objects are use the same integer index, which means that the value at index 5, for example, would describe the same film in both Series objects (The Water Diviner (2015)). If we had a movie in mind, we need the integer index corresponding to that movie to look up information about it.

If we were given just these two Series objects and we wanted to look up the Rotten Tomatoes score for Minions (2015) and Leviathan (2014), we'd have to:

    find the integer index corresponding to Minions (2015) in series_film
    look up the value at that integer index from series_rt
    find the integer index corresponding to Leviathan (2014) in series_film
    look up the value at that integer index from series_rt

This becomes especially cumbersome as we scale up the problem to looking up information for a larger number of movies. What we really want is a way to look up the Rotten Tomatoes scores for many movies at a time using just one command (and one Series object). To accomplish this, we need to find a way to move away from using an integer index corresponding to the row number and instead use a string index corresponding to the film name. Then we can utilize bracket notation to just pass in a list of strings matching the film names and get back the Rotten Tomatoes scores:

    series_custom[['Minions (2015)', 'Leviathan (2014)']]





**Instruction**
- Create a new Series object named series_custom that has a string index (based on the values from film_names), and contains all of the Rotten Tomatoes scores from series_rt.
    - To create a new Series object:
        - Import Series from pandas.
        - Instantiate a new Series object, which takes in a data parameter and an index parameter. See the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series) for help.
        - Both of these parameters need to be lists.


In [3]:
from pandas import Series
import pandas as pd 

fandango = pd.read_csv("fandango_score_comparison.csv")
film_names = fandango["FILM"].values # or series_film.values
rt_scores = fandango["RottenTomatoes"].values
type(film_names)

series_custom = Series(data=rt_scores, index=film_names)
print(type(series_custom))
series_custom

<class 'pandas.core.series.Series'>


Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
The Water Diviner (2015)                           63
Irrational Man (2015)                              42
Top Five (2014)                                    86
Shaun the Sheep Movie (2015)                       99
Love & Mercy (2015)                                89
Far From The Madding Crowd (2015)                  84
Black Sea (2015)                                   82
Leviathan (2014)                                   99
Unbroken (2014)                                    51
The Imitation Game (2014)                          90
Taken 3 (2015)                                      9
Ted 2 (2015)                                       46
Southpaw (2015)                                    59
Night at the Museum: Secret 

Even though we specified that the Series object uses a custom string index, the object still has an internal integer index that we can use for selection. When it comes to indexes, Series objects act like both dictionaries and lists. We can access values with our custom index (like the keys in a dictionary), or the integer index (like the index in a list).

**Instruction**
- Assign the values in series_custom at indexes 5 through 10 to the variable fiveten. Then, print fiveten to 
- verify that you can still use integer values for selection.

In [4]:
series_custom[["Minions (2015)", "Leviathan (2014)"]]

Minions (2015)      54
Leviathan (2014)    99
dtype: int64

In [5]:
fiveten = series_custom[5:11]
fiveten

The Water Diviner (2015)             63
Irrational Man (2015)                42
Top Five (2014)                      86
Shaun the Sheep Movie (2015)         99
Love & Mercy (2015)                  89
Far From The Madding Crowd (2015)    84
dtype: int64

## Reindexing 
Reindexing is the pandas way of modifying the alignment between labels (indexes) and the data (values). The reindex() method allows us to specify a different order for the labels (indexes) in a Series object. This method takes in a list of strings corresponding to the order we'd like for that Series object.

We can use the reindex() method to sort series_custom alphabetically by film. To accomplish this, we need to:

    Return a list representation of the current index using tolist().
    Sort the index with sorted().
    Use reindex() to set the newly-ordered index.

The following code cell contains the logic for accomplishing the first task. We'll leave it up to you to finish the rest.

**Instructions**

 - The list original_index contains the original index. Sort this index using the Python 3 core method sorted(), then pass the result in to the Series method reindex().
 - Store the result in a variable named sorted_by_index.


In [6]:
original_index = series_custom.index 
original_index

Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
       'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
       'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
       'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
       ...
       'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
       'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
       'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
       'Gett: The Trial of Viviane Amsalem (2015)',
       'Kumiko, The Treasure Hunter (2015)'],
      dtype='object', length=146)

In [7]:
sorted_by_index = series_custom.reindex(index=sorted(original_index))
sorted_by_index.index

Index([''71 (2015)', '5 Flights Up (2015)', 'A Little Chaos (2015)',
       'A Most Violent Year (2014)', 'About Elly (2015)', 'Aloha (2015)',
       'American Sniper (2015)', 'American Ultra (2015)', 'Amy (2015)',
       'Annie (2014)',
       ...
       'Unbroken (2014)', 'Unfinished Business (2015)', 'Unfriended (2015)',
       'Vacation (2015)', 'Welcome to Me (2015)',
       'What We Do in the Shadows (2015)', 'When Marnie Was There (2015)',
       'While We're Young (2015)', 'Wild Tales (2014)',
       'Woman in Gold (2015)'],
      dtype='object', length=146)

We just learned how to sort a Series object by the index using the `reindex()` method. This can be cumbersome if we just want to do some quick exploratory data analysis, or reorder by rating instead of film name.

To make sorting easier, pandas comes with a `sort_index()` method that sorts a Series by index, and a `sort_values()` method that sorts a Series by its values. Since the values representing the Rotten Tomatoes scores are integers, sorting by values will return the data in numerically ascending order (low to high).

In both cases, pandas preserves the link between each element's index (film name) and value (score). We call this data alignment, which is a key tenet of pandas that's incredibly important when analyzing data. Pandas allows us to assume the linking will be preserved, unless we specifically change a value or an index.

**Instruction**
- Sort series_custom by index using sort_index(), and assign the result to the variable sc2.
- Sort series_custom by values, and assign the result to the variable sc3.
- Finally, print the first 10 values in sc2 and the first 10 values in sc3.


In [8]:
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
print(sc2[0:10])
print()
print(sc3[0:10])

'71 (2015)                    97
5 Flights Up (2015)           52
A Little Chaos (2015)         40
A Most Violent Year (2014)    90
About Elly (2015)             97
Aloha (2015)                  19
American Sniper (2015)        72
American Ultra (2015)         46
Amy (2015)                    97
Annie (2014)                  27
dtype: int64

Paul Blart: Mall Cop 2 (2015)     5
Hitman: Agent 47 (2015)           7
Hot Pursuit (2015)                8
Fantastic Four (2015)             9
Taken 3 (2015)                    9
The Boy Next Door (2015)         10
The Loft (2015)                  11
Unfinished Business (2015)       11
Mortdecai (2015)                 12
Seventh Son (2015)               12
dtype: int64


## Vectorized Operation
A column is really a vector of values. For this reason, we often want to transform an entire column in a data set. Series objects offer robust support for vectorized operations, which enable us to run computations over an entire column very quickly.

Since pandas builds on NumPy, it takes advantage of NumPy's vectorizaton capabilities. These capabilities generate incredibly optimized, low level code in the C programming language to loop over the values. Using a traditional for loop would be much slower, especially for large data sets.

We can use any of the standard Python arithmetic operators (+, -, *, and /) to transform each of the values in a Series object. If we wanted to transform the Rotten Tomatoes scores from a 100-point scale to a 10-point scale, for example, we could use the Python division operator (/) to divide the Series by 10:






In [9]:
series_custom/10

Avengers: Age of Ultron (2015)                     7.4
Cinderella (2015)                                  8.5
Ant-Man (2015)                                     8.0
Do You Believe? (2015)                             1.8
Hot Tub Time Machine 2 (2015)                      1.4
The Water Diviner (2015)                           6.3
Irrational Man (2015)                              4.2
Top Five (2014)                                    8.6
Shaun the Sheep Movie (2015)                       9.9
Love & Mercy (2015)                                8.9
Far From The Madding Crowd (2015)                  8.4
Black Sea (2015)                                   8.2
Leviathan (2014)                                   9.9
Unbroken (2014)                                    5.1
The Imitation Game (2014)                          9.0
Taken 3 (2015)                                     0.9
Ted 2 (2015)                                       4.6
Southpaw (2015)                                    5.9
Night at t

This will return a new Series object where each value is 1/10 of the original value. We can even use NumPy functions to transform and run calculations over Series objects:

In [10]:
# Add each value with each other
np.add(series_custom, series_custom)
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value, not a Series)
np.max(series_custom)

100

The values in a Series object are part of an `ndarray`, the core data type in NumPy. Applying some NumPy functions to a Series object will return a new Series object, while other functions will return a single value. NumPy's [documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sin.html#numpy.sin) gives us a good sense of the return value for each function. If a particular NumPy function usually returns an `ndarray`, it will return a Series object instead when we apply it to a Series.

The original DataFrame contains the column `RT_norm`, which represents a normalized score (from 0 to 5) of the Rotten Tomatoes average critic score. Let's use vectorized operations to normalize `series_custom` back to the 0-5 scale.

**Instruction**
- Normalize series_custom (which is currently on a 0 to 100-point scale) to a 0 to 5-point scale by dividing each value by 20.
- Assign the new normalized Series object to series_normalized.


In [12]:
series_normalized = series_custom / 20 
series_normalized

Avengers: Age of Ultron (2015)                    3.70
Cinderella (2015)                                 4.25
Ant-Man (2015)                                    4.00
Do You Believe? (2015)                            0.90
Hot Tub Time Machine 2 (2015)                     0.70
The Water Diviner (2015)                          3.15
Irrational Man (2015)                             2.10
Top Five (2014)                                   4.30
Shaun the Sheep Movie (2015)                      4.95
Love & Mercy (2015)                               4.45
Far From The Madding Crowd (2015)                 4.20
Black Sea (2015)                                  4.10
Leviathan (2014)                                  4.95
Unbroken (2014)                                   2.55
The Imitation Game (2014)                         4.50
Taken 3 (2015)                                    0.45
Ted 2 (2015)                                      2.30
Southpaw (2015)                                   2.95
Night at t

## Comparing and Filtering
Pandas uses vectorized operations for many tasks, such as filtering values within a single Series object and comparing two different Series objects. For example, to find all films with an average critic rating of 50 or above on Rotten Tomatoes, running:

In [14]:
# Returns a series object with a boolean value for each film (as the Series Index)
series_custom > 50

Avengers: Age of Ultron (2015)                     True
Cinderella (2015)                                  True
Ant-Man (2015)                                     True
Do You Believe? (2015)                            False
Hot Tub Time Machine 2 (2015)                     False
The Water Diviner (2015)                           True
Irrational Man (2015)                             False
Top Five (2014)                                    True
Shaun the Sheep Movie (2015)                       True
Love & Mercy (2015)                                True
Far From The Madding Crowd (2015)                  True
Black Sea (2015)                                   True
Leviathan (2014)                                   True
Unbroken (2014)                                    True
The Imitation Game (2014)                          True
Taken 3 (2015)                                    False
Ted 2 (2015)                                      False
Southpaw (2015)                                 

will actually return a Series object with a Boolean value for each film. That's because pandas applies the filter (> 50) to each value in the Series object. To retrieve the actual film names, we need to pass this Boolean series into the original Series object.

In [16]:
series_greater_than_50 = series_custom[series_custom > 50]
series_greater_than_50

Avengers: Age of Ultron (2015)                                             74
Cinderella (2015)                                                          85
Ant-Man (2015)                                                             80
The Water Diviner (2015)                                                   63
Top Five (2014)                                                            86
Shaun the Sheep Movie (2015)                                               99
Love & Mercy (2015)                                                        89
Far From The Madding Crowd (2015)                                          84
Black Sea (2015)                                                           82
Leviathan (2014)                                                           99
Unbroken (2014)                                                            51
The Imitation Game (2014)                                                  90
Southpaw (2015)                                                 

Pandas returns Boolean Series objects that serve as intermediate representations of the logic. These objects make it easier to separate complex logic into modular pieces. We can specify filtering criteria in different variables, then chain them together with the and operator (&) or the or operator (|). Finally, we can use a Series object's bracket notation to pass in an expression representing a Boolean Series object and get back the filtered data set.

**Instruction**

- In the following code cell, the criteria_one and criteria_two statements return Boolean Series objects.
- Return a filtered Series object named both_criteria that only contains the values where both criteria are true. Use bracket notation and the & operator to obtain this Series object.


In [18]:
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[(series_custom > 50) & (series_custom < 75)]
both_criteria

Avengers: Age of Ultron (2015)                                            74
The Water Diviner (2015)                                                  63
Unbroken (2014)                                                           51
Southpaw (2015)                                                           59
Insidious: Chapter 3 (2015)                                               59
The Man From U.N.C.L.E. (2015)                                            68
Run All Night (2015)                                                      60
5 Flights Up (2015)                                                       52
Welcome to Me (2015)                                                      71
Saint Laurent (2015)                                                      51
Maps to the Stars (2015)                                                  60
Pitch Perfect 2 (2015)                                                    67
The Age of Adaline (2015)                                                 54

## Data alignment
One of pandas' core tenets is data alignment. Series objects align along indices, and DataFrame objects align along both indices and columns. With Series objects, pandas implicitly preserves the link between the index labels and the values across operations and transformations, unless we explicitly break it. With DataFrame objects, the values link to the index labels and the column labels. Pandas also preserves these links, unless we explicitly break them (by reassigning or editing a column or index label, for example).

This core tenet allows us to use pandas effectively when working with data, and offers a big advantage over using NumPy objects. For Series objects in particular, this means we can use the standard Python arithmetic operators (+, -, *, and /) to add, subtract, multiply, and divide the values at each index label for two different Series objects.

Let's use this functionality to calculate the mean ratings from both critics and users on Rotten Tomatoes.

** Instruction **
- rt_critics and rt_users are Series objects containing the average ratings from critics and users for each film.
- Both Series objects use the same custom string index, which they base on the film names. Use the Python arithmetic operators to return a new Series object, rt_mean, that contains the mean ratings from both critics and users for each film.


In [25]:
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_critics.head(3)

FILM
Avengers: Age of Ultron (2015)    74
Cinderella (2015)                 85
Ant-Man (2015)                    80
dtype: int64

In [24]:
rt_users.head(3)

FILM
Avengers: Age of Ultron (2015)    86
Cinderella (2015)                 80
Ant-Man (2015)                    90
dtype: int64

In [None]:
rt_mean = rt_critics + rt