# Lecture 6 Pandas and Otaining Data from File
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 5](https://wesmckinney.com/book/pandas-basics)

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* Pandas
  * Series and DataFrame objects
  * Setting Index
  * Setting Columns
  * Sorting
  * Statistics
    * `.info()` and `.describe()`
    * Mean and Standard Deviation
  * Dropping rows and columns
  * Subsetting
  * Applying a function to a column
* Loading Data from a File
-----

## Pandas

Pandas is short for "*Pan*el *Da*ta". It was created by Wes McKinney in 2008 (the author of our textbook), and has become a staple in Data Science with Python.

Pandas has two main tools: the *series* and the *dataframe*. A series is basically like a NumPy array, but adds much more functionality. To see this, let's take this dataset of books from "The Chronicles of Narnia" by C.S. Lewis.

In [46]:
import pandas as pd
import numpy as np

data = {'title': ["The Lion, The Witch, and the Wardrobe", "Prince Caspian","The Voyage of the Dawn Treader",
                  "The Silver Chair","The Horse and His Boy","The Magician's Nephew","The Last Battle"],
        'publish_date': [1950,1951,1952,1953,1954,1955,1956],
        'order': [2, 4, 5, 6, 3, 1, 7],
        'book_rating': [4.7, 4.6, 4.7, 4.6, 4.6, 4.6, 4.7],
        'book_num_ratings': [18015, 3813, 3177, 3133, 4593, 9561, 3120],
        'movie_date': [2005, 2008, 2011, np.nan, np.nan, np.nan, np.nan],
        'movie_rating': [6.9, 6.5, 6.3, np.nan, np.nan, np.nan, np.nan],
        'movie_num_ratings': [423000, 223000, 165000, np.nan, np.nan, np.nan, np.nan]
        }


### Series

To create a series:

In [47]:
narnia_titles = pd.Series(data['book_rating'])
narnia_titles

0    4.7
1    4.6
2    4.7
3    4.6
4    4.6
5    4.6
6    4.7
dtype: float64

At first, this looks just like an array. We can do many of the same calculations as we can with an array.

In [51]:
narnia_titles.mean()

4.642857142857144

In [52]:
narnia_titles.std()

0.05345224838248516

However, a series also allows us to add information to help us with our data processing. To start, lets change the index.

In [49]:
narnia_ratings = pd.Series(data['book_rating'], index=data['title'])
narnia_ratings

The Lion, The Witch, and the Wardrobe    4.7
Prince Caspian                           4.6
The Voyage of the Dawn Treader           4.7
The Silver Chair                         4.6
The Horse and His Boy                    4.6
The Magician's Nephew                    4.6
The Last Battle                          4.7
dtype: float64

Now, we can reference an element merely by calling its title.

In [53]:
narnia_ratings['Prince Caspian']

4.6

Let's also say that we only want the ratings with more than 4500 ratings. We can easily filter the books.

In [56]:
narnia_num_ratings = pd.Series(data['book_num_ratings'], index=data['title'])
narnia_num_ratings

The Lion, The Witch, and the Wardrobe    18015
Prince Caspian                            3813
The Voyage of the Dawn Treader            3177
The Silver Chair                          3133
The Horse and His Boy                     4593
The Magician's Nephew                     9561
The Last Battle                           3120
dtype: int64

In [57]:
narnia_num_ratings > 5000

The Lion, The Witch, and the Wardrobe     True
Prince Caspian                           False
The Voyage of the Dawn Treader           False
The Silver Chair                         False
The Horse and His Boy                    False
The Magician's Nephew                     True
The Last Battle                          False
dtype: bool

In [58]:
narnia_num_ratings[narnia_num_ratings > 5000]

The Lion, The Witch, and the Wardrobe    18015
The Magician's Nephew                     9561
dtype: int64

### DataFrames

There are a lot of things we could do with Series, but an even more useful tool is a DataFrame. A DataFrame is bascially a table where each column is a Series.

In [100]:
narnia = pd.DataFrame(data)
narnia

Unnamed: 0,title,publish_date,order,book_rating,book_num_ratings,movie_date,movie_rating,movie_num_ratings
0,"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
1,Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
2,The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
3,The Silver Chair,1953,6,4.6,3133,,,
4,The Horse and His Boy,1954,3,4.6,4593,,,
5,The Magician's Nephew,1955,1,4.6,9561,,,
6,The Last Battle,1956,7,4.7,3120,,,


A couple of things we can do with this. Just as with the series, we can change the index.

If it is a temporary change (just for the one line of code), then we just say `df.set_index('column')`. But if we want this change to be permanent, we add the `inplace=True` argument.

In [101]:
narnia.set_index('title', inplace=True)
narnia

Unnamed: 0_level_0,publish_date,order,book_rating,book_num_ratings,movie_date,movie_rating,movie_num_ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,,
The Horse and His Boy,1954,3,4.6,4593,,,
The Magician's Nephew,1955,1,4.6,9561,,,
The Last Battle,1956,7,4.7,3120,,,


Notice the column names. These names are just fine. But if we want to rename them, we use the `.columns` attribute.

In [102]:
narnia.columns

Index(['publish_date', 'order', 'book_rating', 'book_num_ratings',
       'movie_date', 'movie_rating', 'movie_num_ratings'],
      dtype='object')

In [103]:
narnia.columns = ['Published', 'Chronological Order', 'Book Rating', 'Number of Book Ratings',
                  'Movie Release Date', 'Movie Rating', 'Number of Movie Ratings']
narnia

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,,
The Horse and His Boy,1954,3,4.6,4593,,,
The Magician's Nephew,1955,1,4.6,9561,,,
The Last Battle,1956,7,4.7,3120,,,


In [128]:
narnia['Published']

title
The Lion, The Witch, and the Wardrobe    1950
Prince Caspian                           1951
The Voyage of the Dawn Treader           1952
The Silver Chair                         1953
The Horse and His Boy                    1954
The Magician's Nephew                    1955
The Last Battle                          1956
Name: Published, dtype: int64

In [129]:
narnia.loc['Prince Caspian']

Published                    1951.0
Chronological Order             4.0
Book Rating                     4.6
Number of Book Ratings       3813.0
Movie Release Date           2008.0
Movie Rating                    6.5
Number of Movie Ratings    223000.0
Name: Prince Caspian, dtype: float64

As we said earlier, we can do regular calculations for each series (or column of the DataFrame).

In [104]:
narnia['Movie Rating'].mean()

6.566666666666666

A lot of the basic math that we would normally calculate are summarized in the `.info()` and `.describe()` methods.

In [130]:
narnia.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, The Lion, The Witch, and the Wardrobe to The Last Battle
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Published                7 non-null      int64  
 1   Chronological Order      7 non-null      int64  
 2   Book Rating              7 non-null      float64
 3   Number of Book Ratings   7 non-null      int64  
 4   Movie Release Date       3 non-null      float64
 5   Movie Rating             3 non-null      float64
 6   Number of Movie Ratings  3 non-null      float64
dtypes: float64(4), int64(3)
memory usage: 748.0+ bytes


In [131]:
narnia.describe()

Unnamed: 0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
count,7.0,7.0,7.0,7.0,3.0,3.0,3.0
mean,1953.0,4.0,4.642857,6487.428571,2008.0,6.566667,270333.333333
std,2.160247,2.160247,0.053452,5577.094161,3.0,0.305505,135356.319887
min,1950.0,1.0,4.6,3120.0,2005.0,6.3,165000.0
25%,1951.5,2.5,4.6,3155.0,2006.5,6.4,194000.0
50%,1953.0,4.0,4.6,3813.0,2008.0,6.5,223000.0
75%,1954.5,5.5,4.7,7077.0,2009.5,6.7,323000.0
max,1956.0,7.0,4.7,18015.0,2011.0,6.9,423000.0


Being in a table, we can also do calculations across different variables, such as a correlation.

In [132]:
narnia.corr()

Unnamed: 0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
Published,1.0,0.25,-0.1443376,-0.439537,1.0,-0.981981,-0.95304
Chronological Order,0.25,1.0,0.2886751,-0.698642,0.9819805,-1.0,-0.993099
Book Rating,-0.144338,0.288675,1.0,0.271138,-3.845925e-15,0.188982,0.302844
Number of Book Ratings,-0.439537,-0.698642,0.2711383,1.0,-0.884356,0.95664,0.984198
Movie Release Date,1.0,0.981981,-3.845925e-15,-0.884356,1.0,-0.981981,-0.95304
Movie Rating,-0.981981,-1.0,0.1889822,0.95664,-0.9819805,1.0,0.993099
Number of Movie Ratings,-0.95304,-0.993099,0.3028441,0.984198,-0.9530401,0.993099,1.0


#### Sorting and Filtering

In [105]:
narnia.sort_values(by='Chronological Order')

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
The Magician's Nephew,1955,1,4.6,9561,,,
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
The Horse and His Boy,1954,3,4.6,4593,,,
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,,
The Last Battle,1956,7,4.7,3120,,,


In [106]:
narnia[narnia['Number of Book Ratings'] > 5000]

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
The Magician's Nephew,1955,1,4.6,9561,,,


#### Dropping Data
Sometimes, it will be useful to drop certain rows or columns. 
* To drop a row, include `axis=0` as an argument
* To drop a column, include `axis=1` as an argument

In [107]:
# Select only rows where a movie has been made.
narnia.drop(['The Silver Chair','The Horse and His Boy',"The Magician's Nephew",'The Last Battle'], axis=0)

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0


In [108]:
# If we don't need the Movie Release Date,
narnia.drop('Movie Release Date', axis=1)

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,
The Horse and His Boy,1954,3,4.6,4593,,
The Magician's Nephew,1955,1,4.6,9561,,
The Last Battle,1956,7,4.7,3120,,


Notice that when we dropped the rows, it wasn't permanent. If we want the drop to be permanent, we add the `inplace=True` argument.

#### Using Functions on a DataFrame
Another thing we can do with a DataFrame is to apply a function to multiple columns. We do this using the `.apply()` method.

In [109]:
def rating_difference(x):
    return x-x.mean()

narnia.apply(rating_difference)

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",-3.0,-2.0,0.057143,11527.571429,-3.0,0.333333,152666.666667
Prince Caspian,-2.0,0.0,-0.042857,-2674.428571,0.0,-0.066667,-47333.333333
The Voyage of the Dawn Treader,-1.0,1.0,0.057143,-3310.428571,3.0,-0.266667,-105333.333333
The Silver Chair,0.0,2.0,-0.042857,-3354.428571,,,
The Horse and His Boy,1.0,-1.0,-0.042857,-1894.428571,,,
The Magician's Nephew,2.0,-3.0,-0.042857,3073.571429,,,
The Last Battle,3.0,3.0,0.057143,-3367.428571,,,


A simplified function for the `apply()` method:

In [119]:
narnia.apply(lambda x: x-x.mean())

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",-3.0,-2.0,0.057143,11527.571429,-3.0,0.333333,152666.666667
Prince Caspian,-2.0,0.0,-0.042857,-2674.428571,0.0,-0.066667,-47333.333333
The Voyage of the Dawn Treader,-1.0,1.0,0.057143,-3310.428571,3.0,-0.266667,-105333.333333
The Silver Chair,0.0,2.0,-0.042857,-3354.428571,,,
The Horse and His Boy,1.0,-1.0,-0.042857,-1894.428571,,,
The Magician's Nephew,2.0,-3.0,-0.042857,3073.571429,,,
The Last Battle,3.0,3.0,0.057143,-3367.428571,,,


In [120]:
narnia.apply(rating_difference)['Book Rating']

title
The Lion, The Witch, and the Wardrobe    0.057143
Prince Caspian                          -0.042857
The Voyage of the Dawn Treader           0.057143
The Silver Chair                        -0.042857
The Horse and His Boy                   -0.042857
The Magician's Nephew                   -0.042857
The Last Battle                          0.057143
Name: Book Rating, dtype: float64

The `.apply()` method can also be applied across columns.

In [127]:
narnia.apply(lambda x: x['Movie Rating'] - x['Book Rating'], axis="columns")

title
The Lion, The Witch, and the Wardrobe    2.2
Prince Caspian                           1.9
The Voyage of the Dawn Treader           1.6
The Silver Chair                         NaN
The Horse and His Boy                    NaN
The Magician's Nephew                    NaN
The Last Battle                          NaN
dtype: float64

-----
## Other functions that are useful to know
There are many other functions available to Series and DataFrames. We won't go over all of them here, but here is a list of helpful functions we will use:
* `.unique()`
* `.value_counts()`
* `.sort_index()`

These functions are described in Chapter 5 of the Textbook. We will discuss them as needed throughout the semester.