# Lecture 6 Pandas and Otaining Data from File
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 5](https://wesmckinney.com/book/pandas-basics)

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* Pandas
  * Series and DataFrame objects
  * Setting Index
  * Setting Columns
  * Sorting
  * Statistics
    * `.info()` and `.describe()`
    * Mean and Standard Deviation
  * Dropping rows and columns
  * Subsetting
  * Applying a function to a column
* Loading Data from a File
-----

## Pandas

Pandas is short for "*Pan*el *Da*ta". It was created by Wes McKinney in 2008 (the author of our textbook), and has become a staple in Data Science with Python.

Pandas has two main tools: the *series* and the *dataframe*. A series is basically like a NumPy array, but adds much more functionality. To see this, let's take this dataset of books from "The Chronicles of Narnia" by C.S. Lewis.

In [1]:
import pandas as pd
import numpy as np

data = {'title': ["The Lion, The Witch, and the Wardrobe", "Prince Caspian","The Voyage of the Dawn Treader",
                  "The Silver Chair","The Horse and His Boy","The Magician's Nephew","The Last Battle"],
        'publish_date': [1950,1951,1952,1953,1954,1955,1956],
        'order': [2, 4, 5, 6, 3, 1, 7],
        'book_rating': [4.7, 4.6, 4.7, 4.6, 4.6, 4.6, 4.7],
        'book_num_ratings': [18015, 3813, 3177, 3133, 4593, 9561, 3120],
        'movie_date': [2005, 2008, 2011, np.nan, np.nan, np.nan, np.nan],
        'movie_rating': [6.9, 6.5, 6.3, np.nan, np.nan, np.nan, np.nan],
        'movie_num_ratings': [423000, 223000, 165000, np.nan, np.nan, np.nan, np.nan]
        }


### Series

To create a series:

In [2]:
narnia_titles = pd.Series(data['book_rating'])
narnia_titles

0    4.7
1    4.6
2    4.7
3    4.6
4    4.6
5    4.6
6    4.7
dtype: float64

At first, this looks just like an array. We can do many of the same calculations as we can with an array.

In [3]:
narnia_titles.mean()

np.float64(4.642857142857144)

In [4]:
narnia_titles.std()

np.float64(0.05345224838248516)

However, a series also allows us to add information to help us with our data processing. To start, lets change the index.

In [5]:
narnia_ratings = pd.Series(data['book_rating'], index=data['title'])
narnia_ratings

The Lion, The Witch, and the Wardrobe    4.7
Prince Caspian                           4.6
The Voyage of the Dawn Treader           4.7
The Silver Chair                         4.6
The Horse and His Boy                    4.6
The Magician's Nephew                    4.6
The Last Battle                          4.7
dtype: float64

Now, we can reference an element merely by calling its title.

In [6]:
narnia_ratings['Prince Caspian']

np.float64(4.6)

Let's also say that we only want the ratings with more than 4500 ratings. We can easily filter the books.

In [7]:
narnia_num_ratings = pd.Series(data['book_num_ratings'], index=data['title'])
narnia_num_ratings

The Lion, The Witch, and the Wardrobe    18015
Prince Caspian                            3813
The Voyage of the Dawn Treader            3177
The Silver Chair                          3133
The Horse and His Boy                     4593
The Magician's Nephew                     9561
The Last Battle                           3120
dtype: int64

In [8]:
narnia_num_ratings > 5000

The Lion, The Witch, and the Wardrobe     True
Prince Caspian                           False
The Voyage of the Dawn Treader           False
The Silver Chair                         False
The Horse and His Boy                    False
The Magician's Nephew                     True
The Last Battle                          False
dtype: bool

In [9]:
narnia_num_ratings[narnia_num_ratings > 5000]

The Lion, The Witch, and the Wardrobe    18015
The Magician's Nephew                     9561
dtype: int64

### DataFrames

There are a lot of things we could do with Series, but an even more useful tool is a DataFrame. A DataFrame is bascially a table where each column is a Series.

In [10]:
narnia = pd.DataFrame(data)
narnia

Unnamed: 0,title,publish_date,order,book_rating,book_num_ratings,movie_date,movie_rating,movie_num_ratings
0,"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
1,Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
2,The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
3,The Silver Chair,1953,6,4.6,3133,,,
4,The Horse and His Boy,1954,3,4.6,4593,,,
5,The Magician's Nephew,1955,1,4.6,9561,,,
6,The Last Battle,1956,7,4.7,3120,,,


A couple of things we can do with this. Just as with the series, we can change the index.

If it is a temporary change (just for the one line of code), then we just say `df.set_index('column')`. But if we want this change to be permanent, we add the `inplace=True` argument.

In [11]:
narnia.set_index('title', inplace=True)
narnia

Unnamed: 0_level_0,publish_date,order,book_rating,book_num_ratings,movie_date,movie_rating,movie_num_ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,,
The Horse and His Boy,1954,3,4.6,4593,,,
The Magician's Nephew,1955,1,4.6,9561,,,
The Last Battle,1956,7,4.7,3120,,,


Notice the column names. These names are just fine. But if we want to rename them, we use the `.columns` attribute.

In [12]:
narnia.columns

Index(['publish_date', 'order', 'book_rating', 'book_num_ratings',
       'movie_date', 'movie_rating', 'movie_num_ratings'],
      dtype='object')

In [13]:
narnia.columns = ['Published', 'Chronological Order', 'Book Rating', 'Number of Book Ratings',
                  'Movie Release Date', 'Movie Rating', 'Number of Movie Ratings']
narnia

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,,
The Horse and His Boy,1954,3,4.6,4593,,,
The Magician's Nephew,1955,1,4.6,9561,,,
The Last Battle,1956,7,4.7,3120,,,


In [14]:
narnia['Published']

title
The Lion, The Witch, and the Wardrobe    1950
Prince Caspian                           1951
The Voyage of the Dawn Treader           1952
The Silver Chair                         1953
The Horse and His Boy                    1954
The Magician's Nephew                    1955
The Last Battle                          1956
Name: Published, dtype: int64

In [15]:
narnia.loc['Prince Caspian']

Published                    1951.0
Chronological Order             4.0
Book Rating                     4.6
Number of Book Ratings       3813.0
Movie Release Date           2008.0
Movie Rating                    6.5
Number of Movie Ratings    223000.0
Name: Prince Caspian, dtype: float64

As we said earlier, we can do regular calculations for each series (or column of the DataFrame).

In [16]:
narnia['Movie Rating'].mean()

np.float64(6.566666666666666)

A lot of the basic math that we would normally calculate are summarized in the `.info()` and `.describe()` methods.

In [17]:
narnia.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, The Lion, The Witch, and the Wardrobe to The Last Battle
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Published                7 non-null      int64  
 1   Chronological Order      7 non-null      int64  
 2   Book Rating              7 non-null      float64
 3   Number of Book Ratings   7 non-null      int64  
 4   Movie Release Date       3 non-null      float64
 5   Movie Rating             3 non-null      float64
 6   Number of Movie Ratings  3 non-null      float64
dtypes: float64(4), int64(3)
memory usage: 748.0+ bytes


In [18]:
narnia.describe()

Unnamed: 0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
count,7.0,7.0,7.0,7.0,3.0,3.0,3.0
mean,1953.0,4.0,4.642857,6487.428571,2008.0,6.566667,270333.333333
std,2.160247,2.160247,0.053452,5577.094161,3.0,0.305505,135356.319887
min,1950.0,1.0,4.6,3120.0,2005.0,6.3,165000.0
25%,1951.5,2.5,4.6,3155.0,2006.5,6.4,194000.0
50%,1953.0,4.0,4.6,3813.0,2008.0,6.5,223000.0
75%,1954.5,5.5,4.7,7077.0,2009.5,6.7,323000.0
max,1956.0,7.0,4.7,18015.0,2011.0,6.9,423000.0


Being in a table, we can also do calculations across different variables, such as a correlation.

In [19]:
narnia.corr()

Unnamed: 0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
Published,1.0,0.25,-0.1443376,-0.439537,1.0,-0.981981,-0.95304
Chronological Order,0.25,1.0,0.2886751,-0.698642,0.9819805,-1.0,-0.993099
Book Rating,-0.144338,0.288675,1.0,0.271138,-3.845925e-15,0.188982,0.302844
Number of Book Ratings,-0.439537,-0.698642,0.2711383,1.0,-0.884356,0.95664,0.984198
Movie Release Date,1.0,0.981981,-3.845925e-15,-0.884356,1.0,-0.981981,-0.95304
Movie Rating,-0.981981,-1.0,0.1889822,0.95664,-0.9819805,1.0,0.993099
Number of Movie Ratings,-0.95304,-0.993099,0.3028441,0.984198,-0.9530401,0.993099,1.0


#### Sorting and Filtering

In [20]:
narnia.sort_values(by='Chronological Order')

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
The Magician's Nephew,1955,1,4.6,9561,,,
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
The Horse and His Boy,1954,3,4.6,4593,,,
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,,
The Last Battle,1956,7,4.7,3120,,,


In [21]:
narnia[narnia['Number of Book Ratings'] > 5000]

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
The Magician's Nephew,1955,1,4.6,9561,,,


#### Dropping Data
Sometimes, it will be useful to drop certain rows or columns. 
* To drop a row, include `axis=0` as an argument
* To drop a column, include `axis=1` as an argument

In [22]:
# Select only rows where a movie has been made.
narnia.drop(['The Silver Chair','The Horse and His Boy',"The Magician's Nephew",'The Last Battle'], axis=0)

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0


In [23]:
# If we don't need the Movie Release Date,
narnia.drop('Movie Release Date', axis=1)

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Rating,Number of Movie Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,6.9,423000.0
Prince Caspian,1951,4,4.6,3813,6.5,223000.0
The Voyage of the Dawn Treader,1952,5,4.7,3177,6.3,165000.0
The Silver Chair,1953,6,4.6,3133,,
The Horse and His Boy,1954,3,4.6,4593,,
The Magician's Nephew,1955,1,4.6,9561,,
The Last Battle,1956,7,4.7,3120,,


Notice that when we dropped the rows, it wasn't permanent. If we want the drop to be permanent, we add the `inplace=True` argument.

#### Using Functions on a DataFrame

Sometimes we want to do some math with the columns of our data. For example, let's compare the Book Ratings to the Movie Ratings. We immediately see a problem: Book Ratings are on a scale of 1-5 and Movie Ratings are on a scale of 1-10.

How can we solve this? There are a number of ways. Let's shift both scales so they're on a scale of 0-1 (this process is called __normalization__). We do this as follows:
$$x' = \frac{x-\min}{\max-\min}$$

In [24]:
narnia['Norm Book Rating'] = (narnia['Book Rating'] - narnia['Book Rating'].min()) / (narnia['Book Rating'].max() - narnia['Book Rating'].min())
narnia['Norm Movie Rating'] = (narnia['Movie Rating'] - narnia['Movie Rating'].min()) / (narnia['Movie Rating'].max() - narnia['Movie Rating'].min())
narnia

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings,Norm Book Rating,Norm Movie Rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0,1.0,1.0
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0,0.0,0.333333
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0,1.0,0.0
The Silver Chair,1953,6,4.6,3133,,,,0.0,
The Horse and His Boy,1954,3,4.6,4593,,,,0.0,
The Magician's Nephew,1955,1,4.6,9561,,,,0.0,
The Last Battle,1956,7,4.7,3120,,,,1.0,


Nah, I don't like that. This method works well for large datasets. However, since our 7 observations all have a very similar book rating, the normalization creates some false implications. So, let's do something simpler: Divide each variable by the maximum. That will also put the two variables on the same scale.

In [25]:
narnia['Norm Book Rating'] = narnia['Book Rating'] / 5.0
narnia['Norm Movie Rating'] = narnia['Movie Rating'] / 10.0
narnia

Unnamed: 0_level_0,Published,Chronological Order,Book Rating,Number of Book Ratings,Movie Release Date,Movie Rating,Number of Movie Ratings,Norm Book Rating,Norm Movie Rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"The Lion, The Witch, and the Wardrobe",1950,2,4.7,18015,2005.0,6.9,423000.0,0.94,0.69
Prince Caspian,1951,4,4.6,3813,2008.0,6.5,223000.0,0.92,0.65
The Voyage of the Dawn Treader,1952,5,4.7,3177,2011.0,6.3,165000.0,0.94,0.63
The Silver Chair,1953,6,4.6,3133,,,,0.92,
The Horse and His Boy,1954,3,4.6,4593,,,,0.92,
The Magician's Nephew,1955,1,4.6,9561,,,,0.92,
The Last Battle,1956,7,4.7,3120,,,,0.94,


Much better. Now, we can compare how the books compared to the movies.
* Books consistently had higher ratings
* Voyage of the Dawn Treader had a higher book rating than Prince Caspian, but the movie ratings show that has switched

Another thing we can do with a DataFrame is to apply a function to multiple columns. We do this using the `.apply()` method. For this method, we need to use Lambda functions.

Let's say that instead of a scale of 0-1, we want to __standardize__ our ratings. That is, base our numbers on the mean and standard deviation of the model.
$$x' = \frac{x-\bar{x}}{s}$$

Doing this has some advantages:
* Negative values indicate below average, while positive values indicate above average
* Value represents the z-score, or how many standard deviations the value is from the mean
* All variables will be put onto a similar scale

To do this as a Lambda function, `lambda x: (x-x.mean())/x.std()`

In [35]:
narnia['Z-Score Book Rating'] = narnia['Norm Book Rating'].apply(lambda x: (x - narnia['Norm Book Rating'].mean()) / narnia['Norm Book Rating'].std())
narnia['Z-Score Book Rating']

title
The Lion, The Witch, and the Wardrobe    1.069045
Prince Caspian                          -0.801784
The Voyage of the Dawn Treader           1.069045
The Silver Chair                        -0.801784
The Horse and His Boy                   -0.801784
The Magician's Nephew                   -0.801784
The Last Battle                          1.069045
Name: Z-Score Book Rating, dtype: float64

In [36]:
narnia['Z-Score Movie Rating'] = narnia['Norm Movie Rating'].apply(lambda x: (x - narnia['Norm Movie Rating'].mean()) / narnia['Norm Movie Rating'].std())
narnia['Z-Score Movie Rating']

title
The Lion, The Witch, and the Wardrobe    1.091089
Prince Caspian                          -0.218218
The Voyage of the Dawn Treader          -0.872872
The Silver Chair                              NaN
The Horse and His Boy                         NaN
The Magician's Nephew                         NaN
The Last Battle                               NaN
Name: Z-Score Movie Rating, dtype: float64

The `.apply()` method can also be applied across columns.

In [37]:
narnia.apply(lambda x: x['Z-Score Movie Rating'] - x['Z-Score Book Rating'], axis="columns")

title
The Lion, The Witch, and the Wardrobe    0.022044
Prince Caspian                           0.583566
The Voyage of the Dawn Treader          -1.941917
The Silver Chair                              NaN
The Horse and His Boy                         NaN
The Magician's Nephew                         NaN
The Last Battle                               NaN
dtype: float64

-----
## Other functions that are useful to know
There are many other functions available to Series and DataFrames. We won't go over all of them here, but here is a list of helpful functions we will use:
* `.unique()`
* `.value_counts()`
* `.sort_index()`

These functions are described in Chapter 5 of the Textbook. We will discuss them as needed throughout the semester.