![](https://snag.gy/h9Xwf1.jpg)

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Introduction to `pandas`

_Authors: Tim Book_

---

`pandas` is the most popular python package for managing datasets and is used extensively by data scientists.

### Learning Objectives

- Define the anatomy of DataFrames.
- Explore data with DataFrames.
- Practice plotting with pandas.

### After this lesson, you're strongly encouraged to

- Use pandas for your spreadsheet data manipulations at work!
- Once you see and feel the power of pandas, you'll slowly phase out heavy spreadsheet use!

### Lesson Guide

- What is `pandas`?
- Reading data
- Exploring data
    - Filtering
    - Sorting
- Split-Apply-Combine _(introduce intuition from sac-visual pdf)_
- Missing Values
- Merging

<a id='introduction'></a>

### What is `pandas`?

---

- Data analysis library - **Panel data system** (doesn't actually have anything to do with the animal, sorry).
- Created by Wes McKinney and Open Sourced by AQR Capital Management, LLC 2009.
- Implemented in highly optimized Python/Cython.
- Most popular tool used to start data analysis projects within the Python scientific ecosystem.


### Pandas Use Cases

---

- Working with Tabular data
- Cleaning data / Munging (data manipulation)
- Exploratory Data Analysis (EDA)
- Structuring data for plots or tabular display
- Joining disparate sources
- Filtering, extracting, or transforming 

### Real world context and similarity with Excel
- In real-world data analytics tasks, this is where majority of the groundwork lies. For an application focussed data professional, this could be where they start and understand the necessary theory on the go
- For those of us coming from Excel spreadsheets background, this will be directly related
- As we will see later, a pandas DataFrame exactly holds data as it would appear on a spreadsheet, in rows and columns in a structured manner
- So, any operations we would have done in Excel, including vlookups and pivot tables, filtering, all is do-able using Pandas with much more functionality and scalability into more complex topics like machine learning.

## Importing the Dynamic Trio
From here on out, we'll begin pretty much all of our notebooks with the following three imports.

* **pandas**: The library we'll be using to do pretty much all data manipulation.
* **numpy**: The library we'll need to do various other computations. Even if you don't think you'll need it to start, you'll probably end up using it later.
* **matplotlib**: The library we'll use most for plotting. More on this another day. *(this is the most tradional Python plotting library, followed by seaborn designed to create more visually appealing plots. There are more less-code, simpler options available recently like [plotly express](https://plotly.com/python/plotly-express/) - can be further explored for your work)*

In [1]:
import numpy as np # for mathematical computations
import pandas as pd # for data procesing, analysis
import matplotlib.pyplot as plt # for plotting

### Discussion: Where do you think a data scientist spends most of their time?

`/poll "Where do you think a data scientist spends most of their time?" "Moving data" "Cleaning data" "Exploring data" "Plotting data" "Predictive modeling" anonymous limit 1`

<a id='loading_csvs'></a>

### Loading a csv into a DataFrame

---

Pandas can load many types of files, but one of the most common filetypes for storing data is in a `.csv` file. In the industry, especially when dealing with large datasets, they might be stored as `.csv` file instead of excel (`.xlsx` file). Let's load a dataset on UFO sightings from the `./datasets` directory:

In [2]:
# reading a file and assigning to pandas dataframe --> read_excel below will read excel file, checkout various arguments within!
# note: below will work because datasets folder is in the same place as this jupyter notebook
# ../ can be used in the file path to reference file in a different location after jumping one folder up
ufo = pd.read_csv('datasets/ufo.csv')

In [3]:
type(ufo)

pandas.core.frame.DataFrame

This creates a pandas object called a **DataFrame**. These are powerful containers for data with many built-in functions to explore and manipulate data.

We will barely scratch the surface of DataFrame functionality in this lesson, but over the course of this class you will become an expert at using them.

In [5]:
# look at the DF index, this syntax gets the top 5 rows. try verifying with a spreadsheet app!
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


If we want to see the last part of our data, we can equivalently use the ```.tail()``` function.

In [5]:
# bottom 5 rows
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
80538,Neligh,,CIRCLE,NE,9/4/2014 23:20
80539,Uhrichsville,,LIGHT,OH,9/5/2014 1:14
80540,Tucson,RED BLUE,,AZ,9/5/2014 2:40
80541,Orland park,RED,LIGHT,IL,9/5/2014 3:43
80542,Loughman,,LIGHT,FL,9/5/2014 5:30


<a id='data_dimensions'></a>

### Data dimensions

---

It's good to look at what the dimensions of your data are. The ```.shape``` property will tell you the rows and colum counts of your DataFrame.

In [6]:
ufo.shape

(80543, 5)

Tip: follow loading file into dataframe with shape, head to ensure correctness of import. like below

In [7]:
# Typical check when loading any new dataset
ufo = pd.read_csv('datasets/ufo.csv')
print(ufo.shape)
ufo.head()

(80543, 5)


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


`/poll "In terms of rows, is this the largest dataset you've ever worked with?" "Yes" "No" "Not sure, but I'm not impressed anyway" anonymous limit 1`

You will notice that this operates the same as `.shape` for numpy arrays/matricies. Pandas makes use of numpy under the hood for optimization and speed.

Look at the names of your columns with the ```.columns``` property.

In [7]:
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

Accessing a specific column is easy. You can use the bracket syntax just like python lists with the string name of the column to extract that column.

In [12]:
# Accessing columns as a series uses []
ufo['City']

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
                 ...         
80538                  Neligh
80539            Uhrichsville
80540                  Tucson
80541             Orland park
80542                Loughman
Name: City, Length: 80543, dtype: object

Looking at unique columns values and number of unique column values with unique() and nunique() functions.

In [10]:
ufo['Colors Reported'].unique()

array([nan, 'RED', 'GREEN', 'BLUE', 'ORANGE', 'YELLOW', 'ORANGE YELLOW',
       'RED GREEN', 'RED BLUE', 'RED ORANGE', 'RED GREEN BLUE',
       'RED YELLOW GREEN', 'RED YELLOW', 'GREEN BLUE',
       'ORANGE GREEN BLUE', 'ORANGE GREEN', 'YELLOW GREEN',
       'RED YELLOW BLUE', 'ORANGE BLUE', 'RED YELLOW GREEN BLUE',
       'YELLOW GREEN BLUE', 'RED ORANGE YELLOW', 'RED ORANGE YELLOW BLUE',
       'YELLOW BLUE', 'RED ORANGE GREEN', 'RED ORANGE BLUE',
       'ORANGE YELLOW GREEN', 'ORANGE YELLOW BLUE',
       'RED ORANGE GREEN BLUE', 'RED ORANGE YELLOW GREEN',
       'ORANGE YELLOW GREEN BLUE', 'RED ORANGE YELLOW GREEN BLUE'],
      dtype=object)

Tip: see the missing above? *(indicated as nan)* this is just one of the things that can help for data checks as part of an analysis.

In [11]:
ufo['Colors Reported'].nunique() # does not include nulls

31

In [12]:
ufo['City'].head() # returns first 5 rows only from 'City' column

0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

In [13]:
# Try to refrain from doing this...
# THREAD: Why shouldn't you rely on this? (There are several good reasons).
ufo.City.head()

0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

In [14]:
# one good reason to stick with ['colname'] instead of .colname.--> lost in space!
ufo.Colors Reported.head()

SyntaxError: invalid syntax (947627348.py, line 2)

In [None]:
ufo['Colors Reported'].head()

As you can see we can also use the ```.head()``` function on a single column, which is represented as a pandas Series object.

You can also access a column (as a DataFrame instead of a Series) or multiple columns with a list of strings. Even a single column needs to be passed in a list []

In [14]:
# Accessing columns as a dataframe uses [[]]. 
ufo[['City', 'State']].head()

Unnamed: 0,City,State
0,Ithaca,NY
1,Willingboro,NJ
2,Holyoke,CO
3,Abilene,KS
4,New York Worlds Fair,NY


In [16]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


<a id='dataframe_series'></a>

### DataFrame vs. Series

---

We've been playing with them, so I guess we should define them formally:

* A **`Series`** is a one-dimensional array of values **with an index**.
* A **`DataFrame`** is a two-dimensional array of values **with both a row and column index**.
* It turns out - each column of a `DataFrame` is actually a `Series`!

![](./assets/series-vs-df.png)

There is an important difference between using a list of strings and just a string with a column's name: when you use a list with the string it returns another **DataFrame**, but when you use just the string it returns a pandas **Series** object. *(as we saw above in the codes executed)*

In [17]:
print(type(ufo['City']))

print(type(ufo[['City']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


<a id='info'></a>

### Examining your data with `.info()`

---

The `.info()` should be the first thing you look at when getting acquainted with a new dataset.

**Types** are very important.  They impact the way data will be represented in our machine learning models, how data can be joined, whether or not math operators can be applied, and when you can encounter unexpected results.--> basically, gain a better understanding of the data on hand to apply the appropriate techniques

> _Typical problems when working with new datasets_:
> - Missing values *(like we saw on 'Colors Reported' column above)*
> - Unexpected types (string/object instead of int/float)
> - Dirty data (commas, dollar signs, unexpected characters, etc)
> - Blank values that are actually "non-null" or single white-space characters

`.info()` is a function that is available on every **DataFrame** object. It gives you information about:

- Name of column / variable attribute
- Type of index (RangeIndex is default)
- Count of non-null values by column / attribute
- Type of data contained in column / attribute
- Unqiue counts of dtypes (Pandas data types)
- Memory usage of our dataset


In [18]:
ufo.info() # observe: if all the columns had non-null values, the non-null count would be the same

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             80496 non-null  object
 1   Colors Reported  17034 non-null  object
 2   Shape Reported   72141 non-null  object
 3   State            80543 non-null  object
 4   Time             80543 non-null  object
dtypes: object(5)
memory usage: 3.1+ MB


## Aside:  Working with "Big Data"

---

The term **Big Data** has become a little bit of a buzzword with no clear, consensus definition. The most common definition is that **Big Data are data that are too big to fit in your computer's memory.**

![](https://snag.gy/UGNamo.jpg)

Pandas only runs on a single computer, so it is limited by the memory (RAM) of your own computer. For an 8GB RAM computer, theoretically, we can only load max 8GB if other applications didn't take up any RAM. Thus, the reason that this definition is good is because when your data size exceeds your RAM, you have to use a separate set of tools to solve your problems. For example: 

* Spark (Later Week!)
* Hadoop
* Being clever with how you read and use data
    - Separate it into small chunks for example.

<a id='describe'></a>

## Quick Summaries

---

The `.describe()` function is very useful for taking a quick look at your data. It gives you some of the basic descriptive statistics.

You can use `.value_counts()` to get a good tabular view of a categorical variable. It returns a Series containing counts of unique rows in the DataFrame.

In [15]:
# Let's read in the diamonds data set.
diamonds = pd.read_csv("datasets/diamonds.csv")
print(diamonds.shape) # note: this line is printed to allow both shape and head are printed. can make it better with f-strings!
diamonds.head()

(53940, 10)


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [20]:
# Let's describe the price
diamonds['price'].describe()

count    53940.000000
mean      3932.799722
std       3989.439738
min        326.000000
25%        950.000000
50%       2401.000000
75%       5324.250000
max      18823.000000
Name: price, dtype: float64

In [21]:
# lets try the info function again
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [22]:
# We can even do it to the whole DataFrame - what does that look like?
# What's missing? --> categorical columns (captured as "object" Dtype above)
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


- While describing a DataFrame, by default, only numeric fields are returned
- To describe all columns of a DataFrame regardless of data type, pass include='all' as parameter to decribe function
    - For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency
    - For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median
- Note that the returned stats are excluding NaN values

Tip: try googling to read more any functions and their respective parameters, Python Pandas documentation is extremely informative! [example: describe function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

In [23]:
diamonds.describe(include='all')

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
count,53940.0,53940,53940,53940,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
unique,,5,7,8,,,,,,
top,,Ideal,G,SI1,,,,,,
freq,,21551,11292,13065,,,,,,
mean,0.79794,,,,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,,,,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,,,,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,,,,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,,,,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,,,,62.5,59.0,5324.25,6.54,6.54,4.04


In [24]:
# Let's count up the cuts --> try comparing with your spreadsheet app to confirm you're doing it correctly!
diamonds['cut'].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

In [25]:
# Let's do the same thing, but normalized --> Return proportions out of total, rather than frequencies
diamonds['cut'].value_counts(normalize=True)

Ideal        0.399537
Premium      0.255673
Very Good    0.223990
Good         0.090953
Fair         0.029848
Name: cut, dtype: float64

```.describe()``` gives us these statistics:

- **count**, which is equivalent to the number of cells (rows)
- **mean**, the average of the values in the column
- **std**, which is the standard deviation
- **min**, the minimum value
- **25%**, the 25th percentile of the values 
- **50%**, the 50th percentile of the values, which is the equivalent to the median
- **75%**, the 75th percentile of the values
- **max**, the maximum value

There are built-in math functions that will work on all of the columns of a DataFrame at once, or subsets of the data.

I can use the `.mean()` function on the `ufo` DataFrame to get the mean for every column.

In [26]:
diamonds['price'].mean() # this is the average of all values in the 'price' column

3932.799721913237

In [27]:
diamonds['price'].median() # mid-value (50th percentile) when data is ordered from least to greatest

2401.0

In [28]:
diamonds['price'].quantile([0.025, 0.975, 0.25, 0.5]) # Return values at the given quantile

0.025      478.000
0.975    15618.525
0.250      950.000
0.500     2401.000
Name: price, dtype: float64

The 25th percentile is the value at which 25% of the data values are below this value, similarly for other quantiles above.

<a id='independent_practice'></a>

### <span style="color:blue">***Time for you to practise!***</span> 

---

Now that we know a little bit about basic DataFrame use, let's practice on a new dataset.

> Pro tip:  You can use the "tab" key to browse filesystem resources when your cursor is in a string to get a relative reference to the files that can be loaded in Jupyter notebook.  Remember, you have to use your arrow keys to navigate the files populated in the UI. 

<img src="https://snag.gy/IlLNm9.jpg">

<span style="color:blue">***Task:***</span> 
1. Read in the `cars.csv` dataset. (call the dataframe, `cars`)
1. Check what is the mean `mpg` for cars in this dataset?

In [18]:
cars = pd.read_csv('datasets/cars.csv')
print(cars.shape)
cars.head(2)

(32, 11)


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4


Tip: Observe above how you can tweak the no. of rows to display by passing parameter to head() function. So, we don't need to restrict to just the default first 5 rows, you can even make it 10.

In [30]:
cars['mpg'].mean()

20.090624999999996

## Filtering
We usually don't need to operate on the _whole_ dataset. A very common task is to parse it down to only the pieces we need.
- Think about the steps to filter data on a spreadsheet. Here, we are just translating those steps to code

In [31]:
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [32]:
# A numpy array is a grid of values that can be initialized like Python lists []
v = np.array([12, 98, 9, 50, 23]) 
v

array([12, 98,  9, 50, 23])

In [33]:
# What do you think the result of this cell is?
v[[True, False, True, False, True]]

array([12,  9, 23])

In [34]:
# How about this? --> element by element check & filtering --> boolean results based on condition check
v < 40

array([ True, False,  True, False,  True])

In [35]:
# So...    --> condition checked against each array element and only True results are returned
# Read as "v where v is less than 40"
v[v < 40]

array([12,  9, 23])

In [36]:
# And this? --> checks for condition in each row for column 'mpg' in dataframe cars
cars['mpg'] > 30

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17     True
18     True
19     True
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27     True
28    False
29    False
30    False
31    False
Name: mpg, dtype: bool

In [21]:
# Finally... --> filtered rows from entire dataframe based on specific condition check
# "cars where cars mpg is greate than 30"
cars[(cars['mpg'] > 30)]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
17,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
18,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
19,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


Filtering in pandas uses vectors of booleans to describe inclusion or exclusion. `True` means you're in, `False` means you're out.

In [20]:
# This functions identically to the code above, and can sometimes feel a little cleaner
# Variables that serve this function are sometimes called "masks"
high_mpg = cars['mpg'] > 30
high_mpg.head() # tip: good idea to just put head() to reduce output printout space, yet ensure that the filter condition works

0    False
1    False
2    False
3    False
4    False
Name: mpg, dtype: bool

In [39]:
cars[high_mpg] # same as cars[cars['mpg'] > 30]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
17,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
18,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
19,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


### Multiple Filters
Often we want to filter based on multiple conditions. We can use the usual "and" and "or" logic, but the symbols change for mystical (read: annoying) Python reasons. 

*Note that when the same conditions are used in condition if-elif-else statements, they are used as usual "and" and "or", this is a difference to get used to, with practise.*

In [40]:
# "And" logic - use ampersand (&)
# Note parentheses mandatory! --> parentheses are actually good practise to keep even without multiple conditions (readability!)
cars[(cars['cyl'] == 4) & (cars['am'] == 1)] # Important note: condition check uses ==, while an assignment uses =

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
17,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
18,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
19,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
25,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
26,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
31,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


In [41]:
# "Or" logic - use pipe (|)
cars[(cars['mpg'] < 14) | (cars['mpg'] > 30)]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
14,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
15,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
17,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
18,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
19,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
23,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
27,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


### <span style="color:blue">***Time for you to practise!***</span> 
<span style="color:blue">***Task:***</span>

Show me all the UFO sightings in a particular City and State [you may use .unique() to see the values]

In [42]:
ufo[(ufo['City'] == 'Towaco') & (ufo['State'] == 'NJ')]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
1585,Towaco,,CIRCLE,NJ,5/20/1968 19:00
45630,Towaco,,TRIANGLE,NJ,8/13/2008 1:00
71134,Towaco,,OVAL,NJ,7/15/2013 22:00


### Aside: Some shortcuts

In [43]:
cars[cars['mpg'].between(24, 30)] # condition includes boundary values by default

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
7,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
25,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
26,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2


In [22]:
cars[~cars['mpg'].between(24, 30)] # ~ (tilde) inverts condition --> not between

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
8,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
10,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4


In [24]:
ufo[ufo['City'].isin(['Towaco', 'Montville'])] # checks and returns elements contained in a list. This is super useful!

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
1585,Towaco,,CIRCLE,NJ,5/20/1968 19:00
29123,Montville,,VARIOUS,OH,6/10/2004 21:00
34461,Montville,,CONE,OH,10/20/2005 20:00
45630,Towaco,,TRIANGLE,NJ,8/13/2008 1:00
55349,Montville,,DISK,CT,11/10/2010 19:40
71134,Towaco,,OVAL,NJ,7/15/2013 22:00


In [46]:
ufo[~ufo['City'].isin(['Towaco', 'Montville'])] # ~ inversion applies here too

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
...,...,...,...,...,...
80538,Neligh,,CIRCLE,NE,9/4/2014 23:20
80539,Uhrichsville,,LIGHT,OH,9/5/2014 1:14
80540,Tucson,RED BLUE,,AZ,9/5/2014 2:40
80541,Orland park,RED,LIGHT,IL,9/5/2014 3:43


<a id='indexing'></a>

## Pandas Indexing: `.loc` and `.iloc`

---

So far we've learned how to select both rows and columns. The savvy and skeptical student would have noticed a problem here. We have ambiguous notation! What does this do:

```python
dataframe[something]
```
- Recap cars[cars['mpg'] > 30] code we executed above. It returned the *entire* dataframe (cars) columns, with only row level filter applied on 'mpg' column

We can't tell! Is `something` a mask or a string? One selects rows, the other selects columns. **What if we wanted to filter rows and select columns at the same time?!** 

Pandas has two properties that you can use for indexing: *(remember, DataFrames are 2-D)*

- **`.loc`** indexes with the _labels_ for rows and columns axis.
- **`.iloc`** indexes with the _integer positions_ for rows and columns axis.
> There used to be a third, `.ix` which is now deprecated and shan't ever be used again.

## `.loc` is Most Common
The syntax of `.loc` is pretty intuitive:

```python
dataframe.loc[rows, columns]
```

Where `rows` is often a filter (ie, a **mask**), and `columns` is a list of columns, or even just `:` to select all columns. *(Recap that we used `:` while covering slicing)*

In [47]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [48]:
# lets look at the unique values in 'State'
ufo['State'].unique()

array(['NY', 'NJ', 'CO', 'KS', 'ND', 'CA', 'MI', 'AK', 'OR', 'AL', 'SC',
       'IA', 'GA', 'TN', 'NE', 'LA', 'KY', 'WV', 'NM', 'UT', 'RI', 'FL',
       'VA', 'NC', 'TX', 'WA', 'ME', 'IL', 'AZ', 'OH', 'PA', 'MN', 'WI',
       'MD', 'SD', 'NV', 'ID', 'MO', 'OK', 'IN', 'CT', 'MS', 'AR', 'WY',
       'MA', 'MT', 'DE', 'NH', 'VT', 'HI', 'Ca', 'Fl'], dtype=object)

In [49]:
# applying loc to filter rows with 'State' = 'TX' and only return columns 'City' & 'Shape Reported'
ufo.loc[ufo['State'] == 'TX', ['City', 'Shape Reported']].head()

Unnamed: 0,City,Shape Reported
37,Dallas,SPHERE
43,Alice,DISK
49,Conroe,OTHER
92,Borger,DISK
114,Post,DISK


In [50]:
# same row filter as above, but this time, return ALL dataframe columns
ufo.loc[ufo['State'] == 'TX', :].head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
37,Dallas,,SPHERE,TX,7/15/1945 14:00
43,Alice,,DISK,TX,3/15/1946 15:30
49,Conroe,,OTHER,TX,1/10/1947 20:00
92,Borger,,DISK,TX,6/15/1948 16:00
114,Post,,DISK,TX,9/15/1949 21:00


### Acccctually.....
![](assets/actually.png)
According to **_The Zen of Python_**, explicit is better than implicit. `.loc` is explicit. **Most people choose to always use `.loc` instead of the ambiguous `dataframe[something]` notation! This is a pretty good idea! When in doubt, use `.loc`!**

### `.iloc` is rare, but useful
The `i` stands for "integer" and will give you the actual zero-indexed numerical indices. *(Recap that index starts at 0 in Python)*

The syntax is very similar in structure to `.loc`
```python
dataframe.iloc[rows_index, columns_index]
```

In [51]:
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [52]:
# using iloc to filter rows 0, 1, 2 (recap start:stop-1 slicing concept), and return with ALL columns
cars.iloc[:3, :]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1


In [53]:
# same row filter as above, but this time, returning only selective columns
cars.iloc[:3, 3:6]

Unnamed: 0,hp,drat,wt
0,110,3.9,2.62
1,110,3.9,2.875
2,93,3.85,2.32


In [25]:
cars.columns

Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear',
       'carb'],
      dtype='object')

In [26]:
# when in doubt of column corresponding to index
cars.columns[3]

'hp'

## Sorting

In [55]:
# We can sort individual Series...
cars['mpg'].sort_values().head() # notice that default sort is ascending

15    10.4
14    10.4
23    13.3
6     14.3
16    14.7
Name: mpg, dtype: float64

In [56]:
cars['mpg'].min() # 10.4 is the smallest value in 'mpg'

10.4

In [57]:
cars['mpg'].sort_values(ascending=False).head()

19    33.9
17    32.4
27    30.4
18    30.4
25    27.3
Name: mpg, dtype: float64

In [58]:
cars['mpg'].max() # 33.9 is the largest value in 'mpg'

33.9

In [59]:
# Or we can sort the entire DataFrame --> shift+tab on sort_values reveals 'mpg' is actually taken as the 'by' parameter
cars.sort_values('mpg').head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
15,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
14,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
23,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
6,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
16,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4


In [60]:
# dataframe sort will NOT run without 'by' --> throws error
cars.sort_values().head()

TypeError: sort_values() missing 1 required positional argument: 'by'

In [61]:
ufo.info() # helpful summary with col, num of non-missing data count, dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   City             80496 non-null  object
 1   Colors Reported  17034 non-null  object
 2   Shape Reported   72141 non-null  object
 3   State            80543 non-null  object
 4   Time             80543 non-null  object
dtypes: object(5)
memory usage: 3.1+ MB


In [28]:
# to_datetime function enables conversion to datetime type
# we see from above that Time is currently not identified in the right format
ufo['Time'] = pd.to_datetime(ufo['Time']) 
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


In [29]:
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   City             80496 non-null  object        
 1   Colors Reported  17034 non-null  object        
 2   Shape Reported   72141 non-null  object        
 3   State            80543 non-null  object        
 4   Time             80543 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 3.1+ MB


### <span style="color:blue">***Time for you to practise!***</span> 
<span style="color:blue">***Task:***</span>

Get the 5 most recent UFO sightings in Roswell, New Mexico.

You'll need to filter and use .sort_values()

This is a hard one! --> breaking down the steps one at a time will simply getting to solution..

In [64]:
# Filter to only Roswell, New Mexico
ufo.loc[(ufo.State == 'NM') & (ufo.City == 'Roswell'), :]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
40,Roswell,,SPHERE,NM,1945-12-15 02:00:00
71,Roswell,,,NM,1947-07-07 00:00:00
73,Roswell,,,NM,1947-07-11 00:00:00
120,Roswell,RED,,NM,1950-03-22 00:00:00
233,Roswell,,,NM,1953-06-15 00:00:00
562,Roswell,,,NM,1959-08-15 15:00:00
4472,Roswell,,,NM,1980-12-01 00:01:00
5941,Roswell,,DISK,NM,1988-09-01 17:30:00
6087,Roswell,,,NM,1989-06-01 23:00:00
10406,Roswell,,CIRCLE,NM,1997-06-17 04:55:00


In [65]:
# Sort by "Time"
ufo.loc[(ufo.State == 'NM') & (ufo.City == 'Roswell'), :].sort_values('Time', ascending=False)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
80459,Roswell,ORANGE,CIRCLE,NM,2014-09-01 20:00:00
80056,Roswell,,OTHER,NM,2014-08-18 15:00:00
67040,Roswell,,FLASH,NM,2012-11-08 22:30:00
58960,Roswell,,LIGHT,NM,2011-09-04 19:50:00
55978,Roswell,,EGG,NM,2011-01-16 14:30:00
49948,Roswell,ORANGE,LIGHT,NM,2009-08-05 20:45:00
49122,Roswell,,SPHERE,NM,2009-06-10 21:10:00
45995,Roswell,BLUE,VARIOUS,NM,2008-09-15 18:30:00
45579,Roswell,,LIGHT,NM,2008-08-09 19:03:00
45502,Roswell,,,NM,2008-08-04 21:53:00


In [66]:
# Take top 5 rows
ufo.loc[(ufo.State == 'NM') & (ufo.City == 'Roswell'), :].sort_values('Time', ascending=False).head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
80459,Roswell,ORANGE,CIRCLE,NM,2014-09-01 20:00:00
80056,Roswell,,OTHER,NM,2014-08-18 15:00:00
67040,Roswell,,FLASH,NM,2012-11-08 22:30:00
58960,Roswell,,LIGHT,NM,2011-09-04 19:50:00
55978,Roswell,,EGG,NM,2011-01-16 14:30:00


## Split-Apply-Combine _(lets also review the sac-visual deck for additional intuition)_
---
What if we want summary statistics _with respect to some categorical variable?_ For example, the price of a diamond probably varies widely between different diamond cuts. To tackle this problem, we'll use the **Split-Apply-Combine** technique. (This is sometimes called **MapReduce**, but is more of a special case of MapReduce). 



* **Split**: Separate your data into different DataFrames, one for each category.
* **Apply**: On each split-up DataFrame, apply some function or transformation (for example, the mean).
* **Combine**: Take the results and combine the split-up DataFrames back into one aggregate DataFrame.

This might sound complicated, but it's actually only two commands in pandas (the **Combine** step is done for us).

- Tip: Relate to how you would do this on **Excel**: Pivot Table! 'cut'--> Rows, 'price'--> Values (average) 

In [67]:
# let's recap what's there in the 'diamonds' dataframe
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [68]:
# What is the mean price by diamond cut?
diamonds.groupby('cut')['price'].mean()

cut
Fair         4358.757764
Good         3928.864452
Ideal        3457.541970
Premium      4584.257704
Very Good    3981.759891
Name: price, dtype: float64

In [69]:
# is the same as explicitly passing column to 'by' parameter in groupby()
# observe that passing a specific column/columns in list post groupby operation trims results
# unless you want the entire dataframe like below
diamonds.groupby(by='cut').mean()

Unnamed: 0_level_0,carat,depth,table,price,x,y,z
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fair,1.046137,64.041677,59.053789,4358.757764,6.246894,6.182652,3.98277
Good,0.849185,62.365879,58.694639,3928.864452,5.838785,5.850744,3.639507
Ideal,0.702837,61.709401,55.951668,3457.54197,5.507451,5.52008,3.401448
Premium,0.891955,61.264673,58.746095,4584.257704,5.973887,5.944879,3.647124
Very Good,0.806381,61.818275,57.95615,3981.759891,5.740696,5.770026,3.559801


A groupby operation involves some combination of splitting the object, applying a function, and combining the results

- For those of us familiar with SQL, we would do something like below for groupby:


```python
SELECT Column1, Column2, mean(Column3)
FROM SomeTable
GROUP BY Column1, Column2
```

In [70]:
# groupby by itself creates a grouped object that can be assigned to a variable
# for downstream statistic calculation
diamonds.groupby(by='cut')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7faade624590>

In [71]:
# Can we just describe each price by cut? 
# --> recap that describe on numerical columns gives various useful stats by default
diamonds.groupby('cut')['price'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Fair,1610.0,4358.757764,3560.386612,337.0,2050.25,3282.0,5205.5,18574.0
Good,4906.0,3928.864452,3681.589584,327.0,1145.0,3050.5,5028.0,18788.0
Ideal,21551.0,3457.54197,3808.401172,326.0,878.0,1810.0,4678.5,18806.0
Premium,13791.0,4584.257704,4349.204961,326.0,1046.0,3185.0,6296.0,18823.0
Very Good,12082.0,3981.759891,3935.862161,336.0,912.0,2648.0,5372.75,18818.0


In [72]:
# What if I want my own recipe of statistics? --> .agg with list of statistics to compute
diamonds.groupby('cut')['price'].agg(['count', 'mean', 'median'])

Unnamed: 0_level_0,count,mean,median
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fair,1610,4358.757764,3282.0
Good,4906,3928.864452,3050.5
Ideal,21551,3457.54197,1810.0
Premium,13791,4584.257704,3185.0
Very Good,12082,3981.759891,2648.0


### <span style="color:blue">***Time for you to practise!***</span> 
<span style="color:blue">***Task:***</span>

What is the mean miles per gallon for each cylinder size?

In [73]:
# check dataframe's head to recap -->tip: we want to use cyl and mpg columns for our groupby
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [74]:
cars.groupby('cyl')['mpg'].mean()

cyl
4    26.663636
6    19.742857
8    15.100000
Name: mpg, dtype: float64

### Advanced Split-Apply-Combining
Feel free to skip!

In [75]:
# What if I want my own home-spun aggregate function?

# Maybe the mean of the log-price is interesting to you?
def log_mean(p):
    return np.mean(np.log(p))

diamonds.groupby('cut')['price'].agg(['count', 'mean', log_mean])

Unnamed: 0_level_0,count,mean,log_mean
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fair,1610,4358.757764,8.093441
Good,4906,3928.864452,7.842809
Ideal,21551,3457.54197,7.639467
Premium,13791,4584.257704,7.950795
Very Good,12082,3981.759891,7.798664


In [32]:
# What if I want functions of different columns?
diamonds.groupby('cut').agg({
    'price' : ['count', 'mean'],
    'carat' : ['mean']
})

Unnamed: 0_level_0,price,price,carat
Unnamed: 0_level_1,count,mean,mean
cut,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Fair,1610,4358.757764,1.046137
Good,4906,3928.864452,0.849185
Ideal,21551,3457.54197,0.702837
Premium,13791,4584.257704,0.891955
Very Good,12082,3981.759891,0.806381


## Adding, Dropping, Renaming, and `inplace` Methods

In [78]:
# Adding a column is easy, just define it!
# What if I wanted km per gal instead of miles per gal?
# kpmg is the new column created in our dataframe
cars['kmpg'] = cars['mpg'] * 1.61
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,kmpg
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,33.81
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,33.81
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,36.708
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,34.454
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,30.107


In [79]:
# Oops - that actually doesn't make sense since they'd be using liters anyway.
# Let's drop it.
cars.drop(columns=['kmpg'], axis=1).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [80]:
# But... it's not gone?
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,kmpg
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,33.81
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,33.81
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,36.708
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,34.454
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,30.107


In [81]:
cars.drop('kmpg', axis=1, inplace=True)
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [82]:
# alternative method to drop using columns parameter
cars['kmpg'] = cars['mpg'] * 1.61
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,kmpg
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,33.81
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,33.81
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,36.708
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,34.454
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,30.107


In [83]:
# alternative drop continued..
# multiple columns to drop can just be passed as a list to columns parameter
cars.drop(columns='kmpg', inplace=True) 
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


###  Inplace Methods!
There are several methods in pandas that don't "stick" unless you tell them to. These methods will always have `inplace=False` by default. If you want to run a method and have it "stick" - assign `inplace=True`.

For example...

### Renaming Columns

In [84]:
# Yuck - I hate spaces and capital letters--> poor naming practise
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


In [85]:
# Lowercaseifying is easy:
# The "columns" attribute of a DataFrame works just like a numpy array or Series.
ufo.columns = ufo.columns.str.lower()
# alternate way to check above code's result instead of dataframe.head()
ufo.columns

Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')

In [86]:
ufo.head()

Unnamed: 0,city,colors reported,shape reported,state,time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


In [87]:
# The .rename method--> dict {old_colname: new_colname}
# if inplace=True is not done here, the name changes will need to be assigned back to ufo dataframe
ufo.rename(columns={
    'colors reported': 'colors',
    'shape reported': 'shape'
}, inplace=True)

In [88]:
ufo.head()

Unnamed: 0,city,colors,shape,state,time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


### Aside: `str` and `dt` methods
There are a lot of familiar string and date operations we can perform on columns. Strangely, they exist within a pandas submodule and so have to be prefixed with `str` and `dt` respectively.

In [89]:
# same as our usecase above lower-casifying all dataframe columns
ufo['shape'].str.lower().head() 

0    triangle
1       other
2        oval
3        disk
4       light
Name: shape, dtype: object

In [90]:
# replace('string-to-replace', 'new-string')
ufo['shape'].str.replace('O', 'BRO').head() 

0    TRIANGLE
1     BROTHER
2      BROVAL
3        DISK
4       LIGHT
Name: shape, dtype: object

Read also: [dataframe replace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)

In [33]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


In [91]:
# We already did this above, but datetime variables need to be converted specially.
# ufo['time'] = pd.to_datetime(ufo['time'])
# dt.year extracts only the 'year' element from date-time column
ufo['time'].dt.year.head()

0    1930
1    1930
2    1931
3    1931
4    1933
Name: time, dtype: int64

Try also: [dt.month](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)

## Missing Values

In [92]:
s = pd.Series([5, 7, np.nan, 2, 10])

# alternate missing value declaration (case sensitive) instead of using np.nan
pd.Series([5, 7, None, 2, 10])

0     5.0
1     7.0
2     NaN
3     2.0
4    10.0
dtype: float64

Note: other ways to declare missing values like empty string `''`, will convert the entire series to object dtype, thus should be avoided. Plain `nan` instead of `np.nan` will not work as well

In [93]:
s

0     5.0
1     7.0
2     NaN
3     2.0
4    10.0
dtype: float64

In [94]:
# Hmm...this does't work (because nan is not = to nan!)
s == np.nan

0    False
1    False
2    False
3    False
4    False
dtype: bool

To check missing values in a Series or even a DataFrame, we can use the functions `isnull()` and `notnull()`. Both help to identify missing elements.

In [95]:
s.isnull()

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [96]:
s.notnull()

0     True
1     True
2    False
3     True
4     True
dtype: bool

In [97]:
# let's apply these functions on the ufo dataframe--> we see straightaway, colors has nulls 
ufo.isnull().head()

Unnamed: 0,city,colors,shape,state,time
0,False,True,False,False,False
1,False,True,False,False,False
2,False,True,False,False,False
3,False,True,False,False,False
4,False,True,False,False,False


In [98]:
# above was doing a quick check using head() on only the first 5 rows
# effective way to check null count across all columns in the dataframe would be as below
ufo.isnull().sum()

city         47
colors    63509
shape      8402
state         0
time          0
dtype: int64

In [99]:
# alternative to above using isna()
ufo.isna().sum()

city         47
colors    63509
shape      8402
state         0
time          0
dtype: int64

In [36]:
len(ufo)

80543

In [37]:
# Percent of missing values in each column
ufo.isna().sum()/len(ufo)

City               0.000584
Colors Reported    0.788510
Shape Reported     0.104317
State              0.000000
Time               0.000000
dtype: float64

In [101]:
# Easy way to filter out missings from dataframe!
ufo.loc[ufo['colors'].notnull(), :].head()

Unnamed: 0,city,colors,shape,state,time
12,Belton,RED,SPHERE,SC,1939-06-30 20:00:00
19,Bering Sea,RED,OTHER,AK,1943-04-30 23:00:00
36,Portsmouth,RED,FORMATION,VA,1945-07-10 01:30:00
44,Blairsden,GREEN,SPHERE,CA,1946-06-30 19:00:00
66,Wexford,BLUE,,PA,1947-07-01 20:00:00


In [102]:
# alternative method --> recap tilde used previously!
ufo[~ufo['colors'].isna()].head()

Unnamed: 0,city,colors,shape,state,time
12,Belton,RED,SPHERE,SC,1939-06-30 20:00:00
19,Bering Sea,RED,OTHER,AK,1943-04-30 23:00:00
36,Portsmouth,RED,FORMATION,VA,1945-07-10 01:30:00
44,Blairsden,GREEN,SPHERE,CA,1946-06-30 19:00:00
66,Wexford,BLUE,,PA,1947-07-01 20:00:00


In [104]:
# Alternative method to drop missing values from the dataframe. Note: Use inplace=True if you want the drop to "stick"
# Tip! There are multiple ways to achieve the same result in coding!!!
ufo.dropna(subset=['colors']).head()

Unnamed: 0,city,colors,shape,state,time
12,Belton,RED,SPHERE,SC,1939-06-30 20:00:00
19,Bering Sea,RED,OTHER,AK,1943-04-30 23:00:00
36,Portsmouth,RED,FORMATION,VA,1945-07-10 01:30:00
44,Blairsden,GREEN,SPHERE,CA,1946-06-30 19:00:00
66,Wexford,BLUE,,PA,1947-07-01 20:00:00


## Exporting Data
We can read data, but how do we save it so we can send it out? pandas has several methods of the form `.to_*()`.

- Note: You will need to have created a folder called 'output' in your drive for below to work. There is a progammatic way to create a directory that does not exist. Code below creates a directory called my_folder in the same place as this notebook
```python
import os
if not os.path.exists('my_folder'):
    os.makedirs('my_folder')
    ```
- Tip: the file being saved is a filtered dataframe that can be a separate dataframe assignment during initial filtering, like mpg_above30_cars. Then the syntax to save would just be: 
```python
mpg_above30_cars.to_csv('filepath')
```
- Try: 
    - what happens when you don't pass index=False, note that the default is True
    - you can use f-strings in the path to insert variables into exported file name in { }

In [40]:
import os
if not os.path.exists('output'):
    os.makedirs('output')
cars.loc[cars['mpg'] > 30, :].to_csv('output/efficient-cars.csv')

<a id='review'></a>

### Review

---

 - What would we do with a dataset when we first acquire it?
 - What's important to consider when first looking at a dataset? 
 - What are some common problems we can run into with new data?
 - What are some common operations with DataFrames?
 - How do we slice? Index? Filter?

# EXTRA MATERIALS
![](assets/biohazard.png)

Everything that follows is considered advanced or "too much" for our first session with pandas, and may not be explicitly covered by the instructor. If the instructor _does_ cover it, please don't worry that you don't understand this on your first pass.

**THAT DOES NOT MEAN THESE TOPICS ARE UNIMPORTANT OR RARELY USED!** We highly _highly_ recommend you take a look at these on your own time.

### Merging

In [105]:
movies = pd.read_csv(
    'datasets/movies.tbl',
    sep='|', # pipe symbol is used as separator instead of default ','.
    encoding='latin1',
    header=None,
    names=['movie_id', 'title'],
    usecols=[0, 1]
)
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [106]:
ratings = pd.read_csv(
    'datasets/movie_ratings.tsv',
    sep='\t', # tab symbol is used as separator instead of default ','
    header=None,
    names=['user_id', 'movie_id', 'rating', 'timestamp']
)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [107]:
movie_reviews = pd.merge(ratings, movies, how='left')
movie_reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


In [108]:
# alternate method to replace pd directly with the main dataframe to merge new information from another
movie_ratings = ratings.merge(movies, how='left')
movie_ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


In [109]:
print(movies.shape)
print(ratings.shape)
print(movie_reviews.shape)

(1682, 2)
(100000, 4)
(100000, 5)


- What we did above can be equated to a `vlookup` operation on Excel using 'movie_id' as the lookup key which is common across both the dataframes, movies and ratings

There are many ways to achieve joining:
- Other than `merge`, we can also use `join` to stick 2 dataframes together *column-wise*. Join sticks based on index
- Another dataframe sticking function could be `concat` that can be used to stick dataframes together *row-wise* one below the other (based on default axis = 0 parameter, though it allows *column* level concatenation by changing this axis too)
- The 'how' parameter in a merge or join function follows the exact same logic as [SQL joins](https://www.w3schools.com/sql/sql_join.asp#:~:text=Different%20Types%20of%20SQL%20JOINs&text=(INNER)%20JOIN%20%3A%20Returns%20records,records%20from%20the%20left%20table)

### "Categorical" Variables
Despite the name, when pandas says "Categorical", they really mean "Ordinal" - that is, ordered categories.

For example, check out the following crosstab *(try googling for pandas crosstab documentation):*

In [110]:
# recap what's in the diamonds dataframe
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [111]:
# do a cross tabulation between the 'cut' and 'color' columns
pd.crosstab(diamonds['cut'], diamonds['color'])

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fair,163,224,312,314,303,175,119
Good,662,933,909,871,702,522,307
Ideal,2834,3903,3826,4884,3115,2093,896
Premium,1603,2337,2331,2924,2360,1428,808
Very Good,1513,2400,2164,2299,1824,1204,678


- Pandas crosstab by default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
    - what this means for above is there are 163 value counts with the combination 'cut' = Fair and 'color' = D, and so on

The "cuts" are not in the right order! They're actually in alphabetical order. We can fix this by telling pandas that there really is an important ordering here.

In [112]:
diamonds['cut'] = pd.Categorical(diamonds['cut'], categories=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'])
pd.crosstab(diamonds['cut'], diamonds['color'])

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fair,163,224,312,314,303,175,119
Good,662,933,909,871,702,522,307
Very Good,1513,2400,2164,2299,1824,1204,678
Premium,1603,2337,2331,2924,2360,1428,808
Ideal,2834,3903,3826,4884,3115,2093,896


### Categorizing with `.map()`
[pandas documentation for function](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html)

In [113]:
# recap what's inside cars dataframe
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [114]:
# take a look at the values in 'cyl' column
cars.groupby('cyl')['cyl'].count()

cyl
4    11
6     7
8    14
Name: cyl, dtype: int64

In [115]:
# creating a new column in the cars dataframe, called 'cyl_word'
cars['cyl_word'] = cars['cyl'].map({4: 'Four', 6: 'Six', 8: 'Eight'})

# recap that we do value_counts to count categories
cars['cyl_word'].value_counts() 

Eight    14
Four     11
Six       7
Name: cyl_word, dtype: int64

In [116]:
# new column added to cars dataframe
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cyl_word
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Six
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Six
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Four
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,Six
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,Eight


In [117]:
# defining a custom function
def is_efficient(x):
    if x > 20:
        return "Efficient"
    else:
        return "Wasteful"

# using custom function to define new dataframe column based on condition check outcome
cars['fuel_economy'] = cars['mpg'].map(is_efficient)
cars['fuel_economy'].value_counts()

Wasteful     18
Efficient    14
Name: fuel_economy, dtype: int64

In [118]:
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cyl_word,fuel_economy
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Six,Efficient
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Six,Efficient
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Four,Efficient
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,Six,Efficient
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,Eight,Wasteful


### Advanced Data Manipulation with `.apply()`
The `.apply()` method is very similar to `.map()`, except more advanced. You can apply a function along any axis of a `DataFrame`. `.apply()` is our "Swiss army knife" for data manipulation - if something can't be solved with ordinary means, it might be time for a `.apply()`.

- We can do the same operation done above replacing `.map()` by `.apply` as below:
```python
cars['fuel_economy'] = cars['mpg'].apply(is_efficient)
```

In [119]:
# defining a Series with integers and strings
sizes = pd.Series([8, 4, 5, 'L', 2, 12, 16, 8, 'XL'])

In [120]:
# defining custom function to convert int-->float and strings-->missing (nan)
def to_num(x):
    try:
        out = float(x)
    except:
        out = np.nan
    return out

In [121]:
# calling apply to apply custom function on Series
sizes.apply(to_num)

0     8.0
1     4.0
2     5.0
3     NaN
4     2.0
5    12.0
6    16.0
7     8.0
8     NaN
dtype: float64

In [122]:
# let's try something on cars dataframe
cars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cyl_word,fuel_economy
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,Six,Efficient
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,Six,Efficient
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,Four,Efficient
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,Six,Efficient
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,Eight,Wasteful


In [123]:
# a more advanced application of apply
def describe_car(row):
    efficiency = row['fuel_economy'].lower()
    cyl = row['cyl_word'].lower()
    auto = 'automatic' if row['am'] == 1 else 'manual'
    print(f"This {cyl} cylinder car has {auto} transmission and a(n) {efficiency} fuel economy.")

In [124]:
cars.head().apply(describe_car, axis=1)

This six cylinder car has automatic transmission and a(n) efficient fuel economy.
This six cylinder car has automatic transmission and a(n) efficient fuel economy.
This four cylinder car has automatic transmission and a(n) efficient fuel economy.
This six cylinder car has manual transmission and a(n) efficient fuel economy.
This eight cylinder car has manual transmission and a(n) wasteful fuel economy.


0    None
1    None
2    None
3    None
4    None
dtype: object

BONUS QUESTION: Why are there 5 "None" values in the above output?

SOLUTION#1: Assigning function apply output to a variable assigns the Nones to the variable, so does not print out. Printing x will output Nones

In [125]:
x = cars.head().apply(describe_car, axis=1)

This six cylinder car has automatic transmission and a(n) efficient fuel economy.
This six cylinder car has automatic transmission and a(n) efficient fuel economy.
This four cylinder car has automatic transmission and a(n) efficient fuel economy.
This six cylinder car has manual transmission and a(n) efficient fuel economy.
This eight cylinder car has manual transmission and a(n) wasteful fuel economy.


SOLUTION#2: Reurn some output from the function and assign it to columns in the dataframe

In [126]:
# same custom function definition, but with returns
def describe_car(row):
    efficiency = row['fuel_economy'].lower()
    cyl = row['cyl_word'].lower()
    auto = 'automatic' if row['am'] == 1 else 'manual'
    print(f"This {cyl} cylinder car has {auto} transmission and a(n) {efficiency} fuel economy.")
    return efficiency, cyl, auto

In [127]:
# creating new columns in dataframe based on custom function output
cars[['efficiency', 'cyl', 'auto']] = cars.apply(describe_car, axis=1, result_type='expand')
cars.head()

This six cylinder car has automatic transmission and a(n) efficient fuel economy.
This six cylinder car has automatic transmission and a(n) efficient fuel economy.
This four cylinder car has automatic transmission and a(n) efficient fuel economy.
This six cylinder car has manual transmission and a(n) efficient fuel economy.
This eight cylinder car has manual transmission and a(n) wasteful fuel economy.
This six cylinder car has manual transmission and a(n) wasteful fuel economy.
This eight cylinder car has manual transmission and a(n) wasteful fuel economy.
This four cylinder car has manual transmission and a(n) efficient fuel economy.
This four cylinder car has manual transmission and a(n) efficient fuel economy.
This six cylinder car has manual transmission and a(n) wasteful fuel economy.
This six cylinder car has manual transmission and a(n) wasteful fuel economy.
This eight cylinder car has manual transmission and a(n) wasteful fuel economy.
This eight cylinder car has manual trans

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cyl_word,fuel_economy,efficiency,auto
0,21.0,six,160.0,110,3.9,2.62,16.46,0,1,4,4,Six,Efficient,efficient,automatic
1,21.0,six,160.0,110,3.9,2.875,17.02,0,1,4,4,Six,Efficient,efficient,automatic
2,22.8,four,108.0,93,3.85,2.32,18.61,1,1,4,1,Four,Efficient,efficient,automatic
3,21.4,six,258.0,110,3.08,3.215,19.44,1,0,3,1,Six,Efficient,efficient,manual
4,18.7,eight,360.0,175,3.15,3.44,17.02,0,0,3,2,Eight,Wasteful,wasteful,manual


Rule of thumb with `axis` parameter in apply(): when applying across multiple columns, like on an entire dataframe, MUST specify axis=1. If only applying on single column, can go with default argument value axis=0.