# Working with the Pandas DataFrame

-----

In this notebook, we will first load a DataFrame from a CSV file on an internet website. When reading the airport data from this website, we specify the comma as the delimiter and explicitly indicate that the index column is 'iata', which is the airport code. Using a data-specific column as a row-index can often simplify data processing. After acquiring this file, we demonstrate basic DataFrame functionalities, including descriptive analysis, basic indexing, fancy indexing, grouping and stacking.


## Table of Contents
[Load and Examine the DataSet](#Load-and-Examine-the-DataSet)  

[DataFrame Indexing](#DataFrame-Indexing)  
- [Slice Rows](#Slice-Rows)  
- [Slice Columns](#Slice-Columns)  
- [Slice Rows and Columns](#Slice-Rows-and-Columns)  
- [Masking](#Masking)  

[DataFrame Operations](#DataFrame-Operations)  
- [Sort DataFrame](#Sort-DataFrame)  
- [Groupby](#Groupby)  
- [Pivot Table](#Pivot-Table)  
- [Stacking](#Stacking)  


-----
[[Back to TOC]](#Table-of-Contents)


## Load and Examine the DataSet

After loading a dataset to Pandas DataFrame, we will first take a peek at the dataset. There are three DataFrame functions we can use:
- `head()`: Extracts the first 5 rows of the DataFrame by default. You may also pass an integer to the function to print out the first certain number of rows.
- `tail()`: Extracts the last 5 rows of the DataFrame by default. You may also pass an integer to the function to print out the last certain number of rows.
- `sample()`: Extracts out a random row of the dataset. You may also pass an integer to the function to print out certain number of rows randomly picked from the DataFrame.

In [1]:
import pandas as pd

# Read data from CSV file, and display subset

dfa = pd.read_csv('airport-data.csv', index_col='iata')


In [2]:
# Extract the first 5 rows
dfa.head()

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944


In [3]:
# Extract the last four rows
dfa.tail(4)

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZER,Schuylkill Cty/Joe Zerbey,Pottsville,PA,USA,40.706449,-76.373147
ZPH,Zephyrhills Municipal,Zephyrhills,FL,USA,28.228065,-82.155916
ZUN,Black Rock,Zuni,NM,USA,35.083227,-108.791777
ZZV,Zanesville Municipal,Zanesville,OH,USA,39.944458,-81.892105


In [4]:
# Extract 3 randome rows
dfa.sample(3)

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
HLC,Hill City Municipal,Hill City,KS,USA,39.378836,-99.831494
PGV,Pitt-Greenville,Greenville,NC,USA,35.635239,-77.38532
RWL,Rawlins Muni,Rawlins,WY,USA,41.805597,-107.19994


-----
### Examine the DataFrame

We will then examine the basic information of the DataFrame with **`into()`** function. `info()` function prints a concise summary of a DataFrame including number of rows, column names, column dtypes and non-null values. As `info()` shows, this DataFrame has 3376 rows and 6 columns; Two columns, `city` and `state` have some missing values(number of not null values is smaller than number of rows). The data set contains mixed data, both numeric and text, and thus requires different column storage. `lat` and `long` are numeric columns , while the other four columns are text data, represented by data type `object`. This impacts the default behavior of specific Pandas functions, such as the `describe` function, since summary statistics can only be calculated for numeric data.

-----

In [5]:
dfa.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3376 entries, 00M to ZZV
Data columns (total 6 columns):
airport    3376 non-null object
city       3364 non-null object
state      3364 non-null object
country    3376 non-null object
lat        3376 non-null float64
long       3376 non-null float64
dtypes: float64(2), object(4)
memory usage: 184.6+ KB


**`describe()`** function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding missing values(``NaN``).

In [6]:
# Display a summary of the numerical information in the DataFrame

dfa.describe()

Unnamed: 0,lat,long
count,3376.0,3376.0
mean,40.036524,-98.621205
std,8.329559,22.869458
min,7.367222,-176.646031
25%,34.688427,-108.761121
50%,39.434449,-93.599425
75%,43.372612,-84.137519
max,71.285448,145.621384


-----

## DataFrame Indexing

Since this new `DataFrame` was created with a labelled row index, we can use row labels to slice rows from the `DataFrame`. The following code cells demonstrate basic slicing and indexing of this `DataFrame` by using both explicit indices (row and column labels) and implicit indices (row and column index values).


---
### Slice Rows
- With explicit row index - loc
- With implicit row index - iloc

In [7]:
# Slice rows by using the indicated label from the index column

dfa.loc[['11J', '11R', '12C']]

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
11J,Early County,Blakely,GA,USA,31.396986,-84.895257
11R,Brenham Municipal,Brenham,TX,USA,30.219,-96.374278
12C,Rochelle Municipal,Rochelle,IL,USA,41.893001,-89.07829


The reason that we need to use `loc` to slice rows with explicit row index is that the syntax `dfa[['11J', '11R', '12C']]` will be interpreted as slicing 3 columns from dfa. When we slice with implicit row index, we use `iloc` as shown below.

The standard format to use `loc` is `df.loc[rows, columns]`. In the above example, we only specify list of rows to slice. When the columns part is missing, all columns will be selected. `dfa.loc[['11J', '11R', '12C']]` is equivalent to `dfa.loc[['11J', '11R', '12C'], :]`. This logic applies to `iloc` too, which use implicit index, as shown below.

In [8]:
# Slice rows by using the row implicit index

dfa.iloc[[99, 100, 101]]

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
11J,Early County,Blakely,GA,USA,31.396986,-84.895257
11R,Brenham Municipal,Brenham,TX,USA,30.219,-96.374278
12C,Rochelle Municipal,Rochelle,IL,USA,41.893001,-89.07829


---
### Slice Columns
- Select one column as Pandas Series
- Select multiple columns as Pandas DataFrame

In [9]:
# Select one column as Series
city = dfa['city']
city.head()

iata
00M         Bay Springs
00R          Livingston
00V    Colorado Springs
01G               Perry
01J            Hilliard
Name: city, dtype: object

In [10]:
# Select multiple columns as DataFrame
airport_city = dfa[['airport', 'city']]
airport_city.head()

Unnamed: 0_level_0,airport,city
iata,Unnamed: 1_level_1,Unnamed: 2_level_1
00M,Thigpen,Bay Springs
00R,Livingston Municipal,Livingston
00V,Meadow Lake,Colorado Springs
01G,Perry-Warsaw,Perry
01J,Hilliard Airpark,Hilliard


---
### Slice Rows and Columns
We can select specific ranges of our data in both the row and column directions using either explicit or implicit(integer-based) index.

- loc: indexing via explicit index
- iloc: indexing via implicit index

In [11]:
# Slice rows and columns by using explicit row and column labels

dfa.loc[['11R', '12C', '12D'], ['airport', 'city', 'state', 'country']]

Unnamed: 0_level_0,airport,city,state,country
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11R,Brenham Municipal,Brenham,TX,USA
12C,Rochelle Municipal,Rochelle,IL,USA
12D,Tower Municipal,Tower,MN,USA


In [12]:
# Slice rows and columns by using implicit row and column indices

dfa.iloc[100:103, 0:4]

Unnamed: 0_level_0,airport,city,state,country
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11R,Brenham Municipal,Brenham,TX,USA
12C,Rochelle Municipal,Rochelle,IL,USA
12D,Tower Municipal,Tower,MN,USA


-----

### Masking

Pandas support selecting rows based on column values, which is known as _masking_. This is performed by specifying tests on columns that result in `True` or `False`, and only the `True` results are returned. Thus, a row mask is formed, and masked rows are hidden and unmasked rows are selected. These tests must follow the rules of Boolean logic, but can involve multiple column comparisons that are combined into one final result. 

For example, the first code cell below selects all airports in the state of Delaware by specifying the test `dfa['state'] == 'DE'`. This test effectively selects those rows that have `DE` in the `state` column of the `dfa` `DataFrame`. The next code cell involves a more complicated expression that selects those airports whose latitude is greater than `48` and longitude is less than `-170`. In this case, the two expressions are enclosed in parenthesis and combined with a Boolean _and(&)_ to generate the final test result. The next code cell demonstrates how to use boolean *or(|)*.

-----

In [13]:
dfa[dfa['state'] =='DE']

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
33N,Delaware Airpark,Dover,DE,USA,39.218376,-75.596427
DOV,Dover Air Force Base,Dover,DE,USA,39.130113,-75.46631
EVY,Summit Airpark,Middletown,DE,USA,39.520389,-75.720444
GED,Sussex Cty Arpt,Georgetown,DE,USA,38.689194,-75.358889
ILG,New Castle County,Wilmington,DE,USA,39.678722,-75.606528


If a column name is one word(without white space), we can also refer to it with column name directly as the DataFrame's attribute, ie. `dfa.state`. This is more convenient and thus it's recommended to use a one-word column name in a DataFrame. The code below does the same thing as the above cell.

In [14]:
dfa[dfa.state =='DE']

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
33N,Delaware Airpark,Dover,DE,USA,39.218376,-75.596427
DOV,Dover Air Force Base,Dover,DE,USA,39.130113,-75.46631
EVY,Summit Airpark,Middletown,DE,USA,39.520389,-75.720444
GED,Sussex Cty Arpt,Georgetown,DE,USA,38.689194,-75.358889
ILG,New Castle County,Wilmington,DE,USA,39.678722,-75.606528


When using multiple conditions in DataFrame masking, we use `&` for and, `|` for or. Individual condition needs to be enclosed in `()`.

In [15]:
# Find aiports with latitude > 48 and longitude < -170
dfa[(dfa.lat > 48) & (dfa.long < -170)]

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ADK,Adak,Adak,AK,USA,51.877964,-176.646031
AKA,Atka,Atka,AK,USA,52.220348,-174.20635
GAM,Gambell,Gambell,AK,USA,63.766766,-171.732824
SNP,St. Paul,St. Paul,AK,USA,57.167333,-170.220444
SVA,Savoonga,Savoonga,AK,USA,63.686394,-170.492636


In [16]:
# Find airports in DC or AS
dfa[(dfa.state=='DC') | (dfa.state == 'AS')]

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
09W,South Capitol Street,Washington,DC,USA,38.868723,-77.007476
FAQ,Fitiuta,Fitiuta Village,AS,USA,14.215776,-169.423906
PPG,Pago Pago International,Pago Pago,AS,USA,14.331023,-170.710526
Z08,Ofu,Ofu Village,AS,USA,14.184351,-169.670024


-----
[[Back to TOC]](#Table-of-Contents)


## DataFrame Operations




### Sort DataFrame

We can sort a DataFrame by index or values.

1. `sort_index`: sorts the `DataFrame` based on the values in the index
2. `sort_values`: sorts the `DataFrame` by the column specified in the `by` attribute

Note that the sort functions return a new `DataFrame`; to sort a `DataFrame` in place you must set the `inplace` attribute to `True`. In addition, the sort functions take an `ascending` parameter that specifies if the sort should be in ascending or descending order.

-----

In [17]:
dfa.sort_index().head()

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944


In [18]:
dfa.sort_values(by='lat', ascending=False).head()

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BRW,Wiley Post Will Rogers Memorial,Barrow,AK,USA,71.285448,-156.766002
AWI,Wainwright,Wainwright,AK,USA,70.638,-159.99475
ATK,Atqasuk,Atqasuk,AK,USA,70.467276,-157.435736
AQT,Nuiqsut,Nuiqsut,AK,USA,70.209953,-151.005561
SCC,Deadhorse,Deadhorse,AK,USA,70.194756,-148.465161


-----

We can sort a DataFrame by using multiple columns by specifying the columns as a list, with the order of columns in the list indicating the column sort order. For example, the following code cell first sorts the airports by state in ascending order, followed by the city in descending order.

-----

In [19]:
dfa.sort_values(by=['state', 'city'], ascending=[True, False]).head()

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2Y3,Yakutat SPB,Yakutat,AK,USA,59.562477,-139.741099
YAK,Yakutat,Yakutat,AK,USA,59.503361,-139.660226
68A,Wrangell SPB,Wrangell,AK,USA,56.466325,-132.380018
WRG,Wrangell,Wrangell,AK,USA,56.484326,-132.369824
WSM,Wiseman,Wiseman,AK,USA,67.404573,-150.122742


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, first extract all airports in the state of California. Second, apply a mask to select only those rows with a latitude between `38` and `40`. Finally, compute and display the average and standard deviation of the longitude for these masked rows.

-----

---
### Groupby

Data comes in a number of different types that determine what kinds of operation can be used for them. The most basic distinction is that between continuous and categorical data.

- **Categorical** data contain a finite number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method.
- **Continuous** data are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. For example, the length of a part or the date and time a payment is received.

Some of the most important uses of a Pandas `DataFrame` involve grouping related data by categorical features together and operating on the _grouped_ subsets independently. For example, data may be grouped by a categorical variable, such as state or county, and sales totals accumulated by the grouped region. To demonstrate this functionality, we turn to a second data set on restaurant data that is provided along with the _seaborn_ Python module. The seaborn is presented in a separate lesson and provides support for advanced visualizations. Right now, however, we simply want to easily process this data, so we load the data and display several randomly selected rows.



In [20]:
# Load the 'tips' data set into a DataFrame

import seaborn as sns

dft = pd.read_csv('tips.csv')
dft.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
184,40.55,3.0,Male,Yes,Sun,Dinner,2
32,15.06,3.0,Female,No,Sat,Dinner,2
120,11.69,2.31,Male,No,Thur,Lunch,2
69,15.01,2.09,Male,Yes,Sat,Dinner,2
234,15.53,3.0,Male,Yes,Sat,Dinner,2


-----

To aggregate rows together, we employ the `groupby` method to create groups of rows that should be aggregated into a subset. The column (or columns) used to separate the rows are specified as a parameter to the `groupby` functions, as shown in the following code cell where the _tips_ data set is grouped on the `time` column into a new `DataFrameGroupBy` object called `dg`. This new `dg` object can be operated on as a normal `DataFrame` with the exception that it contains subsets that are treated independently. This is shown in the next two code cells where the `head` and `tail` functions are used to show the first and last few rows of the group data set. Notice how the same number of rows are shown for each _grouped_ data set.

-----

In [21]:
# Group the DataFrame by the time column
dg = dft.groupby('time')
type(dg)

pandas.core.groupby.generic.DataFrameGroupBy

In [22]:
# Display first two rows from each group
dg.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
77,27.2,4.0,Male,No,Thur,Lunch,4
78,22.76,3.0,Male,No,Thur,Lunch,2


In [23]:
# Display last three rows from each group
dg.tail(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
224,13.42,1.58,Male,Yes,Fri,Lunch,2
225,16.27,2.5,Female,Yes,Fri,Lunch,2
226,10.09,2.0,Female,Yes,Fri,Lunch,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


-----

#### Operating on Groups

The `DataFrame` groups can be operated by using similar techniques to the normal `DataFrame`. For example, statistical quantities such as the median or standard deviation can be computed for each group, as shown in the next few code cells. Multiple functions can be computed at once by using the `aggregate` method, which takes a list of the statistical functions to apply to each group.

-----

In [24]:
# Compute median for each column in each group
dg.median()

Unnamed: 0_level_0,total_bill,tip,size
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dinner,18.39,3.0,2
Lunch,15.965,2.25,2


In [25]:
# Compute the standard deviation for each column in each group
dg.std()

Unnamed: 0_level_0,total_bill,tip,size
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dinner,9.142029,1.436243,0.910241
Lunch,7.713882,1.205345,1.040024


You can also apply different aggregate function on different columns. In the below cell, we will show total sales from Lunch and Dinner, average tip of Lunch and Dinner, and count of meals for Lunch and Dinner. Aggregate function `count` counts number of rows of each group. You may apply `count` on any column and they will give same result. Here we count on column `sex`.

In [26]:
dg.agg({'total_bill':'sum', 'tip':'mean', 'sex':'count'})

Unnamed: 0_level_0,total_bill,tip,sex
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dinner,3660.3,3.10267,176
Lunch,1167.47,2.728088,68


The column names of the above output is misleading since it uses original column names directly. We can change column names of the output to more meaningful names with `rename` function as shown below.

In [27]:
dg.agg({'total_bill':'sum', 'tip':'mean', 'sex':'count'}).rename(columns={'total_bill':'total_sales', 'tip':'average_tip', 'sex':'meal_count'})

Unnamed: 0_level_0,total_sales,average_tip,meal_count
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dinner,3660.3,3.10267,176
Lunch,1167.47,2.728088,68


To combine everything together, we have the following code. Notice that `\` at the end of the first line indicates a line breaker that tells the notebook that the line extends to next line. This is to avoid a long line that requres scrolling to see.

In [28]:
dft.groupby('time').agg({'total_bill':'sum', 'tip':'mean', 'sex':'count'})\
.rename(columns={'total_bill':'total_sales', 'tip':'average_tip', 'sex':'meal_count'})

Unnamed: 0_level_0,total_sales,average_tip,meal_count
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dinner,3660.3,3.10267,176
Lunch,1167.47,2.728088,68


The result of above cell is a DataFrame. Notice that the grouped-by column `time` is index, whose name is at a lower level to columns names. Some times you may want to have the grouped-by column as a regular column. You can set `as_index=False` as shown below.

In [29]:
dft.groupby('time', as_index=False).agg({'total_bill':'sum', 'tip':'mean', 'sex':'count'})\
.rename(columns={'total_bill':'total_sales', 'tip':'average_tip', 'sex':'meal_count'})

Unnamed: 0,time,total_sales,average_tip,meal_count
0,Dinner,3660.3,3.10267,176
1,Lunch,1167.47,2.728088,68


---
### Pivot Table

Pivot tables are one of Excel's most powerful features. A pivot table allows you to extract the significance from a large, detailed data set. You may create spreadsheet-style pivot tables as DataFrame with pandas.

In the following cell, we will examine the average spending on lunch and dinner by men and women.

In [30]:
ptable = pd.pivot_table(dft, values='total_bill', index=['time'], columns=['sex'], aggfunc='mean')
ptable

sex,Female,Male
time,Unnamed: 1_level_1,Unnamed: 2_level_1
Dinner,19.213077,21.461452
Lunch,16.339143,18.048485


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, first group the `dft` `DataFrame` by the `sex` column. Next, compute and display the minimum, maximum, and median values for the new grouped `DataFrame` by using the `aggregate` function correctly. 

-----

-----

### Stacking

Given two or more `DataFrame` objects, a common task is joining them together. Pandas supports joins across two `DataFrame` objects. But for two `DataFrame` objects that have the same structure, the process can be simplified by employing either horizontal stacking (where columns are combined) or vertical stacking (where rows are combined). These operations both use the Pandas `concat` function, which by default assumes `axis=0`, which implies vertical stacking. Specifying `axis=1` implies horizontal stacking, where columns from each subsequent `DataFrame` are added to the previous columns. Note that this operation generates a new `DataFrame` with the concatenated data. 

Both of these operations are demonstrated in the following code cells. First, we split the original _tips_ data into two `DataFrame` objects: `t1` and `t2`, based on the implicit row index (notice how the end of the first new `DataFrame` aligns with the start of the second new `DataFrame` via the index). Next, the `concat` method is called to vertically stack (or combine) these two objects into a new `DataFrame`. Afterwards, the first few rows are displayed with the `head` method, and the complete summary statistics are shown for the new `DataFrame` to facilitate comparison with the original data (and the result of the horizontal stacking example).

-----

In [31]:
# Chop the 'tips' DataFrame into two sets based on rows
tr1 = dft.iloc[:200]
tr2 = dft.iloc[200:]

In [32]:
# End of first new DataFrame
tr1.tail(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
198,13.0,2.0,Female,Yes,Thur,Lunch,2
199,13.51,2.0,Male,Yes,Thur,Lunch,2


In [33]:
# Start of second new DataFrame 
tr2.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
200,18.71,4.0,Male,Yes,Thur,Lunch,3
201,12.74,2.01,Female,Yes,Thur,Lunch,2


In [34]:
# Vertical stacking
tc = pd.concat([tr1, tr2])

In [35]:
# Display the first few rows of the stacked data
tc.head(4)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2


In [36]:
# Compute and display the summary statistics of the stacked data
tc.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


------

The second example splits the _tips_ data into two `DataFrame` objects based on the implicit column index. Next the `concat` method is called to horizontally stack (or combine) these two objects into a new `DataFrame` (notice how the columns in the two new `DataFrame` objects align for the same row index). Afterwards the same two functions are called to show the new `DataFrame` is the same as the original (and vertically stacked) `DataFrame`.

------

In [37]:
# Chop the 'tips' DataFrame into two sets based on columns

tc1 = dft.iloc[:,:2]
tc2 = dft.iloc[:,2:]

In [38]:
# Display columns in first new DataFrame
tc1.head(4)

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.5
3,23.68,3.31


In [39]:
# Display columns in first new DataFrame
tc2.head(4)

Unnamed: 0,sex,smoker,day,time,size
0,Female,No,Sun,Dinner,2
1,Male,No,Sun,Dinner,3
2,Male,No,Sun,Dinner,3
3,Male,No,Sun,Dinner,2


In [40]:
# Horizontal stacking
tm = pd.concat([tc1, tc2], axis=1)

In [41]:
# Display the first few rows of the stacked data
tm.head(4)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2


In [42]:
# Compute and display the summary statistics of the stacked data
tm.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, first split the tips `DataFrame` into three new `DataFrame` objects with roughly equal numbers of rows. Second, vertically stack only two of these three `DataFrame` objects into a new `DataFrame`.

-----

---
## Useful Tips

### Avoid printing out the whole DataFrame
First, take a look at following code, do you see any problem in it?
```
dfa = pd.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv', delimiter=',', index_col='iata')
dfa
```

The code itself has no problem, it will load a dataset from a CSV file on the internet website. But, since the line `dfa` is the last line in a code cell, the code cell has an output that is the **whole** DataFrame. The output of code cells will be saved as part of the notebook file. If the dataset is really large, printing out the whole DataFrame in notebook file will dramatically increase the size of the notebook file. It not only takes a lot of disk space, it will also slow down your computer of even crash the Jupyter server. We should always avoid printing out the whole DataFrame. When you want to take a peek at the DataFrame, call `dfa.head()`, `dfa.tail()` or `dfa.sample()`.


-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [Pandas documentation][pdd]
2. A complete Pandas [tutorial][pdt]
3. The [Pandas chapter][pdc] from the book _Python Data Science Handbook_ by Jake VanderPlas

-----

[pdd]: http://pandas.pydata.org/pandas-docs/stable/index.html
[pdt]: https://github.com/TomAugspurger/effective-pandas
[pdc]: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode