# `pandas` - Single Table Verbs

## Contents
1. Setup
1. Rename columns
1. Modify columns
1. Sort rows
1. Sample rows
1. Filter rows

The summarization and grouping verbs are described in the `Summarization` notebook.

## Reference
- http://pandas.pydata.org/pandas-docs/stable/index.html
- https://pandas.pydata.org/pandas-docs/stable/indexing.html
- https://pandas.pydata.org/pandas-docs/stable/dsintro.html

## 1. Setup

Load libraries.

In [3]:
import pandas  as pd
import numpy  as np
(pd.__version__,
 np.__version__
)

('0.24.2', '1.16.4')

The most common way to create a DataFrame is to use the `read_csv` (pandas) function to read a CSV file. 

Another common technique is to use the `DataFrame` function, which has these three parameters:
1. `data`, which is a numpy array, a dictionary or another DataFrame (examples of each follow)
1. `index`, which is a list of the names of the rows
1. `columns`, which is a list of the names of the columns

There are two additional parameters that are not described here. For those details and more see:
- http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.html#pandas.DataFrame

Create a sample dataframe for the demonstrations below.

In [4]:
df_col = pd.DataFrame(data=[[100, 200, 300, 400],
                            [101, 201, 301, 401],
                            [102, 202, 302, 402]], 
                      columns=['col_a', 'col_b', 'col_c', 'col_d']
                     )
df_col

Unnamed: 0,col_a,col_b,col_c,col_d
0,100,200,300,400
1,101,201,301,401
2,102,202,302,402


In [5]:
%%sh
git clone https://github.com/datalab-datasets/file-samples.git

fatal: destination path 'file-samples' already exists and is not an empty directory.


In [6]:
%ls /content/file-samples/imports-85.csv

/content/file-samples/imports-85.csv


## 2. Rename Columns
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

Read a dataframe from a CSV file, replace `-` with `_` in the column names and set `?` to be a missing value indicator.

In [0]:
column_names = ['symboling', 'normalized_losses', 'make', 'fuel-type',
                'aspiration', 'num_of_doors', 'body_style', 'drive_wheels',
                'engine_location', 'wheel_base', 'length', 'width',
                'height', 'curb_weight', 'engine_type', 'num_of_cylinders',
                'engine_size', 'fuel_system', 'bore', 'stroke',
                'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
                'highway_mpg', 'price']
import_df = pd.read_csv('/content/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

Rename the column names using `rename` method with a dict. The `columns` attribute displays a list of column names.

In [8]:
rename_df = import_df.rename(columns={'city_mpg'   : 'mpg_city',
                                      'highway_mpg': 'mpg_highway'
                                     })
rename_df.columns

Index(['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration',
       'num_of_doors', 'body_style', 'drive_wheels', 'engine_location',
       'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
       'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke',
       'compression_ratio', 'horsepower', 'peak_rpm', 'mpg_city',
       'mpg_highway', 'price'],
      dtype='object')

## 3. Modify Columns

From the documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html

> Assigning multiple columns within the same assign is possible, but you cannot reference other columns created within the same assign call.

Assign new columns to the sample dataframe `df_col` using the `assign` method. It returns a new object (a copy) with the new columns added to the original ones.

In [9]:
df_col.assign(apb = df_col.col_a    + df_col.col_b, 
              ctd = df_col['col_c'] * df_col['col_d'])

Unnamed: 0,col_a,col_b,col_c,col_d,apb,ctd
0,100,200,300,400,300,120000
1,101,201,301,401,302,120701
2,102,202,302,402,304,121404


In [10]:
df_col.assign(apb = lambda df: df.col_a + df.col_b, 
              ctd = lambda df: df['col_c'] * df['col_d'])

Unnamed: 0,col_a,col_b,col_c,col_d,apb,ctd
0,100,200,300,400,300,120000
1,101,201,301,401,302,120701
2,102,202,302,402,304,121404


## 4. Sort Rows

- http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.sort_values.html

The `sort_values` method in the following code cell sorts rows by the ascending `horsepower` values. Use `head()` function to return the first 5 rows and the 4 columns (`horsepower`,`make`,`city_mpg`,`highway_mpg`) of the dataframe.

In [11]:
import_df \
  .sort_values(by='horsepower',
               axis=0, 
               ascending=True
              ) \
  .loc[:,['horsepower','make','city_mpg','highway_mpg']] \
  .head()

Unnamed: 0,horsepower,make,city_mpg,highway_mpg
18,48.0,chevrolet,47,53
182,52.0,volkswagen,37,46
184,52.0,volkswagen,37,46
90,55.0,nissan,45,50
158,56.0,toyota,34,36


In [12]:
import_df.sort_values(by='horsepower',
                      axis=0, 
                      ascending=True
                     ) \
         .loc[:,['horsepower','make','city_mpg','highway_mpg']] \
         .head()

Unnamed: 0,horsepower,make,city_mpg,highway_mpg
18,48.0,chevrolet,47,53
182,52.0,volkswagen,37,46
184,52.0,volkswagen,37,46
90,55.0,nissan,45,50
158,56.0,toyota,34,36


## 5. Sample Rows
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

See also parameters for: `frac`, `replace` and `weight`

The `sample` method extract 10 random elements (since n=10) from the `import_df` dataframe object.

In [13]:
import_df \
  .sample(n=10) \
  .loc[:,['horsepower','make','city_mpg','highway_mpg']]

Unnamed: 0,horsepower,make,city_mpg,highway_mpg
139,73.0,subaru,26,31
201,160.0,volvo,19,25
170,116.0,toyota,24,30
7,110.0,audi,19,25
39,86.0,honda,27,33
35,76.0,honda,30,34
104,160.0,nissan,19,25
107,97.0,peugot,19,24
61,84.0,mazda,26,32
203,106.0,volvo,26,27


## 6. Filter Rows
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html
- https://pythonspot.com/pandas-filter/

Filter rows by indexing with a boolean expression.

In [14]:
import_df[import_df.make=="toyota"][['make','body_style','city_mpg','highway_mpg']].head()

Unnamed: 0,make,body_style,city_mpg,highway_mpg
150,toyota,hatchback,35,39
151,toyota,hatchback,31,38
152,toyota,hatchback,31,38
153,toyota,wagon,31,37
154,toyota,wagon,27,32


In [15]:
import_df \
  .loc[lambda df: df.make == "toyota"] \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .head()

Unnamed: 0,make,body_style,city_mpg,highway_mpg
150,toyota,hatchback,35,39
151,toyota,hatchback,31,38
152,toyota,hatchback,31,38
153,toyota,wagon,31,37
154,toyota,wagon,27,32


In [16]:
import_df[(import_df.make=="toyota") & (import_df.city_mpg ==35 )][['make','body_style','city_mpg','highway_mpg']].head()

Unnamed: 0,make,body_style,city_mpg,highway_mpg
150,toyota,hatchback,35,39


Filter rows by calling the `query` method with a boolean expression. This expression is based on the column names. The query method will return a new filtered dataframe.

In [17]:
import_df.query('make=="toyota"')[['make','body_style','city_mpg','highway_mpg']].head()

Unnamed: 0,make,body_style,city_mpg,highway_mpg
150,toyota,hatchback,35,39
151,toyota,hatchback,31,38
152,toyota,hatchback,31,38
153,toyota,wagon,31,37
154,toyota,wagon,27,32


In [18]:
import_df.query('make=="toyota" | city_mpg == 35  & highway_mpg == 39' )[['make','body_style','city_mpg','highway_mpg']].head()

Unnamed: 0,make,body_style,city_mpg,highway_mpg
150,toyota,hatchback,35,39
151,toyota,hatchback,31,38
152,toyota,hatchback,31,38
153,toyota,wagon,31,37
154,toyota,wagon,27,32


__The End__