# Managing data with Pandas.

## (Part 2)

One of the most common task involving data anlysis is extracting data into subsets of interest. Even though Pandas sintax for selecting data from a dataset is quite complex, it has strong capabilities for selecting particular rows and/or columns from a DataFrame or selecting values from a Series. 

Different types of subset selections in Pandas include _selecting specific columns_, _selecting specific rows_, and _selecting rows and columns simultaniously_. This task can be done by using label and by integer location. On the one hand, using labels means explicitly defining the column's name we are interested in. On the other, proceding with integer location implies using numbers between 0 and `n-1` to identify the desired row or column, where n is the total numbers of rows/column in the object.The official pandas documentation refers to integer location as position, however interger location is more explicit. 

__Indexing__ is another, more technical, term that reffers to __*subset selection*__. While the first one is more wildly used in documentation, the second is more descriptive of what's actually happening. In order to do that, we will be using three indexers [ ], _loc_, and _iloc_; which have different rules for how they work.

In this chapter we will be exploring how to handle and prepare data for later visualization through querying, aggregating, filtering and cleaning operations. 

### Content
In the second part of this introduction to pandas we are covering the following content:

- Querying data.
- Aggregating Methods.
- Filering the data.
- Data Cleaning. 

In [1]:
import pandas as pd

## 2.1. Querying the data (Extracting subsets)

__Terminology:__ When brackets are placed directly after a DataFrame/Series name, the term just the backets will be used to differentiate them from the brackets after _loc_ and _iloc_. All three indexers works under different rules and details will be provided in brief. 

`NOTE:` You must be carefull withusing brackets instead of paretheses. In Python, the brackets are a universal operator for selecting subsets of data regarless of the type of object, so you need to avoid the mistake of a appending parentheses to _loc_ or _iloc_. These are NOT methods, but are accessed in the same manner through dot notation. 

In [2]:
JustBrakets = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data Science - Study notes\Data_used\sample_data.csv", index_col='name')
JustBrakets.head()

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


### Using just the brackets __[ ]__

This indexer uses the squared brackets `[ ]` as operators. Data from can be extracted a DataFrame through the name of the column (predictor) or an auxiliary variable. This auxiliary variable can include a single name of the column or a set of names as a list. 

_Just the braackets_ doesn't allow to get specific elements (single cell), nor accept numbers as arguments in the operators, nor the operator " : ".

In [3]:
#The selection from a data frame can be done as: 
JustBrakets['food']
# Only one column returs a Series object. 

name
Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

In [4]:
#If a squared brackets is used into the indexer, then a DataFrame is returned.
JustBrakets[['food']]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


In [5]:
# We can get a DataFrame by indicating several columns as
JustBrakets[['height','food']]
# Its important to notice the double squared brakets used. That means a new list inside the operator. 
# The order of the elements in the list is important, hence it indicates the order of the resulting DF. Works the same into an auxiliary variable. 

Unnamed: 0_level_0,height,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,165,Steak
Niko,70,Lamb
Aaron,120,Mango
Penelope,80,Apple
Dean,180,Cheese
Christina,172,Melon
Cornelia,150,Beans


In [6]:
# We can get a DataFrame by indicating several columns as
columns = ['food','height']
JustBrakets[columns]

Unnamed: 0_level_0,food,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,Steak,165
Niko,Lamb,70
Aaron,Mango,120
Penelope,Apple,80
Dean,Cheese,180
Christina,Melon,172
Cornelia,Beans,150


### Using __.loc[ ]__

Data can be extracted from a DataFrame through the operator `.loc[ ]` too. It builds over the _just squared brackets_ operator, and can select rows and columns simultaneously. 

> The formal structure of this indexer is __*.loc[rows,columns]*__

In [7]:
loc_sample = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data Science - Study notes\Data_used\sample_data.csv", index_col='name')
loc_sample.head()

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


The _just squared brackets_ operator can not perform a query of rows, whereas the _.loc[ ]_ and _.iloc[ ]_ can.  `.loc[ ]` primarly selects subsets by the label of the rows and columns. This is done by separating the row and columns selection with a _comma_. 

In [8]:
# We can get a DataFrame by indicating several rows and columns as
loc_sample.loc[['Niko', 'Penelope'],['food','height']]
# Its important to notice the double squared brakets used. That means a new list inside the operator.

Unnamed: 0_level_0,food,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,Lamb,70
Penelope,Apple,80


`.iloc[ ]` accepts the name of the column (predictor) or an auxiliary variable. This auxiliary variable can include a single name of the column or a set of names as a list. 

In [9]:
# We can get a DataFrame by indicating several rows and columns through auxiliary variables
rows = ['Niko', 'Penelope']
columns = ['food','height']
loc_sample.loc[rows,columns]

Unnamed: 0_level_0,food,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,Lamb,70
Penelope,Apple,80



_Just the brackets_ doesn't allow to get specific elements (single cell), not accept numbers as arguments in the operators. The slice notation is allowed within the loc indexer using the "` : `" operator indicates all elements. THe slice notation in this indexer includes the stop label.  

In [10]:
# We can get several elements by indicating the starting and ending point of the subset through " : "
columns = ['food','height']
loc_sample.loc['Niko':'Penelope',columns]

Unnamed: 0_level_0,food,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,Lamb,70
Aaron,Mango,120
Penelope,Apple,80


A Series can be obtained by indicating only one element of the rows or columns without the parentheses. A row is trasposed into a vertical visualization for didactic purposes only. 

In [11]:
columns = 'height'
loc_sample.loc['Niko':,columns]

name
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

A Dataframe can be extracted by appending columns to the single column label desired:

In [12]:
columns = ['height']
loc_sample.loc['Niko':,columns]

Unnamed: 0_level_0,height
name,Unnamed: 1_level_1
Niko,70
Aaron,120
Penelope,80
Dean,180
Christina,172
Cornelia,150


When working with slice notation we can modify the step size bt adding a new colon as

In [13]:
# This example start extracting a subset by getting every two rows from 'Aaron'
columns = ['height','age','score']
loc_sample.loc['Aaron'::2,columns]

Unnamed: 0_level_0,height,age,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaron,120,12,9.0
Dean,180,32,1.8
Cornelia,150,69,2.2


If the row and column selections are both a single label, then a scalar value and NOT a DataFrame or Series is returned.

In [14]:
# The single age of Dean is obtained as:
rows = 'Dean'
columns = 'age'
loc_sample.loc[rows,columns]

32

In [15]:
#A single value can be selected as a DataFrame adding brackets to the inputs as:
rows = ['Dean']
columns = ['age']
loc_sample.loc[rows,columns]

Unnamed: 0_level_0,age
name,Unnamed: 1_level_1
Dean,32


### Using __.iloc[ ]__

The indexer `.iloc[ ]` stands for _integer location_. It works using numbers instead of labels as _[ ]_ and _.loc[ ]_. The .iloc[ ] indexer is capable of making simultaneous row and columns selection just like loc. 

> The formal structure of this indexer is        __*.iloc[rows,columns]*__

In [16]:
iloc_sample = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data Science - Study notes\Data_used\sample_data.csv", index_col='name')
iloc_sample.head()

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


`.iloc[ ]` accepts number to reference columns (predictors) and rows (registers). The firts row/columns is referenced by the integer 0 and each subsequent row is referenced by the next integer. The last row/column is referenced by n - 1 where n is the number of row/column. The order of the elements into rows and columns is respected.

Just like .loc, `.iloc[ ]` accepts simple numbers and lists, in both single and auxiliary variables. The Slice notation works exactly the same than it works for the loc indexer, however it doens't include the stop integer location, which is how sliceng works with Python lists, tuples, and strings. 

In [17]:
# A subset can be extracted as:
icols = [4,2]
iloc_sample.iloc[2:4,icols]

Unnamed: 0_level_0,height,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,120,Mango
Penelope,80,Apple


.iloc[ ] requires the rows argument. When it's missing, an error will be returned. Columns argument is optional. When columns are not explicitly written, all columns will be returned. 

In [18]:
irows = [0,2,4]
iloc_sample.iloc[irows]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Aaron,FL,red,Mango,12,120,9.0
Dean,AK,gray,Cheese,32,180,1.8


In [19]:
# The rows columns is required
icols = [0,2,4]
iloc_sample.iloc[,icols]

SyntaxError: invalid syntax (352876024.py, line 3)

In [None]:
# A single row and single column returns an scalar value
irows = 0
icols = 4
iloc_sample.iloc[irows,icols]

165

In [None]:
# A DataFrame of size 1 x 1 is returned when parentheses are used
irows = [0]
icols = [4]
iloc_sample.iloc[irows,icols]

Unnamed: 0_level_0,height
name,Unnamed: 1_level_1
Jane,165


In [None]:
# A single Series is extracted by using a single scalar
irows = [0,2,4]
icols = 4
iloc_sample.iloc[irows,icols]
# If a Dataset is required, then squared brackets are required.

name
Jane     165
Aaron    120
Dean     180
Name: height, dtype: int64

### Selecting Subsets of Data from Series

the three indexers, `[ ]`, `loc`, and `iloc` are available to make subsets selections on a Series. They work similarly as they do on DataFrame.

## 2.2. Filtering the data (Boolean Selection)

Boolean selection is also reffered to as Boolean indexing, is the process of selecting subsets of rows from DataFrames (or Series) based on the actual values and NOT by labels or integer locations. Actual values are found based on a strict __logical condition__ that must be checked one row at a time against a value of interest. That means that every row is either kept or discarded according the results of the evaluation, a True or False value associated with it corresponding to the outcome of the logical condition.

The primarly purpose of _just the brackets_ for a DataFrame is to select one or more columns by using either a string or a list of strings. However, rows can be selected from a DataFrame throught the usage of boolean selection when the result is True.

In [20]:
# Lets read a file 
bikes = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data Science - Study notes\Data_used\bikes.csv",parse_dates=['starttime','stoptime'])
bikes.head()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19.0,Clark St & Randolph St,31.0,72.0,16.1,mostlycloudy
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19.0,Damen Ave & Pierce Ave,19.0,73.0,17.3,partlycloudy


By far the most common way to create a boolean Series is from the values of one particular column. We test a condition using one of the six comparison operators:
- > < , <= , > , >= , == , != 

Filtering data using boolean selection is a two step procedure: 
1. *We create a Boolean series:* to make the comparison and determinate which rows fulfill the contition of comparison. It returns a new Series the same lenght as the column under analysis with the boolean values corresponding to the outcome of the comparison.
> filt = bikes['tripduration'] > 1000
2. *Complete the Boolean selection:* to select only those rows with a result _True_. It's done by placing the boolean filter created into the _Just the brackets_ as follow:
> bikes[bikes['tripdutarion'] > 1000] 

In [26]:
# A Boolean filter can be implemented using auxiliary variables:
filt = bikes['tripduration'] > 1000
subset = bikes[filt]
subset.head()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
8,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,31.0,Wood St & Division St,15.0,71.1,0.0,cloudy
10,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,15.0,Damen Ave & Pierce Ave,19.0,79.0,9.2,mostlycloudy
11,Male,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,15.0,Lincoln Ave & Armitage Ave,19.0,79.0,10.4,mostlycloudy
12,Male,2013-07-05 10:02:00,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,19.0,Jefferson St & Monroe St,19.0,79.0,0.0,partlycloudy


Which is a smaller subset of the original DataFrame _bikes_. It only contains those registries with a trip duration longer than 1000:

In [25]:
#The number of rows in the original dataset is:
len(bikes)

50089

In [27]:
#The number of rows in the filtered dataset is:
len(subset)

10178

Boolean indexation can be implemented in a single line using the Boolean Series direectly into the _Just the Brackets_ indexer as well:

In [28]:
bikes[bikes['tripduration'] > 1000].head()

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
8,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,31.0,Wood St & Division St,15.0,71.1,0.0,cloudy
10,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,15.0,Damen Ave & Pierce Ave,19.0,79.0,9.2,mostlycloudy
11,Male,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,15.0,Lincoln Ave & Armitage Ave,19.0,79.0,10.4,mostlycloudy
12,Male,2013-07-05 10:02:00,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,19.0,Jefferson St & Monroe St,19.0,79.0,0.0,partlycloudy
