# Managing data with Pandas.

## (Part 2)

One of the most common task involving data anlysis is extracting data into subsets of interest. Even though Pandas sintax for selecting data from a dataset is quite complex, it has strong capabilities for selecting particular rows and/or columns from a DataFrame or selecting values from a Series. 

Different types of subset selections in Pandas include _selecting specific columns_, _selecting specific rows_, and _selecting rows and columns simultaniously_. This task can be done by using label and by integer location. On the one hand, using labels means explicitly defining the column's name we are interested in. On the other, proceding with integer location implies using numbers between 0 and `n-1` to identify the desired row or column, where n is the total numbers of rows/column in the object.The official pandas documentation refers to integer location as position, however interger location is more explicit. 

__Indexing__ is another, more technical, term that reffers to __*subset selection*__. While the first one is more wildly used in documentation, the second is more descriptive of what's actually happening. In order to do that, we will be using three indexers [ ], _loc_, and _iloc_; which have different rules for how they work.

In this chapter we will be exploring how to handle and prepare data for later visualization through querying, aggregating, filtering and cleaning operations. 

### Content
In the second part of this introduction to pandas we are covering the following content:

- Querying data.
- Aggregating Methods.
- Filering the data.
- Data Cleaning. 

In [3]:
import pandas as pd

## 2.1. Querying the data (Extracting subsets)

__Terminology:__ When brackets are placed directly after a DataFrame/Series name, the term just the backets will be used to differentiate them from the brackets after _loc_ and _iloc_. All three indexers works under different rules and details will be provided in brief. 

`NOTE:` You must be carefull withusing brackets instead of paretheses. In Python, the brackets are a universal operator for selecting subsets of data regarless of the type of object, so you need to avoid the mistake of a appending parentheses to _loc_ or _iloc_. These are NOT methods, but are accessed in the same manner through dot notation. 

In [7]:
JustBrakets = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data Science - Study notes\Data_used\sample_data.csv", index_col='name')
JustBrakets.head()

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


### Using just the brackets __[ ]__

This indexer uses the squared brackets `[ ]` as operators. Data from can be extracted a DataFrame through the name of the column (predictor) or an auxiliary variable. This auxiliary variable can include a single name of the column or a set of names as a list. 

_Just the braackets_ doesn't allow to get specific elements (single cell), nor accept numbers as arguments in the operators, nor the operator " : ".

In [15]:
#The selection from a data frame can be done as: 
JustBrakets['food']
# Only one column returs a Series object. 

name
Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

In [16]:
#If a squared brackets is used into the indexer, then a DataFrame is returned.
JustBrakets[['food']]

Unnamed: 0_level_0,food
name,Unnamed: 1_level_1
Jane,Steak
Niko,Lamb
Aaron,Mango
Penelope,Apple
Dean,Cheese
Christina,Melon
Cornelia,Beans


In [19]:
# We can get a DataFrame by indicating several columns as
JustBrakets[['height','food']]
# Its important to notice the double squared brakets used. That means a new list inside the operator. 
# The order of the elements in the list is important, hence it indicates the order of the resulting DF. Works the same into an auxiliary variable. 

Unnamed: 0_level_0,height,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,165,Steak
Niko,70,Lamb
Aaron,120,Mango
Penelope,80,Apple
Dean,180,Cheese
Christina,172,Melon
Cornelia,150,Beans


In [14]:
# We can get a DataFrame by indicating several columns as
columns = ['food','height']
JustBrakets[columns]
# Its important to notice the double squared brakets used. That means a new list inside the operator.

Unnamed: 0_level_0,food,height
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,Steak,165
Niko,Lamb,70
Aaron,Mango,120
Penelope,Apple,80
Dean,Cheese,180
Christina,Melon,172
Cornelia,Beans,150


### Using __.loc[ ]__

Data can be extracted from a DataFrame through the operator `.loc[ ]` too. It builds over the _just squared brackets_ operator. 

> The formal structure of this indexer is        __*.iloc[rows,columns]*__

`.iloc[ ]` accepts the name of the column (predictor) or an auxiliary variable. This auxiliary variable can include a single name of the column or a set of names as a list. 

_Just the braackets_ doesn't allow to get specific elements (single cell), not accept numbers as arguments in the operators.

### Using __.iloc[ ]__