# Selecting and Viewing Data in Pandas

In the realm of Data science, the ability to strategically select and scrutinize data is of the utmost importance. This process lays the groundwork for successful exploration, analysis, and ultimately, the extraction of knowledge from vast datasets, and can often be facilitated by powerful tools like pandas.

- [Selecting Data](#selecting-data)
  - [Loc and iLoc](#loc-and-iloc)
  - [Selecting Columns](#selecting-columns)
- [Viewing Data](#viewing-data)
  - [Head](#head)
  - [Tail](#tail)
  - [Filtering Data](#filtering-data)
  - [Crosstab](#crosstab)
  - [Groupby](#groupby)


In [40]:
# Importing pandas and reading csv
import pandas as pd

car_sales = pd.read_csv('dataframes/car-sales.csv')

# Selecting Data

There are many ways of selecting the desired data in a pandas DF, both for rows and for columns.


## Loc and iLoc

`.loc` and `.iloc` are two fundamental methods used for accessing and modifying data within DataFrames. While they both serve the purpose of data selection, they differ in their underlying approach.

### `.loc`

This method is used for Label-Based indexing, it prioritizes label-based indexing, meaning it uses the DF's index labels (row labels or column labels) to select data. In other words, `.loc` finds the item with the passed **index**.

### `.iloc`

This method is used for Integer-Based Indexing, it emphasizes position-based indexing, meaning it uses integer positions within the DF to select data. In other words, `.iloc` finds the item located at the **position** that should be in the passed index.


In [41]:
# Getting the 4th row's value
car_sales.loc[3]

Make                    BMW
Colour                Black
Odometer (KM)         11179
Doors                     5
Price            $22,000.00
Name: 3, dtype: object

In [42]:
# Getting the value located in the 4th row and Price column
car_sales.loc[3, 'Price']

'$22,000.00'

In [43]:
# Creating animals Series
animals = pd.Series(
    ['cat', 'dog', 'bird', 'panda', 'snake'],
    index=[0, 3, 9, 8, 3],
)

# Showing Series
animals

0      cat
3      dog
9     bird
8    panda
3    snake
dtype: object

In [44]:
# iloc shows the item on the 4th position
animals.iloc[3]

'panda'

In [45]:
# loc shows the item with the index 3
animals.loc[3]

3      dog
3    snake
dtype: object

## Selecting Columns

Pandas offers two main ways of selecting columns. The first one is to use the column's name as a key like so `my_df['name']`. The second one is to use the column's name as an attribute of the data frame, like so `my_df.name`. Both methods yield the same results.

Although mostly these methods can be used interchangeably, there are cases in which the second can't. This case is when the column's name contains any whitespace in it, making it so that there isn't any way of using it as a full word, required to name an attribute.


In [46]:
# Accessing as a key
car_sales['Make']

0    Toyota
1     Honda
2    Toyota
3       BMW
4    Nissan
5    Toyota
6     Honda
7     Honda
8    Toyota
9    Nissan
Name: Make, dtype: object

In [47]:
# Accessing as an attribute
car_sales.Make

0    Toyota
1     Honda
2    Toyota
3       BMW
4    Nissan
5    Toyota
6     Honda
7     Honda
8    Toyota
9    Nissan
Name: Make, dtype: object

# Viewing Data

Pandas offers many ways of visualizing data contained in its DFs/Series.


## Head

`.head(n)` is a method offered by the pandas package, and it's used to retrieve the first n rows (5 by default) of a DataFrame or Series. It's a fundamental tool for getting a quick glimpse at the beginning of the data to understand its structure and content.

It can be used on pandas basic data types, such as DF's and Series, by simply adding the method to the end of the object's name.


In [48]:
# Getting DF's head
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"


## Tail

Where `.head(n)` shows the first n rows of a DF, `.tail(n)` shows the last n rows (5 by default). This method works very much like its opposite.


In [49]:
# Getting DF's tail
car_sales.tail()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


## Filtering Data

There are also multiple ways of filtering data in pandas. The simplest one is by selecting the data from a specific column using index-like notation


In [50]:
# Filtering Data
car_sales[car_sales['Odometer (KM)'] > 100000]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
4,Nissan,White,213095,4,"$3,500.00"


## Crosstab

This is a powerful tool for analysing relationships between categorical variables in data. It creates a contingency table, also sometimes called a cross-tabulation table or pivot table, that summarizes the frequency distribution of two or more variables.

This function takes as arguments the data that it should relate.


In [51]:
# Crossing the Make and Doors tables
pd.crosstab(car_sales['Make'], car_sales['Doors'])

Doors,3,4,5
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BMW,0,0,1
Honda,0,3,0
Nissan,0,2,0
Toyota,1,3,0


## Groupby

Groupby is a fundamental tool in pandas for working with data grouped by categories. It allows the split of a Data Frame into distinct groups based on one or more columns and then apply various operations to each group. This makes it powerful for data analysis and manipulation tasks like aggregation, filtering and transformations.


In [53]:
# Grouping by Make
car_sales.groupby(['Make']).mean(numeric_only=True)

Unnamed: 0_level_0,Odometer (KM),Doors
Make,Unnamed: 1_level_1,Unnamed: 2_level_1
BMW,11179.0,5.0
Honda,62778.333333,4.0
Nissan,122347.5,4.0
Toyota,85451.25,3.75
