In [1]:
import numpy as np, pandas as pd

# Data Indexing and Selection

## in Series

### as a Dictionary

In [2]:
data=pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
data['b']

0.5

We can use dictionary-like python expressions and methods to examine the keys/indices and values:

In [4]:
'a' in data

True

In [5]:
data.keys(),data.values

(Index(['a', 'b', 'c', 'd'], dtype='object'), array([0.25, 0.5 , 0.75, 1.  ]))

In [6]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

To extend a Series, just like we used to do to a dictionary, we assign a new value to a new index.

In [7]:
data['e']=1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### as one-dimensional array

In [8]:
# slicing by explicit index
data['b':'d']

b    0.50
c    0.75
d    1.00
dtype: float64

In [9]:
# slicing by implicit integer index
data[1:4]

b    0.50
c    0.75
d    1.00
dtype: float64

In [10]:
# masking
data[data>0.75]

d    1.00
e    1.25
dtype: float64

In [11]:
# fancy indexing
data[['b','a','c','e']]

b    0.50
a    0.25
c    0.75
e    1.25
dtype: float64

Note this potential point of confusion. While slicing using explicit indeces, the instance of the final index is also included, wheras while slicing using integer indeces the instance at the final index is not included(as we observe in default python arrays)

### using Indexers: loc, iloc

Suppose the series has explicitly defined indeces beginning from 1 ending at 10 (10 elements), then `data[1]` will use the explicitly defined index wheras `data[1:5]` will use implicitly assumed python-style indeces

In [12]:
data=pd.Series([str((i+1)*15) for i in range(10)], index=[i+1 for i in range(10)])
print(data)
print(data[1])
print(data[1:5])

1      15
2      30
3      45
4      60
5      75
6      90
7     105
8     120
9     135
10    150
dtype: object
15
2    30
3    45
4    60
5    75
dtype: object


Because of this potential confusion, Pandas provides some special indexer attributes that explicitly expose cirtain indexing styles.

First, the `loc` attribute. This attribute allowes us to specify the explicit index

In [13]:
data.loc[1]

'15'

In [14]:
data.loc[2:6]

2    30
3    45
4    60
5    75
6    90
dtype: object

Second, the `iloc` attribute allows to refer to each element with it's python-like indexing style

In [15]:
data.iloc[7:9]

8    120
9    135
dtype: object

It's a common convention and a practice to use explicit indexing rather than implicit indexing

## in Dataframe

### as a Dictionary

In [16]:
area=pd.Series([423967,695662,141297,170312,149995],index=['California','Texas','New York','Florida','Illinois'])
pop=pd.Series({'California': 38332521, 'Texas': 26448193,'New York': 19651127, 'Florida': 19552860,'Illinois': 12882135})
Data=pd.DataFrame({'Population':pop,'Area':area})
Data

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Each individual column can be accessed via usual Dictionary-style indexing using column names, or using attribute style accessing with collumn names

In [17]:
print(Data['Area'])
print(Data.Population)

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: Area, dtype: int64
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: Population, dtype: int64


Both ways of accessing result in the exact same output

In [18]:
Data['Area'] is Data.Area

True

Accessing columns via attribute-style indexing is convinient but not reliable in the cases when the column names conflict with the default methods and functionalities of a dataframe.

For example, say we had named the `Population` coloumn `pop` instead. Then, the comand `Data.pop` would have called the `pop()` method on the `Data` dataframe

The Dictionary style indexing can also be used to define a new column 

In [19]:
Data['Density']=Data['Population']/Data['Area']
Data

Unnamed: 0,Population,Area,Density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


### as a 2-D Array

Only the values of a dictionary, excluding the keys/row names, as a 2d Array can be accessed using the `values` attribute.

In [20]:
Data.values

array([[3.83325210e+07, 4.23967000e+05, 9.04139261e+01],
       [2.64481930e+07, 6.95662000e+05, 3.80187404e+01],
       [1.96511270e+07, 1.41297000e+05, 1.39076746e+02],
       [1.95528600e+07, 1.70312000e+05, 1.14806121e+02],
       [1.28821350e+07, 1.49995000e+05, 8.58837628e+01]])

From this perspective, many array-like operations can be used on Dataframes as well

In [21]:
# For calculating the Transpose of an 2D array
Data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
Population,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
Area,423967.0,695662.0,141297.0,170312.0,149995.0
Density,90.41393,38.01874,139.0767,114.8061,85.88376


Accessing a <b>single row</b> can simply be done as follows

In [22]:
print(Data.values[4])
print(Data.values[4,1])#accessing a apecific element

[1.28821350e+07 1.49995000e+05 8.58837628e+01]
149995.0


...and accessing the columns can be done using the dictionary-like convention as discussed previously

In [23]:
Data[['Area','Density']], Data['Population']

(              Area     Density
 California  423967   90.413926
 Texas       695662   38.018740
 New York    141297  139.076746
 Florida     170312  114.806121
 Illinois    149995   85.883763,
 California    38332521
 Texas         26448193
 New York      19651127
 Florida       19552860
 Illinois      12882135
 Name: Population, dtype: int64)

### using indexers: loc, iloc, ix

For array style indexing, as shown previously the same indexers `loc` and `iloc` can be used which maintain the index and column label in the result

In [24]:
Data.iloc[:3,:2]

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297


In [25]:
Data.loc['Texas':'Florida','Population':'Area']

Unnamed: 0,Population,Area
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312


The `ix` indexer allows us to use a hybrid of these two approaches

In [26]:
Data.ix[1:5,:'Area']

AttributeError: 'DataFrame' object has no attribute 'ix'

<b>IMPORTANT</b>

The `.ix` indexer in pandas was deprecated and removed in newer versions of pandas (specifically, in version 0.20.0, released in May 2017)

The deprecation of .ix was due to ambiguity and potential confusion in its usage, as it allowed both label-based and integer-based indexing, leading to unpredictable behavior in certain cases. To address this, pandas developers recommended using more explicit indexing methods:

- `.loc` for label-based indexing
- `.iloc` for integer-based indexing

These functions can be used with NumPy-style data access patterns.

In [None]:
Data.loc[Data.Density>100,['Population','Density']]

Unnamed: 0,Population,Density
New York,19651127,139.076746
Florida,19552860,114.806121
