# Accessing Data within Pandas

## Introduction
In this lesson we're going to dig into various methods for accessing data from our Pandas Series and DataFrames.

## Objectives

You will be able to:
* Understand and explain some key Pandas methods
* Access DataFrame data by using the label
* Perform boolean indexing on both Series and DataFrames
* Use simple selectors for series
* Set new Series and DataFrame inputs

#### _Our goals today are to be able to_: <br/>

Use the pandas library to:

- Get summary info about a dataset and its variables
  - Apply and use info, describe and dtypes
  - Use mean, min, max, and value_counts 
- Use apply and applymap to transform columns and create new values

- Explain lambda functions and use them to use an apply on a DataFrame
- Explain what a groupby object is and split a DataFrame using a groupby
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting

## Importing pandas and the data

First, let's make sure we import `pandas` as `pd`.

In [1]:
import pandas as pd

In [11]:
student_dict = {
    'name': ['Nati', 'Sami', 'Dani', 'Sara'],
    'age': ['25', '21', '26', '21'],
    'city': ['Houston', 'Seattle', 'New york', 'Atlanta'],
    'state': ['Texas', 'Washington', 'New York', 'Georgia']
}

students_df = pd.DataFrame(student_dict)

In [5]:
students_df

Unnamed: 0,name,age,city,state
0,Nati,25,Houston,Texas
1,Sami,21,Seattle,Washington
2,Dani,26,New york,New York
3,Sara,21,Atlanta,Georgia


In [10]:
students_df['name'] # str

0    Nati
1    Sami
2    Dani
3    Sara
Name: name, dtype: object

In [11]:
students_df['name'] == 'Sara' # =

0    False
1    False
2    False
3     True
Name: name, dtype: bool

The statement students_df[‘name’] == `Sara` produces a Pandas Series with a True/False value for every row in the `students_df` DataFrame, where there are `True` values for the rows where the name is `Sara`.

These type of boolean arrays can be passed directly to the .loc indexer.

In [14]:
students_df.loc[students_df['name'] == 'Sara']#[['city','state']]

Unnamed: 0,name,age,city,state
3,Sara,21,Atlanta,Georgia


What about if we only want the `city` and `state` of the selected students with the name `Sara`?

In [15]:
students_df.loc[students_df['name'] == 'Sara', ['city', 'state']]

Unnamed: 0,city,state
3,Atlanta,Georgia


In [16]:
students_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      object
 2   city    4 non-null      object
 3   state   4 non-null      object
dtypes: object(4)
memory usage: 256.0+ bytes


What amount if we want to select a student of a specific age? say `21`

In [17]:
students_df['age'] == '21'

0    False
1     True
2    False
3     True
Name: age, dtype: bool

In [21]:
students_df

Unnamed: 0,name,age,city,state
0,Nati,25,Houston,Texas
1,Sami,21,Seattle,Washington
2,Dani,26,New york,New York
3,Sara,21,Atlanta,Georgia


What amount if we want to select a student of a specific age?

In [25]:
students_df.loc[(students_df['age'] == '21') & (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state
3,Sara,21,Atlanta,Georgia


In [35]:
# What should be returned? OR = |  AND = & 
students_df.loc[(students_df['age'] == '25') | (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state
0,Nati,25,Houston,Texas
3,Sara,21,Atlanta,Georgia


## Switch gears

## Switch gears

Before we answer those questions about the animal shelter data, let's practice on a simpler dataset.
Read about this dataset here: https://www.kaggle.com/ronitf/heart-disease-uci

<!-- ![heart-data](images/heartbloodpres.jpeg) -->

![](images/heartbloodpres.jpeg)

In [45]:
from sklearn.datasets import load_wine

# data = load_wine()
# df = pd.DataFrame(data.data, columns=data.feature_names)
df = pd.read_csv('data/heart.csv',index_col = 'age')

In [46]:
type(df)

pandas.core.frame.DataFrame

In [47]:
df.head()

Unnamed: 0_level_0,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [96]:
df[df['trestbps'] > 175].shape

(7, 14)

Great! Our data set is now stored in the variable `df`. As you know, you can look at its elements by using `df` or `print(df)`.

In [21]:
print(df)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  target  
0        0   0     1       1  
1        0   0     2     

Now what if you only want to see only a few lines of the data, based on certain constraints? You'll learn how to access data in this lesson!

## Methods and attributes to access data information

It won't be a surprise that our `df` object is a pandas DataFrame object. Let's verify this using the `type()`-function

In [53]:
type(df)

pandas.core.frame.DataFrame

There are some methods and attributes associated with pandas objects (both DataFrames *and* series!) which make retrieving information from the data particularly easy. Some commonly used methods:
- `.head()`
- `.tail()`

And attributes:
- `.index`
- `.columns`
- `.dtypes`
- `.shape`

### Some methods: `.head()`, `.tail()` and `.info()`

By using `.head()` and `.tail()`, you can select the first $n$ rows from your dataframe. The default $n$ is 5, but you can change this value inside the parentheses. For example:

In [26]:
# First 5 rows of df
df.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [38]:
df.shape

(303, 14)

In [43]:
df[['sex','age']].head()

Unnamed: 0,sex,age
0,1,63
1,1,37
2,0,41
3,1,56
4,0,57


In [29]:
# last 3 rows of df
df.tail(1)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


To get a concise summary of the dataframe you can use `.info()`

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


### Some attributes

Using `.index` you can access the index or row labels of the DataFrame.

In [45]:
df.index

RangeIndex(start=0, stop=303, step=1)

Using `.columns`, you can access the column labels of the DataFrame.

In [54]:
val = list(df.columns)
print(val)

['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']


Using `.dtypes` returns the dtypes in the DataFrame (compare with `.info()!)

In [59]:
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

`.shape` returns a tuple representing the dimensionality  (in `(rows,columns)` ) of the DataFrame.

In [55]:
df.shape

(303, 14)

## Selecting dataframe information

In the previous section, we deliberately omitted 2 very important attributes:
- `.iloc`, which is a pandas dataframe indexer used for integer-location based indexing / selection by position.
- `.loc`, which has 2 use cases:
       - Selecting by label / index
       - Selecting with a boolean / conditional lookup


### `.iloc`

You can use `.iloc` to select single rows. To select the 4th row, you can use `.iloc[3]` like:

In [56]:
df.iloc[3]

age          56.0
sex           1.0
cp            1.0
trestbps    120.0
chol        236.0
fbs           0.0
restecg       1.0
thalach     178.0
exang         0.0
oldpeak       0.8
slope         2.0
ca            0.0
thal          2.0
target        1.0
Name: 3, dtype: float64

You can use a colon to select several rows. Note that you'll use a structure `.iloc[a:b]` where the row with index `a` will be included in the selection and the row with index `b` is excluded.

In [69]:
df.iloc[5:8] # [row,???? column?????]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1


In [79]:
# df[df['age'] == 65][['chol','target']]
df.loc[df['age'] == 65, ['chol','target']]

Unnamed: 0,chol,target
28,417,1
31,177,1
38,269,1
39,360,1
181,225,0
199,248,0
218,254,0
222,282,0


Next, you can use `,` to perform *column* selections based on their index as well. The command below selects full columns 3-6:

In [60]:
df.iloc[:,3:7]

Unnamed: 0,trestbps,chol,fbs,restecg
0,145,233,1,0
1,130,250,0,1
2,130,204,0,0
3,120,236,0,1
4,120,354,0,1
...,...,...,...,...
298,140,241,0,1
299,110,264,0,1
300,144,193,1,1
301,130,131,0,1


Last but not least, you can perform column and row selections at once:

In [62]:
df.iloc[5:10,3:9]

Unnamed: 0,trestbps,chol,fbs,restecg,thalach,exang
5,140,192,0,1,148,0
6,140,294,0,0,153,0
7,120,263,0,1,173,0
8,172,199,1,1,162,0
9,150,168,0,1,174,0


### `.loc`

 #### a) `.loc` label-based indexing

You can `.loc` to select columns based on their (row index and) column name. Examples:

In [63]:
df.loc[:,"chol"]

0      233
1      250
2      204
3      236
4      354
      ... 
298    241
299    264
300    193
301    131
302    236
Name: chol, Length: 303, dtype: int64

An alternative method here is simply calling `df["magnesium"]`!

In [64]:
df.loc[7:16,"chol"]

7     263
8     199
9     168
10    239
11    275
12    266
13    211
14    283
15    219
16    340
Name: chol, dtype: int64

#### b) boolean indexing using `.loc`

Sometimes you'd like to select certain rows in your data set based on the value for a certain variable. Imagine you'd like to create a new dataframe that only contains the wines with an alcohol percentage below 12. This can be done as follows:

In [70]:
df.loc[df["trestbps"]<100]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3,1
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2,1


You can verify that simply using `df[df["trestbps"]<100]`, you can obtain the same result!

However, the .`loc` attribute is useful if you'd only want the color intensity for the wines with an alcohol percentage below 12. You can obtain the result as follows:

In [80]:
df.loc[df["trestbps"]<100, ["sex"]]

Unnamed: 0,sex
71,1
124,0


## Selectors for series

Until now we've only really discussed pandas DataFrames. Most of these methods and selectors are also applicable to pandas series. See how you can convert a one-column DataFrame into a Pandas Series:

In [81]:
# Let's save our trestbps dataframe into an object `trestbps`
trestbps = df["trestbps"]

In [84]:
type(trestbps)

pandas.core.series.Series

Note how col_intensity is now a pandas *Series*.

Many of the commands discussed before are readily applicable to series:

In [85]:
trestbps[0:3]

0    145
1    130
2    130
Name: trestbps, dtype: int64

In [86]:
trestbps[trestbps > 175] # or trestbps.loc[trestbps>175]

101    178
110    180
203    180
223    200
248    192
260    178
266    180
Name: trestbps, dtype: int64

In [78]:
# Call the .describe() method on our dataset. What do you observe?
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [87]:
# Use the code below. How does the output differ from info() ?
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [88]:
len('sisay')

5

## Changing and setting values in DataFrames and series

### Changing values

Imagine that for some reason, you're not interested in the color intensity values for color intensities above 10, and simply want to set all color intensities to 10 when they are bigger than 10. You can use a selector method and then assign it a new value, just like this:

In [97]:
df.loc[df["trestbps"] > 175, "trestbps"] = 176

In [100]:
df[df['trestbps'] == 176]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
101,59,1,3,176,270,0,0,145,0,4.2,0,0,3,1
110,64,0,0,176,325,0,1,154,1,0.0,2,0,2,1
203,68,1,2,176,274,1,0,150,1,1.6,1,0,3,0
223,56,0,0,176,288,1,0,133,1,4.0,0,2,3,0
248,54,1,1,176,283,0,0,195,0,0.0,2,1,3,0
260,66,0,0,176,228,1,1,165,1,1.0,1,2,3,0
266,55,0,0,176,327,0,2,117,1,3.4,1,0,2,0


In [101]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


### Creating new columns

Now imagine that we want to create a new column named "shade" which has a value "light" when the color_intensity is below 7, and "dark" when the intensity is > 7. This can be done as follows:

In [107]:
df.loc[df["trestbps"] > 175, "status"] = "Panicking"
df.loc[df["trestbps"] <= 100, "status"] = "Dying"
df.loc[(df["trestbps"] <= 175) & (df["trestbps"] > 100), "status"] = "Fine"

In [108]:
df[['age','trestbps','status']].head()

Unnamed: 0,age,trestbps,status
0,63,145,Fine
1,37,130,Fine
2,41,130,Fine
3,56,120,Fine
4,57,120,Fine


In [129]:
df1 = df[['status']]
df1.status.value_counts()

Fine         290
Panicking      7
Dying          6
Name: status, dtype: int64

In [132]:
f"Panicking: {df.loc[df['status'] == 'Panicking'].shape[0]}"


'Panicking: 7'

In [144]:
# str == ''
# int == 90
# float = 9.99
# list = [1,2,3]
# dict == {'key':'val'}
# tuple == (9, 'sis', 4.5)
type(df.loc[df['status'] == 'Panicking'].shape)

tuple

In [149]:
df.loc[df['status'] == 'Panicking'].shape # (no_row, no_col)

(7, 15)

In [162]:
ls = (9, 'sis', 4.5)
ls1 = [9,'sis',4.5]
# imutable  tuple 
# mutable  list dict

In [163]:
ls1.append('in')

In [164]:
ls1

[9, 'sis', 4.5, 'in']

In [165]:
ls10 = list(ls)

In [167]:
ls10[-1] ='99'

In [168]:
ls10

[9, 'sis', '99']

In [142]:
'Panicking: ' + str(df.loc[df['status'] == 'Panicking'].shape)

'Panicking: (7, 15)'

In [134]:
f"Dying: {df.loc[df['status'] == 'Dying'].shape[0]}"

'Dying: 6'

In [135]:
f'Fine: {df.loc[df["status"] == "Fine"].shape[0]}'

'Fine: 290'

Have another look at `df`. `shade` is added as a 14th column! 

## Summary

We've introduced a range of techniques for accessing information in Pandas Series and DataFrames, selecting rows and columns, changing values, and creating new columns! Now, it's time for some practice! Let's start working on a lab where you will get a chance to combine some of these methods!