## Accessing DataFrame columns

* You can access a DataFrame column by using bracket or dot notation
* Dot notation only works for valid Python variable names (no spaces, special characters, etc.), 
and if the column name is not the same as an existing variable or method

In [1]:
import numpy as np
import pandas as pd

In [2]:
retail_df = pd.read_csv('../DataFrames/retail_2016_2017.csv')
retail_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.000,0
1,1945945,2016-01-01,1,BABY CARE,0.000,0
2,1945946,2016-01-01,1,BEAUTY,0.000,0
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
4,1945948,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [3]:
retail_df["family"] # access a column with brackets, quotes, and the column name as it appears

0                          AUTOMOTIVE
1                           BABY CARE
2                              BEAUTY
3                           BEVERAGES
4                               BOOKS
                      ...            
1054939                       POULTRY
1054940                PREPARED FOODS
1054941                       PRODUCE
1054942    SCHOOL AND OFFICE SUPPLIES
1054943                       SEAFOOD
Name: family, Length: 1054944, dtype: object

In [4]:
retail_df.family # Also works with dot notation. Has to be valid python variable name. It is more practical to use brakets.

0                          AUTOMOTIVE
1                           BABY CARE
2                              BEAUTY
3                           BEVERAGES
4                               BOOKS
                      ...            
1054939                       POULTRY
1054940                PREPARED FOODS
1054941                       PRODUCE
1054942    SCHOOL AND OFFICE SUPPLIES
1054943                       SEAFOOD
Name: family, Length: 1054944, dtype: object

### You can use Series operations on DataFrames columns. Each column is a Series.

In [5]:
retail_df["family"].nunique()

33

In [6]:
retail_df["sales"].mean()

457.72248700136413

In [7]:
retail_df["family"].value_counts().iloc[:5]

family
AUTOMOTIVE                    31968
HOME APPLIANCES               31968
SCHOOL AND OFFICE SUPPLIES    31968
PRODUCE                       31968
PREPARED FOODS                31968
Name: count, dtype: int64

In [8]:
retail_df["sales"].sum().round()

482871591.0

### Selecting multiple columns witha list of column names between brackets
* Ideal for selecting non-consecutive columns in a DataFrame

In [9]:
retail_df[['family', 'store_nbr']]

Unnamed: 0,family,store_nbr
0,AUTOMOTIVE,1
1,BABY CARE,1
2,BEAUTY,1
3,BEVERAGES,1
4,BOOKS,1
...,...,...
1054939,POULTRY,9
1054940,PREPARED FOODS,9
1054941,PRODUCE,9
1054942,SCHOOL AND OFFICE SUPPLIES,9


In [10]:
retail_df[['family', 'store_nbr']].iloc[:5]

Unnamed: 0,family,store_nbr
0,AUTOMOTIVE,1
1,BABY CARE,1
2,BEAUTY,1
3,BEVERAGES,1
4,BOOKS,1


### More examples

In [11]:
oil = pd.read_csv("../DataFrames/oil.csv")
oil

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [17]:
oil.columns = ['date', 'price'] # .columns method allows you to rename the columns in a list
oil.head()

Unnamed: 0,date,price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [14]:
oil.date # returns a pandas Series

0       2013-01-01
1       2013-01-02
2       2013-01-03
3       2013-01-04
4       2013-01-07
           ...    
1213    2017-08-25
1214    2017-08-28
1215    2017-08-29
1216    2017-08-30
1217    2017-08-31
Name: date, Length: 1218, dtype: object

In [18]:
oil.price # returns a pandas Series

0         NaN
1       93.14
2       92.97
3       93.12
4       93.20
        ...  
1213    47.65
1214    46.40
1215    46.46
1216    45.96
1217    47.26
Name: price, Length: 1218, dtype: float64

In [19]:
oil[["price"]] # if you pass in a column name as a list into the brackets the column will return as a dataframe

Unnamed: 0,price
0,
1,93.14
2,92.97
3,93.12
4,93.20
...,...
1213,47.65
1214,46.40
1215,46.46
1216,45.96


### Accessing data with iloc
The .iloc() accessor accesses DataFrames by their row and column indecies
* The first parameter accessed rows, and the second accesses columns
* Remember iloc() accesses based on positional index. Not labels

In [21]:
retail_df.iloc[:5, :] # calling the first 5 rows with [start:5] and all columns with [start:stop]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [23]:
retail_df.iloc[:, 1:4] # grabbing a subset of columns and all rows. [:] grabs all rows followed by [1:4] to grab indices 1-4.

Unnamed: 0,date,store_nbr,family
0,2016-01-01,1,AUTOMOTIVE
1,2016-01-01,1,BABY CARE
2,2016-01-01,1,BEAUTY
3,2016-01-01,1,BEVERAGES
4,2016-01-01,1,BOOKS
...,...,...,...
1054939,2017-08-15,9,POULTRY
1054940,2017-08-15,9,PREPARED FOODS
1054941,2017-08-15,9,PRODUCE
1054942,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES


### Accessing data with .loc()
The .loc() accessor accesses DataFrames by their row and column labels
* The fist parameter accesses rows, and the second accesses columns

In [24]:
retail_df.loc[:, "date"] # accessing all rows and the "date" column. This returns a pandas Series

0          2016-01-01
1          2016-01-01
2          2016-01-01
3          2016-01-01
4          2016-01-01
              ...    
1054939    2017-08-15
1054940    2017-08-15
1054941    2017-08-15
1054942    2017-08-15
1054943    2017-08-15
Name: date, Length: 1054944, dtype: object

In [26]:
retail_df.loc[:, ["date"]] # We can also return a DataFrame by passing a list into .loc()

Unnamed: 0,date
0,2016-01-01
1,2016-01-01
2,2016-01-01
3,2016-01-01
4,2016-01-01
...,...
1054939,2017-08-15
1054940,2017-08-15
1054941,2017-08-15
1054942,2017-08-15


In [28]:
retail_df.loc[:, ["date", "sales"]] # Pass more items into the list to create more columns in the DataFrame

Unnamed: 0,date,sales
0,2016-01-01,0.000
1,2016-01-01,0.000
2,2016-01-01,0.000
3,2016-01-01,0.000
4,2016-01-01,0.000
...,...,...
1054939,2017-08-15,438.133
1054940,2017-08-15,154.553
1054941,2017-08-15,2419.729
1054942,2017-08-15,121.000


In [29]:
retail_df.loc[:, "date":"sales"] # slices can also be used. The stop position is also inclusive

Unnamed: 0,date,store_nbr,family,sales
0,2016-01-01,1,AUTOMOTIVE,0.000
1,2016-01-01,1,BABY CARE,0.000
2,2016-01-01,1,BEAUTY,0.000
3,2016-01-01,1,BEVERAGES,0.000
4,2016-01-01,1,BOOKS,0.000
...,...,...,...,...
1054939,2017-08-15,9,POULTRY,438.133
1054940,2017-08-15,9,PREPARED FOODS,154.553
1054941,2017-08-15,9,PRODUCE,2419.729
1054942,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000


### loc examples

In [31]:
oil['euro_price'] = oil['price'] * 1.1 # create a new column called euro_price with adjusted prices
oil.head()

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52


In [38]:
oil.iloc[:10, -2:] # accesssing just the prices and excluding the first column

Unnamed: 0,price,euro_price
0,,
1,93.14,102.454
2,92.97,102.267
3,93.12,102.432
4,93.2,102.52
5,93.21,102.531
6,93.08,102.388
7,93.81,103.191
8,93.6,102.96
9,94.27,103.697


In [39]:
oil.loc[:5, ["date", "euro_price"]] # calling the labels by passing them into a list while excluding the first column

Unnamed: 0,date,euro_price
0,2013-01-01,
1,2013-01-02,102.454
2,2013-01-03,102.267
3,2013-01-04,102.432
4,2013-01-07,102.52
5,2013-01-08,102.531


In [40]:
oil.loc[:5, ["euro_price", "date"]] # order matters. We can reorganize by putting the labels in the list order we want.

Unnamed: 0,euro_price,date
0,,2013-01-01
1,102.454,2013-01-02
2,102.267,2013-01-03
3,102.432,2013-01-04
4,102.52,2013-01-07
5,102.531,2013-01-08
