# <font color='MidnightBlue'> Pandas DataFrames

**Pandas DataFranes is the Python equivalent of an Excel or SQL table** which we'll use to store and analyze data. They are a tabular data structure, made up from columns and rows.

* Each column of data in a DataFrame is a Pandas Series that shares the same row index

* The column headers work as a column index that contains the Series names

**Topics covered:**
* DataFrame Basics
* Exploring DataFrames
* Accessing & Dropping Data
* Blank & Duplicate Values
* Sorting & Filtering
* Modifying Columns
* Pandas Data Tyoes
* Memory Optimization



![Capture.PNG](https://i.ibb.co/NyMsBxw/Capture.png)

## <font color='MidnightBlue'> DataFrame Basics

DataFrames have these key properties:

* **shape** – the number of rows and columns in a DataFrame (the index is not considered a column)
* **index** – the row index in a DataFrame, represented as a range of integers (axis=0)
* **columns** – the column index in a DataFrame, represented by the Series names (axis=1)
* **axes** – the row and column indices in a DataFrame
* **dtypes** – the data type for each Series in a DataFrame (they can be different!)


In [1]:
import numpy as np
import pandas as pd

# A common practice is to create a path variable to pass to read_csv
path = r"C:\Users\andre\OneDrive\Ambiente de Trabalho\NumPy & Pandas\Pandas Course Resources\retail\retail_2016_2017.csv"

retail_df = pd.read_csv(path)
retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [2]:
retail_df.shape

(1054944, 6)

In [3]:
retail_df.index

RangeIndex(start=0, stop=1054944, step=1)

In [4]:
retail_df.columns

Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')

In [5]:
retail_df.axes

[RangeIndex(start=0, stop=1054944, step=1),
 Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')]

In [6]:
retail_df.dtypes

id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

**Data types:**
When we import our data from an external source like SQL or a flat file like CSV or Excel, Pandas will try to guess the data types when creating the DataFrame, so we should always check the data types and change them if needed.

## <font color='MidnightBlue'> Exploring a DataFrame

After we read the DataFrame there are several very helpful DataFrame methods that allow us to quicky understand what's inside of the DataFrame and get a sense for some of the statistics on our columns

![Capture.PNG](https://i.ibb.co/qnQynCp/Capture.png)

In [7]:
retail_df.tail(3)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.0,8
1054943,3000887,2017-08-15,9,SEAFOOD,16.0,0


In [8]:
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 48.3+ MB


In [9]:
retail_df.describe()

Unnamed: 0,id,store_nbr,sales,onpromotion
count,1054944.0,1054944.0,1054944.0,1054944.0
mean,2473416.0,27.5,457.7225,5.937977
std,304536.2,15.58579,1317.155,18.08632
min,1945944.0,1.0,0.0,0.0
25%,2209680.0,14.0,2.0,0.0
50%,2473416.0,27.5,24.0,0.0
75%,2737151.0,41.0,262.0,3.0
max,3000887.0,54.0,124717.0,741.0


In [10]:
# We can use include='all' to return statistic for all columns and not only numeric olumns
retail_df.describe(include='all').round()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
count,1054944.0,1054944,1054944.0,1054944,1054944.0,1054944.0
unique,,592,,33,,
top,,2017-02-19,,LAWN AND GARDEN,,
freq,,1782,,31968,,
mean,2473416.0,,28.0,,458.0,6.0
std,304536.0,,16.0,,1317.0,18.0
min,1945944.0,,1.0,,0.0,0.0
25%,2209680.0,,14.0,,2.0,0.0
50%,2473416.0,,28.0,,24.0,0.0
75%,2737151.0,,41.0,,262.0,3.0


## <font color='MidnightBlue'> Accessing DataFrame Columns

We can access a DataFrame column by using bracket notation and once we've accessed a single DataFrame column we will be able to use Pandas Series operations on the DataFrame since each column is a series

In [11]:
retail_df['family']

0                          AUTOMOTIVE
1                           BABY CARE
2                              BEAUTY
3                           BEVERAGES
4                               BOOKS
                      ...            
1054939                       POULTRY
1054940                PREPARED FOODS
1054941                       PRODUCE
1054942    SCHOOL AND OFFICE SUPPLIES
1054943                       SEAFOOD
Name: family, Length: 1054944, dtype: object

In [12]:
retail_df['family'].nunique()

33

In [13]:
retail_df['family'].value_counts().loc[['AUTOMOTIVE','DELI']]

AUTOMOTIVE    31968
DELI          31968
Name: family, dtype: int64

We can also select multiple columns with a list of column names between brackets.
* This is ideal for selecting non_consecutive columns in a DataFrame

In [14]:
retail_df[['family','onpromotion']]

Unnamed: 0,family,onpromotion
0,AUTOMOTIVE,0
1,BABY CARE,0
2,BEAUTY,0
3,BEVERAGES,0
4,BOOKS,0
...,...,...
1054939,POULTRY,0
1054940,PREPARED FOODS,1
1054941,PRODUCE,148
1054942,SCHOOL AND OFFICE SUPPLIES,8


## <font color='MidnightBlue'> Accessing DataFrame Columns with `.iloc` and `loc`

In general, it's suggested to use these methods to access DataFrames columns, particulary when accessing more than one column.
They work exacly as in Pandas Series and allow us to access rows/columns, either by their positional index (`.iloc`) or by their labels (`.loc`).

In [15]:
retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [16]:
retail_df.iloc[:5,:2]

Unnamed: 0,id,date
0,1945944,2016-01-01
1,1945945,2016-01-01
2,1945946,2016-01-01
3,1945947,2016-01-01
4,1945948,2016-01-01


In [17]:
retail_df.loc[:5,['id','sales']]

Unnamed: 0,id,sales
0,1945944,0.0
1,1945945,0.0
2,1945946,0.0
3,1945947,0.0
4,1945948,0.0
5,1945949,0.0


In [18]:
retail_df.loc[:2,:'family']

Unnamed: 0,id,date,store_nbr,family
0,1945944,2016-01-01,1,AUTOMOTIVE
1,1945945,2016-01-01,1,BABY CARE
2,1945946,2016-01-01,1,BEAUTY


## <font color='MidnightBlue'>Dropping Rows & Columns

The `.drop()` method drops rows and columns from a DataFrame

* Specify axis=0 to drop rows by label and axis=1 to drop columns

We should drop unnecessary columns early in our workflow to save memory and make DataFrames more manageable. Ideally they shouldn't even be imported to the DataFrame.

In [19]:
retail_df.drop('id'
              ,axis=1
              ,inplace=True)

retail_df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,0.0,0
1,2016-01-01,1,BABY CARE,0.0,0
2,2016-01-01,1,BEAUTY,0.0,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [20]:
retail_df.drop(range(5)
              ,axis=0).head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion
5,2016-01-01,1,BREAD/BAKERY,0.0,0
6,2016-01-01,1,CELEBRATION,0.0,0
7,2016-01-01,1,CLEANING,0.0,0
8,2016-01-01,1,DAIRY,0.0,0
9,2016-01-01,1,DELI,0.0,0


## <font color='MidnightBlue'>Identifying Duplicate Row</font>

When we read in data we might find that there are a lot of duplicates therefore we have the `.duplicated()` methods that identifies duplicate rows of data

* Specify subset=column(s) to look for duplicates across a subset of columns instead of all columns

* Use `.drop_duplicates()` method to drop duplicate rows from a DataFrame

![Capture.PNG](https://i.ibb.co/Gdb2TB3/Capture.png)

In [21]:
# Look for duplicate rows across all columns

retail_df.duplicated().sum()

0

In [22]:
# Look for duplicate rows within a column

retail_df.duplicated('family')

0          False
1          False
2          False
3          False
4          False
           ...  
1054939     True
1054940     True
1054941     True
1054942     True
1054943     True
Length: 1054944, dtype: bool

## <font color='MidnightBlue'>Identifying Missing Data</font>

It's possible to identify missing data by column using the `.isna()` and `.sum()` methods. Additioanlly the `.info()` method can also help identify null values.

Like with Pandas Series, the `.dropna()` and `.fillna()` methods allow us to **handle missing data** in a DataFrame by either removing them or replacing them with other values

In [23]:
retail_df.loc[:0,['family']] = np.nan
retail_df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,,0.0,0
1,2016-01-01,1,BABY CARE,0.0,0
2,2016-01-01,1,BEAUTY,0.0,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [24]:
retail_df.info()
# We can see that there is a null entry for family column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   date         1054944 non-null  object 
 1   store_nbr    1054944 non-null  int64  
 2   family       1054943 non-null  object 
 3   sales        1054944 non-null  float64
 4   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 40.2+ MB


In [25]:
retail_df.isna().sum()

date           0
store_nbr      0
family         1
sales          0
onpromotion    0
dtype: int64

In [26]:
retail_df.fillna({'family':'AUTOMOTIVE'}
                ,inplace=True)

retail_df.head(5)

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,0.0,0
1,2016-01-01,1,BABY CARE,0.0,0
2,2016-01-01,1,BEAUTY,0.0,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [27]:
retail_df.loc[:2,'sales'] = np.nan
retail_df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,,0
1,2016-01-01,1,BABY CARE,,0
2,2016-01-01,1,BEAUTY,,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [28]:
# We can also fill the na values with the mean 
retail_df.fillna({'sales':round(retail_df.loc[:,'sales'].mean(),2)}).head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,457.72,0
1,2016-01-01,1,BABY CARE,457.72,0
2,2016-01-01,1,BEAUTY,457.72,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [29]:
retail_df['sales'].fillna(round(retail_df.loc[:,'sales'].mean(),2))

0           457.720
1           457.720
2           457.720
3             0.000
4             0.000
             ...   
1054939     438.133
1054940     154.553
1054941    2419.729
1054942     121.000
1054943      16.000
Name: sales, Length: 1054944, dtype: float64

## <font color='MidnightBlue'>Filtering DataFrames</font>


We can **filter the rows in a DataFrame** by passing a logical test into the `.loc[]` accessor, just like filtering a Series or a NumPy array

`.loc[rows,columns]`

In [30]:
retail_df.loc[retail_df['family'] == 'AUTOMOTIVE',:]

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,,0
33,2016-01-01,10,AUTOMOTIVE,0.0,0
66,2016-01-01,11,AUTOMOTIVE,0.0,0
99,2016-01-01,12,AUTOMOTIVE,0.0,0
132,2016-01-01,13,AUTOMOTIVE,0.0,0
...,...,...,...,...,...
1054779,2017-08-15,54,AUTOMOTIVE,8.0,0
1054812,2017-08-15,6,AUTOMOTIVE,7.0,0
1054845,2017-08-15,7,AUTOMOTIVE,5.0,0
1054878,2017-08-15,8,AUTOMOTIVE,4.0,0


In [31]:
retail_df.loc[retail_df['family'] == 'AUTOMOTIVE',['family','sales']]

Unnamed: 0,family,sales
0,AUTOMOTIVE,
33,AUTOMOTIVE,0.0
66,AUTOMOTIVE,0.0
99,AUTOMOTIVE,0.0
132,AUTOMOTIVE,0.0
...,...,...
1054779,AUTOMOTIVE,8.0
1054812,AUTOMOTIVE,7.0
1054845,AUTOMOTIVE,5.0
1054878,AUTOMOTIVE,4.0


In [32]:
# We can apply multiple filters by joining the logical tests with an & operator

retail_df.loc[retail_df['family'].isin(['BABY CARE','BOOKS']) & (retail_df['sales'] > 0),:]

Unnamed: 0,date,store_nbr,family,sales,onpromotion
1981,2016-01-02,15,BABY CARE,2.0,0
2179,2016-01-02,20,BABY CARE,3.0,0
2575,2016-01-02,31,BABY CARE,2.0,0
2641,2016-01-02,33,BABY CARE,1.0,0
2905,2016-01-02,40,BABY CARE,1.0,0
...,...,...,...,...,...
1053361,2017-08-15,15,BABY CARE,1.0,0
1053460,2017-08-15,18,BABY CARE,4.0,0
1053493,2017-08-15,19,BABY CARE,1.0,0
1053790,2017-08-15,27,BABY CARE,1.0,0


#### Filtering example: 

1 - Calculate the percentage of times store 25 had more than 2000 transactions, and calculate the sum of transactions on these days

2 - Sum the transactions for stores 25 and 31, that occurred in May or June, and had less than 2000 transactions


In [33]:
#1 - calculate the percentage of times store 25 had more than 2000 transactions

mask = (retail_df['store_nbr'] == 25) \
            & (retail_df['sales'] > 2000) \

round((retail_df.loc[mask,'sales'].count()
/
retail_df.loc[retail_df['store_nbr'] == 25,'sales'].count()) * 100,2)

2.97

In [34]:
#2 - sum of transactions on these days

round(retail_df.loc[mask,'sales'].sum(),2)

1791673.73

In [35]:
#3 - sum the transactions for stores 25 and 31, that occurred in May or June, and had less than 2000 transactions

mask2 = (retail_df['store_nbr'].isin([25,31])) \
            & (retail_df['sales'] < 2000) \
            & (retail_df['date'].str[5:7].isin(['05','06'])) \


round(retail_df.loc[mask2,'sales'].sum(),2)

1402054.57

## <font color='MidnightBlue'>SQL syntax to filter DataFrames</font>

The `.query()` method lets us use SQL-like syntax to filter DataFrames where we can specify any number of filtering conditions by using `and` or `or` keywords

In [36]:
retail_df.query(
    "family in ['BOOKS','BEAUTY'] and sales > 5")

Unnamed: 0,date,store_nbr,family,sales,onpromotion
563,2016-01-01,25,BEAUTY,13.0,0
1850,2016-01-02,11,BEAUTY,10.0,0
2180,2016-01-02,20,BEAUTY,10.0,0
2312,2016-01-02,24,BEAUTY,9.0,1
2345,2016-01-02,25,BEAUTY,8.0,0
...,...,...,...,...,...
1054748,2017-08-15,53,BEAUTY,7.0,1
1054814,2017-08-15,6,BEAUTY,6.0,1
1054847,2017-08-15,7,BEAUTY,11.0,2
1054880,2017-08-15,8,BEAUTY,8.0,2


In [37]:
avg_sales = retail_df.loc[:,'sales'].mean()
avg_sales

457.7237886546999

We can also reference variables by using the `@` symbol

In [38]:
retail_df.query(
    "family in ['CLEANING','DAIRY'] and sales > @avg_sales")

Unnamed: 0,date,store_nbr,family,sales,onpromotion
568,2016-01-01,25,CLEANING,734.0,0
569,2016-01-01,25,DAIRY,1033.0,11
1789,2016-01-02,1,CLEANING,526.0,3
1790,2016-01-02,1,DAIRY,627.0,15
1822,2016-01-02,10,CLEANING,1216.0,4
...,...,...,...,...,...
1054853,2017-08-15,7,DAIRY,1279.0,25
1054885,2017-08-15,8,CLEANING,1198.0,13
1054886,2017-08-15,8,DAIRY,1330.0,24
1054918,2017-08-15,9,CLEANING,1439.0,25


## <font color='MidnightBlue'>Sorting DataFrames</font>

We can sort a DataFrame by it's indices using the `.sort_index()` method.

* This sorts rows (axis=0) by default but we can specify axis=1 to sort the columns

Additionally, we can also sort a DataFrame by its values using the `.sort_values()` method

* We can either sort by a single column or by multiple columns

In [39]:
retail_df.sort_index(ascending=False)

Unnamed: 0,date,store_nbr,family,sales,onpromotion
1054943,2017-08-15,9,SEAFOOD,16.000,0
1054942,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8
1054941,2017-08-15,9,PRODUCE,2419.729,148
1054940,2017-08-15,9,PREPARED FOODS,154.553,1
1054939,2017-08-15,9,POULTRY,438.133,0
...,...,...,...,...,...
4,2016-01-01,1,BOOKS,0.000,0
3,2016-01-01,1,BEVERAGES,0.000,0
2,2016-01-01,1,BEAUTY,,0
1,2016-01-01,1,BABY CARE,,0


In [40]:
retail_df.sort_index(axis=1)
#columns ordered in alphabetical order

Unnamed: 0,date,family,onpromotion,sales,store_nbr
0,2016-01-01,AUTOMOTIVE,0,,1
1,2016-01-01,BABY CARE,0,,1
2,2016-01-01,BEAUTY,0,,1
3,2016-01-01,BEVERAGES,0,0.000,1
4,2016-01-01,BOOKS,0,0.000,1
...,...,...,...,...,...
1054939,2017-08-15,POULTRY,0,438.133,9
1054940,2017-08-15,PREPARED FOODS,1,154.553,9
1054941,2017-08-15,PRODUCE,148,2419.729,9
1054942,2017-08-15,SCHOOL AND OFFICE SUPPLIES,8,121.000,9


In [41]:
retail_df.sort_values(['family','sales'],ascending=[True,False])

Unnamed: 0,date,store_nbr,family,sales,onpromotion
270138,2016-05-31,39,AUTOMOTIVE,255.0,0
532092,2016-10-25,39,AUTOMOTIVE,150.0,0
467940,2016-09-19,39,AUTOMOTIVE,98.0,0
165264,2016-04-02,46,AUTOMOTIVE,59.0,0
466389,2016-09-18,45,AUTOMOTIVE,57.0,0
...,...,...,...,...,...
1052270,2017-08-14,33,SEAFOOD,0.0,0
1052369,2017-08-14,36,SEAFOOD,0.0,0
1053029,2017-08-14,54,SEAFOOD,0.0,0
1054151,2017-08-15,36,SEAFOOD,0.0,0


## <font color='MidnightBlue'>Renaming Columns</font>

We can either rename columns:

* In place via assignment using the `columns` property
* With the `.rename()` method using a dictionary to map the new column names to the old column names

In [42]:
retail_df.columns = [col.upper() for col in retail_df.columns]
retail_df.columns

Index(['DATE', 'STORE_NBR', 'FAMILY', 'SALES', 'ONPROMOTION'], dtype='object')

In [43]:
retail_df.columns = ['date', 'store_number', 'family', 'sales', 'onpromotion']
retail_df.columns

Index(['date', 'store_number', 'family', 'sales', 'onpromotion'], dtype='object')

In [44]:
retail_df.rename(columns={'onpromotion':'on_promotion'}
                 ,inplace=True)
retail_df.columns

Index(['date', 'store_number', 'family', 'sales', 'on_promotion'], dtype='object')

## <font color='MidnightBlue'>Reordering Columns</font>

We can reorder columns with the `.reindex()` method when sorting won't suffice as it will only allow us to sort our column index alphabetically

In [45]:
retail_df.head()

Unnamed: 0,date,store_number,family,sales,on_promotion
0,2016-01-01,1,AUTOMOTIVE,,0
1,2016-01-01,1,BABY CARE,,0
2,2016-01-01,1,BEAUTY,,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [46]:
retail_df.reindex(['date','family','store_number','sales','on_promotion']
                 ,axis=1
                 ).head(5)

Unnamed: 0,date,family,store_number,sales,on_promotion
0,2016-01-01,AUTOMOTIVE,1,,0
1,2016-01-01,BABY CARE,1,,0
2,2016-01-01,BEAUTY,1,,0
3,2016-01-01,BEVERAGES,1,0.0,0
4,2016-01-01,BOOKS,1,0.0,0


## <font color='MidnightBlue'> Arithmetic & Boolean Column Creation</font>

One very common task is creating new columns for further analysis and creating new columns based on arithmetic between a series and a scalar or two series in our DataFrames is pretty easy in Pandas. We can **create columns with arithmetic** by assigning them Series operations:

* Simply specify the new column name and assign the operation of interest

In [47]:
retail_df.fillna({'sales':10}
                 ,inplace=True)

retail_df.head()

Unnamed: 0,date,store_number,family,sales,on_promotion
0,2016-01-01,1,AUTOMOTIVE,10.0,0
1,2016-01-01,1,BABY CARE,10.0,0
2,2016-01-01,1,BEAUTY,10.0,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [48]:
retail_df['tax'] = 0.23
retail_df.head()

Unnamed: 0,date,store_number,family,sales,on_promotion,tax
0,2016-01-01,1,AUTOMOTIVE,10.0,0,0.23
1,2016-01-01,1,BABY CARE,10.0,0,0.23
2,2016-01-01,1,BEAUTY,10.0,0,0.23
3,2016-01-01,1,BEVERAGES,0.0,0,0.23
4,2016-01-01,1,BOOKS,0.0,0,0.23


It's also possible to **create Boolean columns** by assigning them a logical test

In [49]:
retail_df['taxable_family'] = retail_df['family'] != 'BABY CARE'
retail_df.head()

Unnamed: 0,date,store_number,family,sales,on_promotion,tax,taxable_family
0,2016-01-01,1,AUTOMOTIVE,10.0,0,0.23,True
1,2016-01-01,1,BABY CARE,10.0,0,0.23,False
2,2016-01-01,1,BEAUTY,10.0,0,0.23,True
3,2016-01-01,1,BEVERAGES,0.0,0,0.23,True
4,2016-01-01,1,BOOKS,0.0,0,0.23,True


In [50]:
retail_df['sales_tax'] = np.where(retail_df['taxable_family'] == True
                                 ,retail_df['sales'] * (1 + retail_df['tax'])
                                 ,retail_df['sales'])

retail_df.head()

Unnamed: 0,date,store_number,family,sales,on_promotion,tax,taxable_family,sales_tax
0,2016-01-01,1,AUTOMOTIVE,10.0,0,0.23,True,12.3
1,2016-01-01,1,BABY CARE,10.0,0,0.23,False,10.0
2,2016-01-01,1,BEAUTY,10.0,0,0.23,True,12.3
3,2016-01-01,1,BEVERAGES,0.0,0,0.23,True,0.0
4,2016-01-01,1,BOOKS,0.0,0,0.23,True,0.0


In [51]:
# Since boolean values are interpreted as 1/0 we can also use them in arithmetic operations

retail_df['sales_tax2'] = retail_df['taxable_family'] * retail_df['sales'] * (1 + retail_df['tax'])

retail_df.head()

Unnamed: 0,date,store_number,family,sales,on_promotion,tax,taxable_family,sales_tax,sales_tax2
0,2016-01-01,1,AUTOMOTIVE,10.0,0,0.23,True,12.3,12.3
1,2016-01-01,1,BABY CARE,10.0,0,0.23,False,10.0,0.0
2,2016-01-01,1,BEAUTY,10.0,0,0.23,True,12.3,12.3
3,2016-01-01,1,BEVERAGES,0.0,0,0.23,True,0.0,0.0
4,2016-01-01,1,BOOKS,0.0,0,0.23,True,0.0,0.0


## <font color='MidnightBlue'> PRO TIP: NumPy `select()`</font>

NumPy's `select()` function lets us create columns based on multiple conditions.

* This is more flexible than NumPy's `where()` function or Pandas' `where()` method

Basically we specify a set of **conditions** and outcomes (**choices**) for each condition. Then we use the `np.select()` and pass in the conditions, the choices and an optional default outcome if none of the conditions are met

In [52]:
# Create a random sample

sample_df = retail_df.sample(n=5
                ,random_state=12)

sample_df

Unnamed: 0,date,store_number,family,sales,on_promotion,tax,taxable_family,sales_tax,sales_tax2
900772,2017-05-21,33,BOOKS,0.0,0,0.23,True,0.0,0.0
111531,2016-03-03,38,MEATS,157.181,1,0.23,True,193.33263,193.33263
846754,2017-04-21,18,CLEANING,569.0,41,0.23,True,699.87,699.87
207234,2016-04-26,23,PLAYERS AND ELECTRONICS,4.0,0,0.23,True,4.92,4.92
68304,2016-02-08,25,PLAYERS AND ELECTRONICS,3.0,0,0.23,True,3.69,3.69


In [53]:
# Use the np.select() function

conditions = [
    (sample_df['family'] == 'PLAYERS AND ELECTRONICS') & (sample_df['store_number'].isin([20,21,22,23,24,25]))
    ,sample_df['date'].astype('datetime64').dt.year == 2017    
]

choices = ['Eletronic 20s Store','2017 big sale']

sample_df['Sale_Name'] = np.select(conditions
                                   ,choices
                                   ,default='No Sale'
                                  )

sample_df

Unnamed: 0,date,store_number,family,sales,on_promotion,tax,taxable_family,sales_tax,sales_tax2,Sale_Name
900772,2017-05-21,33,BOOKS,0.0,0,0.23,True,0.0,0.0,2017 big sale
111531,2016-03-03,38,MEATS,157.181,1,0.23,True,193.33263,193.33263,No Sale
846754,2017-04-21,18,CLEANING,569.0,41,0.23,True,699.87,699.87,2017 big sale
207234,2016-04-26,23,PLAYERS AND ELECTRONICS,4.0,0,0.23,True,4.92,4.92,Eletronic 20s Store
68304,2016-02-08,25,PLAYERS AND ELECTRONICS,3.0,0,0.23,True,3.69,3.69,Eletronic 20s Store


## <font color='MidnightBlue'> Mapping Values to Columns</font>

The `.map()` method maps values to a column or an entire DataFrame

* We can pass a dictionary with existing values as the keys and the new values as the values


In [72]:
mapping_dict = {'AUTOMOTIVE':'Outside',
                'BABY CARE':'Inside',
                'BEAUTY':'Inside',
                'BEVERAGES':'Inside',
                'BOOKS':'Inside'
                }


retail_df['Outside?'] = retail_df['family'].map(mapping_dict)

retail_df.head()
                

Unnamed: 0,date,store_number,family,sales,on_promotion,tax,taxable_family,sales_tax,sales_tax2,Outside?
0,2016-01-01,1,AUTOMOTIVE,10.0,0,0.23,True,12.3,12.3,Outside
1,2016-01-01,1,BABY CARE,10.0,0,0.23,False,10.0,0.0,Inside
2,2016-01-01,1,BEAUTY,10.0,0,0.23,True,12.3,12.3,Inside
3,2016-01-01,1,BEVERAGES,0.0,0,0.23,True,0.0,0.0,Inside
4,2016-01-01,1,BOOKS,0.0,0,0.23,True,0.0,0.0,Inside


## <font color='MidnightBlue'> PRO TIP: Column Creation with `.assign()`</font>

The `.assign()` method **creates multiple columns** at once and returns a DataFrame

* This can be chained together with other data processing methods

In [74]:
path = r"C:\Users\andre\OneDrive\Ambiente de Trabalho\NumPy & Pandas\Pandas Course Resources\retail\retail_2016_2017.csv"

retail_df2 = pd.read_csv(path)
retail_df2.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [97]:
retail_df2.assign(
    tax_amount = round(retail_df2['sales'] * 0.05,2).map(lambda x: f'{x} €'),
    on_promotion_flag = retail_df2['onpromotion'] > 0,
    year = retail_df2['date'].str[:4].astype('int')
).query("year == 2017 and family == 'AUTOMOTIVE'").head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,on_promotion_flag,year
650430,2596374,2017-01-01,1,AUTOMOTIVE,0.0,0,0.0 €,False,2017
650463,2596407,2017-01-01,10,AUTOMOTIVE,0.0,0,0.0 €,False,2017
650496,2596440,2017-01-01,11,AUTOMOTIVE,0.0,0,0.0 €,False,2017
650529,2596473,2017-01-01,12,AUTOMOTIVE,0.0,0,0.0 €,False,2017
650562,2596506,2017-01-01,13,AUTOMOTIVE,0.0,0,0.0 €,False,2017


## <font color='MidnightBlue'> Memory optimization </font>

Memory Optimization Best Practices:

* 1 - Drop unnecessary columns (when possible, avoid reading them in at all)
* 2 - Convert object types to numeric or datetime datatypes where possible
* 3 - Downcast numeric data to the smallest appropriate bit size
* 4 - Use the categorical datatype for columns where the number of unique values < rows / 2