<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S04_Data_Structures_2/S4_LectureEx_Notebook_with_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# S4 - Python Data Structures II - DataFrame examples (with code)
Programming topics covered in this section:
* Creating `DataFrame` and `Series` objects
* Reading and writing data using pandas
* Indexing, selecting and assigning
* Summary functions and operations
* Grouping and sorting
* Data types and missing values
* Renaming and combining

Examples include:
* Importing and analyzing data of cereal sales

----
## Preliminaries

**pandas** is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. This notebook provides a summary of the different components you have seen during the [Kaggle Pandas course](https://www.kaggle.com/learn/pandas). 

There are two core objects in pandas: 
* `Series`: A Series is a sequence of indexde values. This is similar to a dictionary. However, Series is the simple form of DataFrame which is much more flexible and pandas offers a large number of pre-built functions and methods for this. You can think about a Series as a DataFrame with a single column. 
* `DataFrame`: A DataFrame is a data table (similar to a simple data table in Excel with rows and columns). It contains a set of columns which share the same set of indexes. Each column provides the values which correspond to the indexes in the index list.

**Note that most operations in pandas are returning you a new object. These operations are not generally done "in-place". Thus, if you want to keep your results, you need to assign them to a variable (examples will be provided in this notebook).**

---
## 1. Creating `DataFrame` and `Series` objects

To use the pandas library, you first need to import it; you can and should import it only one time per Jupyter Notebook. You can do this using different ways (as also for other libraries). The first approach consists of the following:
``` python
import pandas
```
By doing so, you then need to use `pandas` in front of all the pandas elements you want to access. For example, to create a `DataFrame`, you would need to do the following:
``` python
pandas.DataFrame({'Yes':[50, 21], 'No':[131, 2]})
```
---
The second approach consists of importing the library under an alias (i.e., a different name). A convention when importing pandas is to import it under the name `pd`. This allows to reduce the number of letters to type afterwards. For example, the previous code would reduce to:
``` python
import pandas as pd
pd.DataFrame({'Yes':[50, 21], 'No':[131, 2]})
```
---
The third approach consists of importing only the elements you need from the library. For example, if we just need the `DataFrame` object, we could do the following:
``` python
from pandas import DataFrame
DataFrame({'Yes':[50, 21], 'No':[131, 2]})
```
---
Ok, let's stick with the convention and import `pandas` with the following (**NOTE**: this step is important as we need to import pandas before using it!):

In [None]:
import pandas as pd

### `Series`
It is possible to create a `Series` by providing a list. A `Series` can be considered as one column of a `DataFrame`.

In [None]:
pd.Series([4, 5, 2, 9])

### `DataFrame`
The two main objects of interest in the `pandas` library are the `DataFrame` and `Series` objects. It is possible to create a `DataFrame` by passing a dictionnary where the keys are the column labels and the values are lists containing the different elements of the corresponding column. Note that all lists should be of the same length.

In [None]:
pd.DataFrame({'Yes':[50, 21], 'No':[131, 2]})

---
## 2. Reading and writing data using pandas

### Importing Data
One common use of pandas is to import the data from a file containing a data table (which can be prepared in Excel). 

**Importing file in Jupyter Notebook**: This can be done using the commands `read_csv()` for CSV (Comma-Separated Values) files or `read_excel()` for Excel files. Other options are also available in the pandas documentation.

**If you use Jupyter Notebook**, you can run the code below. 

In [None]:
# Import the column WEEK_END_DATE as dates
df = pd.read_csv('salesCerealsOriginal.csv', parse_dates=['WEEK_END_DATE']) 
df.shape

**Importing file in Colab Notebook**: Since Colab is on the cloud, one simple way is to upload it directly. This can be done using the code below **if you use Colab** and then click on "Choose Files" to upload it.

In [None]:
from google.colab import files
import io 

uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['salesCerealsOriginal.csv']), parse_dates=['WEEK_END_DATE']) 
df.shape

We can quickly see an overview of the data using the function `.head()`

In [None]:
df.head()

In [None]:
df.dtypes


**Note:**
* `.read_csv()`: is an important pandas function to read csv files and do operations on it
* `parse_dates`: is a parameter of the `.read_csv()` function. It converts the specified data in datetime datatype. Check [this page](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html) for more information about the parameters of the `.read_csv()` function
* `df.shape`: Return a tuple representing the dimensionality of the DataFrame `df`
* `.head()`: This function returns the first `n` rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. By default, `n=5`, so that `df.head()` returns only the first 5 rows of the CSV file read and assigned to `df` 

Here is a description of the variables in the previous `DataFrame`. This file will also be reused in subsequent sessions.

| VARIABLE NAME | DESCRIPTION | 
|:----|:----|
|WEEK_END_DATE|week ending date|
|STORE_NUM|store number|
|UPC|(Universal Product Code) product specific identifier|
|UNITS|units sold|
|VISITS|number of unique purchases (baskets) that included the product|
|HHS|# of purchasing households|
|SPEND|total spend (i.e., $ sales)|
|PRICE|actual amount charged for the product at shelf|
|BASE_PRICE|base price of item|
|FEATURE|product was in in-store circular|
|DISPLAY|product was a part of in-store promotional display|
|TPR_ONLY|temporary price reduction only (i.e., shelf tag only, product was reduced in price but not on display or in an advertisement)|
|DESCRIPTION|product description|
|CATEGORY|category of product|
|SUB_CATEGORY|sub-category of product|



### Note: Exporting Data
**In Jupyter Notebook**: Exporting data can be done with the methods such as `to_csv()` and `to_excel()`.



In [None]:
df.to_csv("downloaded_data.csv")

**In Colab**: After running the block above, you can download using the following function of the module `files`

In [None]:
files.download("downloaded_data.csv")

## 3. Indexing, Selecting and Assigning

You can access any columns of a `DataFrame` by either using the dot-notation or by using square brackets `[]`. Then, you can access a specific row within this column by using the brackets `[]` again with the row number. Note that this second indexing can sometime lead to strange results depending on how the indexes are set.

Let's take a look to the UPC column:

In [None]:
df.UPC

In [None]:
df.UPC[0]

In [None]:
df['UPC'][0]

Note that we can use either `df.UPC[0]` or `df['UPC'][0]` to access the  Universal Product Code of the first item in the list.

### Index-based selection
Another way to access elements of a `DataFrame` is by using **index-based selection**, i.e., the `iloc[row_number, col_number]` method. For example, to access the first row of the column number 3 (i.e., the column 'UPC'), we do

In [None]:
df.iloc[0,2]

### Label-based selection
A third option consists of **label-based selection** where we selects by the row and column labels, i.e., the `.at[]` and `.loc[]` methods. Continuing on our last example, we do

If you want to access a *single* cell using the index and column names, you can also use `.at[index_name, col_name]` as follows.

In [None]:
df.at[0, 'UPC']

Alternatively, we can also use `.loc[]` as follows. 

In [None]:
df.loc[0,'UPC']

Note that the difference between `loc[]` and `at[]` is that `.loc[]` is much more flexible and we can use it to obtain the values from more than one cell. In this case, we can use the `loc[]` method with conditional statements (boolean masks). For example, to show all the rows related to the UPC 1111085319, we can do the following. 

In [None]:
df.loc[df.UPC == 1111085319]

These boolean expressions can be mixed with the symbol `&` to denote an element-wise 'and', and the symbol `|` to denote an element-wise 'or'. Please note that we cannot directly use the terms `and` and `or` as pandas accepts only the symbols for boolean expressions (see [link](https://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#boolean-indexing)).

In [None]:
df.loc[(df.UPC == 1111085319) & (df.FEATURE == True)]

### Selecting a subset of columns and rows
We can also use `.loc[row_range, [col_names]]` to slice a subset of rows and columns. We can also assign this to a new DataFrame object.

In [None]:
df_col_row_subset = df.loc[:5,['UPC','UNITS']]
df_col_row_subset # display the result

We can also use double brackets `[[]]` to indicate the columns we want to select (this will select all rows)

In [None]:
df_col_subset = df[['UPC','UNITS']]
df_col_subset # display the new object of DataFrame

## 4. Summary Functions 

Some interesting summary functions are the `describe()`, `unique()` and `value_counts()` methods. By running these methods below, we are able to find that there is only one store in this dataset and that only 7 different UPCs are in the dataset. We also find that there is data for 156 weeks.
These functions are very useful to do a preliminary analysis of our data.

* `describe()`: This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input (string or numerical).

In [None]:
df.describe()

* `unique()`: shows a list of unique values

NOTE: `df.COLUMN_NAME` can also be used when selecting a *single* column. However, you have to ensure that the column name does not contain any space or special character. Otherwise, you can still use `df[['col_name']]` for this.

In [None]:
df.STORE_NUM.unique()

* `.value_counts()`: shows a list of unique values and how often they occur in the dataset

In [None]:
df.UPC.value_counts()

In [None]:
df.WEEK_END_DATE.value_counts()

## Computing values for a new column

We can also apply standard calculations to all the elements in a column and add the results to a new column.

In [None]:
df['zNormSPEND'] = (df.SPEND - df.SPEND.mean()) / df.SPEND.std()
df['zNormSPEND']

---
### Apply
The `apply()` method allows to go through each row or each column. You can review more details here: [link](https://www.w3resource.com/pandas/dataframe/dataframe-apply.php) (NOTE: this function is not a basic one but quite useful in practice. It is fine if you do not fully understand it).

As an example, let's compute the price rebate in percentage $\left(\frac{\mathit{BASE\_PRICE} - \mathit{PRICE}}{\mathit{BASE\_PRICE}}\right) $ when the item is in the circular ($\mathit{FEATURE}$  is equal to 1). We then add this computation to the column `REBATE_PERC`.

In [None]:
def rebate_perc(row):
    if row.FEATURE == 1:
        rebate = (row.BASE_PRICE - row.PRICE) / row.BASE_PRICE
        return rebate
    else:
        return 0

# axis='columns' allows to go through each row
# axis='index' allows to go through each column
df['REBATE_PERC'] = df.apply(rebate_perc, axis='columns')
df.head()

Alternatively, we can also use **list comprehension** to calculate each element iteratively as well (If you use this option, please make sure you correctly generate the elements for all rows). The code below is equivalent to the one above but we use list comprehension. This option is not recommended for complex operations as the code will be difficult to read.

In [None]:
df['REBATE_PERC_V2'] = [(df.at[i,'BASE_PRICE'] - df.at[i,'PRICE']) / df.at[i,'BASE_PRICE'] \
                        if df.at[i,'FEATURE'] == 1 else 0 for i in df.index]
df.head()

---
## 5. Grouping and Sorting

### Group by
The `groupby()` method is similar to the PivotTables in Excel. It allows the group the items by one or several dimensions (the Rows and Columns area in the PivotTable). After grouping the items, you can then compute some summary functions over these groups (the Values area in the PivotTable). These summary functions can be your own defined by a `lambda` expression or a standard function definition (using `def ...`).

As a first example, let's compute the mean selling price of each `UPC` over the data.

In [None]:
df.groupby('UPC').PRICE.mean()

As a second example, let's compute the number of times that each `UPC` appeared in the circular. Remember that this variable is a binary variable.

In [None]:
df.groupby('UPC').FEATURE.sum()

### Sort
It is also possible to sort `DataFrame` or `Series` by one or multiple variables by using the `sort_values()` method. If you want to instead sort the index, you need to use the `sort_index()` method.

For example, below, we are sorting first by `PRICE` and, if there are any equalities, we then sort by `BASE_PRICE`. This sorting is done in descending order.

In [None]:
df.sort_values(by=['PRICE', 'BASE_PRICE'], ascending=False)

## 6.  Data Types and Missing Values

Pandas assigns different types to the different columns. Some common types are:
- `int` denoted by `int64`
- `float` denoted by `float64`
- `str` denoted by `object`
- dates denoted by `datetime64[ns]`

Pandas may denote your dates as `object` if it doesn't understand that these are dates. It doesn't make a big difference unless you want to use some of the nice functions available in pandas to manipulate dates.

It is possible to check the type of a column by using the `dtype` property. It is also possible to check the type of all columns by using the `dtypes` property as below.

In [None]:
df.dtypes

If you want to convert the type of a column, you can use the `astype()` method.

For example, if we want to convert the type of `STORE_NUM` from `int64` to `float64`, we do the following:

In [None]:
df.STORE_NUM.astype('float64')  # or just use float

### Missing data
It is possible to check whether there are null values in a column of a `DataFrame` (or a full `DataFrame`) by using the `isnull()` function (or its companion `notnull()`). These returns a binary mask indicating whether any null values (denoted by `NaN`) are present. Note that `NaN` values are always of type `float64`.

Let's check below if we have any missing values in the column `SPEND`.

In [None]:
pd.isnull(df.SPEND)

Let's now check if there are any missing values in any columns.

In [None]:
pd.isnull(df).sum()

If we had found any missing values, we could have replaced them by using the `fillna()` method. It is also possible to replace other values by using the `replace()` method.

Since the table does not contain any null value, we first manually add a new column with no values (which will automatically contains `None`)

In [None]:
df['NewColumn'] = None
df['NewColumn'] # display this column

In [None]:
df[['UPC', 'UNITS', 'NewColumn']].isnull() # the new column return True

If you want to replace such values by something else, you can do so by simply calling the method `fillna()`. Note that if you want to keep the results, you need to assign it to a new variable (which can be the same variable of this original DataFrame object if you want to replace it)

In [None]:
df = df.fillna('Unavailable')
df[['NewColumn']] # the values would now change

Alternatively, you can also replace some values as specified. For example, if you want to replace the `STORE_NUM` 367 with some text, you can also use. (Note here that there is only one store number in the data).

In [None]:
df['STORE_TXT'] = df.STORE_NUM.replace(367, 'Kwik-E-Mart')
df

---
## 7. Renaming and Combining

It is also possible to rename indexes or columns by using the `rename()` method. An elegant way to use this method is by providing a dictionnary where the keys are the indexes/columns current labels and the values are the indexes/columns new labels.

For example, let's rename the `WEEK_END_DATE` column to `END_DATE`.

In [None]:
df.rename(columns={'WEEK_END_DATE': 'END_DATE'}) 
# Note that the resulting DataFrame is not assigned back to the original one if we don't indicate df = ...

As another example, let's rename the index 0 to 'First row' and then the index 1 to 'Second row'.

In [None]:
df.rename(index={0:'First row', 1:'Second row'})
# Note that the resulting DataFrame is not assigned back to the original one if we don't indicate df = ...

---
## 8. Plotting data
Plotting methods allow for the default line plot a handful of plot styles. These methods can be provided as the kind keyword argument to `plot()`, and include:

* `‘bar’` or `‘barh’` for bar plots
* `‘hist’` for histogram
* `‘box’` for boxplot
* `‘kde’` or `‘density’` for density plots
* `‘area’` for area plots
* `‘scatter’` for scatter plots
* `‘hexbin’` for hexagonal bin plots
* `‘pie’` for pie plots

You can check [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) for more information.

As an example, let's visualize the number of units (`'UNITS'` column in our data frame). As `.plot()` is a method of `DataFrame` objects, we use double brackets `[[]]` to indicate the columns we want to visualize. A line plot is the default option of the `.plot()` method.

In the following code, we choose on the the `UNITS` of one product to be plotted. Since index will be on x-axis, we use `WEEK_END_DATE` to be the x-axis.

In [None]:
single_prod_df = df.loc[(df.UPC == 1111085319)][['UNITS', 'VISITS', 'WEEK_END_DATE']]
# we can set the index to be 'WEEK_END_DATE' (note that there is no duplicated value of 'WEEK_END_DATE' for each product)
single_prod_df = single_prod_df.set_index('WEEK_END_DATE')
single_prod_df.plot()

In [None]:
# we can also change the size of the plot
single_prod_df.plot(figsize=(12,3))

In [None]:
# We can also plot a single series by using:
single_prod_df.UNITS.plot()

We can combine some of the functions and methods of `DataFrame` objects to visualize mininful information. For instance, if we are interested in visualizing the number of sold units of each product (`'UPC'`), we can do:

In [None]:
df2 = df.groupby(['UPC']).sum() 
df2.plot(y='UNITS', kind='barh')

Or if we are interested in visualizing the average price and base price for each product, we can do:

In [None]:
df3 = df.groupby(['UPC']).mean()
df3.plot(y=['PRICE', 'BASE_PRICE'], kind='barh')