<img src="https://miro.medium.com/max/1080/1*_oSOImPmBFeKj8vqE4FCkQ.jpeg" title="Pandas" width="150" height="100"/>

[Pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use **open source Python package for data management**, built on top of the Python programming language.

The word <i>pandas</i> is a creative abbreviation for Python Data Analysis Library.  The name derives from the term "panel data", used in econometrics to describe observations related to the same individuals over different periods of time. It was developed by Wes McKinney in 2008.  

Pandas is one of the most important tools currently being used in data science applications, particularly *data wrangling*. It contains useful functions for all aspects of  **data manipulation** and is widely used for **reading data, modifying data (data wrangling and cleansing), data analysis and writing data... basically everything with data** as it provides high-performance and easy-to-use pre-written functions. 

Pandas is built on top of the 'NumPy' package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed plotting functions from `Matplotlib`, and machine learning algorithms in `Scikit-learn`.

There are two types of data structures that are created in Pandas: <i>Series</i> and <i>DataFrames</i> (We will see their difference in the paragraphs below).  By creating these data structures with rows and columns, the data becomes much easier to work with.

This notebook will cover the following topics:

1.	Getting Started
    a.	Viewing Pandas Version thatâ€™s been installed in AW
    b.	Importing Pandas
2.	Pandas Series 
    a.	Creating a series
    b.	Identifying Missing values within a Series
    c.	Adding Series together
3.	Pandas Dataframes
    a.	Creating a dataframe
    b.	Summarizing dataframe info
4.	Row & Column Selection (filtering) from a dataframe
    a.	Indices
    b.	Indexing operator
    c.	loc & iloc
    d.	Row selection using a boolean mask
    e.	Multi-indexed dataframes (with loc and iloc)
5.	Modifying Dataframes by Adding New Columns or Removing Columns/Rows
6.	Function application and mapping
7.	DataFrame sorting and ranking
8.	Data Aggregation (GroupBy operations)


<b>Note:</b> This section is intended to be more "hands-on" so that the user can become more familiar with these data structures in Python.

Since Pandas is all about working with Data, we will be illustrating a lot of the Pandas concepts using a dataset. An entire modeling methodology is explored, starting from the basics of data exploration and treatment and ending by exploring different techniques for predictive analytics (logistic regression, decision trees, gradient boosting, etc.) The dataset we will use is a credit risk dataset containing credit card default information of clients in Taiwan. 

What follows is a brief description of the 25 variables in the dataset:
<b>ID</b>: ID of each client
<b>LIMIT_BAL</b>: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
<b>SEX</b>: Gender (1 = male; 2 = female).
<b>EDUCATION</b>: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
<b>MARRIAGE</b>: Marital status (1 = married; 2 = single; 3 = others).
<b>AGE</b>: Age (year).

History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:

<b>PAY_0</b>:  the repayment status in September, 2005;
<b>PAY_2</b>: the repayment status in August, 2005; . . .;
<b>PAY_3</b>: . . .
<b>PAY_4</b>: . . .
<b>PAY_5</b>: . . .>
<b>PAY_6</b>: the repayment status in April, 2005. 
The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Amount of bill statement (NT dollar).

<b>BILL_AMT1</b>: amount of bill statement in September, 2005;
<b>BILL_AMT2</b>: amount of bill statement in August, 2005; . . .;
<b>BILL_AMT3</b>: . . .;
<b>BILL_AMT4</b>: . . .;
<b>BILL_AMT5</b>: . . .;
<b>BILL_AMT6</b>: amount of bill statement in April, 2005.

Amount of previous payment (NT dollar).

<b>PAY_AMT1</b>: amount paid in September, 2005;
<b>PAY_AMT2</b>: amount paid in August, 2005; . . .;
<b>PAY_AMT3</b>: . . .;
<b>PAY_AMT4</b>: . . .;
<b>PAY_AMT5</b>: . . .;
<b>PAY_AMT6</b>: amount paid in April, 2005;
<b>default.payment.next.month</b>: payment default (1 = yes; 2 = no)

Source: UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients





Let's start by confirming that the `Pandas` library has already been installed in this Analytics Workbench Tenant. Don't worry about understanding the code in the next paragraph- it's a popular snippet you'll want to copy and re-use:

In [3]:
import platform, sys, subprocess, re
print("Platform OS: {} version {}".format(platform.system(),platform.release()))
print("Python version: {}".format(sys.version))

cmd = [sys.executable, '-m', 'pip', 'freeze']
pkgs = []
my_keys = ['Module', 'Version']
for line in subprocess.check_output(cmd).decode("utf-8").splitlines():
    pkgs.append( re.split('==', str(line), len(my_keys)-1) )

# print in a nice Zeppelin table
print("These {} modules are available in Python:".format(len(pkgs)))
print("%table\n{}\t{}".format(*my_keys))
for i in pkgs:
        print("{}\t{}".format(*i))

Let's start by importing the `pandas` library into our notebook.  Pandas is almost always referenced as **pd** so it's best to stick with this convention (making it easier to run code somebody else wrote)

In [5]:
import pandas as pd

A Series in `pandas` is a one dimensional array or vector of data. Series are one of the 2 fundamental data structures in `Pandas` and can store values of any type (integers, strings, etc.) with a unique index (label) per value. Series can be thought of as the columns in a table (or dataframe which we'll learn more about in a moment).  More specifically, pandas series are `ndarrays` (see the <i>NumPy</i> class). Their general form follows the syntax: 

````
import pandas as pd
pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
````
In the real world, a Pandas Series will be created by loading the datasets from existing storage (CSV file). Pandas Series can also be created from the lists, dictionaries etc. Below, we will provide example code to do the following:

* Generatea a `pandas` Series object from a range object, using a manually entered list as the index.
* Transform a dictionary to a series (and vice-versa)
* Operations on series:
    * Returning a flag indicating which values are null
    * Adding 2 series together

A key concept mentioned above is that of an *index*:
"Pandas is a best friend to a Data Scientist, and index is the invisible soul behind pandas"- Some wise person on the internet

The concept of an index is critical for many of the pandas methods such as loc, iloc, filtering, stack/unstack, concat, merge, pivot, etc.  You can think of an index like a label that gets applied to your rows and column. For rows, the index is the row label uniquely identifying each row/  For columns, the index is the column name/header.

We'll then quickly move on to dataframes, because that's where the real fun is.


In [7]:
#Series is a one dimensional array-like object. It also contains the index along with each value- by default, the indices are the same as the position number (starting at 0)
obj = pd.Series(range(5))
obj


In [8]:
# We can also provide our own index name into a Series.
obj = pd.Series(range(5), index=['a','b','c','d','e'])
obj


 
Considering that a Series internal structure is given by an `ndarray`, what is the output of this?
````
import numpy as np
obj = pd.Series(np.arange(5),index=['a','b','c','d','e'])
obj[['c','e']]
````
#### <font color=red>PLEASE *DON'T* CODE</font>


In [10]:
import numpy as np
obj = pd.Series(np.arange(5),index=['a','b','c','d','e'])
obj[['c','e']]

A Series can be transformed to a dictionary and vice-versa. The general expression is as follows: 

````
# from Series to dictionary
Series.to_dict(into=<class 'dict'>)

# from dictionary to Series
pd.Series(dictionary)
````

For example, we can convert the following dictionary to a Series.
````
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
````

In [12]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj = pd.Series(sdata)

obj


Let's consider the following Series.
````
sdata = {'Ohio': np.NaN, 'Texas': 71000, 'Oregon': np.NaN, 'Utah': 5000}
obj = pd.Series(sdata)
````
In the NumPy section, we saw that a missing value (`np.nan`) can be identified with the following: 
````
np.isnan(x)
````
In `pandas` we could do something similar: 
````
pd.isnull(x)
````
We can write a script to identify all the missing values in the dictionary above, as shown below:


In [14]:
sdata = {'Ohio': np.NaN, 'Texas': 71000, 'Oregon': np.NaN, 'Utah': 5000}

obj = pd.Series(sdata)

pd.isnull(obj)


We can do different operations as well, using Series object. For example, let's considering the following two Series:
````
sdata1 = {'Ohio': np.NaN, 'Texas': 71000, 'Oregon': np.NaN, 'Utah': 5000}
obj1 = pd.Series(sdata1)
sdata2 = {'Ohio': np.NaN, 'Texas': 71000, 'Oregon': np.NaN, 'Utah': 5000}
obj2 = pd.Series(sdata2)
````
What will be the output of obj1 + obj2?
#### <font color=red>PLEASE *DON'T* CODE</font>

In [16]:
sdata1 = {'Ohio': np.NaN, 'Texas': 71000, 'Oregon': np.NaN, 'Utah': 5000}
obj1 = pd.Series(sdata1)
sdata2 = {'Ohio': np.NaN, 'Texas': 71000, 'Oregon': np.NaN, 'Utah': 5000}
obj2 = pd.Series(sdata2)
obj1 + obj2


A DataFrame is a 2-dimensional table-like structure, comprised of rows and columns. To each column is associated a name (like a column header) and to each row is associated an index (row label). It resembles a SQL table and can be seen as a dictionary of Series objects (each column can be viewed as a `pandas` Series). It is the most used pandas object and can be created in many ways: from existing dictionaries, 2D-ndarrays, Series or other DataFrames.  Most commonly, DataFrames are created from a (CSV) file. Its general structure is: 

````
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
````

The dataframe is the primary pandas data structure and from now on we will mostly work with DataFrames for data wrangling, data cleansing, data enrichment and model development. 

The key difference between a pandas DataFrame object and a pandas Series object is that a Series is the data structure for a *single* column of a DataFrame (the data in a DataFrame is actually stored in memory as a collection of Series).

In this section we will provide examples to do the following: 

* Create a DataFrame from an existing dictionary
* Create a Dataframe from external data (a CSV file)
* Summarize the information in a dataframe

We'll also introduce a couple useful methods to manipulate (in a pretty blunt way) the dataframe.

In [18]:
my_dict = {'id':[1,2,3,4,5],
           'name':['Calvin','Ron','Daisy','Jasmine','Juan'],
           'income':[20000,40000,20000,10000,50000]}

df = pd.DataFrame(my_dict)#,columns=['id','income','name'])

df


Notice that in the above dataframe, the values in the column index (top row) are the same as the column names and the values in the row index (first column) are the row position number (starting at 0).

We can also specify the values that are used for the row index:

In [20]:
my_dict = {'id':[1,2,3,4,5],
           'name':['Calvin','Ron','Daisy','Jasmine','Juan'],
           'income':[20000,40000,20000,10000,50000]}

df = pd.DataFrame(my_dict,columns=['id','income','name'],index=['row0','row1','row2','row3','row4',])

df


Pandas has a lot of methods to read external data and transform to dataframe

```Python
pd.read_csv(loc)
pd.read_json(loc)
pd.read_html(loc)
pd.clip_board(loc)
pd.read_excel(loc)
pd.read_sas(loc)
pd.read_pickle(loc)
pd.read_sql(loc)
pd.read_spss(loc)

#And others
```
Here (and most of the time in the real world) We are going to use the ``pd.read_csv()`` to read an external data, which has the following syntax:

```Python
pd.read_csv(filepath_or_buffer, sep,  decimal, names, dtype,  header, index_col, prefix, na_values )
```

It can read ".txt" and ".csv" files and has many many more options than this, but here are the most useful:

| --- | --- | --- |
| **Parameter** | **Default Value** |**Defintion** |
| filepath_or_buffer | required value | The full file path in string format |
| sep | ',' | a string that will be used to detect the column delimiter |
| decimal | '.' |Character to recognize as decimal point. |
| names | None |a list containing the unique column names (if it's not in the data) |
| dtype | None |a optional dictionary to explicity tell Python the data format for each column |
| header | 0 |Row number to use as columns names. Put equal ``None`` if you don't want to use this parameter |
| index_col | None | optional column name/number (or list of names/number) to be used as index. |
| prefix | None | optional string if you want to add a prefix to every column names |
| na_values | None |optional list containing numbers or strings to be identified as ``null`` values |


Inside AW3.0 Python can read data directly from S3. 

In [22]:
print("Importing required libraries")
import pandas as pd
import os, boto3, subprocess, re, sys, gc
from botocore.client import Config

print("All libraries successfully loaded!")

kms_key = os.environ['AW_S3_ENCRYPTION_KEY']

bucket_name = os.environ['AW_S3_STORAGE_BUCKET']
storage_key = os.environ['AW_S3_STORAGE_KEY'] + '/awdata/rawfiles/'
full_s3_location = 's3://' + bucket_name + '/' + storage_key 
print("full_s3_location: '{}'".format(full_s3_location))

df_twn= pd.read_csv(full_s3_location + "UCI_Credit_Card.csv",nrows=100)
z.show(df_twn.head(10))

What's shown below are some common methods to get quick information about the data contained in a particular dataframe (`describe()`, `info()`, `shape`, `index`, `columns.values`)


In [24]:
df_twn.describe()

In [25]:
df_twn.info()


In [26]:
df_twn.shape
print(df_twn.shape[0])

In [27]:
df_twn.index
# print(df_twn)

In [28]:
df_twn.columns.values

In [29]:
df_twn.dtypes

In [30]:
## find all integers
df_twn.select_dtypes('int64')

## find all numerics
# df_twn.select_dtypes(['int64','float64'])
## fina all strings
# df_twn.select_dtypes(['object'])

In [31]:
# transform the data format
# Tranform all data to float64
df_twn.astype('str',inplace=False)
# or:
# 'float64'
# 'int64'

To select a single column from a DataFrame (a Series), one can simply access  it using its name. For example, given a DataFrame "df" and a  column "col", the following code outputs a pandas Series (single column) originating from a DataFrame called df:  

````
df['col']
````

Another, equivalent, way to do this is the following.  
````
df.col
````
However, the above code requires "col" to be a valid python name- so df.column works, but df.column name would cause an error and you would need to use df['column name']

To return a *dataframe* instead of a series, you can pass in 'col1' as a 1-element list:
````
df[['col1']]
````

Of course you can pass in multi-element lists e.g. ['col1','col2','col3'] as well- which will return a dataframe with *multiple columns* which is a subset of the original
````
df[['col1','col2','col3']]
````

To select a subset of columns *and* rows by their index values (labels), we use the method `loc[]`. To select a subset of columns *and* rows by their numeric locations (starting at 0) we use the method `iloc[]`: 

````
df.loc[row_labels, col_labels]
df.iloc[row_positions, col_positions]
````
Finally, we will show how to select rows using a boolean mask: 
````
df[boolean_mask]
````

We will see these two methods more in-depth  in the following sections


The `loc` method is used to fetch the specified rows and columns by labels. First we specify the row labels we want to keep, then the column labels we want to keep

In [34]:
my_dict = {'id':[1,2,3,4,5],
           'name':['Calvin','Ron','Daisy','Jasmine','Juan'],
           'income':[20000,40000,20000,10000,50000]}

df = pd.DataFrame(my_dict,columns=['id','income','name'],index=['row0','row1','row2','row3','row4',])

df

Predict the output:

````
df.loc[['row0'],['id','income']]
````

#### <font color=red>PLEASE *DON'T* CODE</font>


In [36]:
df.loc[['row0'],['id','income']]

The iloc method is used to fetch the specified rows and columns by index. First we specify the row numeric positionsnumeric positions we want to keep, then the column labels we want to keep



Predict the output:

````
df.iloc[[0,1],[1]]
````

#### <font color=red>PLEASE *DON'T* CODE</font>

In [39]:
df.iloc[[0,1],[1]]


We can also pass a slice to both loc and iloc. 

* For `loc`, the slice is left-inclusive and right- __inclusive__
* For `iloc`, the slice is left-inclusive and right- __exclusive__

In [41]:
df

In [42]:
# df.loc['row2':'row3','income':'name']
df.iloc[1:2,1:3]

In the *extreme* case where you don't specify any index values or column names, loc and iloc give the same results:

In [44]:
 
# create a 4x4 matrix of numbers- note that the labels and the numeric indices are the same
df = pd.DataFrame(np.arange(16).reshape(4,4))

df


In [45]:
df.loc[[0,3],[0,1,2]]

In [46]:
df.iloc[[0,3],[0,1,2]]

If we consider the following DataFrame, how can we access the values of the <b>income</b> column? 


|   | id | name    |income|
| - |:--:| --------:|:---:|
| 0 | 1  | Calvin   |20000|
| 1 | 2  | Ron      |40000|
| 2 | 3  | Daisy    |20000|
| 3 | 4  | Jasmine  |20000|
| 4 | 5  | Juan     |20000|

In [48]:
df = pd.DataFrame({'id':[1,2,3,4,5],'name':['Calvin','Ron','Daisy','Jasmine','Juan'],'income':[20000,40000,20000,10000,50000]})
print(df)
print("method 1:", df['income'])
print("method 2:",df.income)
print("method 3:",df.loc[:,'income'])
print("method 4:",df.iloc[:,2])


Considering our Taiwan dataset described above (df_twn), use pandas to answer the following questions about the data:

1. How many columns and rows does the data have?
2. How much memory (RAM) do you need to read this data?
3. Are there any columns with incorrect default data types? If so, correct them.
4. How many variables do we have by type (numerical, string) (answer this *without* using ``.info()`` or ``.dtypes``)
5. Create two describes, one for numeric variables and one for string variables.

In [51]:
print("Number of rows:", df_twn.shape[0])
print("Number of columns:", df_twn.shape[1])
# The first position of the shape method is the number of rows and the second is the number of columns

In [52]:
df_twn.info()

In [53]:
z.show(df_twn.head(20))

In [54]:
should_be_str = ['SEX','EDUCATION', 'MARRIAGE','PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6','default.payment.next.month']
df_twn[should_be_str] = df_twn[should_be_str].astype('str')

In [55]:
print('Number of numerical variables:', len(df_twn.select_dtypes(['int64','float64']).columns.values))
print('Number of string variables:', len(df_twn.select_dtypes(['object']).columns.values))

In [56]:
print(df_twn.select_dtypes(['float64','int64']).describe())


# Unfortunately, if you use the z.show() it will not show the row index. Because of that, we need to use the .reset_index() method which creates a new default index and (as long as you keep the default setting of 'drop=False')  shifts the old index to the first data column (and shifts all existing columns).

z.show(df_twn.select_dtypes(['float64','int64']).describe().reset_index())

# print(df_twn.select_dtypes(['float64','int64']).describe().reset_index())

In [57]:
z.show(df_twn.select_dtypes(['object']).describe().reset_index())

We can select (filter) rows from a dataframe using a Boolean mask.

What will be the output of the following?
````
df[df.income > 10000]
````
On this dataframe:

|   | id | name    |income|
| - |:--:| --------:|:---:|
| 0 | 1  | Calvin   |20000|
| 1 | 2  | Ron      |40000|
| 2 | 3  | Daisy    |20000|
| 3 | 4  | Jasmine  |20000|
| 4 | 5  | Juan     |20000|


In [60]:
df = pd.DataFrame({'id':[1,2,3,4,5],'name':['Calvin','Ron','Daisy','Jasmine','Juan'],'income':[20000,40000,20000,10000,50000]})
print("df: ",df)
print("\nboolean mask\n", [df.income > 10000])
print("\ndf filtered on the boolean mask\n",  df[df.income > 10000])

Considering the (raw) Taiwan data (df_twn) and using one line of code and the `.loc` method, show only the numeric columns for the first 4 records records which have **LIMIT_BAL**  > 300000 and **AGE** > 30.


In [63]:
# print("first boolean mask: ",df_twn.LIMIT_BAL>300000)
# print("second boolean mask: ",df_twn.AGE>30)
# print("final boolean mask: ",(df_twn.LIMIT_BAL>300000) & (df_twn.AGE>30))
# print ("use loc to select rows based on final boolean mask: ", df_twn.loc[(df_twn.LIMIT_BAL>300000) & (df_twn.AGE>30),:])
#Finally, use select_dtypes to select just the numeric columns and include head in teh z.show
z.show(df_twn.loc[(df_twn.LIMIT_BAL>300000) & (df_twn.AGE>30),:].select_dtypes(["int64", 'float64']).head(5))

1. Create a boolean mask to indicate which rows to keep (True) and drop (False)
2. Use loc with the boolean vector to select the rows to keep
3. Select the numeric columns from the above result
4. Show the first 4 rows from the above result

In [65]:
df = pd.DataFrame(np.random.randn(10,5),
                  index=[['a','a','a','b','b','b','c','c','d','d'],
                         [ 1 , 2 , 3 , 1 , 2 , 3 , 1 , 2 , 1 , 2]])

print("Index: ", df.index)
print("df",df)                         
# df


In [66]:
## Accessing all the values within the index value of 'a'

# print("df.loc['a',:] ", df.loc['a',:])

## Accessing the 2nd & 3rd row 

print("\ndf.iloc[1:3,:] ",df.iloc[1:3,:])



In [67]:
df

What's the output of the following command? 
````
df.loc['a',].loc[1,:]
````
#### <font color=red>PLEASE *DON'T* CODE</font>

In [69]:
df.loc['a',].loc[1,:]
# df.swaplevel(0,1).loc[1,:]


Can you guess how many rows the output will have given the expression below? 
````
df.sum(level = 0)
````

#### <font color=red>PLEASE *DON'T* CODE</font>

In [72]:
print(df)
print(df.sum(level=0))

Please continue on to the 04b. Pandas (With Solutions) Notebook
