# Learning Pandas with examples  

Dataframes (DF) are used extensively in loading data from files. This tabular data is easier to work with if you are already failiar with spreadsheets.

I have used the famous Oracle `scott/tiger` datasets of `EMP` and `DEPT`. I believe that working with 8 columns and 14 rows of known data gives an easier understanding of what is changing in the underlying dataset.

The 2 files of [emp.csv](emp.csv) and [dept.csv](dept.csv) are also present in here so easy access. I have changed few formats of the hiredate column to make cleaning noticeable.

1. Basic Exploration of the dataset  
    1.1. List the data type of the columns  
    1.2. List the names of the columns in the dataframe  
    1.3. Count of rows in the DF  
    1.4. Count of columns in the DF  
    1.5. Explore values in a dataframe  
2. Select a subset of the dataframe to create a new dataframe  
    2.1 Select a subset of columns  
    2.2 Select a subset of row based on filter  
    2.3 Select specific rows and columns  
3. Create a new column  
4. Creating a pivot table  
5. Join 2 dataframes  
6. Convert hiredate to_date  
7. Writing dataframe back to disk  

## Import the Pandas library and load the emp dataset

In [1]:
import pandas as pd
import numpy as np       # used in a below example

emp_df = pd.read_csv('emp.csv')
emp_df.head(20)        # since the dataset is small we want to see completely. default parameter is 5

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno
0,7369,SMITH,CLERK,7902.0,17-12-1980,800,,20
1,7499,ALLEN,SALESMAN,7698.0,02-20-1981,1600,300.0,30
2,7521,WARD,SALESMAN,7698.0,22-FEB-1981,1250,500.0,30
3,7566,JONES,MANAGER,7839.0,2-APR-1981,2975,,20
4,7654,MARTIN,SALESMAN,7698.0,28/SEP/1981,1250,1400.0,30
5,7698,BLAKE,MANAGER,7839.0,1-MAY-1981,2850,,30
6,7782,CLARK,MANAGER,7839.0,9-JUN-1981,2450,,10
7,7788,SCOTT,ANALYST,7566.0,09/12/1982,3000,,20
8,7839,KING,PRESIDENT,,17-NOV-1981,5000,,10
9,7844,TURNER,SALESMAN,7698.0,8-SEP-1981,1500,0.0,30


This is the complete data of the emp table with format changes in the few of the hiredates.

### 1. Basic Exploration of the dataset
1. List the data type of the columns
2. List the names of the columns in the dataframe
3. Count of rows in the DF
4. Count of columns in the DF
5. Explore values in a dataframe

In [2]:
print(emp_df.dtypes)        # Gives the data types of the columns
print()
print(emp_df.columns)       # Gives List of the column names
print()
print(emp_df.shape)         # Gives a tuple with count of rows and columns

empno         int64
ename        object
job          object
mgr         float64
hiredate     object
sal           int64
com         float64
deptno        int64
dtype: object

Index(['empno', 'ename', 'job', 'mgr', 'hiredate', 'sal', 'com', 'deptno'], dtype='object')

(14, 8)


#### 1.5 Explore values in a dataframe
The describe() method works on numeric data but by passing parameter `include='all'` we can make it work on all data types

In [3]:
emp_df.describe(include='all')

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno
count,14.0,14,14,13.0,14,14.0,4.0,14.0
unique,,14,5,,13,,,
top,,TURNER,SALESMAN,,3-DEC-1981,,,
freq,,1,4,,2,,,
mean,7726.571429,,,7739.307692,,2073.214286,550.0,22.142857
std,178.294361,,,103.71466,,1182.503224,602.771377,8.017837
min,7369.0,,,7566.0,,800.0,0.0,10.0
25%,7588.0,,,7698.0,,1250.0,225.0,20.0
50%,7785.0,,,7698.0,,1550.0,400.0,20.0
75%,7868.0,,,7839.0,,2943.75,725.0,30.0


from the above we can see that describe() is best used when the data conveys numeric information. It is useful to know the mean `salary` of the dataset. There is no insight in the average of `deptno` or `empno`.
We can use describe on a single column too.

In [4]:
emp_df.sal.describe()

count      14.000000
mean     2073.214286
std      1182.503224
min       800.000000
25%      1250.000000
50%      1550.000000
75%      2943.750000
max      5000.000000
Name: sal, dtype: float64

### 2. Select a subset of the dataframe to create a new dataframe
#### 2.1 Select a subset of columns

In [5]:
subset_cols = ["empno","ename","deptno"]
subset_df = emp_df[subset_cols]
subset_df

Unnamed: 0,empno,ename,deptno
0,7369,SMITH,20
1,7499,ALLEN,30
2,7521,WARD,30
3,7566,JONES,20
4,7654,MARTIN,30
5,7698,BLAKE,30
6,7782,CLARK,10
7,7788,SCOTT,20
8,7839,KING,10
9,7844,TURNER,30


#### 2.2 Select a subset of row based on filter

In [6]:
less_sal_df = emp_df[emp_df['sal'] <= 1200]
less_sal_df

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno
0,7369,SMITH,CLERK,7902.0,17-12-1980,800,,20
10,7876,ADAMS,CLERK,7788.0,12-JAN-1983,1100,,20
11,7900,JAMES,CLERK,7698.0,3-DEC-1981,950,,30


#### 2.3 Select specific rows and columns
Here we use `.loc` property. `.loc` is used to access group of rows and columns. In the below example we only rows present at `index 0,1,3`. We only want the 2 columns `empno` and `ename`.
This concept it taken further in the next example. A filter condition is used to get filtered rows and after the comma we pass a list of columns.

In [7]:
emp_df.loc[[0,1,3],["empno","ename"]]

Unnamed: 0,empno,ename
0,7369,SMITH
1,7499,ALLEN
3,7566,JONES


In [8]:
specific_df = emp_df.loc[emp_df["com"].notna(), ["empno","ename","deptno"]]
specific_df

Unnamed: 0,empno,ename,deptno
1,7499,ALLEN,30
2,7521,WARD,30
4,7654,MARTIN,30
9,7844,TURNER,30


Now we will use the `.iloc` property. It is purely integer location based for selection by position
Below is the example of selecting rows `0,1,3` and columns `0,1,2,3`

In [9]:
iloc_df = emp_df.iloc[[0,1,3],[0,1,2,3]]
iloc_df

Unnamed: 0,empno,ename,job,mgr
0,7369,SMITH,CLERK,7902.0
1,7499,ALLEN,SALESMAN,7698.0
3,7566,JONES,MANAGER,7839.0


### 3. Create a new column
Create a column `totalSalary` which is addition of `sal` and `com`.  
New column is created by assigning the output to the DF with a new column name in between the `[]`

In [10]:
emp_df['totalSalary'] = emp_df['sal'] + emp_df['com']
emp_df.head()

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno,totalSalary
0,7369,SMITH,CLERK,7902.0,17-12-1980,800,,20,
1,7499,ALLEN,SALESMAN,7698.0,02-20-1981,1600,300.0,30,1900.0
2,7521,WARD,SALESMAN,7698.0,22-FEB-1981,1250,500.0,30,1750.0
3,7566,JONES,MANAGER,7839.0,2-APR-1981,2975,,20,
4,7654,MARTIN,SALESMAN,7698.0,28/SEP/1981,1250,1400.0,30,2650.0


we see that adding NaN to number results in NaN in the new column. So we should replace NaN with zero while computing

In [11]:
emp_df['totalSalary'] = emp_df['sal'] + emp_df['com'].fillna(0)
emp_df.head()

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno,totalSalary
0,7369,SMITH,CLERK,7902.0,17-12-1980,800,,20,800.0
1,7499,ALLEN,SALESMAN,7698.0,02-20-1981,1600,300.0,30,1900.0
2,7521,WARD,SALESMAN,7698.0,22-FEB-1981,1250,500.0,30,1750.0
3,7566,JONES,MANAGER,7839.0,2-APR-1981,2975,,20,2975.0
4,7654,MARTIN,SALESMAN,7698.0,28/SEP/1981,1250,1400.0,30,2650.0


### 4. Creating a pivot table
Pivot table is a very powerful function in Excel. The same functionality is avaiable using the method pivot_table().
Excel gives us the 4 functions for working with datasets. These are:
1. FILTERS: Filtering out rows.
2. COLUMNS: The values of the column in the DF will become new column(s) 
3. ROWS: The distinct values in this column of the DF will become the rows
4. VALUES: The numberic column on which aggregation is to be applied. We can choose the aggregation to be applied.

Filtering needs to be done at the DF level before it is passed to the pivot_table() method.
1. index=: distinct values that appear as rows
2. columns=: distinct values that appear as columns
3. values=: numeric column that is to be aggregated
4. aggfunc=: method of aggregation. Below example uses `average`
5. fill_value=: we give a zero where the value is NaN. E.g. `deptno=10` does not have any analysts and salesman

In [12]:
pd.pivot_table(emp_df[emp_df['ename'] != 'KING'],
               index='job',
               columns='deptno',
               values='totalSalary',
               aggfunc=np.average, 
               fill_value=0)

deptno,10,20,30
job,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ANALYST,0,3000,0
CLERK,1300,950,950
MANAGER,2450,2975,2850
SALESMAN,0,0,1950


### 5. Join 2 dataframes
We will read the `dept` dataframe and merge it to `emp` as an inner join. In pandas join can be done using `.join()` or `.merge()`. `.merge()` is more versatile.

In [13]:
dept_df = pd.read_csv('dept.csv')
dept_df.head()

Unnamed: 0,deptno,dname,loc
0,10,ACCOUNTING,NEW YORK
1,20,RESEARCH,DALLAS
2,30,SALES,CHICAGO
3,40,OPERATIONS,BOSTON


In [14]:
merged_df=emp_df.merge(dept_df,how='inner',left_on='deptno',right_on='deptno')
merged_df

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno,totalSalary,dname,loc
0,7369,SMITH,CLERK,7902.0,17-12-1980,800,,20,800.0,RESEARCH,DALLAS
1,7566,JONES,MANAGER,7839.0,2-APR-1981,2975,,20,2975.0,RESEARCH,DALLAS
2,7788,SCOTT,ANALYST,7566.0,09/12/1982,3000,,20,3000.0,RESEARCH,DALLAS
3,7876,ADAMS,CLERK,7788.0,12-JAN-1983,1100,,20,1100.0,RESEARCH,DALLAS
4,7902,FORD,ANALYST,7566.0,3-DEC-1981,3000,,20,3000.0,RESEARCH,DALLAS
5,7499,ALLEN,SALESMAN,7698.0,02-20-1981,1600,300.0,30,1900.0,SALES,CHICAGO
6,7521,WARD,SALESMAN,7698.0,22-FEB-1981,1250,500.0,30,1750.0,SALES,CHICAGO
7,7654,MARTIN,SALESMAN,7698.0,28/SEP/1981,1250,1400.0,30,2650.0,SALES,CHICAGO
8,7698,BLAKE,MANAGER,7839.0,1-MAY-1981,2850,,30,2850.0,SALES,CHICAGO
9,7844,TURNER,SALESMAN,7698.0,8-SEP-1981,1500,0.0,30,1500.0,SALES,CHICAGO


### 6. Convert hiredate to_date
Since we have changed the string formats in few rows, hiredate is taken as an object `dtype('O')`. We want to convert it to date

In [15]:
emp_df.hiredate.dtype

dtype('O')

In [16]:
emp_df['hiredate'] = pd.to_datetime(emp_df['hiredate'])

In [17]:
emp_df.dtypes

empno                   int64
ename                  object
job                    object
mgr                   float64
hiredate       datetime64[ns]
sal                     int64
com                   float64
deptno                  int64
totalSalary           float64
dtype: object

In [18]:
emp_df.head()    # we see that hiredate is converted to ISO date format

Unnamed: 0,empno,ename,job,mgr,hiredate,sal,com,deptno,totalSalary
0,7369,SMITH,CLERK,7902.0,1980-12-17,800,,20,800.0
1,7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30,1900.0
2,7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30,1750.0
3,7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20,2975.0
4,7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30,2650.0


### 7. Writing dataframe back to disk
We read data as `csv` and we are writing back to disk as `excel`

In [19]:
emp_df.to_excel("emp.xlsx", sheet_name='empInformation', index=False)