# Advance Pandas - Remaining Functionality

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

According to the Wikipedia page on Pandas, “the name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets <br>
pandas consists of the following elements. <br>

* A set of labeled array data structures, the primary of which are Series and DataFrame
* Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
* An integrated group by engine for aggregating and transforming data sets
* Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies
* Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
* Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
* Moving window statistics (rolling mean, rolling standard deviation, etc.)

### Some quick references

__[10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html#min)__

### Managing your working directory in Python

In [None]:
import os

In [None]:
os.getcwd()

** Set you current working directory**

In [None]:
# list down all objects in the current working directory
os.listdir()

In [None]:
os.chdir('C:\\Users\\raggar10\\Desktop\\Study Material\\Data Science Training\\AcadGild\\DSM_Online_July14\\session8\\Pandas Session1')

In [None]:
# list down all objects in the current working directory
os.listdir()

## Create Data in Pandas DataFrame

In [None]:
import pandas as pd
#import numpy as np
#import matplotlib.pyplot as plt

## Titanic Kaggle Data __[Data Description](https://www.kaggle.com/c/titanic/data)__

In [None]:
titanic_train = pd.read_csv("titanic.csv",sep=',')   

In [None]:
titanic_train.shape

In [None]:
titanic_train.head(20)

In [None]:
titanic_train.Name[:10]

In [None]:
type(titanic_train[["PassengerId", 'Pclass','Name']][:10])

In [None]:
titanic_train[["PassengerId", 'Pclass','Name']][:10]

In [None]:
type(titanic_train.Pclass)

In [None]:
titanic_train[:5]

In [None]:
#type(titanic_train.PassengerId)
#type(titanic_train.PassengerId.values)
(titanic_train.Name.values)

### DataFrame Index

In [None]:
titanic_train.index

In [None]:
titanic_train.index = titanic_train.PassengerId +1000

In [None]:
titanic_train.head()

**Get a list of columns**

In [None]:
titanic_train.columns

In [None]:
titanic_train.columns = ['PassengerId', 'Survived', 'Passengerclass', 'PassName', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'EmbarkedPort']

In [None]:
titanic_train.head()

**Get Data Type for each column**

In [None]:
titanic_train.dtypes
#type(titanic_train.dtypes)
#titanic_train.dtypes.index

In [None]:
titanic_train.info()

In [None]:
titanic_train.dtypes

In [None]:
type(titanic_train.dtypes)

In [None]:
titanic_train.dtypes['Age']

In [None]:
print("Categorical Columns")
titanic_train.dtypes == 'object'

In [None]:
type(titanic_train.dtypes == 'object')

In [None]:
titanic_train.columns[titanic_train.dtypes == 'object']

In [None]:
titanic_train.columns

In [None]:
print("Numberical Columns")
titanic_train.columns[titanic_train.dtypes != 'object']

**Get Descriptive Statistics for Numeric Columns**

In [None]:
?pd.read_csv

In [None]:
print( titanic_train.describe() )

**Get Descriptive Statistics for Categorical Columns**

In [None]:
?pd.DataFrame.describe

In [None]:
categorical = titanic_train.dtypes[titanic_train.dtypes == "object"].index
print(categorical)

titanic_train[categorical].describe()

In [None]:
titanic_train.describe(include='all')

In [None]:
titanic_train.shape

In [None]:
titanic_train.info()

**Individual Statistics**

In [None]:
## How to we get individual column statistics
titanic_train['Age'].mean()

In [None]:
titanic_train['Age'].max()

In [None]:
titanic_train['Age'].min()

In [None]:
import numpy as np

In [None]:
np.max(titanic_train.Age)

In [None]:
np.max(titanic_train['Age'])

### Two ways to remove column from Data Frame

In [None]:
titanic_train.head()

In [None]:
del titanic_train["PassengerId"]     # Remove PassengerId

In [None]:
titanic_train.head()

In [None]:
titanic_train.drop(['PassName'],axis=1).head()

In [None]:
titanic_train.head()

In [None]:
titanic_train.drop(['PassName'],axis=1,inplace=True)

In [None]:
titanic_train.head()

### Sorting Values

In [None]:
titanic_train.head()

In [None]:
titanic_train.sort_values(by='Age', inplace=True)

In [None]:
titanic_train.head()

In [None]:
titanic_train.sort_values(by='Age',ascending=False ,inplace=True)

In [None]:
titanic_train.head()

In [None]:
titanic_train["Ticket"][:10]

In [None]:
sorted(titanic_train["Ticket"])[0:15]   # Check the first 15 sorted names

In [None]:
# Sorting by multiple columns
titanic_train.sort_values(by=['Age','Ticket'],ascending=[False,True] ,inplace=True)

In [None]:
titanic_train.head()

### Dealing with Categorical Data

In [None]:
titanic_train["Cabin"].describe()  # Check number of unique cabins

**We will now deal with data which is numerical but categorical in nature**

In [None]:
titanic_train["Survived"].dtypes

In [None]:
titanic_train["Survived"].describe()

In [None]:
titanic_train["Survived"].unique()

In [None]:
new_survived = pd.Categorical(titanic_train["Survived"])

In [None]:
new_survived[0:10]

In [None]:
type(new_survived)

In [None]:
new_survived.describe()

In [None]:
?new_survived.rename_categories

In [None]:
new_survived = new_survived.rename_categories(["Died","Survived"])              

new_survived.describe()

In [None]:
new_survived2 = new_survived.rename_categories({0:"Passenger_Died",1:"Passenger_Survived"})     
new_survived2.describe()

In [None]:
titanic_train['Survived_New']=new_survived2

In [None]:
titanic_train.head(10)

In [None]:
titanic_train.Pclass.unique()

In [None]:
new_Pclass = pd.Categorical(titanic_train["Pclass"],
                           ordered=True)

new_Pclass = new_Pclass.rename_categories(["Class3","Class2","Class1"])     

new_Pclass.describe()

In [None]:
titanic_train.Sex.unique()

In [None]:
new_Gender = pd.Categorical(titanic_train["Sex"],
                           ordered=True)

new_Gender = new_Gender.rename_categories(["0","1"])     

new_Gender.describe()

In [None]:
new_Gender.unique()

In [None]:
titanic_train["Pclass"] = new_Pclass

In [None]:
titanic_train["Cabin"].unique()   # Check unique cabins

In [None]:
? pd.Categorical

## Indexing and Slicing DataFrames

We can use column names or row names to index data points in DataFrame

In [None]:
titanic_train.head()

In [None]:
titanic_train[10:15]

In [None]:
titanic_train[1:2]

In [None]:
titanic_train[['Fare']].head()

In [None]:
titanic_train[['Fare','Age']].head()

**For advaned indexing we can use below two methods**
* loc
* iloc

**loc uses actual row or column indexes for slicing data**

In [None]:
Pandasdataframe.loc[row_index,col_index]

In [None]:
titanic_train.head()

In [None]:
titanic_train.loc[[1022,1038],:]

In [None]:
titanic_train.loc[1001:1020,:]

In [None]:
titanic_train.loc[1010:1015, ['Fare','Age']]

In [None]:
titanic_train.index=titanic_train['PassName']

In [None]:
titanic_train.head(5)

In [None]:
titanic_train.loc['Heikkinen, Miss. Laina':'Allen, Mr. William Henry',:]

In [None]:
titanic_train.index=titanic_train.PassengerId

In [None]:
titanic_train.head()

In [None]:
titanic_train.loc[[1,100,90,80],:]

In [None]:
titanic_train.loc['Allen, Mr. William Henry':'Heikkinen, Miss. Laina',:]

In [None]:
#titanic_train.loc[1:5,:]

**Similarly iloc uses integers as numbered indexes to slice data**

In [None]:
titanic_train.index=titanic_train.PassName

In [None]:
titanic_train.head()

In [None]:
titanic_train.iloc[2:4,:]

# Let's Do It Together

In [None]:
## Read HR_Employee_Attrition_Data.csv file as pandas DataFrame

In [None]:
## Read first 6 and last 7 rows 

In [None]:
## List all columns and data types for each

In [None]:
## Find descriptive statistics for all columns

In [None]:
## Find average age of people whose Attrition and is Yes and No separately

In [None]:
## Find average daily rate for people whose Attrition is Yes and No separately

In [None]:
## Find the Department where Attrition rate is highest

In [None]:
## For people who Travel Rarely and from Sales department what is the average daily rate?