# Basic Pandas Commands
In this guide, we start with the very basics of Pandas and go through the following steps
1. Import Pandas library
2. Load dataset
3. Print a concise summary of the dataset
4. Check if the dataframe is empty
5. View the row index and column headers
6. Understand the dimensions of the dataset (How many rows or columns does it have?)
7. Visualize the dataset
8. Understand the datatype of each column


We practice these commands on the Titanic train dataset. 
The dataset can be downloaded from Kaggle website.

https://www.kaggle.com/c/titanic/data?select=train.csv

### List of methods and properties discussed in this notebook

**Load the data**
- pd.DataFrame()
- pd.read_csv()

**Basic overview of data**
- df.info()
- df.describe()
- df.head()
- df.tail()

**Check rows and columns**
- df.keys()
- df.axes
- df.index
- df.columns

**Size and shape of the dataset**
- df.size
- df.shape
- df.ndim

**Datatypes in the dataset**
- df.dtypes
- df.select_dtypes(include=['float64'])
- df.select_dtypes(exclude=['float64'])

**Others**
- df.empty




In [2]:
#Here is a basic guide to using Pandas for Data Pre-Processing, Exploration and Manipulation in Python
#We will start with the basic usage of the Pandas library and will slowly move to advanced functions

#So, without any further delay...let's start with how to load a dataset in Pandas

## 1. Import Pandas library

In [3]:
#First, import the Pandas library
import pandas as pd

## 2. Creating Data

There are two core objects in pandas: the DataFrame and the Series.

<b>What is a DataFrame?</b><br>
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

Like Series, DataFrame accepts many different kinds of input:
1. Dict of 1D ndarrays, lists, dicts, or Series
2. 2-D numpy.ndarray
3. Structured or record ndarray
4. Series
5. Another DataFrame

For example, consider the following simple DataFrame:

In [5]:
#Creating a basic dataframe
pd.DataFrame({"A": [10,20], "B":[30,40]})

Unnamed: 0,A,B
0,10,30
1,20,40


In [14]:
#Setting the index in the dataframe
pd.DataFrame({"FirstName": ["David","Zlatan"], "LastName": ["Beckham","Ibrahimovic"]}, index=["Player0","Player1"])

Unnamed: 0,FirstName,LastName
Player0,David,Beckham
Player1,Zlatan,Ibrahimovic


<b>What is a Series?</b><br>
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [15]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [17]:
pd.Series([450,490,530],index=["Sales 2018","Sales 2019","Sales 2020"],name="Company ABC Revenue")

Sales 2018    450
Sales 2019    490
Sales 2020    530
Name: Company ABC Revenue, dtype: int64

## 2. Load dataset

In [25]:
#Next, let's load the dataset into a Pandas dataframe
df = pd.read_csv('train.csv') 

#Since, we have the dataset in a csv file, we have used pd.read_csv().
#There are different functions based on the type of data we are trying to load in a dataframe.
#More details can be found here
#https://pandas.pydata.org/pandas-docs/stable/reference/io.html

#Pandas provides support to read following filetypes
#Table, CSV, Clipboard, Excel, JSON, HTML ,XML, Latex, HDFStore: PyTables (HDF5), Feather, Parquet, ORC, SAS, SPSS, SQL, Google BigQuery and STATA

## What is a dataframe?



In [26]:
#Let's view the data from Titanic dataset in the dataframe
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [27]:
#Well, we get a rough overview of the dataset by using the df command. 
#But, there are certain properties of df object, which will help us explore the data in a better way.

## 3. Print a concise summary of the dataset

In [28]:
#Get a summary of the dataset - Names, null-count and data types of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [45]:
#Statistically describes the numeric columns in the dataframe
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 4. Check if the dataframe is empty

In [29]:
#Indicator whether DataFrame is empty
df.empty

False

## 5. View the row index and column headers

In [30]:
#Return a list representing the axes of the DataFrame.
df.axes

[RangeIndex(start=0, stop=891, step=1),
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
        'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
       dtype='object')]

In [31]:
#The index (row labels) of the DataFrame.
df.index

RangeIndex(start=0, stop=891, step=1)

In [32]:
#The column labels of the DataFrame.
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [33]:
#We can also use df.keys() to view the series of columns in the dataframe
df.keys()

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## 6. Understand the dimensions of the dataset

In [34]:
#Total number of elements in the dataframe
df.size

10692

In [35]:
#Return a tuple representing the dimensionality of the DataFrame.
df.shape

(891, 12)

In [36]:
#Return an int representing the number of axes / array dimensions.
df.ndim

2

## 7. Visualize the dataset

We will discuss more ways to visualize the dataframe using head() and tail() method. Currently, we are focusing on the attributes.

In [37]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [38]:
#Return a Numpy representation of the DataFrame
df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

In [39]:
#Look at the top 5 rows of the dataset. 
#We can pass the count of the rows that we want to see as a parameter to the head(x). By default, its 5
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [40]:
#Look at the last 5 rows of the dataset. 
#We can pass the count of the rows that we want to see as a parameter to the tail(x). By default, its 5
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


##  8.Overview of datatype in each column

In [41]:
#Return the dtypes in the DataFrame.
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [42]:
#Include example
df.select_dtypes(include=['float64'])

Unnamed: 0,Age,Fare
0,22.0,7.2500
1,38.0,71.2833
2,26.0,7.9250
3,35.0,53.1000
4,35.0,8.0500
...,...,...
886,27.0,13.0000
887,19.0,30.0000
888,,23.4500
889,26.0,30.0000


In [43]:
#Exclude example
df.select_dtypes(exclude=['float64'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,C148,C


#### Return the selected dtypes in the DataFrame.

1. To select all numeric types, use np.number or 'number'
2. To select strings you must use the object dtype, but note that this will return all object dtype columns
See the numpy dtype hierarchy
3. To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
4. To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
5. To select Pandas categorical dtypes, use 'category'
6. To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

# **End of Sheet**