1. Load Dataset
2. Create Series / Dataframe
3. Basic commands to get an overview of the dataset
4. Select / index / filter / navigate the dataset
5. Manipulate the dataset - Apply functions (custom, descriptive statistics), add/del rows and columns, create custom columns, missing values, duplicate
6. Data Wrangling - group, sort, merge, pivot, reshape

# Summary of key Pandas operations
In this guide, we summarize the key operations we can do with Pandas library
1. Load dataset - I/O
2. Basic overview of the dataset
3. Data Selection (Basic / Conditional)
4. Data Update (Basic / Conditional)

4. Binary Indexing
5. Updation
6. Summary statistics
7. Group
8. Filter
9. Sort
10. Merge, Reshape
11. Duplicate
12. Missing values
13. Visualization



We practice these commands on the Titanic train dataset. 
The dataset can be downloaded from Kaggle website.

https://www.kaggle.com/c/titanic/data?select=train.csv

In [3]:
import pandas as pd

## 1. Load dataset - I/O

In [4]:
#Creating a basic dataframe
pd.DataFrame({"A": [10,20], "B":[30,40]})

Unnamed: 0,A,B
0,10,30
1,20,40


In [5]:
#Setting the index in the dataframe
pd.DataFrame({"FirstName": ["David","Zlatan"], "LastName": ["Beckham","Ibrahimovic"]}, index=["Player0","Player1"])

Unnamed: 0,FirstName,LastName
Player0,David,Beckham
Player1,Zlatan,Ibrahimovic


<b>What is a Series?</b><br>
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [6]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [7]:
pd.Series([450,490,530],index=["Sales 2018","Sales 2019","Sales 2020"],name="Company ABC Revenue")

Sales 2018    450
Sales 2019    490
Sales 2020    530
Name: Company ABC Revenue, dtype: int64

In [8]:
#Next, let's load the dataset into a Pandas dataframe
df = pd.read_csv('train.csv') 

#Since, we have the dataset in a csv file, we have used pd.read_csv().
#There are different functions based on the type of data we are trying to load in a dataframe.
#More details can be found here
#https://pandas.pydata.org/pandas-docs/stable/reference/io.html

#Pandas provides support to read following filetypes
#Table, CSV, Clipboard, Excel, JSON, HTML ,XML, Latex, HDFStore: PyTables (HDF5), Feather, Parquet, ORC, SAS, SPSS, SQL, Google BigQuery and STATA

<hr style="border:2px solid gray"> </hr>.

## 2. Basic overview of the dataset

**2.1 Data Overview**
- df.info()
- df.describe()
- df.head()
- df.tail()
------

**2.2 Check rows and columns**
- df.keys()
- df.axes
- df.index
- df.columns
- df.rename()
--------

**2.3 Size and shape of the dataset**
- df.size
- df.shape
- df.ndim
--------

**2.4 Datatypes in the dataset**
- df.dtypes
- df.select_dtypes(include=['float64'])
- df.select_dtypes(exclude=['float64'])

-----

**2.5 Others**
- df.empty

-----------

<hr style="border:2px solid gray"> </hr>.

## 3. Data Selection - Basic

**3.1 Selecting a single column**
- df.columnName
- df["columnName"]
-----

**3.2 Selecting multiple columns**
- df[["column1","column2", "column3"]]
------

**3.3 Selecting rows/columns based on index/column number - .iloc**
- df.iloc[row_number, column_number]
- df.iloc[2,3]
------

**3.4 Selecting a range of rows and columns using slice operator - .iloc**
- df.iloc[2:5,3:10]
- df.iloc[1:4, 1:]
- df.iloc[:, 2:5]
- df.iloc[0:10,:]
------

**3.5 Selecting specific rows and columns using index - .iloc**
- df.iloc[[2,6],[3,10]]
- df.iloc[[0,10]]
------

**3.6 Selecting specific rows and columns based on name - .loc**
- df.loc[0:3,["PassengerId","Survived","Pclass","Name","Sex","Age"]]
----

**3.7 Access a single value for a row/column label pair - .at**
- df.at[4,"Name"]
------

**3.8 Access a single value for a row/column label pair using numeric location - .iat**
- df.iat[4,3]
-------

<hr style="border:2px solid gray"> </hr>.

## 4. Data Selection - Conditional

**4.1 Boolean Indexing**

Conditional Selection
- df.loc[condition for rows, columns to be selected]
- df.loc[df['Sex']=="male",:]
----

Multiple conditions
- df.loc[(df['Sex']=="male") & (df["Pclass"] ==3),:]
-------

Using bool selector for multiple conditions
- bool_selector = (df['Sex']=="male") & (df["Pclass"] ==3)
- df.loc[bool_selector,:]
-----

**4.2 Using custom function**
- df.loc[lambda df: df[column_name] > 0, :]
- df.loc[:, lambda df: ['A', 'B']]

---------

**4.3 Indexing with .isin()**<br>
1. `.isin()` allows you to select rows where one or more columns have values you want:<br>
`values = {'ids': ['a', 'b'], 'vals': [1, 3]}`<br>
`df.isin(values)`<br><br>

2. Combine DataFrame’s isin with the `any()` and `all()` methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion:<br>
`row_mask = df.isin(values).all(1)`<br>
`df[row_mask]`

--------

**4.4 Indexing with .where()**<br>
Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.

`df.where(df < 0, -df)`

By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original data can be modified without creating a copy:
`df_orig.where(df > 0, -df, inplace=True)`

-----

**4.5 Index with np.where()**<br>
An alternative to where() is to use numpy.where(). Combined with setting a new column, you can use it to enlarge a DataFrame where the values are determined conditionally.

Consider you have two choices to choose from in the following DataFrame. And you want to set a new column color to ‘green’ when the second column has ‘Z’. You can do the following:

`df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})`<br>
`df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')`

------

**4.6 Index multiple conditions with np.select()**<br>

conditions = [
    (df['col2'] == 'Z') & (df['col1'] == 'A'),
    (df['col2'] == 'Z') & (df['col1'] == 'B'),
    (df['col1'] == 'B')
]

choices = ['yellow', 'blue', 'purple']

df['color'] = np.select(conditions, choices, default='black')

--------

**4.7 Indexing with query()**<br>
DataFrame objects have a query() method that allows selection using an expression.

You can get the value of the frame where column b has values between the values of columns a and c. For example:

- Pure Python<br>
`df[(df['a'] < df['b']) & (df['b'] < df['c'])]`
<br><br>
- Query<br>
`df.query('(a < b) & (b < c)')`

<hr style="border:2px solid gray"> </hr>.

## 5. Data Selection - Random Samples

**Sampling series**<br>
- When no arguments are passed, returns 1 row:<br>
s.sample()
<br><br>
- One may specify either a number of rows:<br>
s.sample(n=3)
<br><br>
- Or a fraction of the rows:<br>
s.sample(frac=0.5)
<br><br>
- With replacement:<br>
s.sample(n=6, replace=True)
<br><br>
- With a given seed, the sample will always draw the same rows.<br>
df4.sample(n=2, random_state=2)

<hr style="border:2px solid gray"> </hr>.

## 6. Data Update - Basic

**Update an entire column with a constant**
- df["column_name"] = 5
------

**Update an entire column based on some other column**
- df["column_name"] = df["someothercolumn"] * 10
-------

<hr style="border:2px solid gray"> </hr>.

## 7. Data Update - Custom and Conditional

**Custom function using apply**
- df["column_name"] = df["column_name"].apply(lambda x: 10 if x == 0 else 20)

-------

**Conditional Update**

#If-then on one column
- df.loc[df.AAA >= 5, "BBB"] = -1

#If-then on two columns
- df.loc[df.AAA >= 5, ["BBB", "CCC"]] = 555

--------

**Swap column values**
- df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()

<hr style="border:2px solid gray"> </hr>.

## 8. Remove Duplicates

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: `duplicated` and `drop_duplicates`. Each takes as an argument the columns to use to identify duplicated rows.<br>

- `duplicated` returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.

- `drop_duplicates` removes duplicate rows.

**Checking Duplicates in a single column**
- df2.duplicated('a', keep=False)
- df2.drop_duplicates('a')

------

**Checking Duplicates in multiple columns**
- df2.duplicated(['a', 'b'])
- df2.drop_duplicates(['a', 'b'])

<hr style="border:2px solid gray"> </hr>.

## 9. Merge