## About Pandas:

Pandas is a Python open-source library used for working with data sets. It is free to use and easy to learn. Pandas has functions for analyzing, cleaning, exploring, and manipulating data. It was created by Wes McKinney in 2008. Pandas library has been updated many times since then.

Pandas, Numpy and scikit-learn are among the most popular libraries for data science and analysis along with Python.Numpy is used for lower level scientific computations. Pandas is built on top of Numpy. Scikit-learn comes with several machine learning models.

You can install Pandas on your computer by using a command called "pip install pandas". Then you can import it into your code. Pandas has two main data structures called Series and DataFrame. These data structures help organize and manipulate data. People who work in data science often use Pandas because it works well with other programs they use, like Matplotlib and Scikit-learn. Jupyter Notebook is a good program to use with Pandas because it makes it easy to visualize data and run code.

## Explaining important functions in Pandas:


In [2]:
# To Install Pandas, uncomment the code below and run
# !pip install pandas

In [1]:
# Import Pandas
import pandas as pd

### Pandas Data Structures
**1. Series**

---
It is a one-dimensional labeled array. It can hold any data type.

In [5]:
s = pd.Series([2, -4, 6, 3, None], index=['A', 'B', 'C', 'D', 'E'])
s

A    2.0
B   -4.0
C    6.0
D    3.0
E    NaN
dtype: float64

**2. DataFrame**

---
It is a two-dimensional labeled array. It can hold any data type and different sizes of columns.

In [6]:
data = {'RollNo' : [101, 102, 75, 99],
        'Name' : ['Mithlesh', 'Ram', 'Rudra', 'Mithlesh'],
        'Course' : ['Nodejs', None, 'Nodejs', 'JavaScript']
}

df = pd.DataFrame(data, columns=['RollNo', 'Name', 'Course'])

df.head()

Unnamed: 0,RollNo,Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


###Importing Data
Pandas has ability to import or read various types of file in your workbook. Here are some examples given below.

In [None]:
# Import a CSV file
pd.read_csv(filename)

# Import a TSV file
pd.read_table(filename)

# Import a Excel file
pd.read_excel(filename)

# Import a SQL table/database
pd.read_sql(query, connection_object) # Reads from a SQL table/database

# Import a JSON file
pd.read_json(json_string)

# Import a HTML file
pd.read_html(url)

# From clipboard to read_table()
pd.read_clipboard()

# From dict
pd.DataFrame(dict)

### Exporting Data
Pandas has ability to export or write data in various format. Here are some examples given below.

In [None]:
# Export as a CSV file
df.to_csv(filename)

# Export as a Excel file
df.to_excel(filename)

# Export as a SQL table
df.to_sql(table_name, connection_object)

# Export as a JSON file
df.to_json(filename)

# Export as a HTML table
df.to_html(filename)

# Write to the clipboard
df.to_clipboard()

### Data Cleaning
You may need to remove NULL or duplicate values from your Series or DataFrame. You can use these functions as describe below.

In [15]:
data = {'RollNo' : [101, 102, 75, 99],
        'Name' : ['Mithlesh', 'Ram', 'Rudra', 'Mithlesh'],
        'Course' : ['Nodejs', None, 'Nodejs', 'JavaScript']
}
df = pd.DataFrame(data)
df

Unnamed: 0,RollNo,Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


In [7]:
# Mass renaming of columns
df = df.rename(columns={'RollNo': 'ID', 'Name': 'Student_Name'})

# Or use this edit in same DataFrame instead of in copy
df.rename(columns={'RollNo': 'ID', 'Name': 'Student_Name'}, inplace=True)
df.head()

Unnamed: 0,ID,Student_Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


In [8]:
df.head()

Unnamed: 0,ID,Student_Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


In [45]:
# Counting duplicates in a column
data = {'RollNo' : [101, 102, 75, 99],
        'Name' : ['Mithlesh', 'Ram', 'Rudra', 'Mithlesh'],
        'Course' : ['Nodejs', None, 'Nodejs', 'Nodejs']
}
df = pd.DataFrame(data)

df.duplicated(subset=['Name', 'Course'])

0    False
1    False
2    False
3     True
dtype: bool

In [19]:
# Removing entire row that has duplicate in given coloum
df.drop_duplicates(subset=['Name', 'Course'])
# df.drop_duplicates(subset=['b', 'a'])

Unnamed: 0,RollNo,Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs


In [20]:
# You can choose which one keep - by default is first
df.drop_duplicates(subset=['Name', 'Course'], keep='last')

Unnamed: 0,RollNo,Name,Course
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,Nodejs


In [24]:
# Checks for Null Values
# df
df.isnull()

Unnamed: 0,RollNo,Name,Course
0,False,False,False
1,False,False,True
2,False,False,False
3,False,False,False


In [25]:
# Checks for Missing Values
df.isna()

Unnamed: 0,RollNo,Name,Course
0,False,False,False
1,False,False,True
2,False,False,False
3,False,False,False


In [26]:
# Checks for non-Null Values - reverse of isnull()
df.notnull()

Unnamed: 0,RollNo,Name,Course
0,True,True,True
1,True,True,False
2,True,True,True
3,True,True,True


In [21]:
# Checks for Null Values
df.isnull()

Unnamed: 0,a,b,c
0,False,False,False
1,False,False,True
2,False,False,False
3,False,False,False


In [22]:
# Checks for non-Null Values - reverse of isnull()
df.notnull()

Unnamed: 0,a,b,c
0,True,True,True
1,True,True,False
2,True,True,True
3,True,True,True


In [23]:
# Drops all rows that contain null values
df.dropna()

Unnamed: 0,a,b,c
0,101,Mithlesh,Nodejs
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


In [28]:
df

Unnamed: 0,RollNo,Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,Nodejs


In [9]:
# Drops all columns that contain null values
df.dropna(axis=1)

Unnamed: 0,ID,Student_Name
0,101,Mithlesh
1,102,Ram
2,75,Rudra
3,99,Mithlesh


In [11]:
# Replaces all null values with 'temp'
df.fillna('temp')

Unnamed: 0,ID,Student_Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,temp
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


In [13]:
# Replaces all null values with the mean
x = df["RollNo"].mean()
df.fillna(x)

KeyError: 'RollNo'

In [44]:
# Converts the datatype of the series to float
df["RollNo"].astype(int)

0    101
1    102
2     75
3     99
Name: RollNo, dtype: int32

In [28]:
# Replaces all values equal to 6 with 'Six'
s.replace(6,'Six')

A    2.0
B   -4.0
C    Six
D    3.0
E    NaN
dtype: object

In [30]:
# Replaces all 2 with 'Two' and 6 with 'Six'
s.replace([2,6],['Two','Six'])

A    Two
B   -4.0
C    Six
D    3.0
E    NaN
dtype: object

In [32]:
# s1 is pointing to same series as s
s1 = s

# s_copy of s, but not pointing same series
s_copy = s.copy()

# df1 is pointing to same DataFrame as df
df1 = s

# df_copy of df, but not pointing same DataFrame
df_copy = df.copy()

### Filter, Sort and Group By
These following functions can be used for filtering, sorting, and group by Series and DataFrame.

In [18]:
df

Unnamed: 0,ID,Student_Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,
2,75,Rudra,Nodejs
3,99,Mithlesh,JavaScript


In [19]:
# Filer rows where column is greater than 100
df[df['ID'] > 100]

Unnamed: 0,ID,Student_Name,Course
0,101,Mithlesh,Nodejs
1,102,Ram,


In [21]:
# Filer rows where 70 < column < 101
df[(df['ID'] > 70) & (df['Student_Name'] == "Ram")]

Unnamed: 0,ID,Student_Name,Course
1,102,Ram,


In [None]:
# Sorts values in ascending order
s.sort_values()

In [None]:
# Sorts values in descending order
s.sort_values(ascending=False)

In [None]:
# Sorts values by RollNo in ascending order
df.sort_values('RollNo')

In [None]:
# Sorts values by RollNo in descending order
df.sort_values('RollNo', ascending=False)

### Selection:

**1. Series:**

---

In [None]:
# Accessing one element from Series
s['D']

In [None]:
# Accessing all elements between two given indecies
s['A':'C']

In [None]:
# Accessing all elements from starting till given index
s[:'C']

In [None]:
# Accessing all elements from given index till end
s['B':]


**2. DataFrame:**

---

In [None]:
# Accessing one column
df['Name']

In [None]:
# Accessing rows from after given row
df[1:]

In [None]:
# Accessing till before given row
df[:1]

In [None]:
# Accessing rows between two given rows
df[1:2]

### Selecting by Boolean Indexing and Setting
**1. By Position**

---


In [None]:
df.iloc[0, 1]

In [None]:
# df.iat[0, 1]

**2. By Label**

---

In [None]:
df.loc[[0],  ['Name']]

**3. By Label/Position**

---

In [None]:
df.loc[2]  # Both are same
df.iloc[2]

**4. Boolean Indexing**

---

In [None]:
# Series s where value is > 1
s[(s > 0)]

In [None]:
# Series s where value is <-2 or >1
s[(s < -2) | ~(s > 1)]

In [None]:
# Use filter to adjust DataFrame
df[df['RollNo']>100]

In [None]:
# Set index a of Series s to 6
s['D'] = 10
s.head()

### Dropping

In [None]:
# Drop values from rows (axis=0)
s.drop(['B',  'D'])

In [None]:
# Drop values from columns(axis=1)
df.drop('Name', axis=1)

In [None]:
# Sort by labels along an axis
df.sort_index()

In [None]:
# Sort by the values along an axis
df.sort_values(by='RollNo')

In [None]:
# Assign ranks to entries
df.rank()

### Retriving Information
**1. Basic information**

---



In [None]:
# Counting all elements in Series
len(s)

In [None]:
# Counting all elements in DataFrame
len(df)

In [None]:
# Prints number of rows and columns in dataframe
df.shape

In [None]:
# Prints first 10 rows by defauld, if no value set
df.head(10)

In [None]:
# Prints last 10 rows by defauld, if no value set
df.tail(10)

In [None]:
# For counting non-Null values column-wise
df.count()

In [None]:
# For range of index
df.index

In [None]:
# For name of attributes/columns
df.columns

In [None]:
# Index, Datatype and Memory information
df.info()

In [None]:
# Datatypes of each coloum
df.dtypes

In [None]:
# Summary statistics for numerical columns
df.describe()

**2. Summary**

In [None]:
# For adding all values column-wise
df.sum()

In [None]:
# For min column-wise
df.min()

In [None]:
# For max column-wise
df.max()

In [None]:
# For mean value in number column
df.mean()

In [None]:
# For median value in number column
df.median()

In [None]:
# Count non-Null values
s.count()

In [None]:
# Count non-Null values
df.count()

In [None]:
# Return Series of given column
df['Name'].tolist()

In [None]:
# Name of columns
df.columns.tolist()

In [None]:
# Creating subset
df[['Name', 'Course']]

In [None]:
# Return number of values in each group
df.groupby('Name').count()

### Applying Functions

In [None]:
# Define function
f = lambda x: x*5

In [None]:
# Apply this function on given Series - For each value
s.apply(f)

In [None]:
# Apply this function on given DataFrame - For each value
df.apply(f)

**1. Internal Data Alignment**

In [None]:
# NA values for indices that don't overlap
s2 = pd.Series([8, -1, 4],  index=['A',  'C',  'D'])
s + s2

**2. Arithmetic Operations with Fill Methods**

In [None]:
# Fill values that don't overlap
s.add(s2, fill_value=0)