# **Pandas**

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

**Data Structures :**

Pandas provides and deals with following data strunctures :

* **Series :**
  It is a one dimensional structure stroring homogeneous(same data type) mutable(that can be modified) data.

  **Syntax :** pandas.Series(data, index, name)

* **DataFrame :**
  It is a two dimensional structure storing heterogeneous(different data types) mutable data.

  **Syntax :** pandas.DataFrame(data, column, index)

In [2]:
# importing pandas library
import pandas as pd

# importing numpy library
import numpy as np

## **Creating Pandas DataFrame**

We can create pandas DataFrame with the following constructs :

* Lists
* Series
* Dictionary
* Numpy ndarray

**Using Lists**

In [8]:
# creating a list
lst = [10, 20, 30, 40, 50]

# creating dataframe
indexes = ['Row1', 'Row2', 'Row3', 'Row4', 'Row5']
columns = ['col1']
df = pd.DataFrame(data=lst, index=indexes, columns=columns)
df

Unnamed: 0,col1
Row1,10
Row2,20
Row3,30
Row4,40
Row5,50


**Using Dictionary**

In [10]:
# method 1: dictionary of lists
# creating a dictionary
student = {
    "Name": ['Anas', 'Yugam', 'Nikhil'],
    'Age': [23, 24, 21]
}

# creating dataframe
df = pd.DataFrame(data=student)
df

Unnamed: 0,Name,Age
0,Anas,23
1,Yugam,24
2,Nikhil,21


In [15]:
# method 2: dictionary of series
name = pd.Series(['Anas', 'Yugam', 'Rinku'])
roll_no = pd.Series([1, 2, 3])
marks = pd.Series([100, 99, 98])
stu_result = {
    'Name': name,
    'Roll No': roll_no,
    'Marks': marks
}

# creating dataframe
df = pd.DataFrame(data=stu_result)
df

Unnamed: 0,Name,Roll No,Marks
0,Anas,1,100
1,Yugam,2,99
2,Rinku,3,98


In [16]:
# method 3: List of dictionaries
# creating a list of dictionaries
lst = [{'Anas': 100, 'Yugam': 99, 'Rinku': 98},
       {'Anas': 99, 'Yugam': 98, 'Rinku': 97},
       {'Anas': 98, 'Yugam': 97, 'Harsh': 100}]

# creating dataframe
df = pd.DataFrame(data=lst)
df

Unnamed: 0,Anas,Yugam,Rinku,Harsh
0,100,99,98.0,
1,99,98,97.0,
2,98,97,,100.0


**Using Numpy ndarray**

In [17]:
# creating a numpy array
data = np.arange(0, 20).reshape(5,4)

# creating dataframe
indexes = ['Row1', 'Row2', 'Row3', 'Row4', 'Row5']
columns = ['Col1', 'Col2', 'Col3', 'Col4']
df = pd.DataFrame(data=data, index=indexes, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [7]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


## **Attributes and Methods**

In [20]:
# creating a numpy array
data = np.arange(0, 20).reshape(5,4)

# creating dataframe
indexes = ['Row1', 'Row2', 'Row3', 'Row4', 'Row5']
columns = ['Col1', 'Col2', 'Col3', 'Col4']
df = pd.DataFrame(data=data, index=indexes, columns=columns)

**index Attribute**

It returns all of the index present in the dataframe

In [18]:
df.index

Index(['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], dtype='object')

**columns Attribute**

It returns all of the columns present in dataframe

In [21]:
df.columns

Index(['Col1', 'Col2', 'Col3', 'Col4'], dtype='object')

**shape Attribute**

It returns a tuple consisting of number of rows and columns present in the dataframe

In [22]:
df.shape

(5, 4)

**head() Method**

It returns the first n number of rows

In [23]:
df.head(2)

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7


**tail() Method**

It returns the last n number of rows

In [24]:
df.tail(2)

Unnamed: 0,Col1,Col2,Col3,Col4
Row4,12,13,14,15
Row5,16,17,18,19


**info() Method**

It returns the basic information of dataframe

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Row1 to Row5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Col1    5 non-null      int64
 1   Col2    5 non-null      int64
 2   Col3    5 non-null      int64
 3   Col4    5 non-null      int64
dtypes: int64(4)
memory usage: 372.0+ bytes


**describe() Method**

It returns the descriptive summary of the dataframe

In [26]:
df.describe()

Unnamed: 0,Col1,Col2,Col3,Col4
count,5.0,5.0,5.0,5.0
mean,8.0,9.0,10.0,11.0
std,6.324555,6.324555,6.324555,6.324555
min,0.0,1.0,2.0,3.0
25%,4.0,5.0,6.0,7.0
50%,8.0,9.0,10.0,11.0
75%,12.0,13.0,14.0,15.0
max,16.0,17.0,18.0,19.0


## **Indexing or Slicing**

**using columns names**

In [28]:
df['Col1']

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Col1, dtype: int64

In [27]:
df[['Col1', 'Col2', 'Col3']]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,4,5,6
Row3,8,9,10
Row4,12,13,14
Row5,16,17,18


**using loc**

loc helps us to retreive data based on the name of rows and columns

In [30]:
# retreiving single row information
df.loc['Row2']

Col1    4
Col2    5
Col3    6
Col4    7
Name: Row2, dtype: int64

In [32]:
# slicing through specified row names
df.loc['Row1':'Row3']

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11


In [35]:
# slicing through specified row and column names
df.loc['Row1':'Row2', 'Col1':'Col2']

Unnamed: 0,Col1,Col2
Row1,0,1
Row2,4,5


In [39]:
df.loc[:,['Col1', 'Col4']]

Unnamed: 0,Col1,Col4
Row1,0,3
Row2,4,7
Row3,8,11
Row4,12,15
Row5,16,19


**using iloc**

It is used to retreive data based on position and index

In [36]:
# slicing through specified row indexes
df.iloc[0:3]

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11


In [37]:
# slicing through specified row and column index
df.iloc[0:2, 0:2]

Unnamed: 0,Col1,Col2
Row1,0,1
Row2,4,5


In [38]:
df.iloc[:, [0,3]]

Unnamed: 0,Col1,Col4
Row1,0,3
Row2,4,7
Row3,8,11
Row4,12,15
Row5,16,19


##**Convert DataFrame to ndarray**

In [40]:
# we can convert daraframe to ndarray by using values attribute
df.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

## **Basic Operations**

In [81]:
# creating a list of dictionaries
lst = [{'Anas': 100, 'Yugam': 99, 'Rinku': 98},
       {'Anas': 99, 'Yugam': 98, 'Rinku': 97},
       {'Anas': 98, 'Yugam': 97, 'Harsh': 100},
       {'Anas': 100, 'Yugam': 99, 'Harsh': 99, 'Rinku': 99}]

# creating dataframe
df = pd.DataFrame(data=lst)
df

Unnamed: 0,Anas,Yugam,Rinku,Harsh
0,100,99,98.0,
1,99,98,97.0,
2,98,97,,100.0
3,100,99,99.0,99.0


**check null values**

In [82]:
df.isnull().sum()

Anas     0
Yugam    0
Rinku    1
Harsh    2
dtype: int64

**retreiving NaN records**

In [83]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Anas,Yugam,Rinku,Harsh
0,100,99,98.0,
1,99,98,97.0,
2,98,97,,100.0


**creating new column**

In [84]:
df['Nikhil'] = [100, 90, np.nan, 80]
df

Unnamed: 0,Anas,Yugam,Rinku,Harsh,Nikhil
0,100,99,98.0,,100.0
1,99,98,97.0,,90.0
2,98,97,,100.0,
3,100,99,99.0,99.0,80.0


**Unique Values**

In [86]:
df['Anas'].unique()

array([100,  99,  98])

**Total Number of unique values**

In [87]:
df['Anas'].nunique()

3

**Unique value counts**

In [88]:
df['Anas'].value_counts()

100    2
99     1
98     1
Name: Anas, dtype: int64

**Filling Null values**

In [94]:
df['Harsh'].fillna(df['Harsh'].mean(), inplace=True)
df

Unnamed: 0,Anas,Yugam,Rinku,Harsh,Nikhil
0,100,99,98.0,99.5,100.0
1,99,98,97.0,99.5,90.0
2,98,97,,100.0,
3,100,99,99.0,99.0,80.0


## **Manipulation Methods**

In [73]:
# creating a numpy array
data = np.arange(0, 20).reshape(5,4)

# creating dataframe
indexes = ['Row1', 'Row2', 'Row3', 'Row4', 'Row5']
columns = ['Col1', 'Col2', 'Col3', 'Col4']
df = pd.DataFrame(data=data, index=indexes, columns=columns)

**insert() Method**

It is used to insert new column and the data of column at a particular index

In [74]:
df.insert(1, 'new_col', [7,8,9,10,11])
df

Unnamed: 0,Col1,new_col,Col2,Col3,Col4
Row1,0,7,1,2,3
Row2,4,8,5,6,7
Row3,8,9,9,10,11
Row4,12,10,13,14,15
Row5,16,11,17,18,19


**drop() Method**

It is used to drop one or more than one column from dataframe

In [76]:
df.drop(columns='new_col', axis=1, inplace=True)
df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


**sum() Method**

It is used to return the sum of all the numerical data of a particular column

In [77]:
df['Col1'].sum()

40

**set_index() Method**

It is used to make a columns as index

In [78]:
df.set_index('Col1', inplace=True)
df

Unnamed: 0_level_0,Col2,Col3,Col4
Col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15
16,17,18,19


**reset_index() Method**

It is used to reset the index

In [79]:
df.reset_index(inplace=True)
df

Unnamed: 0,Col1,Col2,Col3,Col4
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


**sort_values() Method**

It is used to sort the values of a column in ascending or descending

In [80]:
df.sort_values('Col1', ascending=False)

Unnamed: 0,Col1,Col2,Col3,Col4
4,16,17,18,19
3,12,13,14,15
2,8,9,10,11
1,4,5,6,7
0,0,1,2,3
