<a href="https://colab.research.google.com/github/avinash2302res07/Python-Data-Science-library-Tutorial/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas is one of the most used open source libraries in Python, and this is because of the powerful data constructs that it offers.

**it is built using NumPy.**


* Used for data analysis & manipulation, building ML models etc having high-level data structures like Series & DataFrame.
* Ensures compatibility with numerical operations and other scientific libraries.
* Helps with efficiently manage, analyze, and visualize large datasets (like, spreadsheets, CSV etc).
* Widely used for transforming raw data into meaningful insights in data science and machine learning.
* Used frequently with libraries like Matplotlib and Seaborn, for effective data visualization.

## Getting Started With Pandas:

After the Pandas have been installed in the system, we need to import the library Like Numpy. This module is generally imported as follows:



```
 import pandas as pd
```



##Data Structure In Pandas :

Like Python or Numpy, Pandas also has two main data structures:
1.   Series
2.   Data Frame

###Series:

A Series, by contrast, is a sequence of data values.

If a DataFrame is a table, a Series is a list.

And in fact you can create Series with nothing more than a list:

Syntax:

       pd.series(data)

In [None]:
#creating a series from List
import pandas as pd
data = [10, 20, 30, 40]
data_series = pd.Series(data)
data_series


Unnamed: 0,0
0,10
1,20
2,30
3,40


A Series is, in essence, a single column of a DataFrame. **So you can assign row labels to the Series the same way as before, using an index parameter**. However, **a Series does not have a column name**, it only has one overall name

**(1) Assign Labels: Create a Series with custom labels**

Syntax:
     
     index = ['A', 'B', 'C', 'D']
     s = pd.Series([10, 20, 30, 40], index=index)


In [None]:
#Assign Labels: Create a series with custom labels.
data = [30, 35, 40]
Index = ['2015 Sales', '2016 Sales', '2017 Sales']
data_series1 = pd.Series(data, index = Index)
data_series1

Unnamed: 0,0
2015 Sales,30
2016 Sales,35
2017 Sales,40


In [None]:
data_series1 = pd.Series(data, index = Index, name = 'Product A')
data_series1

Unnamed: 0,Product A
2015 Sales,30
2016 Sales,35
2017 Sales,40


https://gemini.google.com/app/b2b2ab9eaabb1c51

###Dataframe:

It is a table with rows and columns, with rows having an index each and columns having meaningful names.

There are various ways of creating dataframes.

For example:
* Creating an Empty DataFrame
* Creating a dataframe using List
* Creating DataFrame from dict of ndarray/lists:
* Creating them from dictionaries
* Reading from external file (.txt, .csv files., Excel files, jason file etc )



**Creating an Empty Dataframe**

In [None]:
# import pandas as pd
import pandas as pd

# Calling DataFrame constructor
df = pd.DataFrame()

print(df)

Empty DataFrame
Columns: []
Index: []


**Creating Dataframe Using List**

In [None]:

# import pandas as pd
import pandas as pd

# list of strings
list = ['Geeks', 'For', 'Geeks', 'is',
            'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list
df = pd.DataFrame(list)
print(df)

        0
0   Geeks
1     For
2   Geeks
3      is
4  portal
5     for
6   Geeks


**Creating DataFrame from dict of Numpy Array**

We can also create a Pandas DataFrame using a dictionary of NumPy arrays.
* Each key in the dictionary represents a column name and the corresponding NumPy array provides the values for that column.

In [1]:
import pandas as pd

# initialize data of lists.
data = {'Name': ['Tom', 'nick', 'krish', 'jack'],
        'Age': [20, 21, 19, 18]}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


**Creating dataframe from List  Using dictionary:**

If you have data in the form of lists present in Python, then you can create the dataframe directly through dictionaries.

* The ‘key’ in the dictionary acts as the column name and the ‘values’ stored are the entries under the column.

Syntax:


```
pd.DataFrame(data)
```


In [None]:
#Example - 1 : Create a Data Frame cars using raw data stored in a dictionary
cars_per_cap = [809, 731, 588, 18, 200, 70, 45]
country = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
drives_right = [True, False, False, False, True, True, True]

In [None]:
# the above raw date has given in the form of list . We need to convert it into dictionary first then we will create dataframe from dictionary
#Let,s create dictionary first with name car
car = {"cars_per_cap":cars_per_cap,"country":country,"drives_right":drives_right}
car

{'cars_per_cap': [809, 731, 588, 18, 200, 70, 45],
 'country': ['United States',
  'Australia',
  'Japan',
  'India',
  'Russia',
  'Morocco',
  'Egypt'],
 'drives_right': [True, False, False, False, True, True, True]}

In [None]:
#Now we create dataframe from dictionary car
car_df = pd.DataFrame(car)
car_df

Unnamed: 0,cars_per_cap,country,drives_right
0,809,United States,True
1,731,Australia,False
2,588,Japan,False
3,18,India,False
4,200,Russia,True
5,70,Morocco,True
6,45,Egypt,True


In [None]:
type(car_df)

**Creating dataframe from External File ( CSV, Excel, Jason):**

Another method to create dataframes is to load data from external files.
Data may not necessarily be available in the form of lists.
Mostly, you will have to load the data stored in the form of a CSV file, text file, etc.

Syntax:
     
     pd.read_csv(filepath, sep=',', header='infer')

Note:

separator (by default ‘,’)

header (takes the top row by default, if not specified)

names (list of column name)



In [None]:
from google.colab import drive
drive.mount("/content/drive/")


Mounted at /content/drive/


In [None]:
#Example 2 : Creating a DataFrame by importing cars data from cars.csv
import pandas as pd
cars_df = pd.read_csv("/content/drive/MyDrive/cars.csv")
cars_df

Unnamed: 0,USCA,US,United States,809,FALSE
0,ASPAC,AUS,Australia,731.0,True
1,ASPAC,JAP,Japan,588.0,True
2,ASPAC,IN,India,18.0,True
3,ASPAC,RU,Russia,200.0,False
4,LATAM,MOR,Morocco,70.0,False
5,AFR,EG,Egypt,45.0,False
6,EUR,ENG,England,,True


Pandas provides the flexibility to load data from various sources and has different commands for each of them.
 You can go through the list of commands here, refer : https://pandas.pydata.org/pandas-docs/stable/reference/io.html

The most common files that you will work with are csv files.


IN the above import data from externaal file we are watching row 0 it mean first row is also a header so we need to remove it(delete header)

In [None]:
#skip/delete header(read file)
cars_df = pd.read_csv("/content/drive/MyDrive/cars.csv" , header=None)
cars_df

Unnamed: 0,0,1,2,3,4
0,USCA,US,United States,809.0,False
1,ASPAC,AUS,Australia,731.0,True
2,ASPAC,JAP,Japan,588.0,True
3,ASPAC,IN,India,18.0,True
4,ASPAC,RU,Russia,200.0,False
5,LATAM,MOR,Morocco,70.0,False
6,AFR,EG,Egypt,45.0,False
7,EUR,ENG,England,,True


# **Pandas Row & Columns:**

**Row(Index):**

In [None]:
#find row
cars_df.index

RangeIndex(start=0, stop=8, step=1)

In [None]:
#Example : Read file and set 1st column as index
cars_df = pd.read_csv("/content/drive/MyDrive/cars.csv" , header=None, index_col = 0)
cars_df

Unnamed: 0_level_0,1,2,3,4
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USCA,US,United States,809.0,False
ASPAC,AUS,Australia,731.0,True
ASPAC,JAP,Japan,588.0,True
ASPAC,IN,India,18.0,True
ASPAC,RU,Russia,200.0,False
LATAM,MOR,Morocco,70.0,False
AFR,EG,Egypt,45.0,False
EUR,ENG,England,,True


**Column(Index):**

>find the columns

Syntax:
       
       df.columns

In [None]:

# Example : find the columns of the above car dataframe
cars_df.columns

Index([1, 2, 3, 4], dtype='int64')

>rename the header/column name

Syntax:

       df.columns = ["new_name", "new_name", "new_name"]    
       print(df.columns)


In [None]:
#Example: rename the header/column name of the above car dataframe
cars_df.columns = ['region', 'country', 'cars_per_cap', 'drive_height']
cars_df

Unnamed: 0_level_0,region,country,cars_per_cap,drive_height
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USCA,US,United States,809.0,False
ASPAC,AUS,Australia,731.0,True
ASPAC,JAP,Japan,588.0,True
ASPAC,IN,India,18.0,True
ASPAC,RU,Russia,200.0,False
LATAM,MOR,Morocco,70.0,False
AFR,EG,Egypt,45.0,False
EUR,ENG,England,,True


In [None]:
#Delete the index name (Question :Can we delete/skip  the index to be 0  like header/column ?   )
cars_df.index.name = None
cars_df


#Hierarchical indexing.
It is also possible to create a multilevel indexing for your dataframe; this is known as hierarchical indexing

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


In [None]:
import pandas as pd
car_df = pd.read_csv("/content/drive/MyDrive/cars.csv", header=None)
car_df

Unnamed: 0,0,1,2,3,4
0,USCA,US,United States,809.0,False
1,ASPAC,AUS,Australia,731.0,True
2,ASPAC,JAP,Japan,588.0,True
3,ASPAC,IN,India,18.0,True
4,ASPAC,RU,Russia,200.0,False
5,LATAM,MOR,Morocco,70.0,False
6,AFR,EG,Egypt,45.0,False
7,EUR,ENG,England,,True


In [None]:
car_df.columns = ['Country_code', 'region', 'country', 'cars_per_cap', 'drive_height']
car_df

Unnamed: 0,Country_code,region,country,cars_per_cap,drive_height
0,USCA,US,United States,809.0,False
1,ASPAC,AUS,Australia,731.0,True
2,ASPAC,JAP,Japan,588.0,True
3,ASPAC,IN,India,18.0,True
4,ASPAC,RU,Russia,200.0,False
5,LATAM,MOR,Morocco,70.0,False
6,AFR,EG,Egypt,45.0,False
7,EUR,ENG,England,,True


In [None]:
#Example (hierical indexing): from the above data set region & country_code as a hierchical index
car_df.set_index(['region','Country_code'], inplace = True)
car_df

Unnamed: 0_level_0,Unnamed: 1_level_0,country,cars_per_cap,drive_height
region,Country_code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
US,USCA,United States,809.0,False
AUS,ASPAC,Australia,731.0,True
JAP,ASPAC,Japan,588.0,True
IN,ASPAC,India,18.0,True
RU,ASPAC,Russia,200.0,False
MOR,LATAM,Morocco,70.0,False
EG,AFR,Egypt,45.0,False
ENG,EUR,England,,True


# **Describing Data:**

# **Indexing & Slicing:**

Case Study - Sales Data :

In [None]:
#All relevent libarary import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


In [None]:
#Loading Sales dataset in dataframe
Sales = pd.read_excel("/content/drive/MyDrive/sales.xlsx")
Sales

Unnamed: 0,Market,Region,No_of_Orders,Profit,Sales
0,Africa,Western Africa,251,-12901.51,78476.06
1,Africa,Southern Africa,85,11768.58,51319.5
2,Africa,North Africa,182,21643.08,86698.89
3,Africa,Eastern Africa,110,8013.04,44182.6
4,Africa,Central Africa,103,15606.3,61689.99
5,Asia Pacific,Western Asia,382,-16766.9,124312.24
6,Asia Pacific,Southern Asia,469,67998.76,351806.6
7,Asia Pacific,Southeastern Asia,533,20948.84,329751.38
8,Asia Pacific,Oceania,646,54734.02,408002.98
9,Asia Pacific,Eastern Asia,414,72805.1,315390.77


In [None]:
#Example: Read file and set 2nd column as index
Sales = pd.read_excel("/content/drive/MyDrive/sales.xlsx", index_col=[1])
Sales.head()

Unnamed: 0_level_0,Market,No_of_Orders,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Western Africa,Africa,251,-12901.51,78476.06
Southern Africa,Africa,85,11768.58,51319.5
North Africa,Africa,182,21643.08,86698.89
Eastern Africa,Africa,110,8013.04,44182.6
Central Africa,Africa,103,15606.3,61689.99


Example 1: Column indexing

In [None]:
#Display Sales Columns
#Method1 :
Sales.Sales

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
Western Africa,78476.06
Southern Africa,51319.5
North Africa,86698.89
Eastern Africa,44182.6
Central Africa,61689.99
Western Asia,124312.24
Southern Asia,351806.6
Southeastern Asia,329751.38
Oceania,408002.98
Eastern Asia,315390.77


In [None]:
#Method2 :
Sales['Sales']

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
Western Africa,78476.06
Southern Africa,51319.5
North Africa,86698.89
Eastern Africa,44182.6
Central Africa,61689.99
Western Asia,124312.24
Southern Asia,351806.6
Southeastern Asia,329751.38
Oceania,408002.98
Eastern Asia,315390.77


In [None]:
>type (): find types

In [None]:
#Display Sales and Profit Together
Sales[["Sales", "Profit"]]

Unnamed: 0_level_0,Sales,Profit
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Western Africa,78476.06,-12901.51
Southern Africa,51319.5,11768.58
North Africa,86698.89,21643.08
Eastern Africa,44182.6,8013.04
Central Africa,61689.99,15606.3
Western Asia,124312.24,-16766.9
Southern Asia,351806.6,67998.76
Southeastern Asia,329751.38,20948.84
Oceania,408002.98,54734.02
Eastern Asia,315390.77,72805.1


Example 2: Row Indexing

In [None]:
#Example : Display data for "southern Asia"

Method 1: (loc method)

loc accesor takes row index and column index

>Syntax: df.loc['row_name']

In [None]:
Sales.loc["Southern Asia"]

Unnamed: 0,Southern Asia
Market,Asia Pacific
No_of_Orders,469
Profit,67998.76
Sales,351806.6


Method 2: (iloc method)

loc accesor takes row index and column index

>Syntax: df.loc['row_name']

In [None]:
#Question : How do we find the specific column(Sales data ) in Southern Asia Row
Syntax :
    df.loc['row_name','column_name']

In [None]:
Sales.loc['Southern Asia','Sales']

351806.6

You can use the loc method to extract rows and columns from a dataframe based on the following labels:

Syntax:

>dataframe.loc[[list_of_row_labels], [list_of_column_labels]]


This is called label-based indexing over dataframes. Now, you may face some challenges while dealing with the labels.

As a solution, you might want to fetch data based on the row or column number.

Method 2: (iloc method)

Another method for indexing a dataframe is the iloc method, which uses the row or column number instead of labels.

>Syntax:

>>dataframe.iloc[rows_number, columns_number ]

In [None]:
Syntax:
   df.iloc[row_number]

In [None]:
Sales.iloc[6]

Unnamed: 0,Southern Asia
Market,Asia Pacific
No_of_Orders,469
Profit,67998.76
Sales,351806.6


In [None]:
#Question : find the Sales data inside Sales datarame using iloc method
Syntax:
    df.iloc[row_number, column_number]

In [None]:
Sales.iloc[6,3]

351806.6

#Slicing